# File I/O and NumPy

Now we have the ability to perform NumPy array computation and manipulation and we know how to construct a record array, it's time for us to do some real-world analysis by reading files into a NumPy array and outputing the result array to a file for further analysis.

In this section, you will learn how to load/import your data and save it. There are many different ways of loading data, and the right way depends on your file `type`. You can load/import text files, SAS/Stata files, HDF5 files, and many others. HDF (Hierarchical Data Format) is one of the popular data formats which is used to store and organize large amounts of data and it is very useful while working with a multidimensional homogeneous arrays. For example, Pandas library has a very handy class named as HDFStore where you can easily work with HDF5 files. While working on data science projects, you will most likely see many of these types of files, but in this book, we will cover the most popular ones, such as **NumPy binary files**, **text files** (`.txt`), and **Comma Separated Values** (`.csv`) files.

## Text and CSV Files

We should talk about reading the file first and then exporting the file. But now, we are going to reverse the process, and create a record array first and then output the array to a CSV file. We read the exported CSV file into the NumPy record arrays and compared it with our original record array. The sample array we're going to create will contain an `id` field with consecutive integers, a `value` field containing random floats, and a `date` field with `numpy.datetime64['D']`. This exercise will use all the knowledge you gained from the previous sections and chapters. Let's start creating the record array:

In [1]:
import numpy as np

In [2]:
id_ = np.arange(1000)

In [3]:
value = np.random.random(1000) 

In [4]:
day = np.random.randint(0, 365, 1000) * np.timedelta64(1, 'D') 

In [5]:
date = np.datetime64('2014-01-01') + day 

In [6]:
rec_array = np.core.records.fromarrays(
    [id_, value, date],
    names='id, value, date',
    formats='i4, f4, a10'
) 

In [8]:
rec_array[:5]

rec.array([(0, 0.94466794, b'2014-03-21'), (1, 0.990071  , b'2014-06-12'),
           (2, 0.39904514, b'2014-11-22'), (3, 0.92618376, b'2014-10-26'),
           (4, 0.15883118, b'2014-01-12')],
          dtype=[('id', '<i4'), ('value', '<f4'), ('date', 'S10')])

We first create three NumPy arrays representing the fields we need: `id`, `value`, and `date`. When creating the `date` field, we combine the `numpy.datetime64` with a random NumPy array with size 1000 to simulate random dates in the range from `2014-01-01` to `2014-12-31` (365 days).

Then we use the `numpy.core.records.fromarrays()` function to merge the three arrays into one record array and assign the `names` (field name) and the `formats` (data type). One thing to notice here is that the record array doesn't support the `numpy.datetime64` object, so we stored it in the array as a date/time string with a length of 10.

If you are using Python 3, you will find a prefix `b` added to the front of the date/time string in the record array such as `b'2014-09-25'`. `b` here stands for "bytes literals" meaning it only contains ASCII characters (all string types in Python 3 are Unicode, which is one major change between Python 2 and 3). Therefore in Python 3, converting an object (datetime64) to a string will add the prefix to differentiate between the normal string type. However, it doesn't affect what we are going to do next-exporting the record array to a CSV file:

In [9]:
np.savetxt('./record.csv', rec_array, fmt='%i, %.4f, %s')

We use the `numpy.savetxt()` function to handle the exporting, and we specify the first argument as the exported file location, the array name, and the format using the `fmt` argument. We have three fields with three different data types and we want to add `,` in between each field in the CSV file. If you prefer any other delimiters, replace the comma in the `fmt` argument. We also get rid of redundant digits in the value field, so we specify only four digits after the decimal points to the file by using `%.4f`. Now you may go to the file location we specified in the first argument to check the CSV file. Open it in a spreadsheet software program and you can see the following:

```csv
0, 0.5038, b'2014-04-22'
1, 0.6160, b'2014-06-05'
2, 0.7191, b'2014-08-05'
3, 0.3750, b'2014-12-14'
4, 0.9078, b'2014-07-01'
5, 0.9822, b'2014-03-01'
6, 0.5804, b'2014-01-03'
7, 0.7590, b'2014-04-02'
8, 0.8939, b'2014-08-28'
9, 0.1700, b'2014-07-30'
...
```

Next, we are going to read the CSV file to a record array and use the `value` field to generate a mask field, named `mask`, which represents a value larger than or equal to 0.75. Then we will append the new mask field to the record array. Let's read the CSV file first:

In [13]:
read_array = np.genfromtxt(
    './record.csv',
    dtype='i4,f4,a10',
    delimiter=',',
    skip_header=0
)

In [14]:
read_array.dtype

dtype([('f0', '<i4'), ('f1', '<f4'), ('f2', 'S10')])

We use `numpy.genfromtxt()` to read the file into NumPy record array. The first argument is still the file location we want to access, and the `dtype` argument is optional. If we didn't specify this, NumPy will determine the `dtype` argument using the contents of each column individually. Since we clearly know about the data, it's recommended to specify every time when you read the file.

The `delimiter` argument is also optional, and by default, any consecutive whitespaces act as delimiters. However, we used `","` for the CSV file. The last optional argument we use in the method is the `skip_header`. Although we didn't have the field name on top of the records in the file, NumPy provides the functionality to skip a number of lines at the beginning of the file.

Other than `skip_header`, the `numpy.genfromtext()` function supports 22 more operation parameters to fine-tune the array, such as defining missing and filling values. For more details, please refer to https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.genfromtxt.html.

Now the data is read in to the record array, you will find that the second field is more than four digits after the decimal points as we specified in exporting the CSV. The reason for this is because we use `f4` as its data type when read in. The empty digits will be filled by NumPy, but the valid four digits remain the same as in the file. You may also notice we lost the field name, so let's specify it:

In [25]:
read_array.dtype.names = ('id', 'value', 'date') 

## `.npy` or `.npz`

When you are working with arrays, you will usually save them as NumPy binary files after you have finished working with them. The reason for this is that you need to store your array shape and data type as well. When you reload your array, you expect NumPy to remember it, and you can continue working from where you left off. Moreover, NumPy binary files can store information about an array, even when you open the file on another machine with a different architecture. In NumPy, the `load()`, `save()`, `savez()`, and `savez_compressed()` methods help you to load and save NumPy binary files as follows:

In [24]:
example_array = np.arange(12).reshape(3,4)

In [25]:
# for why `allow_pickle` is set to false, read the note mentioned in a few lines below.
np.save('example.npy', example_array, allow_pickle=False)

In [26]:
d = np.load('example.npy')

In [27]:
d.shape

(3, 4)

In the preceding code, we execute the following steps to a practice saving array as a binary file and how to load it back without affecting its shape:

- Create an array with a shape `(3, 4)`
- Save the array as a binary file
- Load back the array
- Check whether the shape is still the same

> **Note:** Set `allow_pickle=False` in `numpy.save` and `numpy.load` unless the array `dtype` includes Python objects, in which case pickling is required. Pickles are not secure against erroneous or maliciously constructed data.

Similarly, you can use the `savez()` function to save several arrays into a single file. If you want to save your files as compressed NumPy binary files, you can use `savez_compressed()` as follows:

In [28]:
x = np.arange(5)
y = np.arange(10)

In [52]:
np.savez('x_y.npz', x, y)

In [53]:
npzfile = np.load('x_y.npz')

In [54]:
npzfile.files

['arr_0', 'arr_1']

In [55]:
npzfile['arr_0']

array([0, 1, 2, 3, 4])

In [56]:
npzfile['arr_1']

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

When you save several arrays in a single file, if you give a keyword argument such as `first_array=x`, your array will be saved with this name. Otherwise, by default, your first array will be given a variable name, such as `arr_0`.

In [57]:
np.savez_compressed('x_y_compressed.npz', first_array=x , second_array=y)

In [58]:
npzfile = np.load('x_y_compressed.npz')

In [59]:
npzfile.files

['first_array', 'second_array']

In [60]:
npzfile['first_array']

array([0, 1, 2, 3, 4])

In [61]:
npzfile['second_array']

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

> **Note:** In general, prefer `numpy.save` and `numpy.load` to `numpy.ndarray.tofile` and `numpy.fromfile` as they lose information on endianness and precision and so are unsuitable for anything but scratch storage.

## `json`

> **Warning:** NumPy arrays are not directly JSON serializable