# Saving data in numpy/pandas - a recap

Last year we covered how to store data to disk. Let's recap that, as it will be useful in this module later!

## Storing numpy data

Let's crate four arrays (aa, x1, y1 and z1) which we might want to store.

In [1]:
import numpy as np
def func_makedata(a):
    x1 = a**2
    y1 = np.cos(a)
    z1 = 3*a**2 
    return x1, y1, z1

aa = np.linspace(0.,10.,50)
x1, y1, z1 = func_makedata(aa)


 Let's start with storing a single array to disk.

In [2]:
np.savetxt('data_array_aa.dat', aa)

This stores the contents of the array "aa" to the file `data_array_aa.dat`. You can open this file in the Colab/Drive window, check what is in it.

However, we might want to store all 4 of the arrays, to do this we need to *arrange* the data in such a way that this will work nicely. We can do

In [3]:
np.savetxt('data_array_aa2.dat', np.array([aa, x1, y1, z1]))

Again, you can see this in the Colab/Drive window. In this case we have 4 rows, each containing N columns (where N is the length of aa and the other arrays). If we want to swap this to 4 columns and N rows we can do:

In [4]:
np.savetxt('data_array_aa3.dat', np.array([aa, x1, y1, z1]).T)

which will transpose the array and store it nicely.

In all these cases the operation to read the corresponding array back from disk is

In [5]:
file_data = np.loadtxt('data_array_aa3.dat')

`file_data` now stores the arrays we read from disk. To recover the y1 array (which was stored in the 3rd column) we would need to use *slicing* to recover this array as demonstrated previously:

In [6]:
y1_from_file = file_data[:,2]
print(y1 - y1_from_file)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0.]


An alternative to the `savetxt` command is the `savez` command. The advantages to `savez` are:
 * You can easily save and read back in numerous arrays
 * The arrays do not have to do the same length (or need to be shaped into a N x M 2D format)
 * It's much faster to read/save data using this command.

But the one major disadvantage is:

 * These files are not human readable, are not easily readable by anything other than numpy and may not be the same on machines with different CPU architecture (e.g. a file generated on a windows machine (x86_64) may not be readable on a raspberry pi (ARM))

Here's how this works with our datasets above

In [7]:
np.savez('data_array_aa.npz', aa=aa, x1=x1, y1=y1, z1=z1)

Try reading this file in Colab/Drive!

To read this file back in with numpy you use:

In [8]:
file_data = np.load('data_array_aa.npz')
print(list(file_data))

['aa', 'x1', 'y1', 'z1']


`file_data` is implemented basically as a sub-class of python's dictionary. To access each of the arrays we do so as we would access entries in a python dictionary

In [9]:
y1_from_file = file_data['y1']
print(y1_from_file - y1)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0.]


## EXERCISE

Let's practice using this functionality. Write a function to do the following:

 * Create an array containing 1000 numbers uniformly distributed between 0 and $\pi$. Both 0 and $\pi$ should be in the array as the 1st and 1000th entries.
 * Create a second array storing $\sin(x)$
 * Create a third array storing $\cos^2(x) +1$
 * Create a fourth array storing $\mathrm{cosech}(x)$
 
Then write these 4 arrays to a file, read them back in, and check that you can recover the original arrays. Try using the different options illustrated above.

In [10]:
def trig_save(min_in, max_in, n_values):
    arr1 = np.linspace(min_in, max_in, n_values)
    arr2 = np.sin(arr1)
    arr3 = np.cos(arr1) ** 2 + 1
    arr4 = np.cosh(arr1)

    return arr1, arr2, arr3, arr4

In [11]:
arr1, arr2, arr3, arr4 = trig_save(0, np.pi, 1000)

np.savez('trig_save.npz', arr1=arr1, arr2=arr2, arr3=arr3, arr4=arr4)
trig_load = np.load('trig_save.npz')

print((trig_load['arr1'] == arr1).all())
print((trig_load['arr2'] == arr2).all())
print((trig_load['arr3'] == arr3).all())
print((trig_load['arr4'] == arr4).all())

True
True
True
True


## Reading/Writing data to file with pandas

pandas is very nice for reading/writing and manipulating tables of data. Let's illustrate some of the basic functionality here, first let's generate our data arrays again:


In [12]:
import numpy as np
import pandas as pd
def func_makedata(a):
    x1 = a**2
    y1 = np.cos(a)
    z1 = 3*a**2 
    return x1, y1, z1

aa = np.linspace(0.,10.,50)
x1, y1, z1 = func_makedata(aa)

Then we create a pandas `DataFrame` object to store our data. This is initialized from a dictionary, so we first put our data into a dictionary and then initialize the `DataFrame` object.

In [13]:
data_dict = {}
data_dict['aa'] = aa
data_dict['x1'] = x1
data_dict['y1'] = y1
data_dict['z1'] = z1
pd_dataframe = pd.DataFrame(data_dict)

Note that if we print the dataframe it looks much nicer than numpy arrays! (For this to look nice don't use the `print` function, as this is stuff integrated with Jupyter to make it look nice in the notebook)

In [14]:
pd_dataframe

Unnamed: 0,aa,x1,y1,z1
0,0.0,0.0,1.0,0.0
1,0.204082,0.041649,0.979248,0.124948
2,0.408163,0.166597,0.917851,0.499792
3,0.612245,0.374844,0.81836,1.124531
4,0.816327,0.666389,0.684902,1.999167
5,1.020408,1.041233,0.523018,3.123698
6,1.22449,1.499375,0.339426,4.498126
7,1.428571,2.040816,0.141746,6.122449
8,1.632653,2.665556,-0.061817,7.996668
9,1.836735,3.373594,-0.262815,10.120783


We can save this to file using built in methods to this `DataFrame` class. There's a few options here, but let's just show two `to_csv` which writes a human readable file, and `to_hdf` which writes an encoded file (but in a standard that is much more portable than numpy's binary files):

In [15]:
pd_dataframe.to_hdf('data_array_pandas.hdf', key='mydata')

In [16]:
pd_dataframe.to_csv('data_array_pandas.csv')
# to_hdf can be used to store *multiple* DataFrames in a single file!
pd_dataframe.to_hdf('data_array_pandas.hdf', key='mydata')

To read this back in we can use pandas' `read_csv` and `read_hdf` functions:

In [17]:
# NOTE: As to_csv adds an index column in the output file, we have to not use this as a data column when reading
# the file back in. So we set index_col=0. See what happens when this argument is removed.
data_from_csv = pd.read_csv('data_array_pandas.csv', index_col=0)
data_from_hdf = pd.read_hdf('data_array_pandas.hdf', key='mydata')
data_from_csv

Unnamed: 0,aa,x1,y1,z1
0,0.0,0.0,1.0,0.0
1,0.204082,0.041649,0.979248,0.124948
2,0.408163,0.166597,0.917851,0.499792
3,0.612245,0.374844,0.81836,1.124531
4,0.816327,0.666389,0.684902,1.999167
5,1.020408,1.041233,0.523018,3.123698
6,1.22449,1.499375,0.339426,4.498126
7,1.428571,2.040816,0.141746,6.122449
8,1.632653,2.665556,-0.061817,7.996668
9,1.836735,3.373594,-0.262815,10.120783


## EXERCISE

Let's repeat the exercise above but now use pandas to read/write the arrays: Write a function to do the following:
 * Create an array containing 1000 numbers uniformly distributed between 0 and $\pi$. Both 0 and $\pi$ should be in the array as the 1st and 1000th entries.
 * Create a second array storing $\sin(x)$
 * Create a third array storing $\cos^2(x) +1$
 * Create a fourth array storing $\mathrm{cosech}(x)$
 
Then write these 4 arrays to a file, read them back in, and check that you can recover the original arrays. Use the different pandas options illustrated above.

In [18]:
arr1, arr2, arr3, arr4 = trig_save(0, np.pi, 1000)

data_dict = {
    'arr1': arr1,
    'arr2': arr2,
    'arr3': arr3,
    'arr4': arr4,
}

pd_dataframe = pd.DataFrame(data_dict)
pd_dataframe.to_hdf('data_array_pandas_exercise.hdf', key='data')

data_from_hdf = pd.read_hdf('data_array_pandas_exercise.hdf', key='data')

print((data_from_hdf['arr1'] == arr1).all())
print((data_from_hdf['arr2'] == arr2).all())
print((data_from_hdf['arr3'] == arr3).all())
print((data_from_hdf['arr4'] == arr4).all())

True
True
True
True
