## Example methods for reading/writing data files.

If the data is a simple, numerical csv, you can use np.loadtxt to load it into a numpy array.

In [2]:
import numpy as np
import scipy.io as io
data_filename = "01HIVseries/HIVseries.csv"

data_set = np.loadtxt(data_filename, delimiter=',')
print(data_set)

[[0.0000e+00 1.0610e+05]
 [8.3100e-02 9.3240e+04]
 [1.4650e-01 1.6672e+05]
 [2.5870e-01 1.5378e+05]
 [4.8280e-01 1.1880e+05]
 [7.4480e-01 1.1690e+05]
 [9.8170e-01 1.0957e+05]
 [1.2563e+00 1.1135e+05]
 [1.4926e+00 7.4388e+04]
 [1.7299e+00 8.3291e+04]
 [1.9915e+00 6.6435e+04]
 [3.0011e+00 3.5408e+04]
 [4.0109e+00 2.1125e+04]
 [5.0090e+00 2.0450e+04]
 [5.9943e+00 1.5798e+04]
 [7.0028e+00 4.7852e+03]]


MATLAB .mat files can be read similarly, using scipy.io. 
### **An important note about this:** 
After version 7.2, MATLAB went to an HDF5 format for its mat files. You can read them with the h5py package (download and install with conda by typing the command "conda install h5py" into a terminal). scipy.io will not read these HDF5 files, so you can only use it with versions 7.2 and under. 

If you are saving a mat file in MATLAB and want to open it with scipy.io, use the command save('data.mat', '-v7')

In [3]:
mat_data = io.loadmat(data_filename[:-4]+'.mat')
# In this case, you get out a dictionary which tells you the variable names and the values assigned to them.
print(mat_data)

{'__header__': b'MATLAB 5.0 MAT-file, Platform: MACI, Created on: Thu Sep  4 20:40:56 2008', '__version__': '1.0', '__globals__': [], 'a': array([[0.00000000e+00, 1.06096242e+05],
       [8.31000000e-02, 9.32395138e+04],
       [1.46500000e-01, 1.66724721e+05],
       [2.58700000e-01, 1.53780051e+05],
       [4.82800000e-01, 1.18795503e+05],
       [7.44800000e-01, 1.16896094e+05],
       [9.81700000e-01, 1.09572104e+05],
       [1.25630000e+00, 1.11352507e+05],
       [1.49260000e+00, 7.43875063e+04],
       [1.72990000e+00, 8.32913689e+04],
       [1.99150000e+00, 6.64354682e+04],
       [3.00110000e+00, 3.54078861e+04],
       [4.01090000e+00, 2.11251597e+04],
       [5.00900000e+00, 2.04503149e+04],
       [5.99430000e+00, 1.57979233e+04],
       [7.00280000e+00, 4.78519896e+03]])}


In [4]:
# The HIV data was called "a"
print(mat_data['a'])

[[0.00000000e+00 1.06096242e+05]
 [8.31000000e-02 9.32395138e+04]
 [1.46500000e-01 1.66724721e+05]
 [2.58700000e-01 1.53780051e+05]
 [4.82800000e-01 1.18795503e+05]
 [7.44800000e-01 1.16896094e+05]
 [9.81700000e-01 1.09572104e+05]
 [1.25630000e+00 1.11352507e+05]
 [1.49260000e+00 7.43875063e+04]
 [1.72990000e+00 8.32913689e+04]
 [1.99150000e+00 6.64354682e+04]
 [3.00110000e+00 3.54078861e+04]
 [4.01090000e+00 2.11251597e+04]
 [5.00900000e+00 2.04503149e+04]
 [5.99430000e+00 1.57979233e+04]
 [7.00280000e+00 4.78519896e+03]]


scipy.io can save mat files too, which you can then open in MATLAB. Just use scipy.io.savemat.

Numpy has its own, native file format for quickly saving and opening numpy arrays. It is called .npy. It is a binary file, which means it is not human readable. Don't expect to be able to open it outside of using numpy. If you want other people to read your data, or you might need to open the data outside of python, use something else.

In [5]:
x = np.linspace(0,1,1001)
y = 3*np.sin(x)**3 - np.sin(x)

np.save('x_values', x)
np.save('y_values', y)

You can save several variables zipped together into a npz file using np.savez.

In [6]:
np.savez('xy_values', x_vals=x, y_vals=y)

You can load npy files using np.load, pretty much the same way that we previously loaded csv files. npz files work a bit differently. You get an NpzFile object.

The names of the variables are set by the keyword argument names we used when we called np.savez. You can get a list of all of the variable names by looking at xydata.files. Then you can access the data by using the NpzFile object like a dictionary.

In [7]:
xydata = np.load('xy_values.npz')
print(xydata.files)
x_loaded = xydata['x_vals']
y_loaded = xydata['y_vals']
print(x_loaded)
print(y_loaded)

['x_vals', 'y_vals']
[0.    0.001 0.002 ... 0.998 0.999 1.   ]
[ 0.         -0.001      -0.00199997 ...  0.94019283  0.94309582
  0.94599872]


### But if you have more complex data, e.g. with non-numerical types, header information, etc., you really need to switch over to pandas. Pandas is a fast and powerful data analysis library for Python

In [1]:
import pandas as pd
elections = pd.read_csv('Daily_Kos_Elections_08_12_16_congress_districts.csv')
elections

Unnamed: 0,CD,Incumbent,Party,Clinton\n2016,Trump\n2016,Obama\n2012,Romney\n2012,Obama\n2008,McCain\n2008
0,AK-AL,"Young, Don",(R),37.6,52.8,41.2,55.3,38.1,59.7
1,AL-01,"Byrne, Bradley",(R),34.1,63.5,37.4,61.8,38.5,60.9
2,AL-02,"Roby, Martha",(R),33.0,64.9,36.4,62.9,35.0,64.5
3,AL-03,"Rogers, Mike",(R),32.3,65.3,36.8,62.3,36.6,62.6
4,AL-04,"Aderholt, Rob",(R),17.4,80.4,24.0,74.8,25.5,73.3
...,...,...,...,...,...,...,...,...,...
430,WI-08,"Gallagher, Mike",(R),38.6,56.2,47.6,51.3,53.7,45.0
431,WV-01,"McKinley, David",(R),26.4,68.0,35.5,62.2,41.5,56.7
432,WV-02,"Mooney, Alex",(R),29.4,65.8,38.0,60.0,43.9,54.7
433,WV-03,"Jenkins, Evan",(R),23.3,72.5,32.8,65.0,42.3,55.7
