#Storing Large Datasets in Python

This notebook explores 3 options for storing a large numpy array, and compares their relative performance
 - [cPickle (dump)](https://docs.python.org/2/library/pickle.html)
 - [numpy (save)](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.save.html)
 - [h5py](www.h5py.org/)

We generate a dummy dataset and record 
 - the time taken to write
 - the space taken up
 - the time taken to read back

In [2]:
import os

import cPickle
import numpy as np
import h5py

# generate data
data = np.random.uniform(low=0,high=1,size=(10000,100,5))

### Write Times

In [3]:
%%timeit -n 5
with open('data.pkl','w') as f:
    cPickle.dump(data,f)

5 loops, best of 3: 4.68 s per loop


In [4]:
%%timeit -n 5
np.save('data.npy',data)

5 loops, best of 3: 401 ms per loop


In [6]:
%%timeit -n 5
f = h5py.File('data.hdf5','w')
f.create_dataset('data',data=data)
f.close()

5 loops, best of 3: 347 ms per loop


### File Sizes

In [7]:
s1 = os.path.getsize('data.pkl')/1.e6
s2 = os.path.getsize('data.npy')/1.e6
s3 = os.path.getsize('data.hdf5')/1.e6
print('Pickling used %.2fMB\nNumpy\'s save used %.2fMB\nHDF5 used %.2fMB' % (s1,s2,s3))

Pickling used 111.08MB
Numpy's save used 40.00MB
HDF5 used 40.00MB


###Read Times

In [8]:
%%timeit -n 5
with open('data.pkl','r') as f:
    data = cPickle.load(f)

5 loops, best of 3: 10.7 s per loop


In [9]:
%%timeit -n 5
data = np.load('data.npy')

The slowest run took 5.58 times longer than the fastest. This could mean that an intermediate result is being cached 
5 loops, best of 3: 17.2 ms per loop


In [10]:
%%timeit -n 5
f = h5py.File('data.hdf5','r')
data = f['data'].value
f.close()

5 loops, best of 3: 16.4 ms per loop


In [13]:
# remove saved data files
os.remove('data.pkl') 
os.remove('data.npy')
os.remove('data.hdf5')

##Conclusion

numpy's save and hdf5 were the best file storage options in terms of file sizes and read times, while hdf5 was slightly faster than numpy's save when writing out to disk.