# Issues of Scale 

## HDF5 - The Why and How

Since we're on the topic of limits, let's go ahead and take a look at a case study that is very often handy when doing data analysis, machine learning and all related areas. 

The issue we consider here is that of handling very large datasets that do not fit in memory. 

For instance, you may have a dataset that is 85GB large but you only have 32GB main memory in your system. You can let your OS handle the memory but that is going to be too inefficient as the OS has no "understanding" of the data. Instead, we use a library that can do the loading of data in an intelligent manner. We'll see HDF5 here for that. The python interface to this is called *h5py*. 

Let's simulate creation of a large dataset. 

In [None]:
!pip install h5py

In [None]:
import numpy as np
import pandas as pd

#this will be our HDF5 file 
filename = 'data/test.h5'

In [None]:
df = pd.DataFrame(np.arange(10).reshape((5,2)), 
                      columns=['A', 'B'])
print(df)

Output: 

<pre>
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
</pre>

In [None]:
# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')

del df    # allow df to be garbage collected

In [None]:

# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2)) * 10, 
                   columns=['A', 'B'])


# This is the important bit! Remove `append` to see the difference 
df2.to_hdf(filename, 'data', append=True)

del df2 

In [None]:
print(pd.read_hdf(filename, 'data'))

Output: 

<pre> 
    A   B
0   0   1
1   2   3
2   4   5
3   6   7
4   8   9
0   0  10
1  20  30
2  40  50
3  60  70
4  80  90
</pre> 

And that's all there is to it! Libraries such as TensorFlow and Keras support HDF5 directly so you won't have to worry about memory issues later. 