## File formats

### numpy, matlab etc binary: **.npy, .npz or .mat**

In [1]:
import numpy as np

In [2]:
data = {}
data['text'] = 'Something'
data['array'] = np.zeros((10,10))

np.savez('data.npz')

### Hierarchical Data Format (**HDF5**) - https://www.hdfgroup.org/solutions/hdf5/

Because it uses B-trees to index table objects, HDF5 works well for time series data such as stock price series, network monitoring data, and 3D meteorological data. The bulk of the data goes into straightforward arrays (the table objects) that can be accessed much more quickly than the rows of an SQL database, but B-tree access is available for non-array data. The HDF5 data storage mechanism can be simpler and faster than an SQL star schema.


* Simulations and numerical calculations (or gathered data) -> lot of data
* Scanning through it takes up time
* HDD blocks and sectors
* Storing large data efficiently
* B-tree search 

Good reads: 

* Numpy npz versus hdf5: https://stackoverflow.com/questions/27710245/is-there-an-analysis-speed-or-memory-usage-advantage-to-using-hdf5-for-large-arr
* Why not to use hdf5: https://cyrille.rossant.net/moving-away-hdf5/

About B-trees:
https://www.youtube.com/watch?v=aZjYr87r1b8

* Disadvantages:
  * Can't use grep, awk ... 
  * deleting slices won't make its size smaller
  * corrupt data corrupts the whole file

* [trillion particle plasma physics simulation](https://www.hdfgroup.org/trillion-particle/), which needed to store HDF5 files of 40 terabytes or more, and had to be able to sustainably write data at rates exceeding 50 gigabytes per second.

In [3]:
import numpy as np
import h5py

data = np.random.random((100, 100, 100))

with h5py.File('test.hdf', 'w') as outfile:
    dset = outfile.create_dataset('a_descriptive_name', data=data, chunks=True)
    dset.attrs['some key'] = 'Did you want some metadata?'

### The Network Common Data Form (**netcdf**) - https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_introduction.html