## File formats

### numpy, matlab etc binary: **.npy, .npz or .mat**

In [28]:
import numpy as np

In [29]:
data = {}
data['text'] = 'Something'
data['array'] = np.zeros((10,10))

np.savez('data.npz')

### Hierarchical Data Format (**HDF5**) - https://www.hdfgroup.org/solutions/hdf5/

Because it uses B-trees to index table objects, HDF5 works well for time series data such as stock price series, network monitoring data, or 3D meteorological data. The bulk of the data goes into straightforward arrays (the table objects) that can be accessed much more quickly than the rows of an SQL database, but B-tree access is available for non-array data too. The HDF5 data storage mechanism can be simpler and faster than an SQL star schema.


Documentation: https://portal.hdfgroup.org/display/HDF5/Introduction+to+HDF5

* Simulations and numerical calculations (or gathered data) -> lot of data
* Scanning through it takes up time
* HDD blocks and sectors
* Storing large data efficiently
* B-tree search 


About B-trees:
https://www.youtube.com/watch?v=aZjYr87r1b8

* Disadvantages:
  * Can't use grep, awk ... 
  * deleting slices won't make its size smaller
  * corrupt data corrupts the whole file

* [trillion particle plasma physics simulation](https://www.hdfgroup.org/trillion-particle/), which needed to store HDF5 files of 40 terabytes or more, and had to be able to sustainably write data at rates exceeding 50 gigabytes per second.

In [4]:
import numpy as np
import h5py

In [9]:
data = np.random.random((400, 400, 400))

In [10]:
data.shape

(400, 400, 400)

In [11]:
with h5py.File('test.hdf', 'w') as outfile:
    dset = outfile.create_dataset('random_dataset', data=data, chunks=True)
    dset.attrs['some key'] = 'Did you want some metadata?'
    dset.attrs['rand_type'] = 'random'

In [None]:
#
# This examaple creates an HDF5 file dset.h5 and an empty datasets /dset in it.
#
import h5py
#
# Create a new file using defaut properties.
#
file = h5py.File('dset.h5','w')
#
# Create a dataset under the Root group.
#
dataset = file.create_dataset("dset",(4, 6), h5py.h5t.STD_I32BE)
print "Dataset dataspace is", dataset.shape
print "Dataset Numpy datatype is", dataset.dtype
print "Dataset name is", dataset.name
print "Dataset is a member of the group", dataset.parent
print "Dataset was created in the file", dataset.file
#
# Close the file before exiting
#
file.close()

### The Network Common Data Form (**netcdf**) - https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_introduction.html

In [13]:
#
# This examaple creates an HDF5 file dset.h5 and an empty datasets /dset in it.
#
import h5py
#

In [16]:
# Create a new file using defaut properties.
#
file = h5py.File('dset.h5','w')
#
# Create a dataset under the Root group.
#
dataset = file.create_dataset("dset",(4, 6), h5py.h5t.STD_I32BE)
print ("Dataset dataspace is", dataset.shape)
print( "Dataset Numpy datatype is", dataset.dtype)
print ("Dataset name is", dataset.name)
print( "Dataset is a member of the group", dataset.parent)
print( "Dataset was created in the file", dataset.file)
#
# Close the file before exiting
#
file.close()

Dataset dataspace is (4, 6)
Dataset Numpy datatype is >i4
Dataset name is /dset
Dataset is a member of the group <HDF5 group "/" (1 members)>
Dataset was created in the file <HDF5 file "dset.h5" (mode r+)>


In [None]:
#
# This example writes data to the existing empty dataset created by h5_crtdat.py and then reads it back.
#
import h5py
import numpy as np
#

In [26]:
# Open an existing file using default properties.
#
file = h5py.File('dset.h5','r+')
#
# Open "dset" dataset under the root group.
#
dataset = file['/dset']
print(dataset[:])
#
# Initialize data object with 0.
#
data = np.zeros((4,6))
#
# Assign new values
#
for i in range(4):
    for j in range(6):
        data[i][j]= i*6+j+1	 
#
# Write data
#
print("Writing data...")
dataset[...] = data
#
# Read data back and print it.
#
print("Reading data back...")
data_read = dataset[...]
print("Printing data...")
print (data_read)
#
# Close the file before exiting
#
file.close()

[[ 1  2  3  4  5  6]
 [ 7  8  9 10 11 12]
 [13 14 15 16 17 18]
 [19 20 21 22 23 24]]
Writing data...
Reading data back...
Printing data...
[[ 1  2  3  4  5  6]
 [ 7  8  9 10 11 12]
 [13 14 15 16 17 18]
 [19 20 21 22 23 24]]


In [None]:
f['data'].dims[0].label = 'z'
f['data'].dims[2].label = 'x'

In [30]:
# Absolute and relative paths are used to create groups in MyGroup. 
#
import sys
import h5py

#
# Use 'w' to remove existing file and create a new one; use 'w-' if
# create operation should fail when the file already exists.
#
print( "Creating HDF5 file group.h5...")
file = h5py.File('group.h5','w')
#
# Create a group with the name "MyGroup"
#
print ("Creating group MyGroup in the file...")
group = file.create_group("MyGroup")
#
# Create group "Group_A" in group MyGroup
#
print( "Creating group Group_A in MyGroup using absolute path...")
group_a = file.create_group("/MyGroup/Group_A")
#
# Create group "Group_B" in group MyGroup
#
print( "Creating group Group_B in MyGroup using relative path...")
group_b = group.create_group("Group_B")
# 
# Print the contents of MyGroup group
#
print ("Printing members of MyGroup group:", group.keys())
#
# Close the file before exiting; H5Py will close the groups we created.
#
file.close()

Creating HDF5 file group.h5...
Creating group MyGroup in the file...
Creating group Group_A in MyGroup using absolute path...
Creating group Group_B in MyGroup using relative path...
Printing members of MyGroup group: <KeysViewHDF5 ['Group_A', 'Group_B']>
