# HDF5

![hdf_group](img/HDF_logo.png "HDF group")

## what is hdf5 ?

[HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) (for Hierarchical Data Format) is a file format to structure and store data for high volume and complex data

## Why hdf5 ?

* Hierarchical collection of data (directory and file, UNIX-like path)
* High-performance (binary)
* Portable file format (Standard exchange format for heterogeneous data)
* Self-describing extensible types, rich metadata
* Support data compression
* free ( & open source)
* adopted by a large number of institute (NASA, LIGO, ...)
* adopted by most of the synchrotrons (esrf, SOLEIL, Daisy...)
* insure [forward and backward compatibility](https://support.hdfgroup.org/HDF5/doc/ADGuide/CompatFormat180.html)

Data can be mostly anything: image, table, graphs, documents

## HDF5 description

The container is mostly structured with:

* **File**: the root of the container
* **Group**: a grouping structure containing groups or datasets
* **Dataset**: a multidimensional array of data elements
* And other features (links, attributes, datatypes)

![hdf5_class_diag](img/hdf5_model.png "hdf5 class diagram")


## HDF5 example

Here is an example of the file generated by pyFAI

![hdf5_example](img/hdf5_example.png "hdf5 example")

## Usefull tools for HDF5

* h5ls, h5dump, hdfview
```bash
>>> h5ls -r my_first_one.h5 
>>> /                        Group
>>> /data1                   Dataset {100, 100}
>>> /group1                  Group
>>> /group1/data2            Dataset {100, 100}
```

* silx view

==> The HDF group provides a web page with more tools https://support.hdfgroup.org/HDF5/doc/RM/Tools.html

## h5py

![h5py book](img/h5py.gif "h5py book")

[h5py](https://www.h5py.org/) is the python binding for accessing hdf5. Deveop and maintened by some 'enthusiastic'. Orginally from [Andrew Collette](http://shop.oreilly.com/product/0636920030249.do)

With time work more and more closely with the hdfgroup.

Easy to associate hdf5 and python, every thing is represented as a dictionnary.

### How to read an hdf5 file with h5py

first open a file using a [File Object](http://docs.h5py.org/en/stable/high/file.html)
```
h5py.File('myfile.hdf5', opening_mode)
```

[opening modes](http://docs.h5py.org/en/stable/high/file.html#opening-creating-files) are:

|         |                                                  |
|---------|--------------------------------------------------|
| r       | Readonly, file must exist                        |
| r+      | Read/write, file must exist                      |
| w       | Create file, truncate if exists                  |
| w- or x | Create file, fail if exists                      |
| a       | Read/write if exists, create otherwise (default) |
   

In [3]:
import h5py

h5file = h5py.File('data/test.h5', 'r')

# print available names at the first level
print("First children:", list(h5file['/'].keys()))

First children: ['diff_map_0000', 'diff_map_0001', 'diff_map_0002', 'diff_map_0003', 'diff_map_0004']


In [None]:
# reaching a dataset from a sub group
dataset = h5file['/diff_map_0004/data/map']

# using size and types to not read the full stored data
print("Dataset:", dataset.shape, dataset.size, dataset.dtype)

datasets mimics numpy-array

In [None]:
# read and apply the operation
print(dataset[5, 5, 0:5])
print(2 * dataset[0, 5, 0:5])

In [None]:
# copy the data and store it as a numpy-array
b = dataset[...]
b[0, 0, 0:5] = 0
print(dataset[0, 0, 0:5])
print(b[0, 0, 0:5])

![warning](img/warning.jpg)

### Multiple indexing

Indexing a dataset once loads a numpy array into memory.
If you try to index it twice to write data, you may be surprised that nothing
seems to have happened:

In [8]:
f = h5py.File('data/my_hdf5_file.h5', 'w')
dset = f.create_dataset("test", (2, 2))
dset[0][1] = 3.0  # No effect!
# This assignment only modifies the loaded array. It's equivalent to this:
print('orginal value:', dset[0][1])

new_array = dset[0]
new_array[1] = 3.0
print('value modified (in the copy):', new_array[1])
print('orginal value:', dset[0][1])

# To write to the dataset, combine the indexes in a single step:
dset[0, 1] = 3.0
print(dset[0, 1])

orginal value: 0.0
value modified (in the copy): 3.0
orginal value: 0.0
3.0


## How to write an hdf5 with h5py

In [None]:
import numpy
import h5py

data = numpy.arange(10000.0)
data.shape = 100, 100

# write
h5file = h5py.File('my_first_one.h5', mode='w')

# write data into a dataset from the root
h5file['/data1'] = data

# write data into a dataset from group1
h5file['/group1/data2'] = data

h5file.close()

In [2]:
## display contains of an hdf5 file

In [3]:
## writing a .h5 file

In [4]:
## reading a .h5 file

## Hand on - normalization

### step 1: load data, dark and flat from file/data

In [None]:
import h5py
with h5py.File(myfile.h5) as h5:
    data = ...
    dark = ... 
    flat = ...

### step 2: apply normalization for each slice of the data

For this you can use the `normalize` function
warning: dark and slice are 2d array, data is 3d

In [1]:
def normalize(data, dark, flat):
    assert dark.shape == flat.shape
    assert data.shape == dark.shape
    return (data - dark) / (flat - dark)

### step 3: store the normalize data into a new dataset

## Nexus

[Nexus](https://www.nexusformat.org/) is a data format for neutron, x-ray, and muon science.

It defined a common to represente dataset.