# HDF5 & python

![hdf_group](img/HDF_logo.png "HDF group")

## what is hdf5 ?

[HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) (for Hierarchical Data Format) is a file format to structure and store data for high volume and complex data

## Why hdf5 ?

* Hierarchical collection of data (directory and file, UNIX-like path)
* Portable file format (Standard exchange format for heterogeneous data)
* Rich metadata, self-describing extensible types
* Support data compression
* Free ( & open source)
* Adopted by a large number of institute (NASA, LIGO, ...)
* Adopted by most of the synchrotrons (esrf, SOLEIL, Daisy...)
* Insure [forward and backward compatibility](https://support.hdfgroup.org/HDF5/doc/ADGuide/CompatFormat180.html)

**Data can be mostly anything: image, table, graphs, documents**

## HDF5 description

The container is mostly structured with:

* **File**: the root of the container
* **Group**: a grouping structure containing groups or datasets
* **Dataset**: a multidimensional array of data elements
* And other features (links, attributes, datatypes)

![hdf5_class_diag](img/hdf5_model.png "hdf5 class diagram")


## HDF5 example

Here is an example of the file generated by [pyFAI](https://github.com/silx-kit/pyFAI)

![hdf5_example](img/hdf5_example.png "hdf5 example")

## Usefull tools for HDF5

* h5ls, h5dump, hdfview (applications)

```bash
>>> h5ls -r my_first_one.h5 
>>> /                        Group
>>> /data1                   Dataset {100, 100}
>>> /group1                  Group
>>> /group1/data2            Dataset {100, 100}
```


* silx view (application)

```bash
>>> pip install silx
>>> silx view my_file.h5
```

* h5glance (HDF5 files in the terminal or an HTML interface)

```python
from h5glance import H5Glance
H5Glance("data/ID16B_diatomee.h5")
```

==> The HDF group provides a web page with more tools https://support.hdfgroup.org/HDF5/doc/RM/Tools.html

## h5py

![h5py book](img/h5py.gif "h5py book")

[h5py](https://www.h5py.org/) is the python binding for accessing hdf5. Originally from [Andrew Collette](http://shop.oreilly.com/product/0636920030249.do)

Easy to associate hdf5 and python, everything is represented as a dictionnary.

### How to read an hdf5 file with h5py

first open a file using a [File Object](http://docs.h5py.org/en/stable/high/file.html)
```
h5py.File('myfile.hdf5', opening_mode)
```

[opening modes](http://docs.h5py.org/en/stable/high/file.html#opening-creating-files) are:

|         |                                                                        |
|---------|------------------------------------------------------------------------|
| r       | Readonly, file must exist                                              |
| r+      | Read/write, file must exist                                            |
| w       | Create file, truncate if exists                                        |
| w- or x | Create file, fail if exists                                            |
| a       | Read/write if exists, create otherwise (default in 'old' h5py version) |
   

In [None]:
import h5py

h5file = h5py.File('data/test.h5', 'r')

# print available names at the first level
print("First children:", list(h5file.keys()))

In [None]:
# reaching a dataset from a sub group
dataset = h5file['diff_map_0004/data/map']
dataset

datasets mimics numpy-array

In [None]:
# using size and types to not read the full stored data
print("Dataset: shape:", dataset.shape, 'size:', dataset.size, 'data type', dataset.dtype)

In [None]:
# read and apply the operation
print(type(dataset))
print(dataset[5, 5, 0:5])
print(2 * dataset[0, 5, 0:5])

In [None]:
# copy the data and store it as a numpy-array
b = dataset[()]   # ellipsis ('[...]') work also but skip some interpretation
print(type(b))
b[0, 0, 0:5] = 0
print(dataset[0, 0, 0:5])
print(b[0, 0, 0:5])

## How to write an hdf5 with h5py

In [None]:
import numpy
import h5py

data = numpy.random.random(10000)
data.shape = 100, 100

# write
h5file = h5py.File('my_first_one.h5', mode='w')

# write data into a dataset from the root
h5file['/data1'] = data

# write data into a dataset from group1
h5file['/group1/data2'] = data

h5file.close()

The same operation with a context manager (with statement automatically open and close the file)

In [None]:
with h5py.File('my_first_one.h5', mode='w') as h5file:
    # write data into a dataset from the root
    h5file['/data1'] = data
    # write data into a dataset from group1
    h5file['/group1/data2'] = data

More links to:

* see [group functions](http://docs.h5py.org/en/stable/high/group.html) for more information like:
    * [creating a group](http://docs.h5py.org/en/stable/high/group.html#creating-groups)
    * [require a group](http://docs.h5py.org/en/stable/high/group.html#Group.require_group)

* see [dataset functions and features](http://docs.h5py.org/en/stable/high/dataset.html) for more information like
    * [creating dataset](http://docs.h5py.org/en/stable/high/dataset.html#creating-datasets)

* see [h5py documentation regarding attributes](http://docs.h5py.org/en/stable/high/attr.html)

* see [hdf5 official documentation](https://portal.hdfgroup.org/display/HDF5) for:
    * [chunks](https://portal.hdfgroup.org/display/HDF5/Chunking+in+HDF5)

## Exercises

`data/ID16B_diatomee.h5` structure is like: 

![ID16B diatomee](img/ID16B_diatomee_screenshot.png)

# Exercise 1

1. Browse the file ``data/ID16B_diatomee.h5`` (using one or several of the hdf5 utilitaries, silx view, h5glance or h5py)
2. Reach a single raw data, a flat field and a dark image (background) from this file
3. Apply the flat field correction ${\frac{rawdata - dark}{flat - dark}}$
4. Save the result into a new HDF5 file

If you are stuck, the solution is provided in the file [solutions/h5py/exercise1.py](solutions/h5py/exercise1.py)

In [None]:
from h5glance import H5Glance
H5Glance("data/ID16B_diatomee.h5")

In [None]:
import h5py

# Read the data

raw =
flat = 
dark = 
...

# Compute the result
normalized = 

# Save the result

...


note: if you like to plot an image you can use the imshow command
!!! the `%pylab` should be called once before calling the imshow function !!!

In [None]:
%pylab inline

In [None]:
import numpy
imshow(numpy.random.random((20, 60)))

# Exercise 2

1. Apply the flat field correction to all raw data available (use the same flat and dark for all the images)
2. Save each result into different datasets of the same HDF5 file

If you are stuck, the solution is provided in the file [solutions/h5py/exercise2.py](solutions/h5py/exercise2.py)


For this you can use the `normalize` function
warning: dark and slice are 2d array, data is 3d

```python
data = normalize(raw_data, dark, flat)
```

In [None]:
def normalize(data, dark, flat):
    assert dark.shape == flat.shape
    assert data.shape == dark.shape
    return (data - dark) / (flat - dark)

# Exercise 3

From the previous exercise, we can see that the flat field correction was not very good for the last images.

Another flat field was acquired at the end of the acquisition.

We could use this information to compute a flat field closer to the image we want to normalize. It can be done with a linear interpolation of the flat images by using the name of the image as the interpolation factor (which varies between 0 and 500 in this case).

1. For each raw data, compute the corresponding flat field using lineal interpolation (between `flatfield/0000` and `flatfield/0500`)
2. Save each result into different datasets in a single HDF5 file

If you are stuck, the solution is provided in the file [solutions/h5py/exercise3.py](solutions/h5py/exercise3.py)

## Nexus

[Nexus](https://www.nexusformat.org/) is a data format for neutron, x-ray, and muon science.

It aims to be a common data format for scientists for greater collaboration.

If you intend to store some data to be shared it can give you a 'standard way' for storing it.

The main advantage is to insure compatibility between your data files and existing softwares (if they respect the nexus format) or from your software to different dataset.

* an example on [how to store tomography raw data](http://download.nexusformat.org/doc/html/classes/applications/NXtomo.html?highlight=tomography)
* an example to store [tomoraphy application (3D reconstruction)](http://download.nexusformat.org/doc/html/classes/applications/NXtomoproc.html?highlight=tomography)