# Data IO (input/output)


# Introduction

ESRF data (used to) come in (too many) different formats:

* Specfile, EDF, HDF5
* And specific detector formats: MarCCD, Pilatus CBF, Dectris Eiger, …


HDF5 is now the standard ESRF data format so we will only focus on it today.

Methods for accessing other file formats are described in the [io_spec_edf.ipynb](io_spec_edf.ipynb) notebook.

# HDF5

## What is HDF5?

[HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) (for Hierarchical Data Format version 5) is a file format to structure and store complex and large volumes of data.

## Structure

HDF5 organizes data in a hierarchical structure, similar to a file system.

It contains datasets (arrays) and groups (directories) that can hold both datasets and other groups.

**Data can be mostly anything: image, table, graphs, documents**

## Why HDF5?

* High-performance (binary)
* Portable file format (Standard exchange format for heterogeneous data)
* Self-describing extensible types, rich metadata
* Support data compression
* Free ( & open source)
* Widely used in scientific computing
* Adopted by a large number of institutes (NASA, LIGO, ...)
* Adopted by most of the synchrotrons (ESRF, SOLEIL, Desy...)

## HDF5 description

The container is mostly structured with:

* **File**: the root of the container
* **Group**: a grouping structure containing groups or datasets
* **Dataset**: a multidimensional array of data elements
* And other features (links, attributes, datatypes, virtual datasets)

![hdf5_class_diag](images/hdf5_model.png "hdf5 class diagram")


## HDF5 example

Here is an example of the file generated by [pyFAI](https://github.com/silx-kit/pyFAI)

![hdf5_example](images/hdf5_example.png "hdf5 example")

## Useful tools for HDF5

### [HDFGroup tools](https://portal.hdfgroup.org/display/HDF5/HDF5+Tools+by+Category)

Command line and desktop application: `h5ls`, `h5dump`, `hdfview`

```bash
>>> h5ls -r my_first_one.h5
    /                        Group
    /data1                   Dataset {100, 100}
    /group1                  Group
    /group1/data2            Dataset {100, 100}
```

### [`h5glance`](https://github.com/European-XFEL/h5glance)

Jupyter notebook and command line tool for browsing HDF5 files

In [None]:
# From jupyter notebook
from h5glance import H5Glance
H5Glance("data/water.h5")

In [None]:
%%bash
# From the command line
h5glance data/water.h5

### [`jupyterlab-h5web`](https://github.com/silx-kit/jupyterlab-h5web/)

JupyterLab HDF5 file browser and viewer

[Go to JupyterLab](/lab)

In [None]:
from jupyterlab_h5web import H5Web

H5Web("data/water.h5")

### [`silx view`](http://www.silx.org/doc/silx/latest/applications/view.html)

Desktop application file browser/viewer

```bash
>>> pip install silx
>>> silx view my_file.h5
```

In [None]:
%%bash
# With silx view GUI
silx view data/water.h5

# h5py

![h5py book](images/h5py.png "h5py book")

## What is h5py ?

[h5py](https://www.h5py.org/) is the python binding for accessing HDF5 files. Originally from [Andrew Collette](http://shop.oreilly.com/product/0636920030249.do)

It allows to read / write HDF5 files using a simple Pythonic API

## How to install h5py ?

```bash
pip install h5py
```

In [None]:
import h5py

print("h5py:", h5py.version.version)
print("hdf5:", h5py.version.hdf5_version)

## Opening and creating HDF5 files ?

### `h5py.File`

* First open the file with [h5py.File](http://docs.h5py.org/en/stable/high/file.html):
  ```
  h5py.File('myfile.hdf5', mode)
  ```
  [opening modes](http://docs.h5py.org/en/stable/high/file.html#opening-creating-files):

| Mode    | Meaning                                                            |
|---------|--------------------------------------------------------------------|
| r       | Readonly, file must exist; *Default with h5py* **v3**              |
| r+      | Read/write, file must exist                                        |
| w       | Create file, truncate if exists                                    |
| w- or x | Create file, fail if exists                                        |
| a       | Read/write if exists, create otherwise; *Default with h5py* **v2** |

### Opening an existing file

In [None]:
import h5py

h5file = h5py.File("data/water.h5", mode="r")
h5file.close()

### Creating a new file

In [None]:
import h5py

h5file = h5py.File("data/new_data.h5", mode="w")
h5file.close()

### Using a context manager

* Context managers guarantee that resources are released. In our case, it ensures that the HDF5 file is closed.
* Usually used from the `with` statement.

To safely access a HDF5 file, do:

In [None]:
with h5py.File("data/water.h5", mode="r") as h5file:
    pass

### File structure

#### `h5py.Group`

Documentation: [Group](http://docs.h5py.org/en/stable/high/group.html)

##### browsing groups

* Then access the file content with a dictionary-like API, [h5py.Group](http://docs.h5py.org/en/stable/high/group.html):

  - [`Group.keys()`](https://docs.h5py.org/en/stable/high/group.html#h5py.Group.keys)
  - [`Group.items()`](https://docs.h5py.org/en/stable/high/group.html#h5py.Group.items)
  - [`Group.values()`](https://docs.h5py.org/en/stable/high/group.html#h5py.Group.values)

In [None]:
# Available names at the first level
with h5py.File("data/water.h5", mode="r") as h5file:
    print(list(h5file.keys()))

In [None]:
from pprint import pprint

# List 'entry_0000' group children
with h5py.File("data/water.h5", mode="r") as h5file:
    group = h5file["entry_0000"]
    pprint(dict(group.items()))

In [None]:
# List 'entry_0000/4_azimuthal_integration' group children
with h5py.File("data/water.h5", mode="r") as h5file:
    group = h5file["entry_0000"]
    print(list(group["4_azimuthal_integration"].values()))

In [None]:
# List 'entry_0000/4_azimuthal_integration/results' group children
with h5py.File("data/water.h5", mode="r") as h5file:
    print(list(h5file["/entry_0000/4_azimuthal_integration/results"].values()))

##### creating a group

With [Group.create_group](https://docs.h5py.org/en/stable/high/group.html#h5py.Group.create_group):

In [None]:
# List 'entry_0000' group children
with h5py.File("data/new_data.h5", mode="w") as h5file:
    my_group = h5file.create_group('my_group')
    my_group.create_group("sub_group")

#### `h5py.Dataset`

Documentation: [Dataset](http://docs.h5py.org/en/stable/high/dataset.html)

##### reading a dataset

In [None]:
with h5py.File("data/water.h5", mode="r") as h5file:
    h5dataset = h5file["/entry_0000/4_azimuthal_integration/results/I"]
    print(h5dataset)

It mimics `numpy.ndarray`.
The data is read from the file only when it is needed.

In [None]:
with h5py.File("data/water.h5", mode="r") as h5file:
    h5dataset = h5file["/entry_0000/4_azimuthal_integration/results/I"]
    print("Dataset:", h5dataset.shape, h5dataset.dtype, h5dataset.size)

Read data from the file to a numpy.ndarray

In [None]:
with h5py.File("data/water.h5", mode="r") as h5file:
    h5dataset = h5file["/entry_0000/4_azimuthal_integration/results/I"]
    subset = h5dataset[:5]  # Copy the selection to a numpy.ndarray
    print("subset:", subset, "=> sum:", subset.sum())
    
    data = h5dataset[()]  # Copy the whole dataset to a numpy.ndarray
    print("data type:", type(data), "; shape", data.shape, "; min.:", data.min())

Once the file is closed, the Dataset no longer gives access to data

In [None]:
with h5py.File("data/water.h5", mode="r") as h5file:
    h5dataset = h5file["/entry_0000/4_azimuthal_integration/results/I"]
    subset = h5dataset[:5]  # Copy the selection to a numpy.ndarray
    data = h5dataset[()]  # Copy the whole dataset to a numpy.ndarray
print(h5dataset)
print(subset)
print(data)

In [None]:
with h5py.File("data/water.h5", "r") as h5file:
    dataset = h5file["/entry_0000/4_azimuthal_integration/results/I"]
    data = dataset[()]
print(dataset)
print(data[:5])

Not very convenient for interactive browsing... this is why silx view, h5web, h5glance ... exists

##### writing a dataset

In [None]:
import numpy

with h5py.File("data/new_data.h5", mode="w") as h5file:
    h5file["mydataset"] = numpy.random.rand(100, 100)

h5py will create all missing groups to solve the dataset location. So you usually won't have to call 'create_group'

In [None]:
with h5py.File("data/new_data.h5", mode="a") as h5file:
    h5file["group/to/mydataset"] = numpy.random.rand(100, 100)

alternative: using [Group.create_dataset](https://docs.h5py.org/en/stable/high/group.html#h5py.Group.create_dataset)

In [None]:
with h5py.File("data/new_data.h5", mode="w") as h5file:
    h5file.create_dataset("data1", data=numpy.arange(100))

#### attributes

[attributes](https://docs.h5py.org/en/stable/high/attr.html) are the way to store metadata to a [group](https://docs.h5py.org/en/stable/high/group.html) or a [dataset](https://docs.h5py.org/en/stable/high/dataset.html).

Group and Dataset have a small `'<obj>.attrs'` attached to them.

**warning** attributes must be of a limited size

writting an attribute

In [None]:
with h5py.File("data/new_data.h5", "w") as h5file:
    dataset = h5file.create_dataset('my_dataset', data=numpy.random.rand(10, 10))
    dataset.attrs["description"] = 'This is a random dataset'

reading an attribute

In [None]:
with h5py.File("data/new_data.h5", "r") as h5file:
    print(h5file["my_dataset"].attrs['description'])

## Exercice: Flat field correction

Flat-field correction is a technique used to improve quality in digital imaging.

The goal is to normalize images and remove artifacts caused by variations in the pixel-to-pixel sensitivity of the detector and/or by distortions in the optical path. (see https://en.wikipedia.org/wiki/Flat-field_correction)

$$ normalized = \frac{raw - dark}{flat - dark} $$

* `normalized`: Image after flat field correction
* `raw`: Raw image. It is acquired with the sample.
* `flat`: Flat field image. It is the response given out by the detector for a uniform input signal. This image is acquired without the sample.
* `dark`: Also named `background` or `dark current`. It is the response given out by the detector when there is no signal. This image is acquired without the beam.

Here is a function implementing the flat field correction:

*Note: make sure you execute the cell for defining this function*

In [None]:
import numpy

def flatfield_correction(raw, flat, dark):
    """
    Apply a flat-field correction to a raw data using a flat and a dark.
    """
    # Make sure that the computation is done using float
    # to avoid type overflow or loss of precision
    raw = raw.astype(numpy.float32)
    flat = flat.astype(numpy.float32)
    dark = dark.astype(numpy.float32)
    # Do the computation
    return (raw - dark) / (flat - dark)

**Note**: If you like to plot an image you can use `matplotlib`'s `imshow` function.

The `%matplotlib` "magic" command should be called once first.

In [None]:
%matplotlib inline

from matplotlib import pyplot as plt

In [None]:
import numpy
plt.imshow(numpy.random.random((20, 60)))

### Exercise 1

1. Browse the file ``data/ID16B_diatomee.h5``
2. Get **a single** raw dataset, a flat field dataset and a dark image dataset from this file
3. Apply the flat field correction
4. Save the result into a new HDF5 file

If you are stuck, the solution is provided in the file [solutions/exercise1.py](./solutions/exercise1.py)

In [None]:
from jupyterlab_h5web import H5Web
H5Web("data/ID16B_diatomee.h5")

In [None]:
# or
from h5glance import H5Glance
H5Glance("data/ID16B_diatomee.h5")

In [None]:
import h5py

with h5py.File("data/ID16B_diatomee.h5", mode="r") as h5s:
    pass
    # this is a comment

    # step1: Read the data

    # raw_data_path = ...
    # raw_data = ...

    # flat_path = ...
    # flat = ...

    # dark_path = ...
    # dark = ...

# step2: Compute the result

# normalized = flatfield_correction(raw_data, flat, dark)

# step3: Save the result

# ...


### Exercise 2

1. Apply the flat field correction **to all** raw data available (use the same flat and dark for all the images)
2. Save each result into different datasets of the same HDF5 file

If you are stuck, the solution is provided in the file [solutions/exercise2.py](./solutions/exercise2.py)


### Exercise 3

From the previous exercise, we can see that the flat field correction was not very good for the last images.

Another flat field was acquired at the end of the acquisition.

We could use this information to compute a flat field closer to the image we want to normalize. It can be done with a linear interpolation of the flat images by using the name of the image as the interpolation factor (which varies between 0 and 500 in this case).

1. For each raw data, compute the corresponding flat field using lineal interpolation (between `flatfield/0000` and `flatfield/0500`)
2. Save each result into different datasets in a single HDF5 file

If you are stuck, the solution is provided in the file [solutions/exercise3.py](./solutions/exercise3.py)

## Advanced usage

### Dataset compression

Install [hdf5plugin](https://github.com/silx-kit/hdf5plugin) and `import hdf5plugin`.

HDF5 provides dataset compression support.
With `h5py` GZIP and LZF compression are available (see [compression-filters](https://docs.h5py.org/en/stable/high/dataset.html#lossless-compression-filters)).
Yet, there are many [third-party compression filters for HDF5](https://portal.hdfgroup.org/display/support/Registered+Filter+Plugins) available.

[hdf5plugin](https://github.com/silx-kit/hdf5plugin) allows usage of some of those compression filters with `h5py` (Blosc, Blosc2, BitShuffle, BZip2, FciDecomp, LZ4, SZ, SZ3, Zfp, ZStd).


In [None]:
import h5py
import hdf5plugin  # Allows to read dataset stored with supported compressions

To write compressed datasets, see:

- [Group.create_dataset](https://docs.h5py.org/en/stable/high/group.html#h5py.Group.create_dataset) `chunks`, `compression` and `compression_opts` parameters.
- ["Chunked Storage" documentation](https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage)
- [hdf5plugin documentation](https://github.com/silx-kit/hdf5plugin#documentation)

### Soft and external links

A HDF5 file can contain links to Group/Dataset:
- within the same file: see [h5py.SoftLink](https://docs.h5py.org/en/stable/high/group.html#soft-links)
- in another file: see [h5py.ExternalLink](https://docs.h5py.org/en/stable/high/group.html#external-links)

Links can be dangling if the destination does not exist.

### External dataset

A HDF5 file can contain datasets that are stored in external binary files: See [Group.create_dataset](https://docs.h5py.org/en/stable/high/group.html#h5py.Group.create_dataset) `external` parameter.

### Virtual Dataset (aka. VDS)

Virtual dataset allows to map multiple datasets into a single one.
Once created it behaves as other datasets.

See https://docs.h5py.org/en/stable/vds.html


### HDF5 file locking: A word of caution

Do **NOT** open a HDF5 file that is otherwise being written (without caution).

By default, HDF5 locks the file even for reading, and other processes cannot open it for writing.
This can be an issue, e.g., during acquisition.

**WARNING**: Without file locking, do not open twice the same file for writing or the file will be corrupted.

Workarounds:

- Helper to handle HDF5 file locking: [`silx.io.h5py_utils.File`](http://www.silx.org/doc/silx/latest/modules/io/h5py_utils.html#silx.io.h5py_utils.File)
- HDF5 file locking can be disabled by setting the `HDF5_USE_FILE_LOCKING` environment variable to `FALSE`.
- With recent version of `h5py` (>= v3.5.0): [`h5py.File`'s `locking` argument](https://github.com/h5py/h5py/blob/f155036478ca458924d2c46edfd6bfb9e6e32fb5/h5py/_hl/files.py#L443-L451)

### Chunked storage

By default HDF5 datasets will be contiguous (like C). If you have specific usages and want to improve speed maybe you will want to define chunks. See [h5py chunked storage](https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage) (for advanced usage)

### Practical tools

- conversion:
    - [`silx convert`](http://www.silx.org/doc/silx/latest/applications/convert.html): To convert EDF, or spec files to HDF5

- reading/writing HDF5 helpers:
    - [`silx.io.dictdump`](http://www.silx.org/doc/silx/latest/modules/io/dictdump.html): `h5todict`, `dicttoh5`
    - [`silx.io.utils.h5py_read_dataset`](http://www.silx.org/pub/doc/silx/latest/modules/io/utils.html#silx.io.utils.h5py_read_dataset)

### A word about Nexus

[Nexus](https://www.nexusformat.org/) is a data format for neutron, x-ray, and muon science.

It aims to be a common data format for scientists for greater collaboration.

If you intend to store some data to be shared it can give you a 'standard way' for storing it.

The main advantage is to ensure compatibility between your data files and existing softwares (if they respect the nexus format) or from your software to different datasets.

* an example on [how to store tomography raw data](http://download.nexusformat.org/doc/html/classes/applications/NXtomo.html?highlight=tomography)
* an example to store [tomoraphy application (3D reconstruction)](http://download.nexusformat.org/doc/html/classes/applications/NXtomoproc.html?highlight=tomography)


## Conclusion

[h5py](https://www.h5py.org/) provides access to HDF5 file content from Python through:

- [`h5py.File`](https://docs.h5py.org/en/stable/high/file.html) opens a HDF5 file:
  - Do not forget the mode: `'r'`, `'a'`, `'w'`.
  - Use a `with` statement or do not forget to `close` the file.
- [`h5py.Group`](https://docs.h5py.org/en/stable/high/group.html) provides a key-value mapping `dict`-like access to the HDF5 structure.
- [`h5py.Dataset`](https://docs.h5py.org/en/stable/high/dataset.html) gives access to data as `numpy.ndarray`.