# Data IO (input/output)


# Introduction

ESRF data come in (too many) different formats:

* Specfile
* EDF
* HDF5

And specific detector formats:

* MarCCD
* Pilatus CBF
* Dectris Eiger
* …


HDF5 is expected to become the standard ESRF data format. Some beamlines have already switched.

# Accessing ESRF data

## Libraries


* h5py
    * Access to HDF5 files
* FabIO
    * Provides access to several image data formats
    * Managed by the DAU
* silx
    * Normalize a way to access any data
    * Helper to simplify the transition to HDF5
    * `silx view` to show the file structure
    * Also provides data processing functions
    * Managed by the DAU

# Accessing ESRF data

## Libraries


Those are already available for most ESRF computers.

Cross platform (available for Windows, Linux, Mac OS X)

Also available from source code (under MIT license)

* https://github.com/silx-kit/silx
* https://github.com/silx-kit/fabio
* https://github.com/h5py/h5py

## Spec files

* Text format from Spec sequencer
* Contains evolution of measurments and instruments during a scan
* We do not recommand to use this format anymore
* `silx` provides a HDF5-like read access to Spec files

### Spec compatibility

* PyMCA was previously often used as a Python library to read Spec files
* Now prefer using silx

In [None]:
# instead of
from PyMca5.PyMca import specfilewrapper
# prefer using
from silx.io import specfilewrapper

### How to read a spec file

An example is given later in [spec files using silx](#Read-Spec-file-as-an-HDF5)

## EDF files


* ESRF data format
* It contains
    * Header containing various informations
    * 1D/2D/3D array of float/integer
    * Multi-frames (more than one image in a single file)
    * Often used as file series
* Library
    * Use `fabio`
    * `silx` provides a HDF5-like read access

## Read a single EDF image

In [None]:
import fabio

image = fabio.open("data/medipix.edf")

In [None]:
# Here is the data as a numpy array
print(image.data)
# Here is the header as key-value dictionary
print(image.header.keys())

In [None]:
# Better to use a context manager
with fabio.open("data/medipix.edf") as image:
    print(image.header["dir"])

## Read a multi-frame EDF image

A file containing many frames.

In [None]:
import fabio

with fabio.open("data/ID16B_diatomee.edf") as image:

    print("Nb frames: %d" % image.nframes)

    for frame in image.frames():

        average = frame.data.mean()
        
        message = "Frame ID: %d    Data average: %0.2f"
        print(message % (frame.index, average))

## Read a file-series of EDF image

A file-series is compound by many files that have to be iterated, and may contain many frames. `open_series` can be used.

- http://www.silx.org/doc/fabio/latest/getting_started.html#fabio-file-series

In [None]:
import fabio

with fabio.open_series(first_filename="data/ID19_D2H2T2_0000.edf") as series:

    print("Nb frames: %d" % series.nframes)

    for frame in series.frames():

        average = frame.data.mean()

        message = "Filename: %s    Frame ID: %d    Data average: %0.2f"
        print(message % (frame.file_container.filename, frame.index, average))

## Write an EDF image

In [None]:
import numpy
import fabio

image = numpy.random.rand(10, 10)
metadata = {'pixel_size': '0.2'}

image = fabio.edfimage.EdfImage(data=image, header=metadata)
image.write('edf_writing_example.edf')

## Other formats using FabIO

### Reading other formats

FabIO supports image formats from most manufacturers: 
Mar, Rayonix, Bruker, Dectris, ADSC, Rigaku, Oxford, General Electric…

In [None]:
import fabio

pilatus_image    = fabio.open('filename.cbf')
marccd_image     = fabio.open('filename.mccd')

tiff_image       = fabio.open('filename.tif')
fit2d_mask_image = fabio.open('filename.msk')
jpeg_image       = fabio.open('filename.jpg')

# HDF5

## HDF5 introduction

HDF5 (for Hierarchical Data Format) is a file format to structure and store data for high volume and complex data

* Hierarchical collection of data (directory and file, UNIX-like path)
* High-performance (binary)
* Standard exchange format for heterogeneous data
* Self-describing extensible types, rich metadata
* Support data compression

Data can be mostly anything: image, table, graphs, documents



## HDF5 description

The container is mostly structured with:

* **File**: the root of the container
* **Group**: a grouping structure containing groups or datasets
* **Dataset**: a multidimensional array of data elements
* And other features (links, attributes, datatypes)

<img src="images/hdf5_model.png" style="height:50%;margin-left:auto;margin-right:auto;padding:0em;">


## HDF5 example

Here is an example of the file generated by pyFAI

<img src="images/hdf5_example.png" style="height:50%;margin-left:auto;margin-right:auto;padding:0em;">

## Read an HDF5

In [None]:
import h5py

h5file = h5py.File('data/test.h5', "r")

# print available names at the first level
print("First children:", h5file['/'].keys())

In [None]:
# Get a dataset from a sub group
dataset = h5file['/diff_map_0004/data/map']

# Here we only read metadata from the dataset
print("Dataset:", dataset.shape, dataset.size, dataset.dtype)

In [None]:
# Remember to close the file
h5file.close()

In [None]:
# Or better, use a context manager
# The file is closed for you
with h5py.File('data/test.h5', "r") as h5file:
    print(h5file['/'].keys())

## HDF5 mimics numpy-array

The data are reached from the file only when it is needed.

In [None]:
import h5py
h5file = h5py.File('data/test.h5', "r")
dataset = h5file['/diff_map_0004/data/map']

In [None]:
# Read and apply an operation
print(dataset[5, 5, 0:5])
print(2 * dataset[0, 5, 0:5])

In [None]:
# copy the data and store it as a numpy-array
b = dataset[...]
b[0, 0, 0:5] = 0
print(dataset[0, 0, 0:5])
print(b[0, 0, 0:5])

In [None]:
h5file.close()

## Write an HDF5

* http://docs.h5py.org/en/stable/high/group.html
* http://docs.h5py.org/en/stable/high/dataset.html

In [None]:
import numpy
import h5py

# Create a 2D data
data = numpy.arange(100 * 100)
data.shape = 100, 100

# Notice the mode='w', as 'write'
with h5py.File('my_first_one.h5', mode='w') as h5file:

    # write data into a dataset from the root
    h5file['/data1'] = data

    # write data into a dataset from group1
    h5file['/group1/data2'] = data

    # Or with a functional API
    g = h5file.create_group("/group2")
    g.create_dataset("data3", data=data)

## Usefull tools for HDF5

In [None]:
!h5ls -r my_first_one.h5 

In [None]:
from h5glance import H5Glance
H5Glance("my_first_one.h5")

* `h5py`: Connector to HDF5 files
* `silx view`: Qt file browser
* `h5glance`: File browser for jupyter

The HDF group provides a web page with more tools https://support.hdfgroup.org/HDF5/doc/RM/Tools.html

# Module `silx.io`

* Try to simplify the transition to HDF5
    * h5py-like API
    * Single way to access to Spec/EDF/HDF5 files
    * Based on NeXus specifications http://www.nexusformat.org/
* Read-only

## General mapping from Spec file

Silx can expose spec file with an HDF5-like mapping.

![mapping_spec](images/spech5_arrows.png "hdf5-like mapping for spec files")


## General mapping from EDF image

Silx can expose EDF file (or any support formats from `fabio`) with a HDF5-like mapping.

![mapping_spec](images/fabioh5_arrows.png "hdf5-like mapping for EDF files")


## Display the mapping with tools

* `silx view` a command line Qt program.
* `silx.io.utils.h5ls`

In [None]:
import silx.io
import silx.io.utils

with silx.io.open('data/oleg.dat') as h5file:
    string = silx.io.utils.h5ls(h5file)
    print(string)

## Read Spec file as an HDF5

In [None]:
import time
import silx.io
data = silx.io.open('data/oleg.dat')

# Available scans
print("First childs:", data['/'].keys())

# Available measurements from the scan 94.1
print("Containt of measurement:", data['/94.1/measurement'].keys())

# Get data from measurement
epoch = data['/94.1/measurement/Epoch']
bpmi = data['/94.1/measurement/bpmi']
for t, data in zip(epoch, bpmi):
    t = time.strftime("%X", time.gmtime(t))
    print("%s   BPMi: %0.4e" % (t, data))

For more information and examples you can read the silx IO tutorial: https://github.com/silx-kit/silx-training/blob/master/silx/io/io.pdf

## Read EDF image as an HDF5

In [None]:
import silx.io
data = silx.io.open('data/ID16B_diatomee.edf')

# Access to the frames
frames = data['/scan_0/instrument/detector_0/data']
len(frames)  # number of frames
frames[0]    # first frame
print("Number of frames:", len(frames))
print("Size of an image:", frames[0].shape)

# Access to motors, monitor, timestanp
srot = data['scan_0/instrument/positioners/srot'][...]
mon = data['scan_0/measurement/mon'][...]
timestamp = data['scan_0/instrument/detector_0/others/time_of_day'][...]
for t, s, m in zip(timestamp, srot, mon):
    t = time.strftime("%X", time.gmtime(t))
    message = "%s   Rot:% 5.1fdeg   Monitor: %0.2f"
    print(message % (t, s, m))

## Read HDF5 using silx

For conveniance, ``silx`` also provides the h5py API for HDF5 files.

In [None]:
import silx.io
h5file = silx.io.open('data/test.h5')

# print available names at the first level
print("First children:", h5file['/'].keys())

# reaching a dataset from a sub group
dataset = h5file['/diff_map_0004/data/map']

# using size and types to not read the full stored data
print("Dataset:", dataset.shape, dataset.size, dataset.dtype)

h5file.close()

# Convert tools


- `fabio-convert`: To convert raster images 
- `silx convert`: To convert EDF, or spec files to HDF5

# Exercice: Flat field correction

Flat-field correction is a technique used to improve quality in digital imaging.

The goal is to normalize images caused by variations in the pixel-to-pixel sensitivity of the detector and/or by distortions in the optical path.

$ normalized = {{raw - dark}\over{flat - dark}}$

* `normalized`: Image after flat field correction
* `raw`: Raw image. That's an acquisition from a sample.
* `flat`: Flat field image. Is is the response given out by the detector for a uniform input signal. It is acquired without the sample.
* `dark`: Also named `background` or `dark current`. It is the response given out by the detector when there is no signal. The image is acquired without the beam.

# Exercice: Implementation with EDF files

Here is helper already provided to compute the flat field.

In [None]:
# Here we provide some helpers

import fabio
import numpy

def flatfield_correction(raw, flat, dark):
    """
    Apply a flat-field correction to a raw data using a flat and a dark.
    """
    # Make sure that the computation is done using float
    # to avoid type overflow or lose of precision
    raw = raw.astype(numpy.float32)
    flat = flat.astype(numpy.float32)
    dark = dark.astype(numpy.float32)
    # To the computation
    return (raw - dark) / (flat - dark)

And a `matplotlib` function to help to display data.

In [None]:
def imshowmany(*args, **kwargs):
    """
    Dispaly as image all array provided as argument.
    
    The image title is defined using the argument name.
    """
    from matplotlib import pyplot
    if len(kwargs) == 0:
        import collections
        kwargs = collections.OrderedDict()
    for i, arg in enumerate(args):
        if isinstance(arg, dict):
            kwargs.update(arg)
        else:
            kwargs["arg" + i]

    fig = pyplot.figure()
    columns = 3
    nbrows = len(kwargs) // columns + 1
    nbcols = len(kwargs) // nbrows
    for i, (key, value) in enumerate(kwargs.items()):
        a = fig.add_subplot(nbrows, nbcols, i + 1)
        imgplot = plt.imshow(value)
        a.set_title(key)

Here is an implementation of a flat field correction applied on a single EDF files.

The sample is a diatom, an unicellular algae inserted on a needle.

In [None]:
%pylab notebook

In [None]:
# Read the data

with fabio.open("data/ID16_diatomee/dark.edf") as image:
    dark = image.data
with fabio.open("data/ID16_diatomee/flat.edf") as image:
    flat = image.data
with fabio.open("data/ID16_diatomee/data.edf") as image:
    raw = image.data

# Compute the result

normalized = flatfield_correction(raw, flat, dark)

# Save the result

image = fabio.edfimage.EdfImage(data=normalized)
image.save("result.edf")

# Check the saved result

with fabio.open("result.edf") as image:
    saved = image.data
imshowmany(Before=raw, After=saved)

# Exercise 1

1. Browse the file ``data/ID16B_diatomee.h5``
2. Reach a single raw data, a flat field and a dark image from this file
3. Apply the flat field correction
4. Save the result into a new HDF5 file

If you are stuck, the solution is provided in the file [solutions/exercise1.py](./solutions/exercise1.py)

In [None]:
from h5glance import H5Glance
H5Glance("data/ID16B_diatomee.h5")

In [None]:
import h5py

# Read the data

...

# Compute the result

normalized = flatfield_correction(raw, flat, dark)

# Save the result

...


# Exercise 2

1. Apply the flat field correction to all the raw data available (use the same flat and dark for all the images)
2. Save each result into a new dataset alltogether with in a single HDF5 file

If you are stuck, the solution is provided in the file [solutions/exercise2.py](./solutions/exercise2.py)


# Exercise 3

From the previous exercise, we can see that the flat field correction was not very good for the last images.

But another flat field was acquired at the end of the acquisition.

We could use this information to compute a flat field closer to the image we want to normalize. It can be done using a linear interpolation between 0 and 500 using the name of the image.

1. For each raw data to normalize, compute flat field using a lineal interpolation (between `flatfield/0000` and `flatfield/0500`)
2. Save each result in a file

If you are stuck, the solution is provided in the file [solutions/exercise3.py](./solutions/exercise3.py)

# Conclusion

Preconized libraries according to the use case and the file format.

| Formats              | Read            | Write |
|----------------------|-----------------|-------|
| HDF5                 | silx/h5py       | h5py  |
| Specfile             | silx            |       |
| EDF                  | silx/fabio      | fabio |
| Other raster formats | silx/fabio      | fabio |