# Data IO (input/output)


# Introduction

ESRF data come in (too many) different formats:

* Specfile
* EDF
* HDF5

And specific detector formats:

* MarCCD
* Pilatus CBF
* Dectris Eiger
* …


HDF5 is expected to become the standard ESRF data format. Some beamlines have already switched.

# Accessing ESRF data

## Libraries


* h5py
    * Access to HDF5 files
* FabIO
    * Provides access to several image data formats
    * Managed by the DAU
* silx
    * Normalize a way to access any data
    * Helper to simplify the transition to HDF5
    * `silx view` to show the file structure
    * Also provides data processing functions
    * Managed by the DAU

# Accessing ESRF data

## Libraries


Those are already available for most ESRF computers.

Cross platform (available for Windows, Linux, Mac OS X)

Also available from source code (under MIT license)

* https://github.com/silx-kit/silx
* https://github.com/silx-kit/fabio
* https://github.com/h5py/h5py

## Spec files

* Text format from Spec sequencer
* Contains evolution of measurments and instruments during a scan
* We do not recommand to use this format anymore
* `silx` provides a HDF5-like read access to Spec files

### Spec compatibility

* PyMCA was previously often used as a Python library to read Spec files
* Now prefer using silx

In [None]:
# instead of
from PyMca5.PyMca import specfilewrapper
# prefer using
from silx.io import specfilewrapper

### How to read a spec file

An example is given later in [spec files using silx](#Reading-spec-files-using-silx)

## EDF files


* ESRF data format
* It contains
    * Header containing various informations
    * 1D/2D/3D array of float/integer
    * Multi-frames (more than one image in a single file)
    * Often used as file series
* Library
    * Use `fabio`
    * `silx` provides a HDF5-like read access

## Read a single EDF image

In [36]:
import fabio

image = fabio.open("data/medipix.edf")

In [None]:
# Here is the data as a numpy array
print(image.data)
# Here is the header as key-value dictionary
print(image.header.keys())

In [None]:
# Better to use a context manager
with fabio.open("data/medipix.edf") as image:
    print(image.header["dir"])

## Read a multi-frame EDF image

A file containing many frames.

In [None]:
import fabio

with fabio.open("data/ID16B_diatomee.edf") as image:

    print("Nb frames: %d" % image.nframes)

    for frame in image.frames():

        average = frame.data.mean()
        
        message = "Frame ID: %d    Data average: %0.2f"
        print(message % (frame.index, average))

## Read a file-series of EDF image

Many files that have to be iterated, and may contain many frames.

- http://www.silx.org/doc/fabio/latest/getting_started.html#fabio-file-series

In [None]:
import fabio

with fabio.open_series(first_filename="data/ID19_D2H2T2_0000.edf") as series:

    print("Nb frames: %d" % series.nframes)

    for frame in series.frames():

        average = frame.data.mean()

        message = "Filename: %s    Frame ID: %d    Data average: %0.2f"
        print(message % (frame.file_container.filename, frame.index, average))

## Write an EDF image

In [22]:
import numpy
import fabio

image = numpy.random.rand(10, 10)
metadata = {'pixel_size': '0.2'}

image = fabio.edfimage.EdfImage(data=image, header=metadata)
image.write('edf_writing_example.edf')

## Other formats using FabIO

### Reading other formats

FabIO supports image formats from most manufacturers: 
Mar, Rayonix, Bruker, Dectris, ADSC, Rigaku, Oxford, General Electric…

In [None]:
import fabio

pilatus_image    = fabio.open('filename.cbf')
marccd_image     = fabio.open('filename.mccd')

tiff_image       = fabio.open('filename.tif')
fit2d_mask_image = fabio.open('filename.msk')
jpeg_image       = fabio.open('filename.jpg')

## File conversion

Using FabIO you can directly convert data to an other format.

You can also use the command-line `fabio-convert`.

In [49]:
import fabio
image = fabio.open('data/medipix.edf')
image = image.convert('tif')
image.save('filename.tif')

# HDF5

## HDF5 introduction

HDF5 (for Hierarchical Data Format) is a file format to structure and store data for high volume and complex data

* Hierarchical collection of data (directory and file, UNIX-like path)
* High-performance (binary)
* Standard exchange format for heterogeneous data
* Self-describing extensible types, rich metadata
* Support data compression

Data can be mostly anything: image, table, graphs, documents



## HDF5 description

The container is mostly structured with:

* **File**: the root of the container
* **Group**: a grouping structure containing groups or datasets
* **Dataset**: a multidimensional array of data elements
* And other features (links, attributes, datatypes)

<img src="images/hdf5_model.png" style="height:50%;margin-left:auto;margin-right:auto;padding:0em;">


## HDF5 example

Here is an example of the file generated by pyFAI

<img src="images/hdf5_example.png" style="height:50%;margin-left:auto;margin-right:auto;padding:0em;">

## Read an HDF5

In [None]:
import h5py

h5file = h5py.File('data/test.h5', "r")

# print available names at the first level
print("First children:", h5file['/'].keys())

In [None]:
# Get a dataset from a sub group
dataset = h5file['/diff_map_0004/data/map']

# Here we only read metadata from the dataset
print("Dataset:", dataset.shape, dataset.size, dataset.dtype)

In [50]:
# Remember to close the file
h5file.close()

In [None]:
# Or better to use a context manager
with h5py.File('data/test.h5', "r") as h5file:
    print(h5file['/'].keys())

## HDF5 mimics numpy-array

The data are reached from the file only when it is needed.

In [46]:
import h5py
h5file = h5py.File('data/test.h5', "r")
dataset = h5file['/diff_map_0004/data/map']

In [47]:
# Read and apply an operation
print(dataset[5, 5, 0:5])
print(2 * dataset[0, 5, 0:5])

[104.14766  103.352615 103.01642  103.24001  103.27751 ]
[205.95827 206.2795  206.5441  206.48112 206.46625]


In [None]:
# copy the data and store it as a numpy-array
b = dataset[...]
b[0, 0, 0:5] = 0
print(dataset[0, 0, 0:5])
print(b[0, 0, 0:5])

In [44]:
h5file.close()

## Write an HDF5

* http://docs.h5py.org/en/stable/high/group.html
* http://docs.h5py.org/en/stable/high/dataset.html

In [48]:
import numpy
import h5py

# Create a 2D data
data = numpy.arange(100 * 100)
data.shape = 100, 100

# Notice the mode='w', as 'write'
with h5py.File('my_first_one.h5', mode='w') as h5file:

    # write data into a dataset from the root
    h5file['/data1'] = data

    # write data into a dataset from group1
    h5file['/group1/data2'] = data

    # Or with a functional API
    g = h5file.create_group("/group2")
    g.create_dataset("data3", data=data)

## Usefull tools for HDF5

* h5py
* silx
* silx view

The HDF group provides a web page with more tools https://support.hdfgroup.org/HDF5/doc/RM/Tools.html

# Module `silx.io`

* Try to simplify the transition to HDF5
    * h5py-like API
    * Single way to access to Spec/EDF/HDF5 files
    * Based on NeXus specifications http://www.nexusformat.org/
* Read-only

## General mapping from Spec file

Silx can expose spec file with an HDF5-like mapping.

![mapping_spec](images/spech5_arrows.png "hdf5-like mapping for spec files")


## General mapping from EDF image

Silx can expose EDF file (or any support formats from `fabio`) with a HDF5-like mapping

![mapping_spec](images/fabioh5_arrows.png "hdf5-like mapping for EDF files")


## Display the mapping with tools

* `silx view` a command line Qt program.
* `silx.io.utils.h5ls`

In [None]:
import silx.io
import silx.io.utils

with silx.io.open('data/test.h5') as h5file:

    string = silx.io.utils.h5ls(h5file)
    print(string)

## Read HDF5 using silx

For conveniance, ``silx`` also provides the h5py API for HDF5 files.

In [65]:
import silx.io
h5file = silx.io.open('data/test.h5')

# print available names at the first level
print("First children:", h5file['/'].keys())

# reaching a dataset from a sub group
dataset = h5file['/diff_map_0004/data/map']

# using size and types to not read the full stored data
print("Dataset:", dataset.shape, dataset.size, dataset.dtype)

h5file.close()

First children: <KeysViewHDF5 ['diff_map_0000', 'diff_map_0001', 'diff_map_0002', 'diff_map_0003', 'diff_map_0004']>
Dataset: (29, 78, 100) 226200 float32


## Read Spec file as an HDF5

In [None]:
import time
import silx.io
data = silx.io.open('data/oleg.dat')

# Available scans
print("First childs:", data['/'].keys())

# Available measurements from the scan 94.1
print("Containt of measurement:", data['/94.1/measurement'].keys())

# Get data from measurement
epoch = data['/94.1/measurement/Epoch']
bpmi = data['/94.1/measurement/bpmi']
for t, data in zip(epoch, bpmi):
    t = time.strftime("%X", time.gmtime(t))
    print("%s   BPMi: %0.4e" % (t, data))

For more information and examples you can read the silx IO tutorial: https://github.com/silx-kit/silx-training/blob/master/silx/io/io.pdf

## Read EDF image as an HDF5

In [None]:
import silx.io
data = silx.io.open('data/ID16B_diatomee.edf')

# Access to the frames
frames = data['/scan_0/instrument/detector_0/data']
len(frames)  # number of frames
frames[0]    # first frame
print("Number of frames:", len(frames))
print("Size of an image:", frames[0].shape)

# Access to motors, monitor, timestanp
srot = data['scan_0/instrument/positioners/srot'][...]
mon = data['scan_0/measurement/mon'][...]
timestamp = data['scan_0/instrument/detector_0/others/time_of_day'][...]
for t, s, m in zip(timestamp, srot, mon):
    t = time.strftime("%X", time.gmtime(t))
    message = "%s   Rot:% 5.1fdeg   Monitor: %0.2f"
    print(message % (t, s, m))

# Convert tools

### silx.io.convert.write_to_h5

Convert spec file to HDF5

In [None]:
from silx.io.convert import write_to_h5

write_to_h5('data/oleg.dat', 'oleg.h5', mode='w')

In [None]:
ls -al oleg.*

# Exercise


1. Read the EDF file ``medipix.edf``.
2. Data processing. The goal of the processing is to clamp the pixels values to a new range of values ([10%, 90%] of the existing one). To do so:

    * Create a mask to detect pixel which are below 10% 
    * With the above mask, set the affected pixels to the 10% 'low value'.
    * Do the same for value above 90%
    * Create the mask of all the modify pixel

3. Store the source, the mask of changed pixels and the result inside ``process.h5``, as below.

   ![Output file structure](images/exercise-result.png)

4. Load ``process.h5`` and list the root content


In [None]:
# Load data/medipix.edf
# ...

# Process the data
# ...

# Save data into a new file (process.h5)
# ...

# Load process.h5 and list the root content
# ...

## Solution

In [None]:
# Load data/medipix.edf
import exercicesolution
import inspect
print(inspect.getsource(exercicesolution.load_data))

In [None]:
# process data
import exercicesolution
import inspect
print(inspect.getsource(exercicesolution.process_data))

In [None]:
# save data
import exercicesolution
import inspect
print(inspect.getsource(exercicesolution.save_data))

In [None]:
# list root
import exercicesolution
import inspect
print(inspect.getsource(exercicesolution.list_root))

In [None]:
# result
import exercicesolution
raw_data, proc_data, mask = exercicesolution.solution("data/medipix.edf")

In [None]:
%pylab

In [None]:
imshow(mask)

In [None]:
imshow(raw_data)

In [None]:
imshow(proc_data)

# Conclusion

Preconized libraries according to the use case and the file format.

| Formats              | Read            | Write |
|----------------------|-----------------|-------|
| HDF5                 | silx/h5py       | h5py  |
| Specfile             | silx            |       |
| EDF                  | silx/fabio      | fabio |
| Other raster formats | silx/fabio      | fabio |