
<font size = "5"> **AI in STEM - workshop 2020; Day03** </font>

<hr style="height:1px;border-top:4px solid #FF8200" />

# Structure of NSID file format

*Author: Gerd Duscher*

*Date: December 2020*

The pyNSID file format is based on ``h5py`` package for the ``hdf5`` file system.

The NSID conventions implemented on top of the ``hdf5`` file format are easily accessible through the pyNSID  package.

In [None]:
!pip install pyTEMlib

Collecting pyTEMlib
[?25l  Downloading https://files.pythonhosted.org/packages/62/e0/ef9f37003e4e11e9fa7e647f57197d2856d859ff0876295c4f99e0b973dd/pyTEMlib-0.2020.12.0-py2.py3-none-any.whl (461kB)
[K     |▊                               | 10kB 20.8MB/s eta 0:00:01[K     |█▍                              | 20kB 11.8MB/s eta 0:00:01[K     |██▏                             | 30kB 8.8MB/s eta 0:00:01[K     |██▉                             | 40kB 8.2MB/s eta 0:00:01[K     |███▌                            | 51kB 8.2MB/s eta 0:00:01[K     |████▎                           | 61kB 8.5MB/s eta 0:00:01[K     |█████                           | 71kB 7.3MB/s eta 0:00:01[K     |█████▊                          | 81kB 7.3MB/s eta 0:00:01[K     |██████▍                         | 92kB 7.7MB/s eta 0:00:01[K     |███████                         | 102kB 7.7MB/s eta 0:00:01[K     |███████▉                        | 112kB 7.7MB/s eta 0:00:01[K     |████████▌                       | 122kB 7

Start with standard imports:

In [None]:
# Ensure python 3 compatibility:
from __future__ import (absolute_import, division, print_function,
                        unicode_literals)

import sys
import warnings

import h5py
import matplotlib.pylab as plt
import numpy as np

# we will also need a sidpy package
import sidpy 
import pyNSID as nsid
from pyTEMlib.nsi_reader import NSIDReader
from pyTEMlib.dm3_reader import DM3Reader

warnings.filterwarnings("ignore", module="numpy.core.fromnumeric")
warnings.filterwarnings("ignore", module="pyNSID.io.nsi_reader")

## Make test file again

Let's make the test file.

In [None]:
dataset = sidpy.Dataset.from_array(np.random.random([4, 5, 10]), name='new')
dataset.data_type = 'SPECTRAL_IMAGE'
dataset.units = 'nA'
dataset.quantity = 'Current'

dataset.metadata = {'this': 'is just a random dataset'}
dataset.set_dimension(0, sidpy.Dimension(np.arange(dataset.shape[0]), 'x',
                                        units='nm', quantity='Length',
                                        dimension_type='spatial'))
dataset.set_dimension(1, sidpy.Dimension(np.linspace(-2, 2, num=dataset.shape[1], endpoint=True), 'y', 
                                        units='nm', quantity='Length',
                                        dimension_type='spatial'))
dataset.set_dimension(2, sidpy.Dimension(np.sin(np.linspace(0, 2 * np.pi, num=dataset.shape[2])), 'bias',
                                        units='mV', quantity='Voltage',
                                        dimension_type='spectral'))
hf = h5py.File("test.hf5", 'a')
hf.create_group('Measurement_000/Channel_000')
dataset.axes = dataset._axes
dataset.attrs = {}
nsid.hdf_io.write_nsid_dataset(dataset, hf['Measurement_000/Channel_000'], main_data_name="new_spectrum")
hf.close()

<HDF5 group "/Measurement_000/Channel_000" (0 members)> new_spectrum


## And now we load it

In [None]:
hdf5_file = h5py.File("test.hf5", 'r+')
print(hdf5_file["Measurement_000"].keys())

<KeysViewHDF5 ['Channel_000']>


We really do normally not care about the underlying structure as the NSID reader is taking care of everything.

The NSID reader will return a sidpy dataset, which we then can plot, analyze, modify, and write back to the h5py file in pyNSID format.

We can read all of them or just a specific `directory` in this hirachical data file (hdf).

In [None]:
hdf5_file = h5py.File("test.hf5", 'r+')
print(*hdf5_file["Measurement_000"].keys())

nsid_reader = NSIDReader(hdf5_file['Measurement_000/Channel_000'])
sidpy_dataset = nsid_reader.read()[0]
sidpy_dataset

Channel_000


Unnamed: 0,Array,Chunk
Bytes,1.60 kB,1.60 kB
Shape,"(4, 5, 10)","(4, 5, 10)"
Count,1 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 1.60 kB 1.60 kB Shape (4, 5, 10) (4, 5, 10) Count 1 Tasks 1 Chunks Type float64 numpy.ndarray",10  5  4,

Unnamed: 0,Array,Chunk
Bytes,1.60 kB,1.60 kB
Shape,"(4, 5, 10)","(4, 5, 10)"
Count,1 Tasks,1 Chunks
Type,float64,numpy.ndarray


## Exploration the structure of the pyNSID data format

We will use a sidpy function to plot the tree of the hdf5 file.

In [None]:
sidpy.hdf_utils.print_tree(hdf5_file)

/
├ Measurement_000
  ---------------
  ├ Channel_000
    -----------
    ├ bias
    ├ new_spectrum
    ├ original_metadata
      -----------------
    ├ x
    ├ y


As a suggested convention we use Measurement_000 as the first directory to store different datasets that belong together. So ``Measurement_000`` is a ``h5py.Group``. Which contains several other ``h5py.Group``s which all start with ``Channel_``.

All directories are numbered and there is a function to automatically increase this number for convenience.

The different ``Channels`` could be for example reference data, simulataneaously acquired datasets.

The results would be logged with each individual dataset in its channel.

The names of directories of results should start with `Log_` or `Result_`.


## The Channel Group

The channel group contains several other ``h5py.Group``s and ``h5py.Datasets``.

Every attribute of a stored ``sidpy`` dataset will be a group and the ``attributes`` of those groups are the dictionaries of these attributes  of ``sidpy`` datasets.
    
For example ``metadata`` is an attribute of the sidpy dataset.

So there will be an ``h5py.Group`` with the name ``metadata``  and the ``attributes`` of that group contain the dictionary of the original ``metadata`` attribute of the ``sidpy`` dataset.

The attributes of a ``h5py.Group`` can be accessed with ``attrs`` and is shown below.

In [None]:
# print(dict(hdf5_file['Measurement_000/Channel_000/metadata'].attrs))
# print(sidpy_dataset.metadata)

AttributeError: ignored

### Dimensions of a dataset

A ``h5py.Dataset`` can have the dimensions ``attached`` to the dataset. 
The `attributes` of the dataset has actually the dimension labels stored and those dimensions are datasets in the same ``Directory``.

In the list of attributes of the main dataset we can see that a few other mandatorty items of a sidpy datasets (like: data_type) are stored.


In [None]:
for k, v in (hdf5_file['Measurement_000/Channel_000/new_spectrum'].attrs).items():
    print("{}: {}".format(k, v))

DIMENSION_LABELS: [b'x' b'y' b'bias']
DIMENSION_LIST: [array([<HDF5 object reference>], dtype=object)
 array([<HDF5 object reference>], dtype=object)
 array([<HDF5 object reference>], dtype=object)]
data_type: SPECTRAL_IMAGE
main_data_name: new
modality: generic
nsid_version: 0.0.1
quantity: Current
source: generic
this: is just a random dataset
units: nA


We see that ``[]'x' 'y' 'bias']`` are the labels of the Dimensions and those datasets are actually visible in the Channel.

The ``attributes`` of those dimensional ``h5py.Datasets`` contain the addtionional information required by ``pyNSID`` and ``sidpy`` in their attributes and (in captial letters) the information of the ``hdf5`` dimension. 

In [None]:
print(dict(hdf5_file['Measurement_000/Channel_000/x'].attrs))
print(sidpy_dataset.dim_0)
print(sidpy_dataset.x)

{'CLASS': b'DIMENSION_SCALE', 'NAME': b'x', 'REFERENCE_LIST': array([(<HDF5 object reference>, 0)],
      dtype={'names':['dataset','dimension'], 'formats':['O','<i4'], 'offsets':[0,8], 'itemsize':16}), 'dimension_type': 'SPATIAL', 'name': 'x', 'nsid_version': '0.0.1', 'quantity': 'Length', 'units': 'nm'}
x:  Length (nm) of size (4,)
x:  Length (nm) of size (4,)


## Summary
NSID data format is available through the pyNSID package. The format is an extension of the hdf5 format accessible through the h5py package. 
