# Structure of NSID file format

*Author: Gerd Duscher*

*Date: December 2020*
update: 
- *Gerd Duscher 01/2021 (compatibility to pyNSID version 0.0.2)*

The pyNSID file format is based on ``h5py`` package for the ``hdf5`` file system.

The NSID conventions implemented on top of the ``hdf5`` file format are easily accessible through the pyNSID  package.

Start with standard imports:

In [1]:
# Ensure python 3 compatibility:
from __future__ import (absolute_import, division, print_function,
                        unicode_literals)

import sys
import warnings

import h5py
import matplotlib.pylab as plt
import numpy as np

# we will also need a sidpy package

sys.path.insert(0,'../../../sidpy/')
import sidpy 

print(sidpy.__version__)
sys.path.insert(0,'../../')
import pyNSID as nsid



warnings.filterwarnings("ignore", module="numpy.core.fromnumeric")
warnings.filterwarnings("ignore", module="pyNSID.io.nsi_reader")

0.0.4


## Open the test file

In [2]:
import os
try:
    os.remove('test2.hf5') 
    print('removed file: test2.hf')
except:
    pass
dataset = sidpy.Dataset.from_array(np.random.random([4, 5, 10]), name='new')
dataset.data_type = 'SPECTRAL_IMAGE'
dataset.units = 'nA'
dataset.quantity = 'Current'

dataset.metadata={'this': 'is just a random dataset'}

dataset.set_dimension(0, sidpy.Dimension(np.arange(dataset.shape[0]), 'x',
                                        units='nm', quantity='Length',
                                        dimension_type='spatial'))
dataset.set_dimension(1, sidpy.Dimension(np.linspace(-2, 2, num=dataset.shape[1], endpoint=True), 'y', 
                                        units='nm', quantity='Length',
                                        dimension_type='spatial'))
dataset.set_dimension(2, sidpy.Dimension(np.sin(np.linspace(0, 2 * np.pi, num=dataset.shape[2])), 'bias',
                                        units='mV', quantity='Voltage',
                                        dimension_type='spectral'))

hf = h5py.File("test2.hf5", 'a')
if 'Measurement_000' in hf:
    del hf['Measurement_000']
hf.create_group('Measurement_000/Channel_000')
nsid.hdf_io.write_nsid_dataset(dataset, hf['Measurement_000/Channel_000'], main_data_name="new_spectrum")
sidpy.hdf_utils.print_tree(hf)
hf.close()

/
├ Measurement_000
  ---------------
  ├ Channel_000
    -----------
    ├ new_spectrum
      ------------
      ├ __dict__
        --------
      ├ _axes
        -----
      ├ _metadata
        ---------
      ├ bias
      ├ metadata
        --------
      ├ new_spectrum
      ├ x
      ├ y


  warn('validate_h5_dimension may be removed in a future version',


Let's open the test file.

In [3]:
hdf5_file = h5py.File("test2.hf5", 'r+')
print(hdf5_file["Measurement_000"].keys())

<KeysViewHDF5 ['Channel_000']>


We really do normally not care about the underlying structure as the NSID reader is taking care of everything.

The NSID reader will return a sidpy dataset, which we then can plot, analyze, modify, and write back to the h5py file in pyNSID format.

We can read all of them or just a specific `directory` in this hirachical data file (hdf).

In [5]:
nsid_reader = nsid.NSIDReader("test2.hf5")
sidpy_dataset = nsid_reader.read()[0]
sidpy_dataset

Unnamed: 0,Array,Chunk
Bytes,1.60 kB,1.60 kB
Shape,"(4, 5, 10)","(4, 5, 10)"
Count,1 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 1.60 kB 1.60 kB Shape (4, 5, 10) (4, 5, 10) Count 1 Tasks 1 Chunks Type float64 numpy.ndarray",10  5  4,

Unnamed: 0,Array,Chunk
Bytes,1.60 kB,1.60 kB
Shape,"(4, 5, 10)","(4, 5, 10)"
Count,1 Tasks,1 Chunks
Type,float64,numpy.ndarray


## Exploration the structure of the pyNSID data format

We will use a sidpy function to plot the tree of the hdf5 file.

In [6]:
sidpy.hdf_utils.print_tree(hdf5_file)

/
├ Measurement_000
  ---------------
  ├ Channel_000
    -----------
    ├ new_spectrum
      ------------
      ├ __dict__
        --------
      ├ _axes
        -----
      ├ _metadata
        ---------
      ├ bias
      ├ metadata
        --------
      ├ new_spectrum
      ├ x
      ├ y


As a suggested convention we use Measurement_000 as the first directory to store different datasets that belong together. So ``Measurement_000`` is a ``h5py.Group``. Which contains several other ``h5py.Group``s which all start with ``Channel_``.

All directories are numbered and there is a function in ``sidpy`` to automatically increase this number for a new group for convenience (*sidpy.hdf.prov_utils.create_indexed_group*).

The different ``Channels`` could be for example reference data, or  simultaneously acquired datasets.

The results would be logged with each individual dataset in its channel.

The names of directories of results should start with `Log_` or `Result_`.


## The Channel Group

The channel group contains several other ``h5py.Group``s and ``h5py.Datasets``.

Every attribute of a stored ``sidpy`` dataset will be a group and the ``attributes`` of those groups are the dictionaries of these attributes  of ``sidpy`` datasets.
    
For example ``metadata`` is an attribute of the sidpy dataset.

So there will be an ``h5py.Group`` with the name ``metadata``  and the ``attributes`` of that group contain the dictionary of the original ``metadata`` attribute of the ``sidpy`` dataset.

The attributes of a ``h5py.Group`` can be accessed with ``attrs`` and is shown below.

In [8]:
print( hdf5_file['Measurement_000/Channel_000/new_spectrum'].keys())

print(dict(hdf5_file['Measurement_000/Channel_000/new_spectrum/metadata'].attrs))
print(sidpy_dataset.metadata)

<KeysViewHDF5 ['__dict__', '_axes', '_metadata', 'bias', 'metadata', 'new_spectrum', 'x', 'y']>
{'this': 'is just a random dataset'}
{'this': 'is just a random dataset'}


### Dimensions of a dataset

A ``h5py.Dataset`` can have the dimensions ``attached`` to the dataset. 
The `attributes` of the dataset has actually the dimension labels stored and those dimensions are datasets in the same ``Directory``.

In the list of attributes of the main dataset we can see that a few other mandatorty items of a sidpy datasets (like: data_type) are stored.


In [9]:
for k, v in (hdf5_file['Measurement_000/Channel_000/new_spectrum'].attrs).items():
    print("{}: {}".format(k, v))

machine_id: MSE-Tab01.utk.tennessee.edu
platform: Windows-10-10.0.19041-SP0
pyNSID_version: 0.0.2
sidpy_version: 0.0.4
timestamp: 2021_01_15-17_13_39


We see that ``[]'x' 'y' 'bias']`` are the labels of the Dimensions and those datasets are actually visible in the Channel.

The ``attributes`` of those dimensional ``h5py.Datasets`` contain the addtionional information required by ``pyNSID`` and ``sidpy`` in their attributes and (in captial letters) the information of the ``hdf5`` dimension. 

In [11]:
print(dict(hdf5_file['Measurement_000/Channel_000/new_spectrum/x'].attrs))

{'CLASS': b'DIMENSION_SCALE', 'NAME': b'x', 'REFERENCE_LIST': array([(<HDF5 object reference>, 0)],
      dtype={'names':['dataset','dimension'], 'formats':['O','<i4'], 'offsets':[0,8], 'itemsize':16}), 'dimension_type': 'SPATIAL', 'name': 'x', 'quantity': 'Length', 'units': 'nm'}


## Summary
NSID data format is available through the pyNSID package. The format is an extension of the hdf5 format accessible through the h5py package. 
