02. The USIDataset
================================================================================

**Suhas Somnath**

11/11/2017

**This document illustrates how the pyUSID.USIDataset class substantially simplifies accessing information about,
slicing, and visualizing N-dimensional Universal Spectroscopy and Imaging Data (USID) Main datasets**




USID Main Datasets
------------------
According to the **Universal Spectroscopy and Imaging Data (USID)** model, all spatial dimensions are collapsed to a
single dimension and, similarly, all spectroscopic dimensions
are also collapsed to a single dimension. Thus, the data is stored as a two-dimensional (N x P) matrix with N spatial
locations each with P spectroscopic data points.

This general and intuitive format allows imaging data from any instrument, measurement scheme, size, or dimensionality
to be represented in the same way. Such an instrument independent data format enables a single set of analysis and
processing functions to be reused for multiple image formats or modalities.

``Main datasets`` are greater than the sum of their parts. They are more capable and information-packed than
conventional datasets since they have (or are linked to) all the necessary information to describe a measured dataset.
The additional information contained / linked by ``Main datasets`` includes:

* the recorded physical quantity
* units of the data
* names of the position and spectroscopic dimensions
* dimensionality of the data in its original N dimensional form etc.

USIDatasets = USID Main Datasets
--------------------------------
Regardless, ``Main datasets`` are just concepts or blueprints and not concrete digital objects in a programming language
or a file. ``USIDatasets`` are **tangible representations of Main datasets**. From an implementation perspective, the
USIDataset class extends the ``h5py.Dataset object``. In other words, USIDatasets have all the capabilities of
standard HDF5 / h5py Dataset objects but are supercharged from a scientific perspective since they:

* are self-describing
* allow quick interactive visualization in Jupyter notebooks
* allow intuitive slicing of the N dimensional dataset
* and much much more.

While it is most certainly possible to access this information and enable these functionalities via the native ``h5py``
functionality, it can become tedious very quickly.  In fact, a lot of the functionality of USIDataset comes from
orchestration of multiple functions in ``pyUSID.hdf_utils`` outlined in other documents. The USIDataset class
makes such necessary information and functionality easily accessible.

Since Main datasets are the hubs of information in a USID HDF5 file (**h5USID**), we expect that the majority of
the data interrogation will happen via USIDatasets

Recommended pre-requisite reading
---------------------------------
* `USID data model </../../data_format.html>`_
* `Crash course on HDF5 and h5py <./plot_h5py.html>`_
* Utilities for `reading <./plot_hdf_utils_read.html>`_ h5USID files
  h5USID files using pyUSID

Example scientific dataset
---------------------------

Before, we dive into the functionalities of USIDatasets we need to understand the dataset that will be used in this
example. For this example, we will be working with a Band Excitation Polarization Switching (BEPS) dataset acquired
from advanced atomic force microscopes. In the much simpler Band Excitation (BE) imaging datasets, a single spectra
is acquired at each location in a two dimensional grid of spatial locations. Thus, BE imaging datasets have two
position dimensions (X, Y) and one spectroscopic dimension (frequency - against which the spectra is recorded). The
BEPS dataset used in this example has a spectra for each combination of three other parameters (DC offset, Field, and
Cycle). Thus, this dataset has three new spectral dimensions in addition to the spectra itself. Hence, this dataset
becomes a 2+4 = 6 dimensional dataset

.. tip::
    You can download and run this document as a Jupyter notebook using the link at the bottom of this page.

Load all necessary packages
---------------------------

First, we need to load the necessary packages. Here are a list of packages, besides pyUSID, that will be used in
this example:

* ``h5py`` - to open and close the file
* ``numpy`` - for numerical operations on arrays in memory
* ``matplotlib`` - basic visualization of data

In [None]:
from __future__ import print_function, division, unicode_literals
import os
# Warning package in case something goes wrong
from warnings import warn
import subprocess
import sys


def install(package):
    subprocess.call([sys.executable, "-m", "pip", "install", package])
# Package for downloading online files:

import h5py
import numpy as np
import matplotlib.pyplot as plt
try:
    import pyUSID as usid
except ImportError:
    warn('pyUSID not found.  Will install with pip.')
    import pip
    install('pyUSID')
    import pyUSID as usid
    
%matplotlib notebook

Load the dataset
-----------------

Lets open this HDF5 file in read-only mode. Note that opening the file does not cause the contents to be
automatically loaded to memory. Instead, we are presented with objects that refer to specific HDF5 datasets,
attributes or groups in the file



In [None]:
h5_path = './data/BEPS_small.h5'
h5_f = h5py.File(h5_path, mode='r')

Here, ``h5_f`` is an active handle to the open file.
Lets quickly look at the contents of this HDF5 file using a handy function in ``pyUSID.hdf_utils`` - ``print_tree()``



In [None]:
print('Contents of the H5 file:')
usid.hdf_utils.print_tree(h5_f)

For this example, we will only focus on the ``Raw_Data`` dataset which contains the 6D raw measurement data. First lets
access the HDF5 dataset:



In [None]:
pd_raw = usid.USIDataset(h5_f['/Measurement_000/Channel_000/Raw_Data'])
print(pd_raw)

How USIDataset simplifies data handling:
======================

## 1. Reference Values for each Dimension
When visualizing the data it is essential to plot the data against appropriate values on the X, Y, Z axes. The
USIDataset object makes it very easy to access the values over which a dimension was varied using the
``get_pos_values()`` and ``get_spec_values()`` functions. This functionality is enabled by the ``get_unit_values()``
function in ``pyUSID.hdf_utils``.

For example, let us say we wanted to see how the ``DC_Offset`` dimension was varied, we could:



In [None]:
dim_name = 'DC_Offset'

dc_vec = pd_raw.get_spec_values(dim_name)

fig, axis = plt.subplots(figsize=(5, 5))
axis.plot(dc_vec, '*-')
axis.set_xlabel('Points in dimension', fontsize=18)
axis.set_ylabel('Values over which\nparameter was varied', fontsize=18)
axis.set_title(dim_name, fontsize=18)
fig.tight_layout()

## 2. Reshaping to N dimensions
The USID model stores N dimensional datasets in a flattened 2D form of position x spectral values. It can become
challenging to retrieve the data in its original N-dimensional form, especially for multidimensional datasets
such as the one we are working on. Fortunately, all the information regarding the dimensionality of the dataset
are contained in the spectral and position ancillary datasets. PycoDataset makes it remarkably easy to obtain the N
dimensional form of a dataset:



In [None]:
print('\nShape of dataset as it is stored in the HDF5 file: {}\n'.format(pd_raw.shape))

ndim_form = pd_raw.get_n_dim_form()

print('Shape of the N dimensional form of the dataset:')
print(ndim_form.shape)
print('And these are the dimensions')
print(pd_raw.n_dim_labels)

## 3. Indexing / Slicing with words
It is often very challenging to grapple with multidimensional datasets such as the one in this example. It may not
even be possible to load the entire dataset in its 2D or N dimensional form to memory if the dataset is several (or
several hundred) gigabytes large. **Slicing the 2D Main dataset can easily become confusing and frustrating**. 

The hard was is to:
* Load the entire dataset from the file to memory. **What if the dataset is too large to load into memory**
* Reshape it to its N-dimensional form
* Figure out which dimension corresponds to which index in the N-dimensional dataset
* Finally slice it

Let's say we want to get the spatial map for:
* 14th index of DC Offset
* 1st index of cycle
* 0th index of Field (remember Python is 0 based)
* 43rd index of Frequency

In [None]:
# Fortunately, this dataset is tiny and we already have the data in its N-dimensional form already!
spat_map_2 = np.squeeze(ndim_form[:, :, 43, 14, 0, 1])

fig, axis = plt.subplots(figsize=(5, 5))
axis.imshow(np.squeeze(np.abs(spat_map_2)))
axis.set_xlabel('X', fontsize=18)
axis.set_ylabel('Y', fontsize=18);

To solve
this problem, USIDataset has a ``slice()`` function that efficiently loads the only the sliced data into memory and
reshapes the data to an N dimensional form. Best of all, the slicing arguments can be provided in the actual
N dimensional form!

With the USIDataset we:
* Do not need to load the entire dataset to memory
* Can slice by addressing dimensions by their name without having to worry about the actual order of the dimensions

In [None]:
# We slice directly on the USIDataset in ONE line!:
spat_map_1, success = pd_raw.slice({'Frequency': 43, 'DC_Offset': 14, 'Field': 0, 'Cycle': 1})


fig, axis = plt.subplots(figsize=(5, 5))
axis.imshow(np.squeeze(np.abs(spat_map_1)))
axis.set_xlabel('X', fontsize=18)
axis.set_ylabel('Y', fontsize=18);

## 4. Interactive Visualization
### Works best on Google Chrome and not so well on Internet Explorer, Safari or Firefox
USIDatasets also enable quick, interactive, and easy visualization of data up to 2 position and 2 spectroscopic
dimensions (4D datasets). Since this particular example has 6 dimensions, we would need to slice two dimensions in
order to visualize the remaining 4 dimensions. 

In [None]:
pd_raw.visualize(slice_dict={'Field': 0, 'Cycle': 1});

Close the h5_file



In [None]:
h5_f.close()