In [None]:
%matplotlib inline

The PycroDataset
============================================
Suhas Somnath

11/11/2017

Introduction
=========================
We highly recommend reading about the pycroscopy data format - available in the docs.

Pycroscopy uses a data-centric approach to data analysis and processing meaning that results from all data analysis and
processing are written to the same h5 file that contains the recorded measurements. The Hierarchical Data Format (HDF5)
allows data to be stored in multiple datasets in a tree-like manner. However, certain rules and considerations have
been made in pycroscopy to ensure consistent and easy access to any data. pycroscopy.hdf_utils contains a lot of
utility functions that simplify access to data and this tutorial provides an overview of many of the these functions

* Other:
    * print_tree <-- done
* Searching / Lookup:
    * find_dataset
    * find_results_groups
    * get_all_main
    * get_auxillary_datasets
    * get_group_refs
    * get_h5_obj_refs
    * get_source_dataset
    * check_for_matching_attrs
    * check_for_old
* Main dataset - Reading:
    * check_if_main <-- done
    * get_data_descriptor
    * reshape_to_n_dims
    * reshape_to_2d
* Ancillary datasets related:
    * get_formatted_labels
    * get_dimensionality
    * get_sort_order
    * get_unit_values
* Attributes - Reading
    * get_attr
    * get_attributes

Main Datasets via PycroDataset
==============================

For this example, we will be working with a Band Excitation Polarization Switching (BEPS) dataset acquired from
advanced atomic force microscopes. In the much simpler Band Excitation (BE) imaging datasets, a single spectra is
acquired at each location in a two dimensional grid of spatial locations. Thus, BE imaging datasets have two
position dimensions (X, Y) and one spectroscopic dimension (frequency - against which the spectra is recorded).
The BEPS dataset used in this example has a spectra for each combination of three other parameters (DC offset,
Field, and Cycle). Thus, this dataset has three new spectral dimensions in addition to the spectra itself. Hence,
this dataset becomes a 2+4 = 6 dimensional dataset

In pycroscopy, all spatial dimensions are collapsed to a single dimension and similarly, all spectroscopic
dimensions are also collapsed to a single dimension. Thus, the data is stored as a two-dimensional (N x P)
matrix with N spatial locations each with P spectroscopic datapoints.

This general and intuitive format allows imaging data from any instrument, measurement scheme, size, or
dimensionality to be represented in the same way. Such an instrument independent data format enables a single
set of analysis and processing functions to be reused for multiple image formats or modalities.

Main datasets can be thought of as substantially more capable and information-packed than standard datasets
since they have (or are linked to) all the necessary information to describe a measured dataset. The additional
information contained / linked by Main datasets includes:

* the recorded physical quantity
* units of the data
* names of the position and spectroscopic dimensions
* dimensionality of the data in its original N dimensional form etc.

While it is most certainly possible to access this information via the native h5py functionality, it can become
tedious very quickly.  Pycroscopy's PycroDataset class makes such necessary information and any necessary
functionality easily accessible.

PycroDataset objects are still h5py.Dataset objects underneath, like all datasets accessed above, but add an
additional layer of functionality to simplify data operations. Let's compare the information we can get via the
standard h5py library with that from PycroDataset to see the additional layer of functionality. The PycroDataset
makes the spectral and positional dimensions, sizes immediately apparent among other things.



Load all necessary packages
============================

Before we begin demonstrating the numerous functiosn in pycroscopy.hdf_utils, we need to load the necessary packages. Here are a list of packages besides pycroscopy that will be used in this example:
* h5py - to open and close the file
* wget - to download the example data file
* numpy - for numerical operations on arrays in memory
* matplotlib - basic visualization of data

In [1]:
from __future__ import print_function, division, unicode_literals
import os
# Warning package in case something goes wrong
from warnings import warn
# Package for downloading online files:
try:
    # This package is not part of anaconda and may need to be installed.
    import wget
except ImportError:
    warn('wget not found.  Will install with pip.')
    import pip
    pip.main(['install', 'wget'])
    import wget
import h5py
import numpy as np
import matplotlib.pyplot as plt
if True:
    import sys
    sys.path.append(os.path.split(os.path.abspath('.'))[0])
    import pycroscopy as px
else:
    try:
        import pycroscopy as px
    except ImportError:
        warn('pycroscopy not found.  Will install with pip.')
        import pip
        pip.main(['install', 'pycroscopy'])
        import pycroscopy as px

  warn('You are using the unity_dev branch, which is aimed at a 1.0 release for pycroscopy. '


Load the dataset
=========================================
In order to demonstrate the many functions in hdf_utils, we will be using an pycroscopy HDF5 data file generated from an atomic force microscope containing real experimental data and some analysis results. First, let us download this file from the pycroscopy Github project:

In [None]:
# Downloading the example file from the pycroscopy Github project
url = 'https://raw.githubusercontent.com/pycroscopy/pycroscopy/master/data/BEPS_small.h5'
h5_path = 'temp.h5'
_ = wget.download(url, h5_path)

print('Working on:\n' + h5_path)

Next, lets open this HDF5 file in read-only mode. Note that opening the file does not cause the contents to be automatically loaded to memory. Instead, we are presented with objects that refer to specific HDF5 datasets, attributes or groups in the file

In [2]:
# Open the file in read-only mode
h5_path = 'temp.h5'
h5_f = h5py.File(h5_path, mode='r')
# Here, h5_f is an active handle to the open file

Inspect the contents of this h5 data file
=========================================

The file contents are stored in a tree structure, just like files on a contemporary computer. The file contains
datagroups (similar to file folders) and datasets (similar to spreadsheets).
There are several datasets in the file and these store:

* The actual measurement collected from the experiment
* Spatial location on the sample where each measurement was collected
* Information to support and explain the spectral data collected at each location
* Since pycroscopy stores results from processing and analyses performed on the data in the same file, these
  datasets and datagroups are present as well
* Any other relevant ancillary information

print_tree()
------------
Soon after opening any file, it is often of interest to list the contents of the file. While one can use the open
source software HDFViewer developed by the HDF organization, pycroscopy.hdf_utils also has a very handy function - print_tree() to quickly visualize all the datasets and datagroups within the file within python.

In [3]:
print('Contents of the H5 file:')
px.hdf_utils.print_tree(h5_f)

Contents of the H5 file:
/
├ Measurement_000
  ---------------
  ├ Channel_000
    -----------
    ├ Bin_FFT
    ├ Bin_Frequencies
    ├ Bin_Indices
    ├ Bin_Step
    ├ Bin_Wfm_Type
    ├ Excitation_Waveform
    ├ Noise_Floor
    ├ Position_Indices
    ├ Position_Values
    ├ Raw_Data
    ├ Raw_Data-SHO_Fit_000
      --------------------
      ├ Fit
      ├ Guess
      ├ Spectroscopic_Indices
      ├ Spectroscopic_Values
    ├ Spatially_Averaged_Plot_Group_000
      ---------------------------------
      ├ Bin_Frequencies
      ├ Mean_Spectrogram
      ├ Spectroscopic_Parameter
      ├ Step_Averaged_Response
    ├ Spatially_Averaged_Plot_Group_001
      ---------------------------------
      ├ Bin_Frequencies
      ├ Mean_Spectrogram
      ├ Spectroscopic_Parameter
      ├ Step_Averaged_Response
    ├ Spectroscopic_Indices
    ├ Spectroscopic_Values
    ├ UDVS
    ├ UDVS_Indices


In [None]:
# Accessing the raw data
pycro_main = main_dsets[0]
print('Dataset as observed via h5py:')
print()
print('\nDataset as seen via a PycroDataset object:')
print(pycro_main)
# Showing that the PycroDataset is still just a h5py.Dataset object underneath:
print()
print(isinstance(pycro_main, h5py.Dataset))
print(pycro_main == h5_raw)

Main Datasets via PycroDataset
==============================

For this example, we will be working with a Band Excitation Polarization Switching (BEPS) dataset acquired from
advanced atomic force microscopes. In the much simpler Band Excitation (BE) imaging datasets, a single spectra is
acquired at each location in a two dimensional grid of spatial locations. Thus, BE imaging datasets have two
position dimensions (X, Y) and one spectroscopic dimension (frequency - against which the spectra is recorded).
The BEPS dataset used in this example has a spectra for each combination of three other parameters (DC offset,
Field, and Cycle). Thus, this dataset has three new spectral dimensions in addition to the spectra itself. Hence,
this dataset becomes a 2+4 = 6 dimensional dataset

In pycroscopy, all spatial dimensions are collapsed to a single dimension and similarly, all spectroscopic
dimensions are also collapsed to a single dimension. Thus, the data is stored as a two-dimensional (N x P)
matrix with N spatial locations each with P spectroscopic datapoints.

This general and intuitive format allows imaging data from any instrument, measurement scheme, size, or
dimensionality to be represented in the same way. Such an instrument independent data format enables a single
set of analysis and processing functions to be reused for multiple image formats or modalities.

Main datasets can be thought of as substantially more capable and information-packed than standard datasets
since they have (or are linked to) all the necessary information to describe a measured dataset. The additional
information contained / linked by Main datasets includes:

* the recorded physical quantity
* units of the data
* names of the position and spectroscopic dimensions
* dimensionality of the data in its original N dimensional form etc.

While it is most certainly possible to access this information via the native h5py functionality, it can become
tedious very quickly.  Pycroscopy's PycroDataset class makes such necessary information and any necessary
functionality easily accessible.

PycroDataset objects are still h5py.Dataset objects underneath, like all datasets accessed above, but add an
additional layer of functionality to simplify data operations. Let's compare the information we can get via the
standard h5py library with that from PycroDataset to see the additional layer of functionality. The PycroDataset
makes the spectral and positional dimensions, sizes immediately apparent among other things.



Main datasets are often linked to supporting datasets in addition to the mandatory ancillary datasets.  The main
dataset contains attributes which are references to these datasets



In [None]:
for att_name in pycro_main.attrs:
    print(att_name, pycro_main.attrs[att_name])

These datasets can be accessed easily via a handy hdf_utils function:



In [None]:
print(px.hdf_utils.getAuxData(pycro_main, auxDataName='Bin_FFT'))

The additional functionality of PycroDataset is enabled through several functions in hdf_utils. Below, we provide
several such examples along with comparisons with performing the same operations in a simpler manner using
the PycroDataset object:



In [None]:
# A function to describe the nature of the contents within a dataset
print(px.hdf_utils.get_data_descriptor(h5_raw))

# this functionality can be accessed in PycroDatasets via:
print(pycro_main.data_descriptor)

Using Ancillary Datasets
========================

As mentioned earlier, the ancillary datasets contain information about the dimensionality of the original
N-dimensional dataset.  Here we see how we can extract the size and corresponding names of each of the spectral
and position dimensions.



In [None]:
# an alternate way to get the spectroscopic indices is simply via:
print(pycro_main.h5_spec_inds)

# We can get the spectral / position labels and dimensions easily via:
print('Spectroscopic dimensions:')
print(pycro_main.spec_dim_descriptors)
print('Size of each dimension:')
print(pycro_main.spec_dim_sizes)
print('Position dimensions:')
print(pycro_main.pos_dim_descriptors)
print('Size of each dimension:')
print(pycro_main.pos_dim_sizes)

When visualizing the data it is essential to plot the data against appropriate values on the X, Y, Z axes.
Extracting a simple list or array of values to plot against may be challenging especially for multidimensional
dataset such as the one under consideration. Fortunately, hdf_utils has a very handy function for this as well:



In [None]:
h5_spec_inds = px.hdf_utils.getAuxData(pycro_main, auxDataName='Spectroscopic_Indices')[0]
h5_spec_vals = px.hdf_utils.getAuxData(pycro_main, auxDataName='Spectroscopic_Values')[0]
dimension_name = 'DC_Offset'
dc_dict = px.hdf_utils.get_unit_values(h5_spec_inds, h5_spec_vals, dim_names=dimension_name)
print(dc_dict)
dc_val = dc_dict[dimension_name]

fig, axis = plt.subplots()
axis.plot(dc_val)
axis.set_title(dimension_name)
axis.set_xlabel('Points in dimension')

Yet again, this process is simpler when using the PycroDataset object:



In [None]:
dv_val = pycro_main.get_spec_values(dim_name=dimension_name)

fig, axis = plt.subplots()
axis.plot(dc_val)
axis.set_title(dimension_name)
axis.set_xlabel('Points in dimension')

Reshaping Data
==============

Pycroscopy stores N dimensional datasets in a flattened 2D form of position x spectral values. It can become
challenging to retrieve the data in its original N-dimensional form, especially for multidimensional datasets
such as the one we are working on. Fortunately, all the information regarding the dimensionality of the dataset
are contained in the spectral and position ancillary datasets. hdf_utils has a very useful function that can
help retrieve the N-dimensional form of the data using a simple function call:



In [None]:
ndim_form, success, labels = px.hdf_utils.reshape_to_Ndims(h5_raw, get_labels=True)
if success:
    print('Succeeded in reshaping flattened 2D dataset to N dimensions')
    print('Shape of the data in its original 2D form')
    print(h5_raw.shape)
    print('Shape of the N dimensional form of the dataset:')
    print(ndim_form.shape)
    print('And these are the dimensions')
    print(labels)
else:
    print('Failed in reshaping the dataset')

The whole process is simplified further when using the PycroDataset object:



In [None]:
ndim_form = pycro_main.get_n_dim_form()
print('Shape of the N dimensional form of the dataset:')
print(ndim_form.shape)
print('And these are the dimensions')
print(pycro_main.n_dim_labels)

In [None]:
# Close and delete the h5_file
h5_f.close()
os.remove(h5_path)