# Developing Scientific Workflows in Pycroscopy - Part 1: Translation & Data Format

#### Suhas Somnath
8/8/2017

This set of notebooks will serve as examples for developing and end-to-end workflows for and using pycroscopy. 

In this example, we describe the pycroscopy data format and transform a __Scanning Tunnelling Spectroscopy (STS)__ raw data file, as obtained from an Omicron STM, to the pycroscopy data format. 

My hope is that this notebook will serve as a comprehensive example for:

1. __Translation__
    1. Learning how to read the raw data files obtained from certain microscopes
    2. Shaping and structuring the data in a way that is compatible with pycroscopy
    3. Writing this data to .h5 files that are used in pycroscopy
    
2. __Data Access__
    1. Loading, reading, writing, and manipulating HDF5 / H5 files.
    
3. __Data Analysis__
    1. Using data analysis routines already present in pycroscopy
    
4. __Developing Custom Data Processing__
    1. Performing some custom data analysis not available in pycroscopy
    2. Writing results of this analysis back to the file
    
5. __Visualization__
    1. Visualizing results of analyses and processing using pycroscopy functions
    2. Developing simple interactive visualizers

## Why should you care?

The quest for understanding more about samples has necessitated the development of a multitude of microscopes, each capable of numerous measurement modalities. 

Typically, each commercial microscope generates data files formatted in proprietary data formats by the instrument manufacturer. The proprietary natures of these data formats impede scientific progress in the following ways:
1. By making it challenging for researchers to extract data from these files 
2. Impeding the correlation of data acquired from different instruments.
3. Inability to store results back into the same file
4. Accomodating files from few kilobytes to several gigabytes of data
5. Requiring different versions of analysis routines for each format

Future concerns:
1. Several fields are moving towards the open science paradigm which will require journals and researchers to support journal papers with data and analysis software 
2. US Federal agencies that support scientific research require curation of datasets in a clear and organized manner

To solve these and many more problems, we have developed an __instrument agnostic data format__ that can be used to represent data from any instrument, size, dimensionality, or complexity.

## Pycroscopy data format

Regardless of origin, modality or complexity, imaging data have one thing in common:
* __The same measurement is performed at multiple spatial locations__

The data format in pycroscopy is based on this one simple ground truth. The data always has some spatial dimensions (X, Y, Z) and some spectroscopic dimensions (time, frequency, intensity, wavelength, temperature, cycle, voltage, etc.). Pycroscopy, the spatial dimensions are collapsed onto a single dimension and the spectroscopic dimensions are flattened to the other dimensions. Thus, all data are stored as two dimensional grids. Here are some examples of how some familar data can be represented using this paradigm:
* __Grayscale photographs__: A single value (intensity) in is recorded at each pixel in a two dimensional grid. Thus, there are are two spatial dimensions - X, Y and one spectroscopic dimension - "Intensity". The data can be represented as a N x 1 matrix where N is the product of the number of rows and columns of pixels. The second axis has size of 1 since we only record one value (intensity) at each location. __The positions will be arranged as row0-col0, row0-col1.... row0-colN, row1-col0....__
    * In the case of a color image, the data would be of shape N x 3. Where the red, green, blue intensity values would be stored separately. 
* A __single Raman spectra__: In this case, the measurement is recorded at a single location. At this position, data is recorded as a function of a single (spectroscopic) variable such as wavelength. Thus this data is represented as a 1 x P matrix, where P is the number of points in the spectra
* __Scanning Tunelling Spectroscopy or IV spectroscopy__: The current (A 1D array of size P) is recorded as a function of voltage at each position in a two dimensional grid of points (two spatial dimensions). Thus the data would be represente as a N x P matrix, where N is the product of the number of rows and columns in the grid and P is the number of spectroscopic points recorded. 
    * If the same voltage sweep were performed twice at each location, the data would be represented as N x 2 P. The data is still saved as a long (2*P) 1D array at each location. The number of spectroscopic dimensions would change from just ['Voltage'] to ['Voltage', 'Cycle'] where the second spectroscopic dimension would account for repetitions of this bias sweep.
        * __The spectroscopic data would be stored as it would be recorded as volt_0-cycle_0, volt_1-cycle_0..... volt_P-1-cycle_0, volt_0-cycle_1.....volt_P-1-cycle-1. Just like the positions__
    * Now, if the bias was swept thrice from -1 to +1V and then thrice again from -2 to 2V, the data bacomes N x 2 * 3 P. The data now has two position dimensions (X, Y) and three spectrosocpic dimensions ['Voltage', 'Cycle', 'Step']. The data is still saved as a (P * 2 * 3) 1D array at each location. 
    
#### Making sense of such flattned datasets:
Each main dataset is always accompanied by four ancillary datasets: 
* the position value and index of each spatial location (row)
* the spectroscopic value and index of any column in the dataset
In addition to serving as a legend or the key, these ancillary datasets are necessary for explaining:
* the original dimensionality of the dataset
* how to reshape the data back to its N dimensional form

From the __IV Spectorscopy__ example with [X, Y] x [Voltage, Cycle, Step]:
* The position datasets would be of shape N x 2 - N total position, two spatial dimensions. 
    * The position indices datasets may start like: 
    
| 0 | 0 |
| 0 | 1 |
| a | t |


        * 0, 0
        * 0, 1
        * ....
        * 0, N/2
        * 1, 0 ....
        would be structured exactly

#### Channels
The pycroscopy data format also allows multiple channels of information to be recorded as separate datasets in the same file. For example, one channel could be a spectra (1D array) collected at each location on a 2D grid while another could be the temperature (single value) recorded by another sensor at the same spatial positions

## 0. Setting up the notebook

There are a few things that need to be done before any code is written 
1. If the notebook is intended to work with both python 2 and 3, import from __future__ before importing any other packages
2. Next, import packages necessary for use later. 
3. Set up the plotting backend for matplotlib

In [1]:
# Ensure python 3 compatibility:
from __future__ import division, print_function, absolute_import, unicode_literals

# The package for accessing files in directories, etc.:
from os import path

# The mathematical computation package:
import numpy as np

# The package used for creating and manipulating HDF5 files:
import h5py

# Packages for plotting:
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from mpl_toolkits.axes_grid1 import make_axes_locatable

# Packages for signal filtering and data analysis:
from scipy.signal import medfilt
from scipy.ndimage.filters import gaussian_filter
from sklearn.utils.extmath import randomized_svd
from sklearn.cluster import KMeans

# Finally import pycroscopy for certain scientific analysis:
import pycroscopy as px
from pycroscopy.io.translators.omicron_asc import AscTranslator

# set up notebook to show plots within the notebook
% matplotlib notebook

## 1. Loading data from raw data files

Before any data analysis, we need to access data stored in the files generated by the microscope. Often, the data and parameters in these files are __not__ straightforward to access. In certain cases, additional packages are necessary to access the data while in many other cases, it is possible to extract the necessary information from built-in __numpy__ or similar python packages included with __anaconda__.

Pycroscopy aims to make data access, storage, curation, etc. simply by storing the data along with all relevant parameters in a single __.hdf5__ or __.h5__ file. Among the numerous benefits of __HDF5__ files are that these files:
* are readily compatible with high-performance computing facilities
* scale very efficiently from few kilobytes to several terabytes
* can be read and modified using any language including Python, Matlab, C/C++, Java, Fortran, Igor Pro, etc.

The process of copying data from the original format to __pycroscopy compatible hdf5__ files is called __Translation__ and the classes available in pycroscopy that perform these operation are called __Translators__

__The goal in this section is to trandslate the .asc file obtained from an Omicron microscope into a pycroscopy compatible .h5 file. __
While there is an __AscTranslator__ avialable in pycroscopy that can translate these files in just a __single__ line, we will intentionally assume that no such translator is avialable. Using a handful of useful functions in pycroscopy, we will translate the files from the source __.asc__ format to the pycroscopy compatible __.h5__ in just a few lines. The code developed below is essentially the __AscTranslator__. The same methodology can be used to translate other data formats

In [2]:
#%% Load file
raw_file_path = px.io.uiGetFile(filter='Omicron STS Files (*.asc)')

### Exploring the instrument generated data file

Inherently, one may not know how to read these __.asc__ files. One option is to try and read the file as a text file one line at a time. 

It turns out that these .asc files are effectively the standard __ASCII__ text files. 

Here is how we tested to see if the __asc__ files could be interpreted as text files. Below, we read just thefirst 10 lines in the file

In [4]:
with open(raw_file_path, 'r') as file_handle:
    for lin_ind in range(10):
        print(file_handle.readline())

# File Format = ASCII

# Created by SPIP 4.6.5.0 2016-09-22 13:32

# Original file: C:\Users\Administrator\AppData\Roaming\Omicron NanoTechnology\MATRIX\default\Results\16-Sep-2016\I(V) TraceUp Tue Sep 20 09.17.08 2016 [14-1]  STM_Spectroscopy STM

# x-pixels = 100

# y-pixels = 100

# x-length = 29.7595

# y-length = 29.7595

# x-offset = -967.807

# y-offset = -781.441

# z-points = 500



Now that we know that these files are simple text files, we can manually go through the file to find out which lines are important, at what lines the data starts etc. 

Manual investigation of such .asc files revealed that these files are always formatted in the same way. Also, they contain parameters in the first 403 lines and then contain data which is arranged as one pixel per row.

STS experiments result in 3 dimensional datasets (X, Y, current). In other words, a 1D array of current data (as a function of excitation bias) is sampled at every location on a two dimensional grid of points on the sample.

By knowing where the parameters are located and how the data is structured, it is possible to extract the necessary information from these files.

#### Step 1. Read the entire file to memory

In [6]:
# Extracting the raw data into memory
file_handle = open(raw_file_path, 'r')
string_lines = file_handle.readlines()
file_handle.close()

#### Step 2 Read the parameters

Present in the first few lines of the file

In [20]:
# Reading parameters stored in the first few rows of the file
parm_dict = dict()
for line in string_lines[3:17]:
    line = line.replace('# ', '')
    line = line.replace('\n', '')
    temp = line.split('=')
    test = temp[1].strip()
    try:
        test = float(test)
        # convert those values that should be integers:
        if test % 1 == 0:
            test = int(test)
    except ValueError:
        pass
    parm_dict[temp[0].strip()] = test

# Print out the parameters extracted
for key in parm_dict.keys():
    print(key, ':\t', parm_dict[key])

z-section :	 491
x-pixels :	 100
z-unit :	 nV
x-length :	 29.7595
z-offset :	 1116.49
x-offset :	 -967.807
z-points :	 500
voidpixels :	 0
y-offset :	 -781.441
z-range :	 2000000000
value-unit :	 nA
y-length :	 29.7595
y-pixels :	 100
scanspeed :	 59519000000


#### Step 3.1 Prepare to read the data

Before we read the data, we need to make an empty array to store all this data. In order to do this, we need to read the dictionary of parameters we made in step 2 and extract necessary quantities

In [15]:
num_rows = int(parm_dict['y-pixels'])
num_cols = int(parm_dict['x-pixels'])
num_pos = num_rows * num_cols
spectra_length = int(parm_dict['z-points'])

#### Step 3.2 Read the data

Data is present after the first 403 lines of parameters. 

In [16]:
# num_headers = len(string_lines) - num_pos
num_headers = 403

# Extract the STS data from subsequent lines
raw_data_2d = np.zeros(shape=(num_pos, spectra_length), dtype=np.float32)
for line_ind in range(num_pos):
    this_line = string_lines[num_headers + line_ind]
    string_spectrum = this_line.split('\t')[:-1]  # omitting the new line
    raw_data_2d[line_ind] = np.array(string_spectrum, dtype=np.float32)

#### Step 4.a Preparing the parameters to pass onto the NumpyTranslator

The NumpyTranslator simplifies the ceation of pycroscopy compatible datasets. It handles the file creation, dataset creation and writing, creation of ancillary datasets, datagroup creation, writing parameters, linking ancillary datasets to the main dataset etc.

In [37]:
max_v = 1 # This is the one parameter we are not sure about

folder_path, file_name = path.split(raw_file_path)
file_name = file_name[:-4] + '_'

# Generate the x / voltage / spectroscopic axis:
volt_vec = np.linspace(-1 * max_v, 1 * max_v, spectra_length)

h5_path = path.join(folder_path, file_name + '.h5')

#### Step 4b. Calling the NumpyTranslator to do all the heavy lifting

With a single call to the NumpyTranslator, we complete the translation process. 

In [38]:
tran = px.io.NumpyTranslator()
h5_path = tran.translate(h5_path, raw_data_2d, num_rows, num_cols, 
                         qty_name='Current', data_unit='nA', spec_name='Bias', 
                         spec_unit='V', spec_val=volt_vec, scan_height=100, 
                         scan_width=200, spatial_unit='nm', data_type='STS', 
                         translator_name='ASC', parms_dict=parm_dict)

## Notes on pycroscopy translation
* Steps 1-3 would be performed anyway in order to begin data analysis
* The actual pycroscopy translation step are reduced to just 3-4 lines in step 4.
* While this approach is feasible and encouraged for simple and small data, it may be necessary to use lower level calls to write efficient translators

## Next example  - Reading and Acessing Data
* Please see the next notebook in the example series to learn more about reading and accessing data. 
* We have shown briefly what the file looks like after it is written below:

In [39]:
with h5py.File(h5_path, mode='r') as h5_file:
    px.hdf_utils.print_tree(h5_file)

/
Measurement_000
Measurement_000/Channel_000
Measurement_000/Channel_000/Position_Indices
Measurement_000/Channel_000/Position_Values
Measurement_000/Channel_000/Raw_Data
Measurement_000/Channel_000/Spectroscopic_Indices
Measurement_000/Channel_000/Spectroscopic_Values
