# Measuring the correlation between data content and filenames

#### Author: **Thomas Casey**
##### Location: University of California Santa Barbara

## Table of Contents
1. [Business Understanding](#Business-Understanding)
2. [Data Understanding](#Data-Understanding)
3. [Data Preparation](#Data-Preparations)
4. [Modeling](#Method-Description)
5. [Evaluation](#Evaluation)
6. [Deployment](#Deployment)

<a id="Business Understanding"> </a>
## Business Understanding

Can undocumented and unstructured data be structured and interpreted with an acceptable level of confidence based only on the raw content of the data and filenames associated with the data?


<a id="Data Understanding"> </a>
## Data Understanding

The data are generated using an experiment called Overhauser Dynamic Nuclear Polarization (ODNP). The data should exist in blocks, meaning each individual file or folder is only useful as a part of a collection. The routine that collects and organizes the data is known, meaning some aspects of the structure of the data are understood:
1. Each "sample" corresponds to a base folder. Each folder contains folders named "1" through "33", "304", "503, "700", "701", and two .mat files OR two .csv files; one named "power" and the other named "t1_powers". The folders can be of type null, contain a one dimensional spectrum, or contain two dimensional spectra. The .mat or .csv files contain lists of continuous microwave power measurements made over the span of approximately two hours. 
2. One of the one dimensional spectra is considered an "off" spectrum and several others are compared to it. Most of the spectra are one dimensional and should present changing intensities.
3. A subset are two dimensional spectra and should also present differing intensities. 
4. The measurements in the .mat and .csv files should correspond in time to the collection of each spectrum and should be condensed to a length that matches the total number of spectra by sectioning and averaging each section. 
5. The one dimensional spectra must be processed and condensed to a single amplitude that is the integral of the spectrum. The two dimensional spectra must be condensed to a single number "T1" that is the result of processing, integrating, and fitting the trend in integrals to an exponential function.
6. When properly arranged, the data should yield a curve of spectral amplitudes that increases asmyptotically and another curve of "T1" values that increases linearly. 

<a id="Data Preparation"> </a>
## Data Preparation

For handling this data most efficiently I will use a python package that I helped develop and currently maintain called DNPLab. This package contains functions for loading proprietary data formats, processing data, and modeling the data using analytical functions. First lets isolate one sample folder to learn how to handle the data,

In [None]:
import dnplab                      # import the DNPLab package
from dnplab.dnpImport import load  # condense the syntax for loading data
import os                          # import os for using path tools
import numpy as np                 # import numpy for useful tools
import matplotlib.pyplot as plt    # use matplotlib to create some plots for visualizing the data

base_path = "../test_set"
paths = os.listdir(base_path)      # create a list of paths to try

flag = []
for indx, path in enumerate(       # loop through paths attempting to interpret them as data
    paths
):  
    try:
        data = load(os.path.join(base_path, path))
        flag.append("DATA")        # successful loading is marked as DATA
    except:
        flag.append("NULL")        # errors marked as NULL
        continue

truth_table = np.column_stack(     # construct a 2D list to differentiate DATA from NULL folders
    (paths, flag)
)  

Lets visualize some of the data,

In [None]:
data = load(os.path.join(base_path, paths[10]))
plt.plot(data.values)
plt.show()

This looks like one dimensional data, lets look at a two dimensional set,

In [None]:
data = load(os.path.join(base_path, paths[1]))
plt.plot(data.values)
plt.show()

Now that we have the relevant data identified, lets arrange it for modeling. Start by condensing to just the usable data from the collection of good and NULL data,

In [None]:
data_dict = {}
for indx, path in enumerate(paths):
    print(indx)
    if truth_table[indx, 1] == "DATA":
        data_dict[truth_table[indx, 0]] = load(os.path.join(base_path, path))
    else:
        pass

Finally, lets extract the isolate the target characterisitcs of the data,

In [None]:
final_target_1D = []
folders_1D = []
final_target_2D = []
folders_2D = []
data_dimensions = []
for indx, spec in enumerate(data_dict.keys()):
    workspace = dnplab.create_workspace("proc", data_dict[spec])
    dnplab.dnpNMR.remove_offset(workspace)
    dnplab.dnpNMR.window(workspace, linewidth=10)
    dnplab.dnpNMR.fourier_transform(workspace, zero_fill_factor=2)
    dnplab.dnpNMR.autophase(workspace, force_positive=False)
    data_dimensions.append(workspace["proc"].ndim)
    dnplab.dnpTools.integrate(workspace)
    if data_dimensions[indx] == 1:
        final_target_1D.append(workspace["proc"].values)
        folders_1D.append(spec)
    elif data_dimensions[indx] == 2:
        dnplab.dnpFit.exponential_fit(workspace, type="T1")
        final_target_2D.append(workspace["fit"].attrs["T1"])
        folders_2D.append(spec)
        

folders_1D = list(map(int, folders_1D))
folders_1D.sort()
folders_2D = list(map(str, folders_2D))
folders_2D.sort()

powers_1D = dnplab.dnpIO.cnsi.get_powers(base_path, "power", folders_1D)
powers_2D = dnplab.dnpIO.cnsi.get_powers(base_path, "t1_powers", folders_2D)

The data are now arranged into target format: list of amplitudes or "T1" each with corresponding power lists.

<a id="Modeling"> </a>
## Modeling

Models exist for fitting data of this nature and have been programmed into a module within the DNPLab package. Some known parameters must be passed along with the data and physical constants are calculated using optimization routines. Pass the data to the correct module,

In [None]:
hydration = {
    "E": np.array(final_target_1D),
    "E_power": np.array(powers_1D),
    "T1": np.array(final_target_2D),
    "T1_power": np.array(powers_2D),
}
hydration.update(
    {
        "T10": T10,
        "T100": T100,
        "spin_C": spin_C,
        "field": field,
        "smax_model": smax_model,
        "t1_interp_method": t1_interp_method,
    }
)
hyd = dnplab.create_workspace()
hyd.add("hydration_inputs", hydration)

results = dnplab.dnpHydration.hydration(hyd)

print(results)

<a id="Evaluation"> </a>
## Evaluation

Previously published data were used to confirm the model produces the correct result.

<a id="Deployment"> </a>
## Deployment

An example application was the batch processing of a large dataset for which each of the `base_folder`s had clues in their name but there was no available description of the dataset. I process the entire batch using the procedure above and try to correlate the clues in the folder names with the results. First lets make a function to perform the above procedure so that we can pass an entire batch, 

In [None]:
def process(base_path):
 
    # insert above code

    return results["hydration_results"]["k_sigma"]

and loop over the entire set of base folders creating a dictionary of one aspect of the results that should inform on the character of the data,

In [None]:
base_list = os.listdir(set_path)

descriptive = []
for indx, path in enumerate(base_list):
    descriptive.append(process(path))
    

Next, lets take apart the folder names and make a table of potential clues along with the descriptive for each. 

In [None]:
import pandas as pd

folder_list = pd.read_csv("../lst_dirs.csv")

data = []
sample = []
index = []
for indx, name in enumerate(folder_list):
    nm = name.split("_")
    date.append(nm[0])
    sample.append(nm[2])
    index.append(nm[3])


descriptive_dict = {"date": date,
                    "sample": sample,
                    "index": index,
                    "descriptor": descriptive,
                   }

Now we start to evaluate correlations. For example, lets see how "samp6" correlates with the corresponding descriptors,

In [1]:
date = []
sample = []
time = []
index = []
for indx, name in enumerate(folder_list[folder_list.columns[0]]):
    nm = name.split("_")
    if "samp6" in nm[2]:
        date.append(nm[0])
        sample.append(nm[2])
        time.append(nm[3])
        index.append(nm[4])


descriptive_dict = {"date": date,
                 "sample": sample,
                 "time": time,
                 "index": index,
                 "descriptor": descriptive,
                 }

and lets make a plot of the descriptives,

In [2]:
plt.plot(descriptive_dict["descriptor"])
plt.show()

NameError: name 'plt' is not defined

We notice that the descriptors are in groups. Lets see if the groups correlate with "time",

In [3]:
plt.plot(descriptive_dict["time"],descriptive_dict["descriptor"])
plt.show()

NameError: name 'plt' is not defined

It looks like they are tricplcate measurements of the same samples! The correlation also shows with index,

In [None]:
plt.plot(descriptive_dict["index"],descriptive_dict["descriptor"])
plt.show()