# dataimport.py example

This notebook will show how the functions contained within the `dataimport.py` module are used to generate hdf5 files for storing raw data and peak fitting results. This module is also the primary way in which the functions contained within spectrafit.py are utilized.

In [1]:
# """
# This model allows a user who has all their experimental data saved in a directory folder to mass import the files and have an hdf5 file created that has the data organized and fit. This module interacts closely with the dataprep.py module also
# included with this package. 

# The advantage of this model is that it is an automated (looped) version of the `add_experiment` function in dataprep.py thus allowing a user not to have to manually import data files one at a time.

# The user needs to have had an organized filename structure as the organized hdf5 file relies on it. For this use case the compound name 'FA_' (Formic Acid) is the first part of the files in the directory and then the 

# Developed by the Raman-Noodles team (2019 DIRECT Cohort, University of Washington)
# """


#initial imports
import os
import h5py
import matplotlib.pyplot as plt
from ramandecompy import dataprep


def data_import(hdf5_filename, directory):
#     """
#     This function adds Raman experimental data to an existing hdf5 file. It uses the
#     spectrafit.fit_data function to fit the data before saving the fit result and
#     the raw data to the hdf5 file. The data_filename must be in a standardized format
#     to interact properly with this function. It must take the form anyname_temp_time.xlsx
#     (or .csv) since this function will parse the the temp and time from the filename to
#     label the data and fit result in the hdf5 file.

#     Args:
#         hdf5_filename (str): the filename and location of an existing hdf5 file to add the
#                              experiment data too. Variable must be in a string format.



#         directory (str): the folder location of raw Raman spectroscopy data in a 
#                              string format.

#     Returns:
#         None
#     """
    # open hdf5 file as read/write
    dataprep.new_hdf5(hdf5_filename)
    exp_file = h5py.File(hdf5_filename+'.hdf5', 'r+')
    for filename in os.listdir(directory):
        if filename.startswith('FA_') and filename.endswith('.csv'):
            locationandfile = directory + filename
            dataprep.add_experiment(str(hdf5_filename)+'.hdf5', locationandfile)
            print('Data from {} fit with compound pseudo-Voigt model. Results saved to {}.'.format(filename, hdf5_filename))
            # printing out to user the status of the import (because it can take a long time if importing a lot of data,
            # about minute/data set for test files
            exp_file.close()
            continue
        else:
            print('Data from {} fit with compound pseudo-Voigt model. Results saved to {}.'.format(filename, hdf5_filename))
            exp_file.close()
            continue
    return

def add_experiment(hdf5_filename, exp_filename):
    """
    This function adds Raman experimental data to an existing hdf5 file. It uses the
    spectrafit.fit_data function to fit the data before saving the fit result and
    the raw data to the hdf5 file. The data_filename must be in a standardized format
    to interact properly with this function. It must take the form anyname_temp_time.xlsx
    (or .csv) since this function will parse the the temp and time from the filename to
    label the data and fit result in the hdf5 file.

    Args:
        hdf5_filename (str): the filename and location of an existing hdf5 file to add the
                             experiment data too.
        exp_filename (str): the filename and location of raw Raman spectroscopy data in
                             either the form of an .xlsx or a .csv with the wavenumber data
                             in the 1st column and the counts data in the 2nd column. These
                             files should contain only the wavenumber and counts data
                             (no column labels).

    Returns:
        None
    """
    # handling input errors
    if not isinstance(hdf5_filename, str):
        raise TypeError('Passed value of `hdf5_filename` is not a string! Instead, it is: '
                        + str(type(hdf5_filename)))
    if not hdf5_filename.split('/')[-1].split('.')[-1] == 'hdf5':
        raise TypeError('`hdf5_filename` is not type = .hdf5! Instead, it is: '
                        + hdf5_filename.split('/')[-1].split('.')[-1])
    if not isinstance(exp_filename, str):
        raise TypeError('Passed value of `data_filename` is not a string! Instead, it is: '
                        + str(type(exp_filename)))
    # confirm exp_filename is correct format (can handle additional decimals in exp_filename
    label = '.'.join(exp_filename.split('/')[-1].split('.')[:-1])
    if len(label.split('_')) < 2:
        raise ValueError("""Passed value of `exp_filename` inapproprate. exp_filename must contain
        at least one '_', preferably of the format somename_temp_time.xlsx (or .csv)""")
    # r+ is read/write mode and will fail if the file does not exist
    exp_file = h5py.File(hdf5_filename, 'r+')
    if exp_filename.split('.')[-1] == 'xlsx':
        data = pd.read_excel(exp_filename, header=None, names=('wavenumber', 'counts'))
    elif exp_filename.split('.')[-1] == 'csv':
        data = pd.read_csv(exp_filename, header=None, names=('wavenumber', 'counts'))
    else:
        print('data file type not recognized')
    # ensure that the data is listed from smallest wavenumber first
    if data['wavenumber'][:1].values > data['wavenumber'][-1:].values:
        data = data.iloc[::-1]
        data.reset_index(inplace=True, drop=True)
    else:
        pass
    # peak detection and data fitting
    fit_result, residuals = spectrafit.fit_data(data['wavenumber'].values, data['counts'].values)
    # extract experimental parameters from filename
    specs = exp_filename.split('/')[-1].split('.')[-2]
    if len(specs) > 1:
        spec = ''
        for _, element in enumerate(specs):
            spec = str(spec+element)
        specs = spec
    specs = specs.split('_')
    time = specs[-1]
    temp = specs[-2]
    # write data to .hdf5
    exp_file['{}/{}/wavenumber'.format(temp, time)] = data['wavenumber']
    exp_file['{}/{}/counts'.format(temp, time)] = data['counts']
    exp_file['{}/{}/residuals'.format(temp, time)] = residuals
    for i, result in enumerate(fit_result):
        # create custom datatype
        my_datatype = np.dtype([('fraction', np.float),
                                ('center', np.float),
                                ('sigma', np.float),
                                ('amplitude', np.float),
                                ('fwhm', np.float),
                                ('height', np.float),
                                ('area under the curve', np.float)])
        if i < 9:
            dataset = exp_file.create_dataset('{}/{}/Peak_0{}'.format(temp, time, i+1),
                                              (1,), dtype=my_datatype)
        else:
            dataset = exp_file.create_dataset('{}/{}/Peak_{}'.format(temp, time, i+1),
                                              (1,), dtype=my_datatype)
        # apply data to tuple
        data = tuple(result[:7])
        data_array = np.array(data, dtype=my_datatype)
        # write new values to the blank dataset
        dataset[...] = data_array
    print("""Data from {} fit with compound pseudo-Voigt model.
     Results saved to {}.""".format(exp_filename, hdf5_filename))
    exp_file.close()

In [2]:
import os
import h5py
import pandas as pd
import matplotlib.pyplot as plt
import math
import numpy as np
import lineid_plot
from ramandecompy import spectrafit
from ramandecompy import peakidentify
from ramandecompy import dataprep
from ramandecompy import datavis
from ramandecompy import dataimport

## The old way
### dataprep.add_experiment

First we will see how the function `dataprep.add_experiment` operates and how it stores experimental data under groups that specify the temperature and residence time for each experiment added. First we will make a new `experiment.hdf5` file to store the experimental data. Importing this file will take longer than the earlier examples since this spectra contains a larger number of peaks that need to be fit. 

In [3]:
dataprep.new_hdf5('olddataprep_experiment')
hdf5_filename = 'olddataprep_experiment'
oldexphdf5 = h5py.File(hdf5_filename+'.hdf5', 'r+')
add_experiment('olddataprep_experiment.hdf5', '../ramandecompy/tests/test_files/FA_3.6wt%_300C_25s.csv')
add_experiment('olddataprep_experiment.hdf5', '../ramandecompy/tests/test_files/FA_3.6wt%_400C_12.5s.csv')
# dataprep.add_experiment('olddataprep_experiment.hdf5', '../ramandecompy/tests/test_files/FA_3.6wt%_300C_45s.csv')

Data from ../ramandecompy/tests/test_files/FA_3.6wt%_300C_25s.csv fit with compound pseudo-Voigt model.
     Results saved to olddataprep_experiment.hdf5.


IndexError: list index out of range

In [None]:
dataprep.view_hdf5('olddataprep_experiment.hdf5')


In [None]:
oldexphdf5.close()

In [None]:
os.remove('olddataprep_experiment.hdf5')



### dataimport.data_import

Next we will see how the slighly different function `dataimport.data_import` operates and how it stores experimental data under groups that specify the temperature and residence time for each experiment added. The function `dataimport.data_import` is essentially a wrapper function which uses the `dataprep.add_experiment` to add all the excel data files 

First we will make a new `dataimport_experiment.hdf5` file to store the experimental data. Then we will search through the directory `../ramandecompy/tests/test_files/` to 

Importing this file will take longer than the earlier examples since this spectra contains a larger number of peaks that need to be fit. 

In [None]:
hdf5_filename = 'dataimport_experiment'
directory = '../ramandecompy/tests/test_files/'
# open hdf5 file as read/write
data_import(hdf5_filename,directory)
dataprep.view_hdf5(hdf5_filename+'.hdf5')

In [None]:
# open .hdf5
hdf5_filename = 'dataimport_experiment'
exphdf5 = h5py.File(hdf5_filename+'.hdf5', 'r+')

In [None]:
dataprep.view_hdf5(hdf5_filename+'.hdf5')

## Remove the file so that there are no errors - this is done in the basic .py file

In order to keep the file system clean, and to avoid errors associated with running this notebook multiple times, we lastly will delete the two .hdf5 files generated by this notebook. Comment out the final cell if you wish you explore these files further.


Close the hdf5 file first to stop all processes using the file

In [None]:
exphdf5.close()

Remove the hdf5 file

In [None]:
os.remove('dataimport_experiment.hdf5')