# Polyolefin InfraRed Classification - Piped Preprocessing

In connection with: add DOI INFO/LINK HERE

This code was predominantly produced by Bradley P. Sutliff, with assistance from Tyler B. Martin, and Debra Audus

This notebook is provided in an effort to further open research initiatives and to further the circular economy.

Please direct any questions to Bradley.Sutliff@nist.gov

## Notebook setup

### Do we want to save the results of this notebook as netcdfs?

This is mostly added to prevent overriding files or generating files unnecesarily when you are "playing" with the code. You can always edit this or manually save them later. You will need to eventually save the netcdf file for the following notebooks to work successfully.

In [1]:
save_data = True

### Define color schemes and plotting dictionaries

In [2]:
import seaborn as sns

# set up notebook for plotting nicely
%matplotlib inline
contxt = "notebook"
sns.set(context=contxt, style="ticks", palette="bright")

### Set our color palette to match one of our dictionaries

In [3]:
import Scripts.misc_funcs as misc

cpalette = misc.dict_cBlind

## Load data from our files

It is assumed that the data and code are set up in the following directory structure.:

```
Main  
  ├ *.ipynb  
  ├ Data  
  |  ├ SampleInformation.csv  
  |  └ NIR  
  |    ├ N1476LDPE_1.csv  
  |    ├ ...  
  |    └ H0009PP_7.csv  
  └ Scripts  
    ├ *.py  
    ├ *.sh  
    └ *.ps1  
```

First we load the file that has our general sample information

In [4]:
import pandas as pd

sample_info = pd.read_csv('Data/SampleInformation.csv')
sample_info.tail()

Unnamed: 0,Source,Sample,Class1,Class2,Physical State,Color,Recycled,Alternate names,bigSMILES,CAS number,Material Keywords,reference URL
40,Commercial Supplier 4,P0051LLDPE,PE,LLDPE,Pellet,Natural,Yes,"`linear low density polyethylene','LLDPE','pol...","{[][$]CC[$],[$]CC({[$][$]C[$][$]}C)[$][]}",9002-88-4,"polyolefins','semicrystalline','copolymer'",
41,Commercial Supplier 3,E0046PP,PP,PP,Pellet,Gray,Yes,"`polypropylene','PP'",{[][$]CC(C)[$][]},9003-07-0,"polyolefins','semicrystalline','linear','homop...",
42,Commercial Supplier 3,E0035PP,PP,PP,Pellet,Black,Yes,"`polypropylene','PP'",{[][$]CC(C)[$][]},9003-07-0,"polyolefins','semicrystalline','linear','homop...",
43,Commercial Supplier 2,C0028PE,PE,,Pellet,Natural,,"`polyethylene', 'PE'",{[][$]CC[$][]},9002-88-4,"polyolefins','semicrystalline','linear','homop...",
44,Commercial Supplier 2,C0079PE,PE,,Pellet,Natural,,"`polyethylene', 'PE'",{[][$]CC[$][]},9002-88-4,"polyolefins','semicrystalline','linear','homop...",


Finally, we can use our file to generate a list of the csvs that hold the spectral data

In [5]:
import itertools

samples = sample_info.Sample
replicates = [1, 2, 3, 4, 5, 6, 7]
pre_filelist = itertools.product(samples, replicates)
filelist = [
    f"Data/NIR/{'_'.join([fs[0], str(fs[1])])}.csv" for fs in pre_filelist]

Now we can load in each `*.csv` file from our filelist and add it to an Xarray Dataset. When we load it in, we'll also add all of the information from our other 2 files, and we'll add units where necessary.

In [6]:
from tqdm import tqdm
import datetime
import xarray as xr

ds_list = []
for filepath in tqdm(filelist, position=0, leave=True):
    sample = filepath.split('/')[-1].split('.')[0]
    polymer = sample.split('_')[0]
    repeat = sample.split('_')[1].split('.')[0]
    s_info = sample_info.loc[sample_info.Sample == polymer, :]

    # use pandas csv reader to read in file
    dataframe = pd.read_csv(filepath, names=['Wavenumber', 'Intensity'])
    # some files span wider wavelengths than others, so this ensures
    # we are comparing similar ranges of spectra
    dataframe = dataframe.loc[dataframe['Wavenumber'] < 10000, :]

    # convert to Xarray Dataset object
    dataset = dataframe.set_index('Wavenumber').to_xarray()

    # add in extra metadata from filename/filepath
    dataset['polymer'] = str(polymer)
    dataset['sample'] = str(sample)
    dataset['state'] = str(s_info['Physical State'].values[0])
    dataset['repeat'] = int(repeat)
    dataset['Class1'] = str(s_info['Class1'].values[0])
    dataset['Class2'] = str(s_info['Class2'].values[0])
    #dataset['Class2_num'] = misc.dict_zord[str(s_info['Class2'].values[0])]
    dataset['Color'] = str(s_info['Color'].values[0])

    # add units where applicable
    dataset['Wavenumber'].attrs = {'units': '1/cm'}
    dataset['Intensity'].attrs = {'units': '% Reflectance'}

    # define global attributes
    dataset.attrs = {'creation_date': datetime.datetime.now().strftime('%Y%m%d'),
                     'author': 'Bradley Sutliff',
                     'email': 'Bradley.Sutliff@nist.gov',
                     'data_collected_by': 'Shailja Goyal'}
    # aggregate data into a list
    ds_list.append(dataset)

ds = xr.concat(ds_list, dim='sample')
ds = ds.set_coords(['sample', 'polymer', 'state',
                   'Class2', 'Class1', 'repeat',
                    'Color'])#, 'Class2_num'])
  
# also saving a copy of this for later use
ds_nopipdims = ds.copy()
ds['Intensity'] = (ds.Intensity
                     .assign_coords({"pipeline": "none"})
                     .expand_dims("pipeline"))

100%|███████████████████████████████████████████████████████████████████████████████| 315/315 [00:01<00:00, 297.84it/s]


## Use OneHotEncoder on Class1 and Class2 categorical variables

First separate out the samples that don't have Class 2 labels

In [7]:
ds_noClass2 = ds.where(ds.Class2=='nan', drop=True)
ds = ds.where(ds.Class2!='nan', drop=True)

if save_data == True:
    ds.to_netcdf('ds_for_ohe.nc')

Now we can encode so that we can use these labels for our classification models later down the road (if we need to).

In [8]:
from sklearn.preprocessing import OneHotEncoder

ohe_c1 = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
ohe_c1.fit(ds.Class1.values.reshape(-1,1))
ohe_c1_data = ds.Class1.values.reshape(-1,1)
ds['Class1_ohe'] = xr.DataArray(
    data=ohe_c1.transform(ohe_c1_data),
    dims=["sample", "class_1"],
    coords=dict(
        sample=ds.sample.values,
        class_1=ohe_c1.categories_[0])
)

ohe_c2 = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
ohe_c2.fit(ds.Class2.values.reshape(-1,1))

ohe_c2_data = ds.Class2.values.reshape(-1,1)
ds['Class2_ohe'] = xr.DataArray(
    data=ohe_c2.transform(ohe_c2_data),
    dims=["sample", "class_2"],
    coords=dict(
        sample=ds.sample.values,
        class_2=ohe_c2.categories_[0])
)

ds=ds.set_coords(['Class1_ohe', 'Class2_ohe'])
ohe_dict={'Class1': ohe_c1, 'Class2':ohe_c2}



## Define our pipelines

Now we can start doing what we really want to do: preprocess the data.

### Start by defining all of our transformers

by making these functions into transformers they can be easily called from within a pipeline

In [9]:
from sklearn.preprocessing import (FunctionTransformer, MinMaxScaler,
                                   StandardScaler, Normalizer)
from sklearn_xarray import wrap
import umap

trans_dict = {'None1': FunctionTransformer(misc.id_func, validate=False),
              'None2': FunctionTransformer(misc.id_func, validate=False),
              'None3': FunctionTransformer(misc.id_func, validate=False),
              'None4': FunctionTransformer(misc.id_func, validate=False),
              'None5': FunctionTransformer(misc.id_func, validate=False),
              'None6': FunctionTransformer(misc.id_func, validate=False),
              'MeanCentering':FunctionTransformer(misc.xr_MC, validate=False),
              'SNV':FunctionTransformer(misc.xr_SNV, validate=False),
              'RNV': FunctionTransformer(misc.xr_RNV, validate=False),
              'Detrending': FunctionTransformer(misc.xr_detrend, validate=False),
              'SG': FunctionTransformer(misc.xr_SG,
                                        kw_args={'window_length':21,
                                                 'polyorder':6},
                                        validate=False),
              'SG1': FunctionTransformer(misc.xr_SG,
                                         kw_args={'window_length':21,
                                                  'polyorder':6, 'deriv':1},
                                         validate=False),
              'SG2': FunctionTransformer(misc.xr_SG2,
                                         kw_args={'window_length':21,
                                                  'polyorder':6},
                                         validate=False),
              'L1': wrap(Normalizer(norm='l1'), sample_dim='sample'),
              'L2': wrap(Normalizer(norm='l2'), sample_dim='sample'),
              'MinMaxScaler': wrap(MinMaxScaler(), sample_dim='sample'),
              'StandardScaler': wrap(StandardScaler(), sample_dim='sample'),
              'PCA': misc.my_PCA(n_components=0.99, min_n_components=5),
              'fPCA': misc.my_fPCA(n_components=0.99, min_n_components=5),
              'UMAP': wrap(umap.UMAP(n_components=5), sample_dim='sample',
                           reshapes='Wavenumber')}

### Now we make the preprocessing pipelines

We can alter these lists if we don't want to run all 1152$^*$ pipelines

$^*$1152 pipelines per combination of color and state

In [10]:
# data selection
# l_data_colors = ['Natural', 'NoBlack', 'AllColors'] # Possible color down selection
# l_data_state = ['Pellet', 'AllStates'] # Possible physical state down selection
l_data_colors = ['AllColors']
l_data_state = ['AllStates']

# preprocessing steps
l_pp_1 = ['None1', 'MeanCentering', 'SNV', 'RNV']
l_pp_2 = ['None2','Detrending']
l_pp_3 = ['None3', 'SG', 'SG1', 'SG2']
l_pp_4 = ['None4', 'L1', 'L2',]                      # Sample normalization
l_pp_5 = ['None5', 'MinMaxScaler', 'StandardScaler'] # Feature normalization

# data reduction steps
l_dr = ['None6', 'PCA', 'fPCA', 'UMAP']

In [11]:
from sklearn.pipeline import Pipeline

preproc_dict = {}
for i in itertools.product(l_pp_1, l_pp_2, l_pp_3,l_pp_4, l_pp_5, l_dr):
    pipename = '_'.join(i).strip('_')
    pipe = [(j, trans_dict[j]) for j in i]
    preproc_dict[pipename] = Pipeline(pipe)

In [12]:
l_dsel = list(itertools.product(l_data_colors, l_data_state))
l_dsel

[('AllColors', 'AllStates')]

$\color{red}{\text{Warning! This next cell can easily take an hour or more to run!}}$
$\color{red}{\text{If you don't want data from all possible combinations, please reduce the lists l_pp_*}}$
$\color{red}{\text{and l_dr to reduce computation time!}}$

In [13]:
import numpy as np

ds_pp_X = xr.Dataset()
ds_pp_Y = xr.Dataset()

for dsel in tqdm(l_dsel, position=0, leave=True):
     
    X, y = misc.data_select(ds, colors=dsel[0], state=dsel[1])

    for pipe in list(preproc_dict): # you can also slice this list to speed things up 
        # print(pipe)
        pipe_transformer = preproc_dict[pipe]
        pipe_transformer.fit(X)
        X_pp = pipe_transformer.transform(X)

        ds_pp = xr.DataArray(
            data = X_pp,
            dims = ["sample", "feature"],
            coords = {
                "sample": X_pp.sample.values,
                "feature": np.arange(X_pp.shape[1]),
            })

        ds_pp_X[f'{dsel[0]}-{dsel[1]}_{pipe}'] = ds_pp
        ds_pp_Y[f'{dsel[0]}-{dsel[1]}_{pipe}'] = y

100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [41:18<00:00, 2478.05s/it]


Check the number of data variables in X and Y. They should be the same for ease of accessing the data consistently later, but could be as small as `len(l_dsel)` if we wanted to take up less memory/drive space.

In [14]:
print(f'X: {len(ds_pp_X.data_vars)}')
print(f'Y: {len(ds_pp_Y.data_vars)}')

X: 1152
Y: 1152


## Do we want to save the preprocessed data?

In [15]:
save_data=True

In [16]:
if save_data == True:
    from datetime import datetime 
    #check for location to save data
    import os
    newpath = r'NetCDFs/'
    if not os.path.exists(newpath):
        os.makedirs(newpath)
    date = datetime.today().strftime('%Y%m%d%H%M%S')
    print(date)
    ds_pp_X.to_netcdf(f'NetCDFs/{date}_preprocessed_X_example.nc')
    ds_pp_Y.to_netcdf(f'NetCDFs/{date}_preprocessed_Y_example.nc')

20240709135516


In [20]:
with open('netCDF_date.txt', 'w') as file:
    file.write(f'{date}')

In [17]:
print('Finished!')

Finished!
