# Polyolefin InfraRed Classification - Bulk Classification (No HP-Tuning)

In connection with: add DOI INFO/LINK HERE

This code was predominantly produced by Bradley P. Sutliff, with assistance from Tyler B. Martin, and Debra Audus

This notebook is provided in an effort to further open research initiatives and to further the circular economy.

Please direct any questions to Bradley.Sutliff@nist.gov

## Notebook setup

This notebook is meant to show how our bulk classification step works. It is basically a big for-loop that loops through each set of preprocessed data, trains and tests each classifer on that set of data.


### Do we want to save the results of this notebook as individual csv files?

The results will be saved per pipeline so that the output files include the scores for every classifier, but only one pipeline at a time.
While this generates an annoying number of csvs (1152, if running all pipelines) it also enables users to pick up from where they left off if the code fails/stalls midway through.
This line is mostly added to prevent overriding files or generating files unnecesarily when you are "playing" with the code. You can always edit this or manually save them later.

In [1]:
save_data = False

### Load in our data that was generated by 2-pirc_piped-preprocessing.ipynb

It is assumed that the data and code are now set up in the following directory structure.:

```
Main  
  ├ *.ipynb  
  ├ Data  
  |  ├ SampleInformation.csv  
  |  └ NIR  
  |    ├ N1476LDPE_1.csv  
  |    ├ ...  
  |    └ H0009PP_7.csv  
  ├ Scripts  
  |  ├ *.py  
  |  ├ *.sh  
  |  └ *.ps1
  └ NetCDFs
     └ *.nc
```

In [2]:
import xarray as xr
date = '20240703130237'
ds_pp_X = xr.open_dataset(f'NetCDFs/{date}_preprocessed_X_example.nc')
ds_pp_y = xr.open_dataset(f'NetCDFs/{date}_preprocessed_Y_example.nc')
print('Data Loaded!')

Data Loaded!


### Use a OneHotEncoder (OHE) to convert our labels into "dummy variables"

Not every algorithm will be able to use OHE labels, but we like to use them when possible to avoid potential issues with algorithms like PLS-DA that need numerical categories. It helps avoid arbitrary "ranking" of categories because 1,2,3,4,5 have inherent order, but HDPE, LDPE, LLDPE, MDPE, and PP don't really. We don't want the model to "learn" an order just because we chose to label our categories on way or another.
The algorithms that were able to use OHE labels easily were RandomForest, SIMCA, PLS-DA, and MLPC

In [3]:
from sklearn.preprocessing import OneHotEncoder
ohe_c2 = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
ohe_c2.fit(ds_pp_y.Class2.values.reshape(-1,1))

Make a dictionary to hold the encoder, just incase we want to encode a different set of labes, this is easy to access.

In [4]:
ohe_dict={'Class2':ohe_c2}
eval_class = 'Class2'
print('ohe_dict made')

ohe_dict made


### Instantiate all of our classifiers, and save them in a dictionary

Saving them in a dictionary simplifies the loop we use later to cycle through all of the classifiers.

In [5]:
import Scripts.misc_funcs as misc
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.cross_decomposition import PLSRegression
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier

clsfr_dict = {'SIMCA':misc.SIMCA_classifier(cat_encoder=ohe_dict[eval_class],
                                           simca_type="SIMCA"),

              'PLS-DA':PLSRegression(),
              'RandomForest': RandomForestClassifier(),
              'MLPC':MLPClassifier(max_iter=2000),
              'LDA':LinearDiscriminantAnalysis(),
              'QDA':QuadraticDiscriminantAnalysis(),
              'LinearSVC':svm.LinearSVC(),
              'RBF_SVC':svm.SVC(),
              'GaussianNB':GaussianNB(),
              'KNN':KNeighborsClassifier(),
              'AdaBoost':AdaBoostClassifier()
             }
print('Dictionary made!')

Dictionary made!


### Split our into Train and Test sets now so that each classifier is trained and tested on the same subsets of data

In [12]:
from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.33)
for train, test in sss.split(X=ds_pp_X.sample, y=ds_pp_y.Class2):
    ds_pp_X_train = ds_pp_X.isel(sample=train)
    ds_pp_y_train = ds_pp_y.isel(sample=train)
    ds_pp_X_test = ds_pp_X.isel(sample=test)
    ds_pp_y_test = ds_pp_y.isel(sample=test)


### Loop through each pipeline and train, test, and score each of our classifiers.

$\color{red}{\text{Warning! This will take a very long time to run since it is basically running 1152*11 maching learning models.}}$

You can downselect the pipelines by slicing `list(ds_pp_X_train.data_vars)` or by giving your own list of pipes from that list. Downselecting the classifiers can be done similarly with `list(clsfr_dict.keys())`.

Additionally, within the Scripts folder, the `3b-pirc_classification-NoTuningSCRIPT.py` file can be used to run the same analysis in the background from a terminal. There are also some bash and powershell scripts for reference and tracking of progress.

In [13]:
import pandas as pd
import warnings
from tqdm import tqdm
from sklearn.metrics import (accuracy_score, precision_score,
                             recall_score, f1_score)

with warnings.catch_warnings(record=True) as cx_manager:
    for pipe in list(ds_pp_X.data_vars): 
        print(pipe)
        df_pipe_scores = pd.DataFrame()
        for clsfr in list(clsfr_dict.keys()):
            print(clsfr)
            # print(clsfr, pipe)
            df_scores = pd.DataFrame()
            # set X and y
            X_train = ds_pp_X_train[pipe].dropna(dim='feature', how='all').dropna(dim='sample', how='all')
            y_train = ds_pp_y_train[pipe].dropna(dim='sample', how='all')
            X_test = ds_pp_X_test[pipe].dropna(dim='feature', how='all').dropna(dim='sample', how='all')
            y_test = ds_pp_y_test[pipe].dropna(dim='sample', how='all')
                   
            # train
            if clsfr in ['LDA','QDA','LinearSVC', 'RBF_SVC','GaussianNB','AdaBoost']:
                y_train = ohe_dict[eval_class].inverse_transform(y_train).squeeze()
            model = clsfr_dict[clsfr]
            model.fit(X_train, y_train)
            print('model trained')
            # test
            y_pred = model.predict(X_test)
            if clsfr in ['LDA', 'QDA','LinearSVC', 'RBF_SVC','GaussianNB','AdaBoost']:
                y_pred = ohe_dict[eval_class].transform(y_pred.reshape(-1,1))

            y_pred = (y_pred > 0.5).astype('uint8')
            print('model prediction made')
            # score
            acc = accuracy_score(y_test, y_pred)
            prec_micro = precision_score(y_test, y_pred, average='micro', zero_division=0.0)
            rec_micro = recall_score(y_test, y_pred, average='micro')
            f1_micro = f1_score(y_test, y_pred, average='micro', zero_division=0.0)
            prec_macro = precision_score(y_test, y_pred, average='macro', zero_division=0.0)
            rec_macro = recall_score(y_test, y_pred, average='macro')
            f1_macro = f1_score(y_test, y_pred, average='macro', zero_division=0.0)
            prec_weighted = precision_score(y_test, y_pred, average='weighted', zero_division=0.0)
            rec_weighted = recall_score(y_test, y_pred, average='weighted')
            f1_weighted = f1_score(y_test, y_pred, average='weighted', zero_division=0.0)
            print('all model scores evaluated')
            
            # add to dataframe
            df_scores['Data'] = [pipe.split('_')[0]]
            df_scores['Preprocessing'] = '_'.join(pipe.split('_')[1:])
            df_scores['Classifier'] = clsfr
            df_scores['Accuracy'] = acc
            df_scores['Precision_micro'] = prec_micro
            df_scores['Recall_micro'] = rec_micro
            df_scores['F1_micro'] = f1_micro
            df_scores['Precision_macro'] = prec_macro
            df_scores['Recall_macro'] = rec_macro
            df_scores['F1_macro'] = f1_macro
            df_scores['Precision_weighted'] = prec_weighted
            df_scores['Recall_weighted'] = rec_weighted
            df_scores['F1_weighted'] = f1_weighted
            df_scores['Warning'] = str([i.message for i in cx_manager])  # HERE
            cx_manager.clear()
            df_pipe_scores = pd.concat([df_pipe_scores, df_scores], ignore_index=True)
            print('--------'*10)
            
        if save_data ==True:
            #check for location to save data
            import os
            newpath = r'ClassifierScores/'
            if not os.path.exists(newpath):
                os.makedirs(newpath)
            df_pipe_scores.to_csv(f'{newpath}/{pipe}.csv')

print("FINISHED!")

AllColors-AllStates_None1_None2_None3_None4_None5_None6
SIMCA
model trained
model prediction made
all model scores evaluated
--------------------------------------------------------------------------------
PLS-DA
model trained
model prediction made
all model scores evaluated
--------------------------------------------------------------------------------
RandomForest
model trained
model prediction made
all model scores evaluated
--------------------------------------------------------------------------------
MLPC
model trained
model prediction made
all model scores evaluated
--------------------------------------------------------------------------------
LDA
model trained
model prediction made
all model scores evaluated
--------------------------------------------------------------------------------
QDA
model trained
model prediction made
all model scores evaluated
--------------------------------------------------------------------------------
LinearSVC
model trained
model prediction 

In [14]:
print("Finished!")

Finished!
