# Prepare Data

FTIR are collected from both known and unknown samples. Known samples are prepared in labs and consists of only single material. While, unknown samples are collected from places of interest so the samples may consists of multiple materials.

FTIR data are given as a group of spectra in [.SPG](https://www.spectrochempy.fr/latest/userguide/importexport/importIR.html#Import-of-OMNIC-files) file. Spectra are grouped either by material types (for known types) or measurement settings (for unknown types). Data are converted to CSV files using `spg2csv.ipynb` script. A manual approach is to use the [Spectragryph](https://www.effemm2.de/spectragryph/index.html) software (for sanity check).

The purpose of this script is to create a standard training and test data for all experiments. The steps include:
* Split data into training and test data with 60:40 ratio.
* Get reference shift (spectra data have different shift values). Use the one with the highest frequency (maximum value).

In [3]:
import os
import re
import glob
import shutil

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

**User Param**

In [4]:
input_dir = 'Microplastics_BCET_csv'
output_dir = 'data'

## List All Data

Get a list of filenames and its label (target), i.e., the type of material of the spectrum.

In [11]:
fl = glob.glob(os.path.join(input_dir, '[!Membrane]*_SD','*.csv'))
target = [re.search(r'/(.*)_SD', f).group(1) for f in fl]

Sanity check: Show number of samples for each type of material and the number of all materials

In [63]:
u, c = np.unique(target, return_counts=True)
f'No. of materials : {len(u)}'

'No. of materials : 22'

In [64]:
np.vstack([u, c]).T

array([['Acrylic', '10'],
       ['Cellulose', '10'],
       ['ENR', '10'],
       ['EPDM', '10'],
       ['HDPE', '10'],
       ['LDPE', '10'],
       ['Nylon', '10'],
       ['PBAT', '10'],
       ['PBS', '10'],
       ['PC', '10'],
       ['PEEK', '10'],
       ['PEI', '10'],
       ['PET', '10'],
       ['PLA', '10'],
       ['PMMA', '10'],
       ['POM', '10'],
       ['PP', '10'],
       ['PS', '10'],
       ['PTEE', '10'],
       ['PU', '10'],
       ['PVA', '10'],
       ['PVC', '10']], dtype='<U21')

## Split Train/Test Data

Split with 60:40 ratio with stratification to ensure that all types of materials exist in the test data.

In [48]:
X_train, X_test, y_train, y_test = train_test_split(fl, target, test_size=0.4, random_state=42, stratify=target)

In [51]:
np.unique(y_test, return_counts=True)

(array(['Acrylic', 'Cellulose', 'ENR', 'EPDM', 'HDPE', 'LDPE', 'Nylon',
        'PBAT', 'PBS', 'PC', 'PEEK', 'PEI', 'PET', 'PLA', 'PMMA', 'POM',
        'PP', 'PS', 'PTEE', 'PU', 'PVA', 'PVC'], dtype='<U9'),
 array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]))

Save to output directory.

In [58]:
os.makedirs(os.path.join(output_dir, 'train'), exist_ok=True)
os.makedirs(os.path.join(output_dir, 'test'), exist_ok=True)

In [62]:
for src in X_train:
    dst = os.path.join(output_dir, 'train', os.path.basename(src))
    shutil.copyfile(src, dst)
    
for src in X_test:
    dst = os.path.join(output_dir, 'test', os.path.basename(src))
    shutil.copyfile(src, dst)

## Get Reference FTIR Shift

In [93]:
df = pd.read_csv(os.path.join(input_dir, 'PET_SD/SD_PET_1.csv', header=None, names=['shift', 'intensity'])
df

Unnamed: 0,shift,intensity
0,3999.881104,0.012529
1,3999.640045,0.012560
2,3999.398987,0.012705
3,3999.157928,0.012930
4,3998.916870,0.013177
...,...,...
14411,525.989319,0.068047
14412,525.748260,0.064801
14413,525.507202,0.062974
14414,525.266144,0.062596


In [94]:
df['shift'].to_csv(os.path.join(output_dir, 'ref.csv'), index=False)

## Get Unlabeled Data

In [10]:
os.makedirs(os.path.join(output_dir, 'unlabel'), exist_ok=True)

In [8]:
fl = glob.glob(os.path.join(input_dir, '*-SB','*.csv'))

In [11]:
for src in fl:
    dst = os.path.join(output_dir, 'unlabel', os.path.basename(src))
    shutil.copyfile(src, dst)

In [12]:
fl = glob.glob(os.path.join(output_dir, 'train','*.csv'))

In [15]:
for src in fl:
    dst = os.path.join(output_dir, 'unlabel', os.path.basename(src))
    shutil.copyfile(src, dst)

## Get Unknown Data

In [5]:
os.makedirs(os.path.join(output_dir, 'unknown'), exist_ok=True)

In [6]:
fl = glob.glob(os.path.join(input_dir, '*-SB','*.csv'))

In [7]:
for src in fl:
    dst = os.path.join(output_dir, 'unknown', os.path.basename(src))
    shutil.copyfile(src, dst)