# Pre-processing & Feature Engineering

## Import Necessary Packages & Data

In [27]:
import numpy as np
import pandas as pd
import os
import re
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [3]:
#Get the relative local folder where the data is stored
base_fpath = os.getcwd() #the file path to the working directory of the code
d_path = base_fpath.replace('notebooks', 'data') #raw data file path

s_data = pd.read_csv(d_path + '/gas_data_cleaned.csv') #sensor data

## Prepare the Data

Based on the findings from the last notebook we will want to create multiple dataframes, each capable of being ran in the future models we create. Each data frame will have their own pros and cons, which will be discussed during their creation. However, lets first start off by spliting the data into 3 sets: train, test, and future.
The training dataset will be used to create a scaling method and to build our models.
The test dataset will be used to verify our model and provide metrics on goodness of fit.
The last dataset, arbitrarily named future, will consist of the data that was captured as part of later batches (i.e. at later dates). The future dataset will be used to evaluate the model's goodness of fit over time and help us determine if any of the sensors lose accuracy in measurement over time (often called drift).

### Split the Data

The first thing we need to decide is which batches we should use for training and testing, and which ones should be used for all other data. I'm sure there is a good matematical guideline for determining how many data points are needed, but lets first look at how many data points are in each batch.

In [21]:
batch_data = pd.DataFrame(s_data['BatchNumber'].value_counts(sort=False).sort_index())
batch_data.rename(columns= {'BatchNumber': '# of Runs'}, inplace=True)
batch_data['% of Data'] = batch_data['# of Runs'] / len(s_data) * 100
batch_data

Unnamed: 0,# of Runs,% of Data
1,445,3.199137
2,1244,8.943206
3,1586,11.401869
4,161,1.157441
5,197,1.416247
6,2300,16.534867
7,3613,25.974119
8,294,2.113587
9,470,3.378864
10,3600,25.880661


The first five batches represent approximately 25% of the data or 3633 runs. If we were keeping the same number of dimensions that we currently have it would lead to a sparsely populated dimensional space. However, since we will be reducing the feature space, I think this is an acceptable number of data points to build and validate a model.

In [39]:
ind_vars = tt_data.drop(columns=['ChemicalCode', 'Concentration', 'BatchNumber']).columns #column names of the independent variables
tt_data = s_data[s_data['BatchNumber'] <= 5] #train test dataset
X_tt = tt_data[ind_vars] #train test independent variable dataset
y_tt_code = tt_data['ChemicalCode'] #1st dependent variable
y_tt_con = tt_data['Concentration'] #2nd dependent variable

X_train, X_test, y_code_train, y_code_test = train_test_split(X_tt, y_tt_code, test_size=.3, random_state=1991) #split the data 1st time
X_train2, X_test2, y_con_train, y_con_test = train_test_split(X_tt, y_tt_con, test_size=.3, random_state=1991) #split the data for the second dependent variable, should be the same
if (X_train['DR_1'] != X_train2['DR_1']).sum() == 0:
    print('Independent datasets from concentration split and chemical code split match so can keep just one X_train and X_test')
else:
    print('Splits do not match so need to keep X_train, X_train2, X_test, and X_test2')


f_data = s_data[s_data['BatchNumber'] > 5] #future dataset
X_f = f_data[ind_vars] #train test independent variable dataset
y_f_code = f_data['ChemicalCode'] #1st dependent variable
y_f_con = f_data['Concentration'] #2nd dependent variable

Independent datasets from concentration split and chemical code split match so can keep just one X_train and X_test


### Scale the Data

Lets scale the data by stanardizing it such that the mean is 0 and the standard deviation is 1. I believe good practice dictates that we fit on the training data and transform the others based on that fitted scale.

In [40]:
ss = StandardScaler().fit(X_train)
X_train_s = ss.transform(X_train)
X_test_s = ss.transform(X_test)
X_f_s = ss.transform(X_f)

## Feature Engingeering & Selection

### Highly Correlated Feature Selection

### PCA Feature Reduction

### LASSO Regression Feature Reduction