# Feature Engineering Notebook

Clean and extract features from raw data

# Steps

1. Split the data into training and test data set
1. Clean the data (transform null values)
1. Scale necessary attributes (normalization, standardization)
1. Save transformed data for model training


# Import packages

In [6]:
# load data
import matplotlib.pyplot

# Add directory above current directory to path
import sys; sys.path.insert(0, '..')
#from submodules.load_data import load_data

# data manipulation
import pandas as pd

# data splitting
from sklearn.model_selection import train_test_split

# data preprocessing
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# model
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier

# hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# k-fold cross validation
from sklearn.model_selection import cross_validate

# serializing, compressing, and loading the models
import joblib

# performance
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import plot_precision_recall_curve
import matplotlib.pyplot as plt

# displaying plots
from IPython import display
import matplotlib.image as mpimg
import glob
from skimage.util import montage
import numpy as np

SyntaxError: invalid syntax (735039495.py, line 1)

# Load the data

Load semi-colon separated data from disk

In [10]:
# load the data using a python function
#data = load_data()
# without using a python function
csv_path = "../data/raw/dataSepsis.csv"
data = pd.read_csv(csv_path, sep=";")

# 1. Create Training and Test Dataset
> uses scikit-learn

Performing this early minimizes generalization and bias you may inadvertently apply to your system.
Simply put, a test set of data involves: picking ~20% of the instances randomly and setting them aside.

Some considerations for sampling methods that generate the test set:
1. you don't want your model to see the entire dataset
1. you want to be able to fetch new data for training
1. you want to maintain the same percentage of training data against the entire dataset
1. you want a representative training dataset (~7% septic positive)

https://realpython.com/train-test-split-python-data/

In [39]:
# sets 10%/15%/20% of the data aside for testing, sets the random number generate to it always generates the same shuffled indicies
# x = 2 dimensional array with inputs
# X_train is the training part of the first sequence (x)
# X_test is the test part of the first sequence (x)
# y = 1 dimensional array with outputs
# y_train is the labeled training part of the second sequence
# y_test is the labeled test part of the second sequence
# axis Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’)
# test_size is the amount of the total dataset to set aside for testing = 10%
# random state fixes the randomization so you get the same results each time
# Shuffle before the data is split, it is shuffled
# stratified splitting keeps the proportion of y values trhough the train and test sets
X_train, X_test, y_train, y_test = \
    train_test_split(data.drop(["Age", "Unit1", "Unit2", "HospAdmTime", "ICULOS", "Gender", "Bilirubin_direct", "TroponinI", "isSepsis"], axis=1),
    data["isSepsis"], test_size=0.20,
    random_state=42, stratify=data["isSepsis"])

# 2. Clean the Data
Instead of preparing data manually, write functions to:
1. reproduce transformations easily on any dataset (e.g., data refresh)
1. builds a library of functions to reuse in future projects
1. use functions in live stream to transform new data before inferencing


## Steps
1. transform current and future null values
1. impute median for missing attributes (>7k)

## 2.1 Transform missing values from numeric data

In [42]:
# create simpleimputer instance
# replace attributes missing values with median of the attribute
imputer = SimpleImputer(strategy="median")

# fit applies the imputer to ALL numeric data in case new data includes null values
# when system goes live
# results are stored in a imputer.statistics_ value
imputer.fit_transform(X_train)

array([[ 87.  ,  98.5 ,  36.5 , ...,   6.8 , 251.  , 139.  ],
       [ 72.  ,  98.  ,  36.8 , ...,   7.5 , 251.  , 150.  ],
       [ 72.  ,  99.  ,  36.28, ...,   2.6 , 251.  , 183.  ],
       ...,
       [ 72.  ,  95.  ,  36.8 , ...,   6.2 , 251.  , 107.  ],
       [ 78.5 , 100.  ,  36.65, ...,  11.1 , 251.  , 173.  ],
       [ 64.  , 100.  ,  36.22, ...,  10.  , 251.  , 339.  ]])

In [43]:
# apply the trained imputer to transform the training set replacing the
# missing values with learn medians
N = imputer.transform(X_train)
# result above is plain NumPy array with transformed features
# put back to a pandas DataFrame
M = pd.DataFrame(N, columns=X_train.columns, index=X_train.index)
M.head()

Unnamed: 0,HR,O2Sat,Temp,SBP,MAP,DBP,Resp,EtCO2,BaseExcess,HCO3,...,Magnesium,Phosphate,Potassium,Bilirubin_total,Hct,Hgb,PTT,WBC,Fibrinogen,Platelets
26908,87.0,98.5,36.5,108.0,73.0,63.0,16.5,33.0,0.0,24.0,...,1.8,3.4,4.2,0.8,27.5,8.7,30.7,6.8,251.0,139.0
5174,72.0,98.0,36.8,145.0,86.33,62.0,17.0,33.0,0.0,25.0,...,1.6,3.4,3.9,0.5,29.2,9.6,30.7,7.5,251.0,150.0
15997,72.0,99.0,36.28,96.0,70.0,62.0,14.0,33.0,0.0,24.0,...,2.2,3.1,3.7,0.8,28.8,10.0,30.7,2.6,251.0,183.0
13058,76.5,100.0,35.55,102.0,64.0,49.0,14.0,33.0,0.0,24.0,...,2.0,3.4,5.2,0.8,26.2,8.9,66.1,36.0,251.0,79.0
23132,68.0,95.0,36.6,119.0,92.0,80.0,20.0,33.0,0.0,24.0,...,1.8,3.4,3.4,0.8,43.6,15.3,30.7,10.7,251.0,219.0


# 3 Feature Scaling
1. ML algorithms don't work well when numeric attributes have very different scales
    (e.g. HR max 184,  pH max 7.67)
1. Scaling target values is not necessary
1. Apply
    1. normalization (MinMaxScaler) bounds the values to a specific range (e.g. 0-1)
    1. standardization (StandardScaler) less affected by outliers does not bound to range

In [50]:
scaler = StandardScaler()

O = scaler.fit_transform(N)
P = pd.DataFrame(O, columns=X_train.columns, index=X_train.index)
P.head()

Unnamed: 0,HR,O2Sat,Temp,SBP,MAP,DBP,Resp,EtCO2,BaseExcess,HCO3,...,Magnesium,Phosphate,Potassium,Bilirubin_total,Hct,Hgb,PTT,WBC,Fibrinogen,Platelets
26908,0.199062,0.374434,-0.527378,-0.641028,-0.551583,-0.028341,-0.318226,0.013681,0.03009,-0.042334,...,-0.607165,-0.107243,0.185125,-0.117318,-0.855207,-1.106481,-0.173873,-0.645006,-0.065903,-0.709709
5174,-0.669441,0.192386,-0.019029,0.998721,0.252568,-0.109121,-0.215282,0.013681,0.03009,0.308687,...,-1.216605,-0.107243,-0.31821,-0.289078,-0.540252,-0.625355,-0.173873,-0.537316,-0.065903,-0.594803
15997,-0.669441,0.556483,-0.900167,-1.172838,-0.732562,-0.109121,-0.832944,0.013681,0.03009,-0.042334,...,0.611715,-0.403405,-0.653767,-0.117318,-0.614359,-0.411521,-0.173873,-1.291144,-0.065903,-0.250085
13058,-0.40889,0.92058,-2.137151,-0.906933,-1.094521,-1.15926,-0.832944,0.013681,0.03009,-0.042334,...,0.002275,-0.107243,1.862907,-0.117318,-1.096055,-0.999564,2.313841,3.847193,-0.065903,-1.336469
23132,-0.901042,-0.899906,-0.357928,-0.153535,0.594619,1.344918,0.402379,0.013681,0.03009,-0.042334,...,-0.607165,-0.107243,-1.157101,-0.117318,2.127604,2.421778,-0.173873,-0.045021,-0.065903,0.125971


## 3.1 Transformation Pipeline

Common to apply many transformation steps in a specific order (fill the nulls before you apply the scaling)

In [51]:
# this pipeline should work for all the estimators/algorithms
pipeline = Pipeline([
                    ('imputer', SimpleImputer(strategy='median')),
                    ('std_scaler', StandardScaler()),
                    ])

In [52]:
# this is the transformed data to train from
X_train_prepared = pipeline.fit_transform(X_train)

In [53]:
# neural networks sometimes expect a 0-1 normalized scale and perform better
pipeline_minmax = Pipeline([
                    ('imputer', SimpleImputer(strategy='median')),
                    ('minMax', MinMaxScaler()),
                    ])

In [56]:
# this is the transformed data to train the MLP from
X_train_prepared_m = pipeline_minmax.fit_transform(X_train)


# 4. Save the data for model training

Common to apply many transformation steps in a specific order (fill the nulls before you apply the scaling)

In [57]:
# compress and save the pipeline
joblib.dump(pipeline, "../data/transform/pipeline.pkl")
joblib.dump(pipeline_minmax, "../data/transform/pipeline_minmax.pkl")

#Save the transformed data into data/transform folder

np.savetxt("../data/transform/X_train_prepared_m.csv", X_train_prepared_m, delimiter=",")
np.savetxt("../data/transform/X_train_prepared.csv", X_train_prepared, delimiter=",")
np.savetxt("../data/transform/X_train.csv", X_train, delimiter=",")
np.savetxt("../data/transform/X_test.csv", X_test, delimiter=",")
np.savetxt("../data/transform/y_train.csv", y_train, delimiter=",")
np.savetxt("../data/transform/y_test.csv", y_test, delimiter=",")
