# Feature Engineering Notebook

Clean and extract features from raw data

# Steps

1. Split the data into training and test data set
1. Clean the data (transform null values)
1. Scale necessary attributes (normalization, standardization)
1. Save transformed data for model training


# Import packages

In [34]:
# data manipulation
import pandas as pd
import numpy as np


# data splitting
from sklearn.model_selection import train_test_split

# data preprocessing
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# serializing, compressing, and loading the models
import joblib
import sys
sys.path.append("../lib")

from getConfig import *
config = getConfig("../")
config.cleanup(config.traintest_path)


# Load the data

Load comma separated data from disk

In [35]:
data = pd.read_csv(config.input_data, sep=",")


# 1. Create Training and Test Dataset
> uses scikit-learn

Performing this early minimizes generalization and bias you may inadvertently apply to your system.
Simply put, a test set of data involves: picking ~20% of the instances randomly and setting them aside.

Some considerations for sampling methods that generate the test set:
1. you don't want your model to see the entire dataset
1. you want to be able to fetch new data for training
1. you want to maintain the same percentage of training data against the entire dataset
1. you want a representative training dataset (~7% septic positive)

https://realpython.com/train-test-split-python-data/

In [36]:
# sets 10%/15%/20% of the data aside for testing, sets the random number generate to it always generates the same shuffled indicies
# x = 2 dimensional array with inputs
# X_train is the training part of the first sequence (x)
# X_test is the test part of the first sequence (x)
# y = 1 dimensional array with outputs
# y_train is the labeled training part of the second sequence
# y_test is the labeled test part of the second sequence
# axis Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’)
# test_size is the amount of the total dataset to set aside for testing = 10%
# random state fixes the randomization so you get the same results each time
# Shuffle before the data is split, it is shuffled
# stratified splitting keeps the proportion of y values trhough the train and test sets
X_train, X_test, y_train, y_test = \
    train_test_split(data.drop(["Age", "Unit1", "Unit2", "HospAdmTime", "ICULOS", "Gender", "Bilirubin_direct", "TroponinI", "isSepsis"], axis=1),
    data["isSepsis"], test_size=0.20,
    random_state=42, stratify=data["isSepsis"])

# 2. Clean the Data
Instead of preparing data manually, write functions to:
1. reproduce transformations easily on any dataset (e.g., data refresh)
1. builds a library of functions to reuse in future projects
1. use functions in live stream to transform new data before inferencing


## Steps
1. transform current and future null values
1. impute median for missing attributes (>7k)

## 2.1 Transform missing values from numeric data

In [37]:
# create simpleimputer instance
# replace attributes missing values with median of the attribute
imputer = SimpleImputer(strategy="median")

# fit applies the imputer to ALL numeric data in case new data includes null values
# when system goes live
# results are stored in a imputer.statistics_ value
imputer.fit_transform(X_train)


array([[ 80.  ,  99.  ,  36.89, ...,  15.  , 243.  , 341.  ],
       [ 84.  ,  95.  ,  37.06, ...,  13.5 , 243.  , 292.  ],
       [ 78.  , 100.  ,  36.89, ...,   7.5 , 243.  , 135.  ],
       ...,
       [ 76.  ,  98.  ,  36.89, ...,   6.2 , 243.  ,  77.  ],
       [ 80.  , 100.  ,  36.4 , ...,  25.7 , 243.  , 196.  ],
       [100.5 ,  99.  ,  36.89, ...,   0.6 ,  77.  , 172.  ]])

In [38]:
# apply the trained imputer to transform the training set replacing the
# missing values with learn medians
N = imputer.transform(X_train)
# result above is plain NumPy array with transformed features
# put back to a pandas DataFrame
M = pd.DataFrame(N, columns=X_train.columns, index=X_train.index)
M.head()

Unnamed: 0,HR,O2Sat,Temp,SBP,MAP,DBP,Resp,EtCO2,BaseExcess,HCO3,...,Magnesium,Phosphate,Potassium,Bilirubin_total,Hct,Hgb,PTT,WBC,Fibrinogen,Platelets
23,80.0,99.0,36.89,129.0,100.0,77.0,18.0,2.0,0.0,22.0,...,1.8,2.8,3.8,0.7,28.0,9.3,25.8,15.0,243.0,341.0
1084,84.0,95.0,37.06,134.0,84.0,57.5,17.0,2.0,0.0,22.0,...,2.5,5.1,4.1,0.7,31.6,10.7,121.5,13.5,243.0,292.0
414,78.0,100.0,36.89,129.0,80.0,57.0,10.0,2.0,0.0,24.0,...,1.8,2.7,4.0,0.7,23.6,8.0,30.3,7.5,243.0,135.0
437,79.5,99.0,37.0,114.5,87.5,71.5,31.5,2.0,1.0,25.0,...,1.7,3.3,4.0,0.7,32.9,11.3,31.1,18.4,243.0,268.0
671,102.0,98.0,36.44,122.0,70.67,57.5,16.0,2.0,0.0,24.0,...,1.9,2.3,3.8,3.3,28.2,9.3,38.5,11.0,243.0,160.0


# 3 Feature Scaling
1. ML algorithms don't work well when numeric attributes have very different scales
    (e.g. HR max 184,  pH max 7.67)
1. Scaling target values is not necessary
1. Apply
    1. normalization (MinMaxScaler) bounds the values to a specific range (e.g. 0-1)
    1. standardization (StandardScaler) less affected by outliers does not bound to range

In [39]:
scaler = StandardScaler()

O = scaler.fit_transform(N)
P = pd.DataFrame(O, columns=X_train.columns, index=X_train.index)
P.head()

Unnamed: 0,HR,O2Sat,Temp,SBP,MAP,DBP,Resp,EtCO2,BaseExcess,HCO3,...,Magnesium,Phosphate,Potassium,Bilirubin_total,Hct,Hgb,PTT,WBC,Fibrinogen,Platelets
23,-0.232464,0.510396,0.024048,0.54744,1.540106,2.031346,0.023994,0.0,0.043109,-0.579576,...,-0.540031,-0.776203,-0.610075,-0.142612,-0.715562,-0.858904,-0.499411,0.620465,-0.064134,1.276695
1084,0.008067,-0.879637,0.287529,0.800337,0.437076,-0.105018,-0.166885,0.0,0.043109,-0.579576,...,1.448641,1.51273,-0.063418,-0.142612,0.009165,-0.036622,5.309965,0.349054,-0.064134,0.786204
414,-0.35273,0.857904,0.024048,0.54744,0.161319,-0.159797,-1.503036,0.0,0.043109,-0.077481,...,-0.540031,-0.875722,-0.245637,-0.142612,-1.601341,-1.622452,-0.226243,-0.736588,-0.064134,-0.785371
437,-0.262531,0.510396,0.194536,-0.185962,0.678364,1.428782,2.600857,0.0,0.387765,0.173567,...,-0.824127,-0.278609,-0.245637,-0.142612,0.270873,0.315785,-0.17768,1.235662,-0.064134,0.545963
671,1.090455,0.162888,-0.673403,0.193384,-0.481885,-0.105018,-0.357763,0.0,0.043109,-0.077481,...,-0.255935,-1.273797,-0.610075,1.59354,-0.6753,-0.858904,0.27153,-0.103297,-0.064134,-0.53512


## 3.1 Transformation Pipeline

Common to apply many transformation steps in a specific order (fill the nulls before you apply the scaling)

In [40]:
# this pipeline should work for all the estimators/algorithms
pipeline = Pipeline([
                    ('imputer', SimpleImputer(strategy='median')),
                    ('std_scaler', StandardScaler()),
                    ])

In [41]:
# this is the transformed data to train from
X_train_prepared = pipeline.fit_transform(X_train)

In [42]:
# neural networks sometimes expect a 0-1 normalized scale and perform better
pipeline_minmax = Pipeline([
                    ('imputer', SimpleImputer(strategy='median')),
                    ('minMax', MinMaxScaler()),
                    ])

In [43]:
# this is the transformed data to train the MLP from
X_train_prepared_m = pipeline_minmax.fit_transform(X_train)
X_test_prepared=pipeline_minmax.fit_transform(X_test)

# 4. Save the data for model training

Common to apply many transformation steps in a specific order (fill the nulls before you apply the scaling)

In [44]:
# compress and save the pipeline

joblib.dump(pipeline, config.traintest_path + "pipeline.pkl")
joblib.dump(pipeline_minmax, config.traintest_path + "pipeline_minmax.pkl")

#Save the transformed data into data/transform folder

np.savetxt(config.traintest_path + "X_train_prepared_m.csv", X_train_prepared_m, delimiter=",")
np.savetxt(config.traintest_path + "X_train_prepared.csv", X_train_prepared, delimiter=",")
np.savetxt(config.traintest_path + "X_train.csv", X_train, delimiter=",")
np.savetxt(config.traintest_path + "X_test.csv", X_test, delimiter=",")
np.savetxt(config.traintest_path + "X_test_prepared.csv", X_test_prepared, delimiter=",")
np.savetxt(config.traintest_path + "y_train.csv", y_train, delimiter=",")
np.savetxt(config.traintest_path + "y_test.csv", y_test, delimiter=",")
