# Feature Engineering Notebook

Clean and extract features from raw data

# Steps

1. Split the data into training and test data set
1. Clean the data (transform null values)
1. Scale necessary attributes (normalization, standardization)
1. Save transformed data for model training


# Import packages

In [19]:
# load data
import matplotlib.pyplot

# Add directory above current directory to path
import sys; sys.path.insert(0, '..')

# data manipulation
import pandas as pd
import numpy as np


# data splitting
from sklearn.model_selection import train_test_split

# data preprocessing
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# model
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# k-fold cross validation
from sklearn.model_selection import cross_validate

# serializing, compressing, and loading the models
import joblib

# performance
import matplotlib.pyplot as plt
import os
import shutil

# Load the data

Load comma separated data from disk

In [20]:
import configparser

settings = configparser.ConfigParser()
settings._interpolation = configparser.ExtendedInterpolation()
settings.read('../config')
settings.sections()

#path to data folder
csv_path = "../" + settings.get('Data', 'raw')
data = pd.read_csv(csv_path, sep=",")
version = settings.get('Run', 'version')

#Prepare data directory for current run
data_path = "../" + settings.get('Path', 'data') + "run_" + version
transform_path = data_path + "/transform/"

if not os.path.exists(data_path):
    os.makedirs(data_path)
    os.makedirs(transform_path)
else:
    shutil.rmtree(data_path)           # Removes all the subdirectories!
    os.makedirs(data_path)
    os.makedirs(transform_path)


# 1. Create Training and Test Dataset
> uses scikit-learn

Performing this early minimizes generalization and bias you may inadvertently apply to your system.
Simply put, a test set of data involves: picking ~20% of the instances randomly and setting them aside.

Some considerations for sampling methods that generate the test set:
1. you don't want your model to see the entire dataset
1. you want to be able to fetch new data for training
1. you want to maintain the same percentage of training data against the entire dataset
1. you want a representative training dataset (~7% septic positive)

https://realpython.com/train-test-split-python-data/

In [21]:
# sets 10%/15%/20% of the data aside for testing, sets the random number generate to it always generates the same shuffled indicies
# x = 2 dimensional array with inputs
# X_train is the training part of the first sequence (x)
# X_test is the test part of the first sequence (x)
# y = 1 dimensional array with outputs
# y_train is the labeled training part of the second sequence
# y_test is the labeled test part of the second sequence
# axis Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’)
# test_size is the amount of the total dataset to set aside for testing = 10%
# random state fixes the randomization so you get the same results each time
# Shuffle before the data is split, it is shuffled
# stratified splitting keeps the proportion of y values trhough the train and test sets
X_train, X_test, y_train, y_test = \
    train_test_split(data.drop(["Age", "Unit1", "Unit2", "HospAdmTime", "ICULOS", "Gender", "Bilirubin_direct", "TroponinI", "isSepsis"], axis=1),
    data["isSepsis"], test_size=0.20,
    random_state=42, stratify=data["isSepsis"])

# 2. Clean the Data
Instead of preparing data manually, write functions to:
1. reproduce transformations easily on any dataset (e.g., data refresh)
1. builds a library of functions to reuse in future projects
1. use functions in live stream to transform new data before inferencing


## Steps
1. transform current and future null values
1. impute median for missing attributes (>7k)

## 2.1 Transform missing values from numeric data

In [22]:
# create simpleimputer instance
# replace attributes missing values with median of the attribute
imputer = SimpleImputer(strategy="constant")

# fit applies the imputer to ALL numeric data in case new data includes null values
# when system goes live
# results are stored in a imputer.statistics_ value
imputer.fit_transform(X_train)


array([[ 80.  ,  99.  ,   0.  , ...,  15.  ,   0.  , 341.  ],
       [ 84.  ,  95.  ,  37.06, ...,  13.5 ,   0.  , 292.  ],
       [ 78.  , 100.  ,   0.  , ...,   7.5 ,   0.  , 135.  ],
       ...,
       [ 76.  ,  98.  ,   0.  , ...,   6.2 ,   0.  ,  77.  ],
       [ 80.  , 100.  ,  36.4 , ...,  25.7 ,   0.  ,   0.  ],
       [100.5 ,  99.  ,   0.  , ...,   0.6 ,  77.  , 172.  ]])

In [23]:
# apply the trained imputer to transform the training set replacing the
# missing values with learn medians
N = imputer.transform(X_train)
# result above is plain NumPy array with transformed features
# put back to a pandas DataFrame
M = pd.DataFrame(N, columns=X_train.columns, index=X_train.index)
M.head()

Unnamed: 0,HR,O2Sat,Temp,SBP,MAP,DBP,Resp,EtCO2,BaseExcess,HCO3,...,Magnesium,Phosphate,Potassium,Bilirubin_total,Hct,Hgb,PTT,WBC,Fibrinogen,Platelets
23,80.0,99.0,0.0,129.0,100.0,77.0,18.0,0.0,0.0,22.0,...,1.8,2.8,3.8,0.0,28.0,9.3,25.8,15.0,0.0,341.0
1084,84.0,95.0,37.06,134.0,84.0,0.0,17.0,0.0,0.0,22.0,...,2.5,5.1,4.1,0.0,31.6,10.7,121.5,13.5,0.0,292.0
414,78.0,100.0,0.0,129.0,80.0,57.0,10.0,0.0,0.0,24.0,...,1.8,2.7,4.0,0.0,23.6,8.0,0.0,7.5,0.0,135.0
437,79.5,99.0,37.0,114.5,87.5,71.5,31.5,0.0,1.0,25.0,...,1.7,3.3,4.0,0.0,32.9,11.3,31.1,18.4,0.0,268.0
671,102.0,98.0,36.44,122.0,70.67,0.0,16.0,0.0,0.0,24.0,...,1.9,2.3,3.8,3.3,28.2,9.3,38.5,11.0,0.0,160.0


# 3 Feature Scaling
1. ML algorithms don't work well when numeric attributes have very different scales
    (e.g. HR max 184,  pH max 7.67)
1. Scaling target values is not necessary
1. Apply
    1. normalization (MinMaxScaler) bounds the values to a specific range (e.g. 0-1)
    1. standardization (StandardScaler) less affected by outliers does not bound to range

In [24]:
scaler = StandardScaler()

O = scaler.fit_transform(N)
P = pd.DataFrame(O, columns=X_train.columns, index=X_train.index)
P.head()

Unnamed: 0,HR,O2Sat,Temp,SBP,MAP,DBP,Resp,EtCO2,BaseExcess,HCO3,...,Magnesium,Phosphate,Potassium,Bilirubin_total,Hct,Hgb,PTT,WBC,Fibrinogen,Platelets
23,-0.114565,0.294776,-1.001265,0.514898,1.243452,1.398276,0.122276,0.0,0.043109,-0.026082,...,0.240464,0.136565,0.057328,-0.222119,-0.116594,-0.116098,0.131474,0.696486,-0.230317,1.239055
1084,0.083934,0.106096,1.008413,0.663885,0.44023,-1.118581,-0.037854,0.0,0.043109,-0.026082,...,1.046187,1.325732,0.27777,-0.222119,0.250722,0.269448,4.135162,0.470037,-0.230317,0.835872
414,-0.213814,0.341947,-1.001265,0.514898,0.239425,0.744547,-1.158764,0.0,0.043109,0.225535,...,0.240464,0.084862,0.204289,-0.222119,-0.565536,-0.474105,-0.94789,-0.435756,-0.230317,-0.455959
437,-0.139377,0.294776,1.005159,0.082834,0.615935,1.2185,2.284029,0.0,0.387765,0.351344,...,0.125361,0.395079,0.204289,-0.222119,0.383364,0.434682,0.353204,1.209769,-0.230317,0.638395
671,0.977177,0.247606,0.974792,0.306315,-0.228954,-1.118581,-0.197984,0.0,0.043109,0.225535,...,0.355568,-0.12195,0.057328,1.840268,-0.096187,-0.116098,0.662789,0.092623,-0.230317,-0.250253


## 3.1 Transformation Pipeline

Common to apply many transformation steps in a specific order (fill the nulls before you apply the scaling)

In [25]:
# this pipeline should work for all the estimators/algorithms
pipeline = Pipeline([
                    ('imputer', SimpleImputer(strategy='constant')),
                    ('std_scaler', StandardScaler()),
                    ])

In [26]:
# this is the transformed data to train from
X_train_prepared = pipeline.fit_transform(X_train)

In [27]:
# neural networks sometimes expect a 0-1 normalized scale and perform better
pipeline_minmax = Pipeline([
                    ('imputer', SimpleImputer(strategy='constant')),
                    ('minMax', MinMaxScaler()),
                    ])

In [28]:
# this is the transformed data to train the MLP from
X_train_prepared_m = pipeline_minmax.fit_transform(X_train)
X_test_prepared=pipeline_minmax.fit_transform(X_test)

# 4. Save the data for model training

Common to apply many transformation steps in a specific order (fill the nulls before you apply the scaling)

In [29]:
# compress and save the pipeline

joblib.dump(pipeline, transform_path + "pipeline.pkl")
joblib.dump(pipeline_minmax, transform_path + "pipeline_minmax.pkl")

#Save the transformed data into data/transform folder

np.savetxt(transform_path + "X_train_prepared_m.csv", X_train_prepared_m, delimiter=",")
np.savetxt(transform_path + "X_train_prepared.csv", X_train_prepared, delimiter=",")
np.savetxt(transform_path + "X_train.csv", X_train, delimiter=",")
np.savetxt(transform_path + "X_test.csv", X_test, delimiter=",")
np.savetxt(transform_path + "X_test_prepared.csv", X_test_prepared, delimiter=",")
np.savetxt(transform_path + "y_train.csv", y_train, delimiter=",")
np.savetxt(transform_path + "y_test.csv", y_test, delimiter=",")
