**Feature Engineering**  
Date Created: 4/2/2020  
Last Edited: 4/8/2020  
Description: Feature engineer a dataset. Specifically feature scaling, identification of noise/outliers, mitigating class imbalance (perhapse through SMOTE), feature selection (Wrapper Methods) and feature extraction (PCA, SVD and LDA)

In [1]:
#Load Libraries
import pandas as pd
import numpy as np


In [2]:
#Load cleaned data set
from utils import intake_data
data = intake_data()
data.isnull().sum(axis = 0)
data.head(10)

Unnamed: 0,age,sex,Alb,PLT,WBC,CRP,APACHE II,SOFA,McCabe,PaO2/FiO2,...,CT score,PEEP,PIP,TV,DARDS = 1,days,death = 1,days.1,ventilator weaning = 1,VFD
0,79.0,M,2.3,10.8,4000.0,17.8,24.0,8.0,1.0,108.0,...,191.6,24.0,25.948052,428.05036,0.0,21.0,1.0,28.0,0.0,0.0
1,83.0,M,4.4,13.5,10200.0,8.9,16.0,6.0,1.0,78.0,...,213.3,5.0,10.0,360.0,0.0,21.0,1.0,28.0,0.0,0.0
2,70.0,M,2.7,10.8,5300.0,25.3,22.0,7.0,1.0,70.9,...,221.7,18.0,24.0,525.0,0.0,8.0,1.0,28.0,0.0,0.0
3,61.0,M,3.3,8.8,1800.0,22.2,26.0,7.0,1.0,59.2,...,211.6,10.0,24.0,480.0,0.0,11.0,1.0,28.0,0.0,0.0
4,81.0,M,3.1,26.2,10600.0,17.0,19.0,4.0,1.0,83.6,...,234.9,5.0,10.0,625.0,0.0,6.0,1.0,28.0,0.0,0.0
5,79.0,M,3.4,37.9,13200.0,21.0,18.0,5.0,1.0,71.5,...,236.7,5.0,25.0,460.0,0.0,12.0,1.0,28.0,0.0,0.0
6,83.0,M,2.8,18.0,1500.0,54.9,30.0,11.0,1.0,96.2,...,180.5,22.0,25.948052,428.05036,0.0,1.0,1.0,28.0,0.0,0.0
7,70.0,M,2.8,18.5,14100.0,18.4,15.0,5.0,1.0,78.2,...,311.6,10.0,25.0,428.05036,0.0,13.0,1.0,28.0,0.0,0.0
8,65.0,F,3.2,35.4,7200.0,10.4,18.0,3.0,1.0,194.0,...,231.6,8.0,23.0,440.0,0.0,28.0,0.0,10.0,1.0,18.0
9,72.0,M,2.9,26.8,19800.0,29.4,20.0,5.0,1.0,74.0,...,233.1,5.0,25.0,428.05036,0.0,22.0,1.0,16.0,1.0,12.0


In [78]:
#Load dataset
#data = pd.read_csv("data_test_interpolate.csv")
data["sex"] = data["sex"].astype(str)
data = data.replace(['M', 'F'], [0, 1])
data["sex"] = pd.to_numeric(data["sex"])
#rename death = 1 to death
data.rename(columns = {'death = 1 ': 'death'}, inplace=True)
#data_X_1 = data.iloc[:, :17] 
#data_X_2 = data.iloc[:, 18:]
#data_X = pd.concat([data_X_1, data_X_2], axis=1)#df with features
#I think the last few columns are giving away too much information, since patients who die
#will have numbers that are vastly different. Plus, these values will not help the 
#hospital predict the patients chance of survival
data_X = data.iloc[:, :16]
data_Y = data.iloc[:, 17] #df with class labels
print(data_Y.head(10))
data_X.head(10)


0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
5    1.0
6    1.0
7    1.0
8    0.0
9    1.0
Name: death, dtype: float64


Unnamed: 0,age,sex,Alb,PLT,WBC,CRP,APACHE II,SOFA,McCabe,PaO2/FiO2,LDH,CT score,PEEP,PIP,TV,DARDS = 1
0,79.0,0,2.3,10.8,4000.0,17.8,24.0,8.0,1.0,108.0,339.0,191.6,24.0,25.948052,428.05036,0.0
1,83.0,0,4.4,13.5,10200.0,8.9,16.0,6.0,1.0,78.0,385.0,213.3,5.0,10.0,360.0,0.0
2,70.0,0,2.7,10.8,5300.0,25.3,22.0,7.0,1.0,70.9,461.0,221.7,18.0,24.0,525.0,0.0
3,61.0,0,3.3,8.8,1800.0,22.2,26.0,7.0,1.0,59.2,227.0,211.6,10.0,24.0,480.0,0.0
4,81.0,0,3.1,26.2,10600.0,17.0,19.0,4.0,1.0,83.6,680.0,234.9,5.0,10.0,625.0,0.0
5,79.0,0,3.4,37.9,13200.0,21.0,18.0,5.0,1.0,71.5,224.0,236.7,5.0,25.0,460.0,0.0
6,83.0,0,2.8,18.0,1500.0,54.9,30.0,11.0,1.0,96.2,342.0,180.5,22.0,25.948052,428.05036,0.0
7,70.0,0,2.8,18.5,14100.0,18.4,15.0,5.0,1.0,78.2,478.0,311.6,10.0,25.0,428.05036,0.0
8,65.0,1,3.2,35.4,7200.0,10.4,18.0,3.0,1.0,194.0,316.0,231.6,8.0,23.0,440.0,0.0
9,72.0,0,2.9,26.8,19800.0,29.4,20.0,5.0,1.0,74.0,245.0,233.1,5.0,25.0,428.05036,0.0


There is a class imbalance in the data, death = 1 (n=69) death = 0 (n=128). To even out the class imblance, SMOTE will be used to create new death = 1 data points. 

In [79]:
#Class Imbalance
#Dependencies on imblearn and joblib
#conda install -c conda-forge imbalanced-learn
#conda install -c anaconda joblib
#SMOTE
#import imblearn
from imblearn import over_sampling
oversample = over_sampling.SMOTE()
X_SMOTE, Y_SMOTE = oversample.fit_resample(data_X, data_Y)
print(X_SMOTE.shape)
Y_SMOTE.value_counts()
X_SMOTE.head(10)

(256, 16)


Unnamed: 0,age,sex,Alb,PLT,WBC,CRP,APACHE II,SOFA,McCabe,PaO2/FiO2,LDH,CT score,PEEP,PIP,TV,DARDS = 1
0,79.0,0,2.3,10.8,4000.0,17.8,24.0,8.0,1.0,108.0,339.0,191.6,24.0,25.948052,428.05036,0.0
1,83.0,0,4.4,13.5,10200.0,8.9,16.0,6.0,1.0,78.0,385.0,213.3,5.0,10.0,360.0,0.0
2,70.0,0,2.7,10.8,5300.0,25.3,22.0,7.0,1.0,70.9,461.0,221.7,18.0,24.0,525.0,0.0
3,61.0,0,3.3,8.8,1800.0,22.2,26.0,7.0,1.0,59.2,227.0,211.6,10.0,24.0,480.0,0.0
4,81.0,0,3.1,26.2,10600.0,17.0,19.0,4.0,1.0,83.6,680.0,234.9,5.0,10.0,625.0,0.0
5,79.0,0,3.4,37.9,13200.0,21.0,18.0,5.0,1.0,71.5,224.0,236.7,5.0,25.0,460.0,0.0
6,83.0,0,2.8,18.0,1500.0,54.9,30.0,11.0,1.0,96.2,342.0,180.5,22.0,25.948052,428.05036,0.0
7,70.0,0,2.8,18.5,14100.0,18.4,15.0,5.0,1.0,78.2,478.0,311.6,10.0,25.0,428.05036,0.0
8,65.0,1,3.2,35.4,7200.0,10.4,18.0,3.0,1.0,194.0,316.0,231.6,8.0,23.0,440.0,0.0
9,72.0,0,2.9,26.8,19800.0,29.4,20.0,5.0,1.0,74.0,245.0,233.1,5.0,25.0,428.05036,0.0


Linear Discriminant Analysis is a method of feature extraction. The goal is to identify a linear combination of variables that provides the maximum seperation between the two groups. 
Helpful link to combine dimension reduction algorithms: 
https://scikit-learn.org/stable/auto_examples/neighbors/plot_nca_dim_reduction.html

In [80]:
import warnings
warnings.filterwarnings("ignore")
#LDA 
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score

scaler = StandardScaler()
lda = LDA()
knn = KNeighborsClassifier()
pipe = Pipeline(steps=[('scaler', scaler), ('lda', lda), ('knn', knn)])

param_grid = {
    'lda__n_components' :list(range(1, 20)),
    'knn__n_neighbors': list(range(1, 30)),     
}
clf_grid_pipe = GridSearchCV(pipe, param_grid, cv=5)
clf_grid_pipe.fit(X_SMOTE, Y_SMOTE)
print('Best score:', round(clf_grid_pipe.best_score_, 4))
print('Best parameters:\n',
      'KNN n neighbors:', clf_grid_pipe.best_params_['knn__n_neighbors'],
      'LDA n components:', clf_grid_pipe.best_params_['lda__n_components'])
Y_pred = cross_val_predict(clf_grid_pipe, X_SMOTE, Y_SMOTE, cv=5)
#display accuracy
print('Accuracy: '+ str(round(accuracy_score(Y_SMOTE, Y_pred), 2)*100) + '%')

Best score: 0.6486
Best parameters:
 KNN n neighbors: 29 LDA n components: 1
Accuracy: 63.0%


Wrapper methods are a form of feature selection used to select a subset of attributes. This is a crude start, if we choose to pick this mode of feature engineering we'd have to clean it up a bit...but this is to give an idea.

In [86]:
#Wrapper methods
warnings.filterwarnings("ignore")
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
knn = KNeighborsClassifier(n_neighbors=29) #temp input from the output of the LDA
sfs = sfs(knn, k_features=5, forward=True, floating=False,
          verbose=2, scoring='accuracy', cv=5)
sfs = sfs.fit(X_SMOTE, Y_SMOTE) #I think we can input the full data set because an input 
#in sfs is cv...this could lead to data leakage though. We'll want to check
feat_cols = list(sfs.k_feature_idx_)
feat_cols

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed:    0.7s finished

[2020-04-08 18:47:54] Features: 1/5 -- score: 0.6518853695324284[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    0.5s finished

[2020-04-08 18:47:55] Features: 2/5 -- score: 0.6597285067873303[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  14 out of  14 | elapsed:    0.5s finished

[2020-04-08 18:47:55] Features: 3/5 -- score: 0.6636500754147813[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 

[1, 2, 3, 7, 11]

Feature selection through wrapping found that the following 5 columns should be selected

In [87]:
X_SMOTE.iloc[:, feat_cols]

Unnamed: 0,sex,Alb,PLT,SOFA,CT score
0,0,2.300000,10.800000,8.000000,191.600000
1,0,4.400000,13.500000,6.000000,213.300000
2,0,2.700000,10.800000,7.000000,221.700000
3,0,3.300000,8.800000,7.000000,211.600000
4,0,3.100000,26.200000,4.000000,234.900000
...,...,...,...,...,...
251,0,2.588814,16.155930,4.027965,385.175895
252,0,2.899080,13.288501,7.497700,246.535237
253,0,2.504610,19.233331,12.000000,210.730486
254,0,2.794847,18.549242,5.017177,311.505524
