<img src=https://storage.googleapis.com/kaggle-datasets-images/180/384/3da2510581f9d3b902307ff8d06fe327/dataset-cover.jpg>  
## Breast Cancer Wisconsin (Diagnostic) Data Set
[Kaggle Data set](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data#data.csv): Predict whether the cancer is benign or malignant

In [1]:
import numpy as np
import pandas as pd
import featuretools as ft  # featuretools for automated feature engineering

from sklearn.model_selection import KFold 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from featuretools import selection
from warnings import simplefilter  # import warnings filter

pd.options.mode.chained_assignment = None  #hide any pandas warnings
simplefilter(action='ignore', category=FutureWarning)  # ignore all future warnings
np.random.seed(123) #ensure reproducibility

### Data Preparation

#### 1. Download dataset from [Kaggle](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/downloads/data.csv/2)

In [2]:
# Read breast cancer data set as Pandas dataframe
df_raw = pd.read_csv("data.csv")
df_raw.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


#### 2. Clean up dataframe

In [3]:
# rearrange columns; remove coluum "Unnamed: 32"; set id as index
fixed_columns = [df_raw.columns[0]]+list(df_raw.columns[2:-1])+[df_raw.columns[1]]
df_data = df_raw[fixed_columns]

In [4]:
# Convert (M=malignant, B=benign) to (1,0)
df_data.loc[df_data.index[df_data['diagnosis']=='B'],'diagnosis'] = 0
df_data.loc[df_data.index[df_data['diagnosis']=='M'],'diagnosis'] = 1

In [5]:
df_data.head()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,diagnosis
0,842302,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1
1,842517,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1
2,84300903,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1
3,84348301,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1
4,84358402,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,1


#### Column Description
diagnosis: The diagnosis of breast tissues (M = malignant, B = benign)  
radius: Distances from center to points on the perimeter  
texture: Standard deviation of gray-scale values  
perimeter: Mean size of the core tumor  
smoothness: Local variation in radius lengths  
compactness: perimeter^2 / area - 1.0  
concavity: Sverity of concave portions of the contour  
concave points: Number of concave portions of the contour   
fractal_dimension: "coastline approximation" - 1  

#### 3. Count the case number of benign and malignant

In [6]:
df_data['diagnosis'].value_counts()

0    357
1    212
Name: diagnosis, dtype: int64

In [7]:
df_data.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,diagnosis
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.372583
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,0.0
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


### Compare the Performance of Machine Leanring Models

#### 1. Setup function for K-fold cross validation

In [8]:
def cross_validatoin(fold, model, X, y):
    """
    Perform K-fold cross validation
    compare the sensitivity, specificity, accuracy, and F1-score of input models
    """
    sensitivity=[]
    specificity=[]
    accuracy=[]
    F1scores=[]
    
    kf = KFold(n_splits=fold,shuffle=True) 
    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index] 
        y_train, y_test = y[train_index], y[test_index]
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        cm = confusion_matrix(y_test,y_pred)
        sensitivity.append(cm[0,0]/(cm[0,0]+cm[1,0]))
        specificity.append(cm[1,1]/(cm[1,1]+cm[0,1]))
        accuracy.append(accuracy_score(y_test,y_pred))
        F1scores.append(f1_score(y_test, y_pred, pos_label=1))
        
    return [np.mean(sensitivity),np.mean(specificity),np.mean(accuracy),np.mean(F1scores)]

In [9]:
# Set the number of fold for cross validation
fold = 10

#### 2. Apply and compare machine learning models

In [10]:
# Prepare input matrix for machine learning models
feature_names = df_data.columns.tolist()[1:-1]
X = df_data[feature_names].values
y = df_data['diagnosis'].values

In [11]:
# Create dictionary to collect results
d_Model_eva = {} 

# Logistic regression
d_Model_eva['Logistic Regression'] = cross_validatoin(fold, LogisticRegression(), X, y)

# Decision tree
d_Model_eva['Decision Tree'] = cross_validatoin(fold, DecisionTreeClassifier(), X, y)

# Random forest
d_Model_eva['Random Forest'] = cross_validatoin(fold, RandomForestClassifier(), X, y)

#### 3. Determine the best model for each metric

In [12]:
# Create output dataframe
df_eva = pd.DataFrame(d_Model_eva, index=['Sensitivity','Specificity','Accuracy','F1-score'])
df_eva.round(3).T.sort_values('F1-score',ascending=False)

Unnamed: 0,Sensitivity,Specificity,Accuracy,F1-score
Random Forest,0.961,0.957,0.96,0.946
Logistic Regression,0.952,0.949,0.951,0.93
Decision Tree,0.939,0.875,0.914,0.88


### Automated Feature Engineering
Apply Automated Feature Engineering to further improve the performance of models

#### 1. Generate new features with featuretools

In [13]:
# Create new entityset
es = ft.EntitySet(id = 'breastcancer')

df_data_ft = df_data.loc[:,['id']+feature_names]

# Create an entity from the breast cancer dataframe
es = es.entity_from_dataframe(entity_id = 'breastcancer', dataframe = df_data_ft, index='id')

In [14]:
# Generate new features
df_feature_matrix, feature_defs = ft.dfs(entityset = es, target_entity = 'breastcancer',
                                      trans_primitives = ['multiply_numeric','divide_by_feature'])
df_feature_matrix.head()

Unnamed: 0_level_0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,1 / concavity_worst * smoothness_worst,1 / area_worst * symmetry_worst,1 / compactness_se * compactness_worst,1 / fractal_dimension_worst * smoothness_worst,1 / radius_se * texture_worst,1 / fractal_dimension_se * smoothness_se,1 / concave points_se * concave points_worst,1 / concavity_mean * radius_worst,1 / concave points_worst * fractal_dimension_worst,1 / radius_se * smoothness_mean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8670,15.46,19.48,101.7,748.9,0.1092,0.1223,0.1466,0.08087,0.1931,0.05796,...,17.062267,0.003049,281.476377,80.662243,0.081091,65118.411319,604.301904,0.354168,82.367126,19.30742
8913,12.89,13.12,81.89,515.9,0.06955,0.03729,0.0226,0.01171,0.1337,0.05581,...,87.684102,0.007506,648.207867,150.388061,0.42004,101572.226334,3155.945027,3.248736,269.498993,93.852123
8915,14.96,19.1,97.03,687.3,0.08992,0.09823,0.0594,0.04819,0.1879,0.05852,...,42.218105,0.004169,156.043973,89.897854,0.132716,66624.116065,565.789109,1.036001,79.271916,38.654836
9047,12.94,16.17,83.18,507.6,0.09879,0.08836,0.03296,0.0239,0.1735,0.062,...,47.14046,0.005221,397.451541,108.915282,0.297946,175649.252331,1631.33457,2.189019,152.180151,69.427174
85715,13.17,18.66,85.98,534.6,0.1158,0.1231,0.1226,0.0734,0.2128,0.06777,...,11.184787,0.003376,102.756167,47.490281,0.124619,42023.735342,394.178768,0.520524,40.621476,30.078644


In [15]:
print("Number of the new features: {}".format(df_feature_matrix.shape[1]-(len(feature_names)+1)))

Number of the new features: 899


#### 2. Remove highly correlated columns
Drop Highly Correlated Features, the code is adapted from [work](https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/) by Chris Albon.

In [16]:
# Define the threshold for removing correlated variables
threshold = 0.99

In [17]:
# Get correlation of each variables
corr_matrix = df_feature_matrix.corr().abs()
# Upper triangle of correlations
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
upper.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,1 / concavity_worst * smoothness_worst,1 / area_worst * symmetry_worst,1 / compactness_se * compactness_worst,1 / fractal_dimension_worst * smoothness_worst,1 / radius_se * texture_worst,1 / fractal_dimension_se * smoothness_se,1 / concave points_se * concave points_worst,1 / concavity_mean * radius_worst,1 / concave points_worst * fractal_dimension_worst,1 / radius_se * smoothness_mean
radius_mean,,0.323782,0.997855,0.987357,0.170581,0.506124,0.676764,0.822529,0.147741,0.311631,...,0.149721,0.815289,0.223729,0.090137,0.517824,0.047488,0.237168,0.220343,0.360355,0.508623
texture_mean,,,0.329533,0.321086,0.023389,0.236702,0.302418,0.293464,0.071401,0.076437,...,0.081735,0.27974,0.111015,0.081285,0.594917,0.10765,0.100085,0.121929,0.153297,0.26374
perimeter_mean,,,,0.986507,0.207278,0.556936,0.716136,0.850977,0.183027,0.261477,...,0.161655,0.81681,0.245476,0.130014,0.528227,0.013118,0.250672,0.232467,0.380166,0.525132
area_mean,,,,,0.177028,0.498502,0.685983,0.823269,0.151293,0.28311,...,0.143765,0.751123,0.213802,0.092306,0.518312,0.012849,0.221812,0.206473,0.33774,0.515071
smoothness_mean,,,,,,0.659123,0.521984,0.553695,0.557775,0.584792,...,0.209486,0.232173,0.36751,0.712783,0.268493,0.439512,0.290126,0.234099,0.376197,0.538705


In [18]:
# Select columns with correlations above threshold
col_to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
print('{} columns with correlation > {}:\n-{}'.format(len(col_to_drop),threshold,"\n-".join(col_to_drop)))

234 columns with correlation > 0.99:
-perimeter_mean
-perimeter_worst
-area_se * perimeter_mean
-perimeter_worst * radius_mean
-concave points_se * radius_mean
-area_mean * perimeter_mean
-perimeter_worst * texture_worst
-perimeter_mean * radius_se
-perimeter_worst * radius_worst
-fractal_dimension_se * perimeter_mean
-radius_mean * radius_worst
-compactness_se * perimeter_worst
-radius_worst * symmetry_mean
-concavity_mean * perimeter_mean
-concavity_worst * radius_worst
-area_worst * radius_mean
-perimeter_worst * smoothness_se
-area_worst * perimeter_se
-perimeter_mean * symmetry_se
-area_mean * radius_worst
-concave points_mean * radius_mean
-concave points_mean * perimeter_worst
-compactness_mean * perimeter_mean
-concavity_worst * radius_mean
-radius_mean * smoothness_se
-fractal_dimension_worst * perimeter_mean
-fractal_dimension_mean * perimeter_worst
-concave points_worst * radius_mean
-concave points_worst * perimeter_mean
-perimeter_worst * texture_se
-concavity_se * radius_

-1 / concave points_worst * fractal_dimension_worst


In [19]:
df_feature_matrix_dropcorr = df_feature_matrix.drop(columns = col_to_drop)
df_feature_matrix_dropcorr.head()

Unnamed: 0_level_0,radius_mean,texture_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,...,1 / smoothness_mean * texture_se,1 / compactness_mean * fractal_dimension_mean,1 / area_se * texture_mean,1 / area_worst * symmetry_worst,1 / compactness_se * compactness_worst,1 / fractal_dimension_worst * smoothness_worst,1 / radius_se * texture_worst,1 / fractal_dimension_se * smoothness_se,1 / concave points_se * concave points_worst,1 / radius_se * smoothness_mean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8670,15.46,19.48,748.9,0.1092,0.1223,0.1466,0.08087,0.1931,0.05796,0.4743,...,11.652257,141.073411,0.001063,0.003049,281.476377,80.662243,0.081091,65118.411319,604.301904,19.30742
8913,12.89,13.12,515.9,0.06955,0.03729,0.0226,0.01171,0.1337,0.05581,0.1532,...,30.657026,480.502436,0.006011,0.007506,648.207867,150.388061,0.42004,101572.226334,3155.945027,93.852123
8915,14.96,19.1,687.3,0.08992,0.09823,0.0594,0.04819,0.1879,0.05852,0.2877,...,11.731009,173.960857,0.002105,0.004169,156.043973,89.897854,0.132716,66624.116065,565.789109,38.654836
9047,12.94,16.17,507.6,0.09879,0.08836,0.03296,0.0239,0.1735,0.062,0.1458,...,11.185063,182.537712,0.005444,0.005221,397.451541,108.915282,0.297946,175649.252331,1631.33457,69.427174
85715,13.17,18.66,534.6,0.1158,0.1231,0.1226,0.0734,0.2128,0.06777,0.2871,...,9.662726,119.868332,0.00221,0.003376,102.756167,47.490281,0.124619,42023.735342,394.178768,30.078644


#### 3. Remove columns with missing values or with too little information

In [20]:
# Remove columns with missing values
df_feature_matrix_dropcorr.replace([np.inf, -np.inf], np.nan, inplace=True)
col_without_nan = df_feature_matrix_dropcorr.columns[~df_feature_matrix_dropcorr.isna().any()]
df_feature_matrix_dropcorr_dropnan = df_feature_matrix_dropcorr[col_without_nan]

# Remove columns with too little information (less than ten distinct values)
selection.remove_low_information_features(df_feature_matrix_dropcorr_dropnan)

df_feature_matrix_dropcorr_dropnan.head()

Unnamed: 0_level_0,radius_mean,texture_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,...,1 / smoothness_se * smoothness_worst,1 / smoothness_mean * texture_se,1 / compactness_mean * fractal_dimension_mean,1 / area_se * texture_mean,1 / area_worst * symmetry_worst,1 / compactness_se * compactness_worst,1 / fractal_dimension_worst * smoothness_worst,1 / radius_se * texture_worst,1 / fractal_dimension_se * smoothness_se,1 / radius_se * smoothness_mean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8670,15.46,19.48,748.9,0.1092,0.1223,0.1466,0.08087,0.1931,0.05796,0.4743,...,1036.587388,11.652257,141.073411,0.001063,0.003049,281.476377,80.662243,0.081091,65118.411319,19.30742
8913,12.89,13.12,515.9,0.06955,0.03729,0.0226,0.01171,0.1337,0.05581,0.1532,...,2198.126071,30.657026,480.502436,0.006011,0.007506,648.207867,150.388061,0.42004,101572.226334,93.852123
8915,14.96,19.1,687.3,0.08992,0.09823,0.0594,0.04819,0.1879,0.05852,0.2877,...,1428.384514,11.731009,173.960857,0.002105,0.004169,156.043973,89.897854,0.132716,66624.116065,38.654836
9047,12.94,16.17,507.6,0.09879,0.08836,0.03296,0.0239,0.1735,0.062,0.1458,...,2955.463529,11.185063,182.537712,0.005444,0.005221,397.451541,108.915282,0.297946,175649.252331,69.427174
85715,13.17,18.66,534.6,0.1158,0.1231,0.1226,0.0734,0.2128,0.06777,0.2871,...,857.180671,9.662726,119.868332,0.00221,0.003376,102.756167,47.490281,0.124619,42023.735342,30.078644


In [21]:
print("{} features are removed because they contain missing values or contain too little information.".format(df_feature_matrix_dropcorr.shape[1] - df_feature_matrix_dropcorr_dropnan.shape[1]))

91 features are removed because they contain missing values or contain too little information.


### Compare the Performance of Machine Leanring Models (with Feature Engineering)
Apply K-fold cross validation to compare the performance after feature engineering with featuretools

#### 1. Add diagnosis column back to the feature matrix

In [22]:
df_outcomes = df_data.loc[:,['id','diagnosis']]
df_outcomes.set_index('id',inplace=True)

df_feature_matrix_outcomes = pd.merge(df_feature_matrix_dropcorr_dropnan,df_outcomes,
                                      left_index=True, right_index=True)
df_feature_matrix_outcomes.head()

Unnamed: 0_level_0,radius_mean,texture_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,...,1 / smoothness_mean * texture_se,1 / compactness_mean * fractal_dimension_mean,1 / area_se * texture_mean,1 / area_worst * symmetry_worst,1 / compactness_se * compactness_worst,1 / fractal_dimension_worst * smoothness_worst,1 / radius_se * texture_worst,1 / fractal_dimension_se * smoothness_se,1 / radius_se * smoothness_mean,diagnosis
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8670,15.46,19.48,748.9,0.1092,0.1223,0.1466,0.08087,0.1931,0.05796,0.4743,...,11.652257,141.073411,0.001063,0.003049,281.476377,80.662243,0.081091,65118.411319,19.30742,1
8913,12.89,13.12,515.9,0.06955,0.03729,0.0226,0.01171,0.1337,0.05581,0.1532,...,30.657026,480.502436,0.006011,0.007506,648.207867,150.388061,0.42004,101572.226334,93.852123,0
8915,14.96,19.1,687.3,0.08992,0.09823,0.0594,0.04819,0.1879,0.05852,0.2877,...,11.731009,173.960857,0.002105,0.004169,156.043973,89.897854,0.132716,66624.116065,38.654836,0
9047,12.94,16.17,507.6,0.09879,0.08836,0.03296,0.0239,0.1735,0.062,0.1458,...,11.185063,182.537712,0.005444,0.005221,397.451541,108.915282,0.297946,175649.252331,69.427174,0
85715,13.17,18.66,534.6,0.1158,0.1231,0.1226,0.0734,0.2128,0.06777,0.2871,...,9.662726,119.868332,0.00221,0.003376,102.756167,47.490281,0.124619,42023.735342,30.078644,1


#### 2. Prepare input matrix (after feature engineering) for machine learning models

In [23]:
# Prepare input matrix for machine learning models
feature_names_ft = df_feature_matrix_outcomes.columns.tolist()[1:-1]
X_ft = df_feature_matrix_outcomes[feature_names_ft].values
y_ft = df_feature_matrix_outcomes['diagnosis'].values

#### 3. Apply and compare machine learning models with k-fold cross validation

In [24]:
# Logistic regression
d_Model_eva['Logistic Regression (After Feature Engineering)'] = cross_validatoin(fold, LogisticRegression(), X_ft, y_ft)

# Decision tree
d_Model_eva['Decision Tree (After Feature Engineering)'] = cross_validatoin(fold, DecisionTreeClassifier(), X_ft, y_ft)

# Random forest
d_Model_eva['Random Forest (After Feature Engineering)'] = cross_validatoin(fold, RandomForestClassifier(), X_ft, y_ft)

In [25]:
# Create output dataframe
df_eva = pd.DataFrame(d_Model_eva, index=['Sensitivity','Specificity','Accuracy','F1-score'])
df_eva.round(3).T.sort_values('F1-score',ascending=False)

Unnamed: 0,Sensitivity,Specificity,Accuracy,F1-score
Random Forest (After Feature Engineering),0.969,0.958,0.963,0.952
Logistic Regression (After Feature Engineering),0.968,0.953,0.961,0.95
Random Forest,0.961,0.957,0.96,0.946
Logistic Regression,0.952,0.949,0.951,0.93
Decision Tree (After Feature Engineering),0.953,0.922,0.942,0.92
Decision Tree,0.939,0.875,0.914,0.88


### Conclusion

In [26]:
# Print out conclusion
print('{}-fold cross validation shows:'.format(fold))
for index, row in df_eva.iterrows():
    print("\t- {} has the best {} score = {:.3f}.".format(df_eva.loc[index,:].idxmax(axis=1), index,
                                         df_eva.loc[index, df_eva.loc[index,:].idxmax(axis=1)]))

10-fold cross validation shows:
	- Random Forest (After Feature Engineering) has the best Sensitivity score = 0.969.
	- Random Forest (After Feature Engineering) has the best Specificity score = 0.958.
	- Random Forest (After Feature Engineering) has the best Accuracy score = 0.963.
	- Random Forest (After Feature Engineering) has the best F1-score score = 0.952.


In [27]:
models = ['Logistic Regression','Decision Tree','Random Forest']

for model in models:
    before_fe = df_eva.loc['F1-score',model]
    after_fe = df_eva.loc['F1-score',model+" (After Feature Engineering)"]
    print('{}: Automated feature engineering improves F1-score by {:.3f}'.format(model,after_fe-before_fe))

Logistic Regression: Automated feature engineering improves F1-score by 0.020
Decision Tree: Automated feature engineering improves F1-score by 0.040
Random Forest: Automated feature engineering improves F1-score by 0.005
