<img src=https://storage.googleapis.com/kaggle-datasets-images/180/384/3da2510581f9d3b902307ff8d06fe327/dataset-cover.jpg>  
## Breast Cancer Wisconsin (Diagnostic) Data Set
[Kaggle Data set](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data#data.csv): Predict whether the cancer is benign or malignant

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns  #for plotting
import featuretools as ft  # featuretools for automated feature engineering

from sklearn.model_selection import KFold 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from featuretools import selection
from warnings import simplefilter  # import warnings filter

pd.options.mode.chained_assignment = None  #hide any pandas warnings
simplefilter(action='ignore', category=FutureWarning)  # ignore all future warnings
np.random.seed(123) #ensure reproducibility

### Download dataset from [Kaggle](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/downloads/data.csv/2)

In [2]:
# Read breast cancer data set as Pandas dataframe
df_raw = pd.read_csv("data.csv")
df_raw.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


### Column Description
diagnosis: The diagnosis of breast tissues (M = malignant, B = benign)  
radius: Distances from center to points on the perimeter  
texture: Standard deviation of gray-scale values  
perimeter: Mean size of the core tumor  
smoothness: Local variation in radius lengths  
compactness: perimeter^2 / area - 1.0  
concavity: Sverity of concave portions of the contour  
concave points: Number of concave portions of the contour   
fractal_dimension: "coastline approximation" - 1  

### Clean up dataframe

In [3]:
# rearrange columns; remove coluum "Unnamed: 32"; set id as index
fixed_columns = [df_raw.columns[0]]+list(df_raw.columns[2:-1])+[df_raw.columns[1]]
df_data = df_raw[fixed_columns]

In [4]:
# Convert (M=malignant, B=benign) to (1,0)
df_data.loc[df_data.index[df_data['diagnosis']=='B'],'diagnosis'] = 0
df_data.loc[df_data.index[df_data['diagnosis']=='M'],'diagnosis'] = 1

In [5]:
df_data.head()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,diagnosis
0,842302,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1
1,842517,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1
2,84300903,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1
3,84348301,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1
4,84358402,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,1


### Compare the performance of machine leanring models

#### 1. Setup function for K-fold cross validation

In [6]:
def cross_validatoin(fold, model, X, y):
    """
    Perform K-fold cross validation
    compare the sensitivity, specificity, accuracy, and F1-score of input models
    """
    sensitivity=[]
    specificity=[]
    accuracy=[]
    F1scores=[]
    
    kf = KFold(n_splits=fold,shuffle=True) 
    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index] 
        y_train, y_test = y[train_index], y[test_index]
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        cm = confusion_matrix(y_test,y_pred)
        sensitivity.append(cm[0,0]/(cm[0,0]+cm[1,0]))
        specificity.append(cm[1,1]/(cm[1,1]+cm[0,1]))
        accuracy.append(accuracy_score(y_test,y_pred))
        F1scores.append(f1_score(y_test, y_pred, pos_label=1))
        
    return [np.mean(sensitivity),np.mean(specificity),np.mean(accuracy),np.mean(F1scores)]

In [7]:
# Set the number of fold for cross validation
fold = 10

#### 2. Apply and compare machine learning models

In [8]:
# Prepare input matrix for machine learning models
feature_names = df_data.columns.tolist()[1:-1]
X = df_data[feature_names].values
y = df_data['diagnosis'].values

In [9]:
# Create dictionary to collect results
d_Model_eva = {} 

# Logistic regression
d_Model_eva['Logistic Regression'] = cross_validatoin(fold, LogisticRegression(), X, y)

# Decision tree
d_Model_eva['Decision Tree'] = cross_validatoin(fold, DecisionTreeClassifier(), X, y)

# Random forest
d_Model_eva['Random Forest'] = cross_validatoin(fold, RandomForestClassifier(), X, y)

#### 3. Determine the best model for each metric

In [10]:
# Create output dataframe
df_eva = pd.DataFrame(d_Model_eva, index=['Sensitivity','Specificity','Accuracy','F1-score'])
df_eva.round(3).T.sort_values('F1-score',ascending=False)

Unnamed: 0,Sensitivity,Specificity,Accuracy,F1-score
Random Forest,0.961,0.957,0.96,0.946
Logistic Regression,0.952,0.949,0.951,0.93
Decision Tree,0.939,0.875,0.914,0.88


### Automated Feature Engineering
Apply Automated Feature Engineering to further improve the performance of models

#### 1. Generate new features with featuretools

In [11]:
# Create new entityset
es = ft.EntitySet(id = 'breastcancer')

df_data_ft = df_data.loc[:,['id']+feature_names]

# Create an entity from the breast cancer dataframe
es = es.entity_from_dataframe(entity_id = 'breastcancer', dataframe = df_data_ft, index='id')

In [12]:
# Generate new features
df_feature_matrix, feature_defs = ft.dfs(entityset = es, target_entity = 'breastcancer',
                                      trans_primitives = ['multiply_numeric','divide_by_feature'])
df_feature_matrix.head()

Unnamed: 0_level_0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,1 / area_worst * texture_mean,1 / compactness_mean * symmetry_worst,1 / concavity_worst * radius_se,1 / perimeter_se * symmetry_se,1 / compactness_mean * texture_se,1 / concavity_mean * fractal_dimension_worst,1 / concavity_se * texture_mean,1 / compactness_se * concavity_mean,1 / concavity_mean * symmetry_mean,1 / area_se * compactness_mean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8670,15.46,19.48,101.7,748.9,0.1092,0.1223,0.1466,0.08087,0.1931,0.05796,...,4.4e-05,28.821343,5.561515,23.135734,10.404142,85.064003,1.824909,459.655148,35.325129,0.169253
8913,12.89,13.12,81.89,515.9,0.06955,0.03729,0.0226,0.01171,0.1337,0.05581,...,0.000132,116.140498,55.037227,55.395984,57.178765,639.881238,4.613772,3289.798335,330.948299,2.114893
8915,14.96,19.1,97.03,687.3,0.08992,0.09823,0.0594,0.04819,0.1879,0.05852,...,6.5e-05,34.369309,19.267422,30.263944,10.738596,198.713608,3.408595,795.981883,89.595619,0.409336
9047,12.94,16.17,83.18,507.6,0.09879,0.08836,0.03296,0.0239,0.1735,0.062,...,0.000106,34.32617,37.893429,53.609961,12.505346,387.28371,3.834031,2361.074383,174.869198,0.996245
85715,13.17,18.66,85.98,534.6,0.1158,0.1231,0.1226,0.0734,0.2128,0.06777,...,7.1e-05,20.829428,6.957864,30.243725,9.089713,69.182416,1.84477,349.169814,38.329919,0.334989


In [13]:
print("Number of the new features: {}".format(df_feature_matrix.shape[1]-(len(feature_names)+1)))

Number of the new features: 899


#### 2. Remove highly correlated columns
Drop Highly Correlated Features, the code is adapted from [work](https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/) by Chris Albon.

In [14]:
# Define the threshold for removing correlated variables
threshold = 0.99

In [15]:
# Get correlation of each variables
corr_matrix = df_feature_matrix.corr().abs()
# Upper triangle of correlations
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
upper.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,1 / area_worst * texture_mean,1 / compactness_mean * symmetry_worst,1 / concavity_worst * radius_se,1 / perimeter_se * symmetry_se,1 / compactness_mean * texture_se,1 / concavity_mean * fractal_dimension_worst,1 / concavity_se * texture_mean,1 / compactness_se * concavity_mean,1 / concavity_mean * symmetry_mean,1 / area_se * compactness_mean
radius_mean,,0.323782,0.997855,0.987357,0.170581,0.506124,0.676764,0.822529,0.147741,0.311631,...,0.803844,0.385974,0.219171,0.33737,0.256653,0.172363,0.197945,0.130867,0.18888,0.55673
texture_mean,,,0.329533,0.321086,0.023389,0.236702,0.302418,0.293464,0.071401,0.076437,...,0.584426,0.156154,0.129646,0.271704,0.434473,0.102695,0.228809,0.078071,0.114982,0.26867
perimeter_mean,,,,0.986507,0.207278,0.556936,0.716136,0.850977,0.183027,0.261477,...,0.800758,0.421173,0.232455,0.357721,0.288317,0.185675,0.214093,0.143203,0.202676,0.577108
area_mean,,,,,0.177028,0.498502,0.685983,0.823269,0.151293,0.28311,...,0.740699,0.372185,0.208333,0.361115,0.265703,0.165475,0.192975,0.126583,0.180837,0.529243
smoothness_mean,,,,,,0.659123,0.521984,0.553695,0.557775,0.584792,...,0.069851,0.615745,0.246041,0.334028,0.516224,0.241969,0.237902,0.246795,0.259364,0.498318


In [16]:
# Select columns with correlations above threshold
col_to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
print('{} columns with correlation > {}:\n-{}'.format(len(col_to_drop),threshold,"\n-".join(col_to_drop)))

230 columns with correlation > 0.99:
-perimeter_mean
-perimeter_worst
-area_mean * radius_worst
-concavity_worst * radius_worst
-concave points_se * perimeter_mean
-perimeter_worst * radius_worst
-perimeter_worst * symmetry_se
-concave points_worst * radius_worst
-compactness_se * radius_mean
-concavity_mean * radius_worst
-perimeter_mean * radius_worst
-perimeter_mean * smoothness_se
-area_mean * radius_se
-area_mean * radius_mean
-fractal_dimension_se * perimeter_mean
-area_worst * perimeter_mean
-area_se * radius_se
-radius_worst * smoothness_se
-fractal_dimension_worst * perimeter_worst
-perimeter_mean * perimeter_worst
-radius_worst * symmetry_worst
-concavity_se * perimeter_worst
-compactness_se * perimeter_worst
-concave points_se * perimeter_worst
-area_worst * radius_se
-perimeter_mean * smoothness_mean
-compactness_worst * perimeter_mean
-perimeter_mean * radius_se
-compactness_mean * radius_mean
-concavity_mean * texture_worst
-concave points_se * radius_mean
-perimeter_se *

In [17]:
df_feature_matrix_dropcorr = df_feature_matrix.drop(columns = col_to_drop)
df_feature_matrix_dropcorr.head()

Unnamed: 0_level_0,radius_mean,texture_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,...,1 / compactness_mean * radius_se,1 / fractal_dimension_se * smoothness_se,1 / perimeter_se * texture_se,1 / area_se * symmetry_mean,1 / area_worst * texture_mean,1 / compactness_mean * symmetry_worst,1 / perimeter_se * symmetry_se,1 / compactness_mean * texture_se,1 / compactness_se * concavity_mean,1 / area_se * compactness_mean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8670,15.46,19.48,748.9,0.1092,0.1223,0.1466,0.08087,0.1931,0.05796,0.4743,...,17.239331,65118.411319,0.411256,0.107197,4.4e-05,28.821343,23.135734,10.404142,459.655148,0.169253
8913,12.89,13.12,515.9,0.06955,0.03729,0.0226,0.01171,0.1337,0.05581,0.1532,...,175.044654,101572.226334,1.912284,0.589861,0.000132,116.140498,55.395984,57.178765,3289.798335,2.114893
8915,14.96,19.1,687.3,0.08992,0.09823,0.0594,0.04819,0.1879,0.05852,0.2877,...,35.384739,66624.116065,0.485883,0.213992,6.5e-05,34.369309,30.263944,10.738596,795.981883,0.409336
9047,12.94,16.17,507.6,0.09879,0.08836,0.03296,0.0239,0.1735,0.062,0.1458,...,77.622347,175649.252331,1.107742,0.507367,0.000106,34.32617,53.609961,12.505346,2361.074383,0.996245
85715,13.17,18.66,534.6,0.1158,0.1231,0.1226,0.0734,0.2128,0.06777,0.2871,...,28.294939,42023.735342,0.589849,0.193783,7.1e-05,20.829428,30.243725,9.089713,349.169814,0.334989


#### 3. Remove columns with missing values or with too little information

In [18]:
# Remove columns with missing values
df_feature_matrix_dropcorr.replace([np.inf, -np.inf], np.nan, inplace=True)
col_without_nan = df_feature_matrix_dropcorr.columns[~df_feature_matrix_dropcorr.isna().any()]
df_feature_matrix_dropcorr_dropnan = df_feature_matrix_dropcorr[col_without_nan]

# Remove columns with too little information (less than ten distinct values)
selection.remove_low_information_features(df_feature_matrix_dropcorr_dropnan)

df_feature_matrix_dropcorr_dropnan.head()

Unnamed: 0_level_0,radius_mean,texture_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,...,1 / compactness_mean * smoothness_se,1 / compactness_mean * radius_se,1 / fractal_dimension_se * smoothness_se,1 / perimeter_se * texture_se,1 / area_se * symmetry_mean,1 / area_worst * texture_mean,1 / compactness_mean * symmetry_worst,1 / perimeter_se * symmetry_se,1 / compactness_mean * texture_se,1 / area_se * compactness_mean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8670,15.46,19.48,748.9,0.1092,0.1223,0.1466,0.08087,0.1931,0.05796,0.4743,...,1310.354949,17.239331,65118.411319,0.411256,0.107197,4.4e-05,28.821343,23.135734,10.404142,0.169253
8913,12.89,13.12,515.9,0.06955,0.03729,0.0226,0.01171,0.1337,0.05581,0.1532,...,5668.324028,175.044654,101572.226334,1.912284,0.589861,0.000132,116.140498,55.395984,57.178765,2.114893
8915,14.96,19.1,687.3,0.08992,0.09823,0.0594,0.04819,0.1879,0.05852,0.2877,...,1909.262819,35.384739,66624.116065,0.485883,0.213992,6.5e-05,34.369309,30.263944,10.738596,0.409336
9047,12.94,16.17,507.6,0.09879,0.08836,0.03296,0.0239,0.1735,0.062,0.1458,...,3920.103277,77.622347,175649.252331,1.107742,0.507367,0.000106,34.32617,53.609961,12.505346,0.996245
85715,13.17,18.66,534.6,0.1158,0.1231,0.1226,0.0734,0.2128,0.06777,0.2871,...,1243.643118,28.294939,42023.735342,0.589849,0.193783,7.1e-05,20.829428,30.243725,9.089713,0.334989


In [19]:
print("{} features are removed because they contain missing values or contain too little information.".format(df_feature_matrix_dropcorr.shape[1] - df_feature_matrix_dropcorr_dropnan.shape[1]))

94 features are removed because they contain missing values or contain too little information.


### Compare the performance of machine leanring models (with feature engineering)
Apply K-fold cross validation to compare the performance after feature engineering with featuretools

#### 1. Add diagnosis column back to the feature matrix

In [20]:
df_outcomes = df_data.loc[:,['id','diagnosis']]
df_outcomes.set_index('id',inplace=True)

df_feature_matrix_outcomes = pd.merge(df_feature_matrix_dropcorr_dropnan,df_outcomes,
                                      left_index=True, right_index=True)
df_feature_matrix_outcomes.head()

Unnamed: 0_level_0,radius_mean,texture_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,...,1 / compactness_mean * radius_se,1 / fractal_dimension_se * smoothness_se,1 / perimeter_se * texture_se,1 / area_se * symmetry_mean,1 / area_worst * texture_mean,1 / compactness_mean * symmetry_worst,1 / perimeter_se * symmetry_se,1 / compactness_mean * texture_se,1 / area_se * compactness_mean,diagnosis
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8670,15.46,19.48,748.9,0.1092,0.1223,0.1466,0.08087,0.1931,0.05796,0.4743,...,17.239331,65118.411319,0.411256,0.107197,4.4e-05,28.821343,23.135734,10.404142,0.169253,1
8913,12.89,13.12,515.9,0.06955,0.03729,0.0226,0.01171,0.1337,0.05581,0.1532,...,175.044654,101572.226334,1.912284,0.589861,0.000132,116.140498,55.395984,57.178765,2.114893,0
8915,14.96,19.1,687.3,0.08992,0.09823,0.0594,0.04819,0.1879,0.05852,0.2877,...,35.384739,66624.116065,0.485883,0.213992,6.5e-05,34.369309,30.263944,10.738596,0.409336,0
9047,12.94,16.17,507.6,0.09879,0.08836,0.03296,0.0239,0.1735,0.062,0.1458,...,77.622347,175649.252331,1.107742,0.507367,0.000106,34.32617,53.609961,12.505346,0.996245,0
85715,13.17,18.66,534.6,0.1158,0.1231,0.1226,0.0734,0.2128,0.06777,0.2871,...,28.294939,42023.735342,0.589849,0.193783,7.1e-05,20.829428,30.243725,9.089713,0.334989,1


#### 2. Prepare input matrix (after feature engineering) for machine learning models

In [21]:
# Prepare input matrix for machine learning models
feature_names_ft = df_feature_matrix_outcomes.columns.tolist()[1:-1]
X_ft = df_feature_matrix_outcomes[feature_names_ft].values
y_ft = df_feature_matrix_outcomes['diagnosis'].values

#### 3. Apply and compare machine learning models with k-fold cross validation

In [22]:
# Logistic regression
d_Model_eva['Logistic Regression (Feature Engineering)'] = cross_validatoin(fold, LogisticRegression(), X_ft, y_ft)

# Decision tree
d_Model_eva['Decision Tree (Feature Engineering)'] = cross_validatoin(fold, DecisionTreeClassifier(), X_ft, y_ft)

# Random forest
d_Model_eva['Random Forest (Feature Engineering)'] = cross_validatoin(fold, RandomForestClassifier(), X_ft, y_ft)

In [23]:
# Create output dataframe
df_eva = pd.DataFrame(d_Model_eva, index=['Sensitivity','Specificity','Accuracy','F1-score'])
df_eva.round(3).T.sort_values('F1-score',ascending=False)

Unnamed: 0,Sensitivity,Specificity,Accuracy,F1-score
Random Forest (Feature Engineering),0.965,0.978,0.968,0.957
Random Forest,0.961,0.957,0.96,0.946
Logistic Regression (Feature Engineering),0.965,0.948,0.958,0.945
Logistic Regression,0.952,0.949,0.951,0.93
Decision Tree (Feature Engineering),0.948,0.924,0.939,0.915
Decision Tree,0.939,0.875,0.914,0.88


### Conclusion

In [24]:
# Print out conclusion
print('{}-fold cross validation shows:'.format(fold))
for index, row in df_eva.iterrows():
    print("- {} has the best {} score = {:.3f}.".format(df_eva.loc[index,:].idxmax(axis=1), index,
                                         df_eva.loc[index, df_eva.loc[index,:].idxmax(axis=1)]))

10-fold cross validation shows:
- Logistic Regression (Feature Engineering) has the best Sensitivity score = 0.965.
- Random Forest (Feature Engineering) has the best Specificity score = 0.978.
- Random Forest (Feature Engineering) has the best Accuracy score = 0.968.
- Random Forest (Feature Engineering) has the best F1-score score = 0.957.
