Heart disease diagnosis
---

## Exercise - Evaluate "most-frequent" baseline

> **Exercise**: Load and split the `heart-disease.csv` data into 70-30 train/test sets - make sure to keep the same proportion of classes by setting `stratify`. Evaluate the accuracy of the "most-frequent" baseline.

In [1]:
import pandas as pd
import os
from sklearn.model_selection import train_test_split

df=pd.read_csv(os.path.join('data','heart-disease.csv'))
df.sex=df.sex.map({'male':0, 'female':1})
df.cp=df.cp.map({'typical angina':0, 'asymptomatic':1, 'non-anginal pain':2, 'atypical angina':3})
df.restecg=df.restecg.map({'ventricular hypertrophy':0, 'normal':1, 'ST-T wave':2})
df.fbs=df.fbs.map({'yes':0, 'no':1})
df.exang=df.exang.map({'yes':0, 'no':1})
df.slope=df.slope.map({'downsloping':0, 'flat':1, 'upsloping':2})
df.thal=df.thal.map({'fixed defect':0, 'normal':1, 'reversable defect':2})


df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,disease
0,63,0,0,145,233,0,0,150,1,2.3,0,0,0,absence
1,67,0,1,160,286,1,0,108,0,1.5,1,3,1,likely
2,67,0,1,120,229,1,0,129,0,2.6,1,2,2,likely
3,37,0,2,130,250,1,1,187,1,3.5,0,0,1,absence
4,41,1,3,130,204,1,0,172,1,1.4,2,0,1,absence


In [2]:
X=df.drop('disease',axis=1).values
y=df.disease.values

X_tr, X_te, y_tr, y_te, = train_test_split(X,y, test_size=0.3, random_state=0, stratify=y)

# Evaluate 'most-frequent' baseline
from sklearn.dummy import DummyClassifier

# Create the dummy classifier
dummy = DummyClassifier(strategy='most_frequent')

# Fit it
dummy.fit(X_tr, y_tr)

# Compute test accuracy
accuracy = dummy.score(X_te, y_te)
print('Accuracy: {:.2f}%'.format(accuracy*100))


Accuracy: 53.85%


Exercise - Evaluate k-NN baseline
---

> **Exercise**: Tune a k-NN classifier using grid search with **stratified 10-fold** cross-validation
> * Number of neighbors k
> * Distance metric - $L_{1}$ or $L_{2}$
> * Weighting strategy - uniform or by distance
>
> Refit the best estimator on the whole train set and report the test accuracy.

Data set documentation: http://archive.ics.uci.edu/ml/datasets/heart+Disease

In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

knn=Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

# Create k-fold object
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

grid_param={
    'scaler': [None, StandardScaler()],
    'knn__n_neighbors': [1,2,5,10,15,20, 30, 50],
    'knn__weights': ['uniform','distance'],
    'knn__p': [1, 2]
}

grid_cv=GridSearchCV(knn, grid_param, cv=10, refit=True, return_train_score=True, verbose=True, n_jobs=-1, iid=True)

grid_cv.fit(X_tr, y_tr)

columns=['param_scaler', 'param_knn__weights', 'param_knn__p', 'param_knn__n_neighbors', 'mean_test_score', 'std_test_score', 'mean_train_score']
pd.DataFrame(grid_cv.cv_results_).sort_values('mean_test_score', ascending=False)[columns].head()



Fitting 10 folds for each of 64 candidates, totalling 640 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 600 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done 640 out of 640 | elapsed:    1.6s finished


Unnamed: 0,param_scaler,param_knn__weights,param_knn__p,param_knn__n_neighbors,mean_test_score,std_test_score,mean_train_score
49,StandardScaler(),uniform,1,30,0.674528,0.051978,0.680813
33,StandardScaler(),uniform,1,15,0.669811,0.04616,0.693918
35,StandardScaler(),distance,1,15,0.665094,0.054266,1.0
59,StandardScaler(),distance,1,50,0.665094,0.041836,1.0
51,StandardScaler(),distance,1,30,0.665094,0.036069,1.0


Exercise - Logistic regression
---

> **Exercise**: Same with a logistic regression
> * Try both OvR and softmax
> * tune C
>
> Which estimator would you use in practice? k-NN or logistic regression?

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
import numpy as np

knn=Pipeline([
    ('scaler', StandardScaler()),
    ('knn', LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=10000))
])

# Create k-fold object
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

grid_param={
    'scaler': [None, StandardScaler()],
    'knn__multi_class': ['ovr','multinomial'],
    'knn__C': np.logspace(-4, 4, num=10),
    'knn__solver': ['sag', 'saga', 'lbfgs']
}

grid_cv=GridSearchCV(knn, grid_param, cv=10, refit=True, return_train_score=True, verbose=True, n_jobs=-1, iid=True)

grid_cv.fit(X_tr, y_tr)

columns=['param_scaler', 'param_knn__solver', 'param_knn__multi_class', 'param_knn__C', 'mean_test_score', 'std_test_score', 'mean_train_score']
pd.DataFrame(grid_cv.cv_results_).sort_values('mean_test_score', ascending=False)[columns].head()




[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 10 folds for each of 120 candidates, totalling 1200 fits


[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 880 tasks      | elapsed:   15.0s
[Parallel(n_jobs=-1)]: Done 1200 out of 1200 | elapsed:   24.2s finished


Unnamed: 0,param_scaler,param_knn__solver,param_knn__multi_class,param_knn__C,mean_test_score,std_test_score,mean_train_score
49,StandardScaler(),sag,ovr,0.359381,0.679245,0.081538,0.735842
53,StandardScaler(),lbfgs,ovr,0.359381,0.679245,0.081538,0.735842
51,StandardScaler(),saga,ovr,0.359381,0.679245,0.081538,0.735842
119,StandardScaler(),lbfgs,multinomial,10000.0,0.674528,0.072415,0.744737
93,StandardScaler(),saga,multinomial,166.81,0.674528,0.072415,0.744737
