Heart disease diagnosis
---

## Exercise - Evaluate "most-frequent" baseline

> **Exercise**: Load and split the `heart-disease.csv` data into 70-30 train/test sets - make sure to keep the same proportion of classes by setting `stratify`. Evaluate the accuracy of the "most-frequent" baseline.

In [78]:
# Load libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import scale
%matplotlib inline
import matplotlib.pyplot as plt

In [6]:
# Load data
data = pd.read_csv("data/heart-disease.csv")
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,disease
0,63,male,typical angina,145,233,yes,ventricular hypertrophy,150,no,2.3,downsloping,0,fixed defect,absence
1,67,male,asymptomatic,160,286,no,ventricular hypertrophy,108,yes,1.5,flat,3,normal,likely
2,67,male,asymptomatic,120,229,no,ventricular hypertrophy,129,yes,2.6,flat,2,reversable defect,likely
3,37,male,non-anginal pain,130,250,no,normal,187,no,3.5,downsloping,0,normal,absence
4,41,female,atypical angina,130,204,no,ventricular hypertrophy,172,no,1.4,upsloping,0,normal,absence


In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null object
cp          303 non-null object
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null object
restecg     303 non-null object
thalach     303 non-null int64
exang       303 non-null object
oldpeak     303 non-null float64
slope       303 non-null object
ca          303 non-null int64
thal        303 non-null object
disease     303 non-null object
dtypes: float64(1), int64(5), object(8)
memory usage: 33.2+ KB


In [106]:
# Create X/y arrays
X = data.drop('disease', axis=1)
y = data.disease

#one hot encoding
X_num = pd.get_dummies(X).values

# Split into train/test sets
X_tr, X_te, y_tr, y_te = train_test_split(X_num, y, test_size=0.3, random_state=0, stratify=y)

# Compute distribution using Pandas
pd.Series(y_tr).value_counts(normalize=True)

absence        0.542453
likely         0.301887
very likely    0.155660
Name: disease, dtype: float64

In [108]:
# Create the dummy classifier
dummy = DummyClassifier(strategy='most_frequent')

# Fit it
dummy.fit(None, y_tr)

# Compute test accuracy
accuracy = dummy.score(None, y_te)
print('Accuracy: {:.2f}%'.format(accuracy*100))

Accuracy: 53.85%


Exercise - Evaluate k-NN baseline
---

> **Exercise**: Tune a k-NN classifier using grid search with **stratified 10-fold** cross-validation
> * Number of neighbors k
> * Distance metric - $L_{1}$ or $L_{2}$
> * Weighting strategy - uniform or by distance
>
> Refit the best estimator on the whole train set and report the test accuracy.

Data set documentation: http://archive.ics.uci.edu/ml/datasets/heart+Disease

In [109]:
# Create pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(algorithm = 'brute'))
])

# Create cross-validation object
grid = {
    'knn__n_neighbors': np.arange(1, 50, 1),
    'knn__weights': ['uniform','distance'],
    'knn__p': [1,2],

}

grid_cv = GridSearchCV(pipe, grid, cv=10, refit=True, return_train_score=True, verbose=1)

# Fit estimator
grid_cv.fit(X_tr, y_tr)

Fitting 10 folds for each of 196 candidates, totalling 1960 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1960 out of 1960 | elapsed:    8.5s finished


GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('knn', KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'knn__n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]), 'knn__weights': ['uniform', 'distance'], 'knn__p': [1, 2]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [110]:
# Get the results with "cv_results_"
grid_cv.cv_results_.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_knn__n_neighbors', 'param_knn__p', 'param_knn__weights', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'split5_test_score', 'split6_test_score', 'split7_test_score', 'split8_test_score', 'split9_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'split3_train_score', 'split4_train_score', 'split5_train_score', 'split6_train_score', 'split7_train_score', 'split8_train_score', 'split9_train_score', 'mean_train_score', 'std_train_score'])

In [111]:
# Collect results in a DataFrame
cv_results = pd.DataFrame(grid_cv.cv_results_)

# Print a few interesting columns
cols = ['mean_test_score', 'std_test_score', 'mean_train_score', 'std_train_score', 
        'param_knn__n_neighbors', 'param_knn__weights', 'param_knn__p']
cv_results[cols].sort_values('mean_test_score', ascending=False).head()

Unnamed: 0,mean_test_score,std_test_score,mean_train_score,std_train_score,param_knn__n_neighbors,param_knn__weights,param_knn__p
134,0.688679,0.068187,0.671391,0.011821,34,uniform,2
165,0.683962,0.062327,1.0,0.0,42,distance,1
161,0.683962,0.051419,1.0,0.0,41,distance,1
169,0.683962,0.062327,1.0,0.0,43,distance,1
156,0.683962,0.051419,0.680854,0.008276,40,uniform,1


In [134]:
# Compute test accuracy
accuracy = grid_cv.score(X_te, y_te)
print('Accuracy: {:.3f}'.format(accuracy))

Accuracy: 0.692


Exercise - Logistic regression
---

> **Exercise**: Same with a logistic regression
> * Try both OvR and softmax
> * tune C
>
> Which estimator would you use in practice? k-NN or logistic regression?

In [136]:
import warnings
from sklearn.exceptions import ConvergenceWarning
# Filter convergence warnings
warnings.simplefilter('ignore', ConvergenceWarning)


# Create estimator
logreg_cv = LogisticRegressionCV(Cs=np.logspace(-4, 4, num=20), cv=10, multi_class='multinomial', solver='saga', refit=True)

# Fit the estimator
logreg_cv.fit(scale(X_tr), y_tr);

In [137]:
accuracy_logreg = logreg_cv.score(scale(X_te), y_te)
print('Accuracy: {:.3f}'.format(accuracy_logreg))

Accuracy: 0.692


Both method give around the same results. We would need further analysis to choose the best one.