Heart disease diagnosis
---

## Exercise - Evaluate "most-frequent" baseline

> **Exercise**: Load and split the `heart-disease.csv` data into 70-30 train/test sets - make sure to keep the same proportion of classes by setting `stratify`. Evaluate the accuracy of the "most-frequent" baseline.

In [22]:
# Load libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

In [6]:
# Load data
data = pd.read_csv("data/heart-disease.csv")
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,disease
0,63,male,typical angina,145,233,yes,ventricular hypertrophy,150,no,2.3,downsloping,0,fixed defect,absence
1,67,male,asymptomatic,160,286,no,ventricular hypertrophy,108,yes,1.5,flat,3,normal,likely
2,67,male,asymptomatic,120,229,no,ventricular hypertrophy,129,yes,2.6,flat,2,reversable defect,likely
3,37,male,non-anginal pain,130,250,no,normal,187,no,3.5,downsloping,0,normal,absence
4,41,female,atypical angina,130,204,no,ventricular hypertrophy,172,no,1.4,upsloping,0,normal,absence


In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null object
cp          303 non-null object
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null object
restecg     303 non-null object
thalach     303 non-null int64
exang       303 non-null object
oldpeak     303 non-null float64
slope       303 non-null object
ca          303 non-null int64
thal        303 non-null object
disease     303 non-null object
dtypes: float64(1), int64(5), object(8)
memory usage: 33.2+ KB


In [8]:
# Create X/y arrays
X = data.drop('disease', axis=1).values
y = data.disease.values

# Split into train/test sets
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

# Compute distribution using Pandas
pd.Series(y_tr).value_counts(normalize=True)

absence        0.542453
likely         0.301887
very likely    0.155660
dtype: float64

In [11]:
# Create the dummy classifier
dummy = DummyClassifier(strategy='most_frequent')

# Fit it
dummy.fit(None, y_tr)

# Compute test accuracy
accuracy = dummy.score(None, y_te)
print('Accuracy: {:.2f}'.format(accuracy))

Accuracy: 0.54


Exercise - Evaluate k-NN baseline
---

> **Exercise**: Tune a k-NN classifier using grid search with **stratified 10-fold** cross-validation
> * Number of neighbors k
> * Distance metric - $L_{1}$ or $L_{2}$
> * Weighting strategy - uniform or by distance
>
> Refit the best estimator on the whole train set and report the test accuracy.

Data set documentation: http://archive.ics.uci.edu/ml/datasets/heart+Disease

In [27]:
# Create pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(algorithm = 'brute'))
])

# Create cross-validation object
grid = {
    'knn__n_neighbors': np.array([5,10]),
    'knn__weights': ['uniform', 'distance'],
    'knn__p': [1,2],

}

grid_cv = GridSearchCV(pipe, grid, cv=10, return_train_score=True, verbose=1)

# Fit estimator
grid_cv.fit(X_tr, y_tr)

Fitting 10 folds for each of 8 candidates, totalling 80 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


ValueError: could not convert string to float: 'male'

Exercise - Logistic regression
---

> **Exercise**: Same with a logistic regression
> * Try both OvR and softmax
> * tune C
>
> Which estimator would you use in practice? k-NN or logistic regression?

In [None]:
???