# SOLUTION NOTEBOOK

---

#  Heart disease diagnosis

## Exercise - Load data and apply one-hot encoding

> **Exercise**: Load the data and encode categorical features using **one-hot encoding**. Create X/y arrays and split them into 70-30 train/test sets using `train_test_split(random_state=0)`. Make sure that the train/test sets have the same proportion of data points in each class by setting `stratify=y`.

In [1]:
import pandas as pd
import os

# Load data
data_df = pd.read_csv(os.path.join('data', 'heart-disease.csv'))

# First five rows
data_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,disease
0,63,male,typical angina,145,233,yes,ventricular hypertrophy,150,no,2.3,downsloping,0,fixed defect,absent
1,58,female,asymptomatic,100,248,no,ventricular hypertrophy,122,no,1.0,flat,0,normal,absent
2,48,male,non-anginal pain,124,255,yes,normal,175,no,0.0,upsloping,2,normal,absent
3,57,male,asymptomatic,132,207,no,normal,168,yes,0.0,upsloping,0,reversable defect,absent
4,52,male,non-anginal pain,138,223,no,normal,169,no,0.0,upsloping,0,normal,absent


In [2]:
# One-hot encoding
encoded_df = pd.get_dummies(data_df, columns=['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal'])

# Create X/y arrays
X = encoded_df.drop('disease', axis=1).values
y = encoded_df.disease.values

print('X:', X.shape, X.dtype)
print('y:', y.shape, y.dtype)

X: (303, 25) float64
y: (303,) object


In [3]:
from sklearn.model_selection import train_test_split

# Split into train/test sets
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=0)

print('Train:', X_tr.shape, y_tr.shape)
print('Test:', X_te.shape, y_te.shape)

Train: (212, 25) (212,)
Test: (91, 25) (91,)


## Exercise - Evaluate baseline

> **Exercise**: Evaluate the accuracy of the "most-frequent" baseline.

In [4]:
from sklearn.dummy import DummyClassifier

# Evaluat baseline
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_tr, y_tr)
accuracy = dummy.score(X_te, y_te)
print('Baseline accuracy ("most-frequent"): {:.3f}'.format(accuracy))

Baseline accuracy ("most-frequent"): 0.538


## Exercise - Grid search with cross-validation for *k*-NN

> **Exercise**: Fit and evaluate the accuracy of a *k*-NN classifier. Tune the following hyperparameters using grid search with **stratified 10-fold** cross-validation.
> * Number of neighbors *k*
> * Distance metric - $L_{1}$ or $L_{2}$
> * Weighting strategy - uniform or by distance

In [5]:
import numpy as np

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# Create pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

# Grid search with cross-validation
grid = {
    'knn__n_neighbors': [1, 5, 10, 15, 20],
    'knn__weights': ['uniform', 'distance'],
    'knn__p': [1, 2]
}
grid_cv = GridSearchCV(pipe, grid, cv=10)

# Fit it to train data
grid_cv.fit(X_tr, y_tr)

# Collect results in a DataFrame
df = pd.DataFrame.from_items([
    ('k', grid_cv.cv_results_['param_knn__n_neighbors']),
    ('p', grid_cv.cv_results_['param_knn__p']),
    ('weights', grid_cv.cv_results_['param_knn__weights']),
    ('mean_te', grid_cv.cv_results_['mean_test_score'])
])

# Ten best combinations according to the mean test score
df.sort_values(by='mean_te', ascending=False).head(10)

Unnamed: 0,k,p,weights,mean_te
16,20,1,uniform,0.683962
14,15,2,uniform,0.669811
18,20,2,uniform,0.665094
17,20,1,distance,0.665094
15,15,2,distance,0.665094
11,10,2,distance,0.665094
4,5,1,uniform,0.660377
5,5,1,distance,0.650943
12,15,1,uniform,0.650943
6,5,2,uniform,0.646226


In [6]:
# Evaluate accuracy on test set
accuracy = grid_cv.score(X_te, y_te)
print('k-NN accuracy: {:.3f}'.format(accuracy))

k-NN accuracy: 0.670


## Exercise - Logistic regression

> **Exercise**: Fit a logistic regression classifier (try both OvR and softmax versions). Tune the `C` parameter using a **stratified 10-fold** cross-validation. Print the optimal `C` value for each class.

In [7]:
from sklearn.linear_model import LogisticRegressionCV

# List of C values
Cs = np.logspace(-4, 4, num=10)

# Logistic regression (OvR) with C select using cross-validated grid search
logreg_cv = LogisticRegressionCV(Cs, cv=10, multi_class='ovr')
logreg_cv.fit(X_tr, y_tr)

# Print C values
for category, C in zip(logreg_cv.classes_, logreg_cv.C_):
    print('Category "{}": {:.1e}'.format(category, C))
    
# Evaluate accuracy on test set
accuracy = logreg_cv.score(X_te, y_te)
print('Logistic regression accuracy: {:.3f}'.format(accuracy))

Category "absent": 3.6e-01
Category "likely": 1.0e-04
Category "very likely": 1.0e-04
Logistic regression accuracy: 0.670


In [8]:
# Softmax regression
    logreg_cv = LogisticRegressionCV(Cs, cv=10, multi_class='multinomial')
    logreg_cv.fit(X_tr, y_tr)

# Evaluate accuracy on test set
accuracy = logreg_cv.score(X_te, y_te)
print('Logistic regression accuracy: {:.3f}'.format(accuracy))

Logistic regression accuracy: 0.681
