### 0. Prepare the dataset for the subsequent modelling.
1. Download the heart disease dataset from https://www.statlearning.com/s/Heart.csv
2. Load the dataset and drop all variables except the predictors Age, Sex, ChestPain, RestBP, Chol, and the target variable AHD. Drop all rows containing a NaN value.
3. Onehot encode the variable ChestPain. This means that where you before had a single column with one of four values ['typical', 'asymptomatic', 'nonanginal', 'nontypical'], you will now have four binary columns (their names don't matter), akin to 'ChestPain_typical' 'ChestPain_asymptomatic', 'ChestPain_nonanginal', 'ChestPain_nontypical'. A row that before had ChestPain='typical' will now have ChestPain_typical=1 and the other three columns set to 0, ChestPain='asymptomatic' will have ChestPain_asymptomatic=1 and the other three set to 0, etc.
4. Binary encode the target variable AHD such that 'No'=0 and 'Yes'=1.

In [1]:
import pandas as pd


predictors = ['Age', 'Sex', 'ChestPain', 'RestBP', 'Chol']
target = 'AHD'

df = pd.read_csv('https://www.statlearning.com/s/Heart.csv')
df = df[predictors + [target]]
df = df.dropna()
df = pd.get_dummies(df, columns=['ChestPain'])
df['AHD'] = df['AHD'].map({'No': 0, 'Yes': 1})
df.head()

Unnamed: 0,Age,Sex,RestBP,Chol,AHD,ChestPain_asymptomatic,ChestPain_nonanginal,ChestPain_nontypical,ChestPain_typical
0,63,1,145,233,0,False,False,False,True
1,67,1,160,286,1,True,False,False,False
2,67,1,120,229,1,True,False,False,False
3,37,1,130,250,0,False,True,False,False
4,41,0,130,204,0,False,False,True,False


### 1. Fit a model using a standard train/validation split through multiple steps.
Through the steps you will practice chaining functions, and you will also create the infrastructure necessary for the remaining tasks.

1. Write a function "stratified_split" that takes three arguments: A dataframe, a number of folds, and a list of variables to stratify by. The function should return a list of dataframes, one for each fold, where the dataframes are stratified by the variables in the list. Test that the function works by splitting the dataset into two folds based on 'AHD', 'Age' and 'RestBP' and print the size of each fold, the counts of 0s and 1s in AHD, and the mean of each of 'Age' and 'RestBP' (all these should be printed individually per fold). Ensure that the function does not modify the original dataframe.

In [2]:
import numpy as np

from collections import Counter
from typing import List


def stratified_split(df: pd.DataFrame, num_folds: int, 
                     variables: List[str]) -> List[pd.DataFrame]:
    """Splits a given dataframe into a given number of folds while 
    stratifying on given variables.

    Parameters:
    ----------
    df : pd.DataFrame
        The dataframe that should be split
    num_folds : int
        The number of folds the dataset should be split into
    variables : List[str]
        A list of variables to stratify by

    Returns:
    -------
    List[pd.DataFrame]
        A list of non-overlapping folds from the original dataset

    Raises:
    ------
    KeyError
        If any of the stratification variables are not columns in 
        the original dataset
    """
    df = df.copy()
    df = df.sort_values(variables)
    df['fold'] = np.arange(len(df)) % num_folds

    return [df[df['fold'] == fold].drop(columns=['fold']) \
            for fold in range(num_folds)]

folds = stratified_split(df, 2, ['AHD', 'Age', 'RestBP'])

for i, fold in enumerate(folds):
    print(f'Fold {i} (n={len(fold)})')
    print(f'AHD: {Counter(fold["AHD"])}')
    print(f'Age: {np.mean(fold["Age"]):.2f}')
    print(f'RestBP: {np.mean(fold["RestBP"]):.2f}')

Fold 0 (n=152)
AHD: Counter({0: 82, 1: 70})
Age: 54.36
RestBP: 132.20
Fold 1 (n=151)
AHD: Counter({0: 82, 1: 69})
Age: 54.52
RestBP: 131.18


2. Write a function 'fit_and_predict' that takes 4 arguments: A training set, a validation set, a list of predictors, and a target variable. The function should fit a logistic regression model to the training set using the predictors and target variable, and return the predictions of the model on the validation set.

In [3]:
from sklearn.linear_model import LogisticRegression


def fit_and_predict(train: pd.DataFrame, validation: pd.DataFrame, 
                    predictors: List[str], target: str) -> np.ndarray:
    model = LogisticRegression()
    model.fit(train[predictors], train[target])

    return model.predict_proba(validation[predictors])[:,1]

predictors = [col for col in df.columns if col != target]

fit_and_predict(folds[0], folds[1], predictors, target)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


array([0.45095566, 0.16076033, 0.06375348, 0.22494394, 0.06908415,
       0.51064285, 0.18406991, 0.06692634, 0.0496547 , 0.23196937,
       0.33114035, 0.17073074, 0.25492414, 0.50759808, 0.63681864,
       0.20680979, 0.09940288, 0.19536815, 0.25069693, 0.21283495,
       0.70464829, 0.67968945, 0.06905448, 0.2352198 , 0.38794901,
       0.72557673, 0.2640675 , 0.27261373, 0.23341443, 0.40478037,
       0.4220961 , 0.08283804, 0.31131807, 0.33961668, 0.62189409,
       0.10438391, 0.73702733, 0.64630262, 0.28569464, 0.12334552,
       0.58780986, 0.12275632, 0.4516331 , 0.46998356, 0.25366013,
       0.12732769, 0.32005531, 0.10668871, 0.34379322, 0.28610207,
       0.1009056 , 0.68288448, 0.31577299, 0.80657638, 0.47336213,
       0.8118554 , 0.43464129, 0.37635592, 0.3355453 , 0.39609768,
       0.81019788, 0.34658857, 0.67842862, 0.17274286, 0.41153555,
       0.38367723, 0.47747438, 0.15493128, 0.74650112, 0.5520207 ,
       0.74434391, 0.86390874, 0.17656547, 0.8319983 , 0.46233

3. Write a function 'fit_and_predict_standardized' that takes 5 arguments: A training set, a validation set, a list of predictors, a target variable, and a list of variables to standardize. Using a loop (or a scaler), the function should z-score standardize the given variables in both the training set and the validation set based on the mean and standard deviation in the training set. Then, the function should call the 'fit_and_predict' function and return its result. Ensure that the function does not modify the original dataframes. Test the function using the train and validation set from above (e.g. the two folds from the split), while standardizing the 'Age', 'RestBP' and 'Chol' variables (as mentioned above, the target should be AHD, and you should also include the remaining predictors: 'Sex' and the ChestPain-variables)

In [4]:
from sklearn.preprocessing import StandardScaler


def fit_and_predict_standardized(train: pd.DataFrame, validation: pd.DataFrame,
                                 predictors: List[str], target: str,
                                 variables_to_standardize: List[str]) -> np.ndarray:
    train = train.copy()
    validation = validation.copy()
    
    scaler = StandardScaler()
    train[variables_to_standardize] = scaler.fit_transform(
        train[variables_to_standardize]
    )
    validation[variables_to_standardize] = scaler.transform(
        validation[variables_to_standardize]
    )

    return fit_and_predict(train, validation, predictors, target)

fit_and_predict_standardized(folds[0], folds[1], predictors, target, 
                             ['Age', 'RestBP', 'Chol'])

array([0.24580942, 0.10301595, 0.04318178, 0.19265105, 0.05521626,
       0.35005167, 0.12335629, 0.04871613, 0.03574997, 0.20508148,
       0.26053605, 0.13650888, 0.21607723, 0.38691392, 0.63428011,
       0.21666342, 0.05742623, 0.15114677, 0.21690386, 0.16463146,
       0.64549472, 0.66232289, 0.04653685, 0.147312  , 0.37418151,
       0.68481353, 0.27261254, 0.25993081, 0.1947186 , 0.3937794 ,
       0.36757399, 0.05514858, 0.24402256, 0.27384063, 0.45940217,
       0.0929065 , 0.76984209, 0.46038522, 0.23011678, 0.10416688,
       0.52227398, 0.10213225, 0.44635117, 0.46600833, 0.21257989,
       0.09340648, 0.3266818 , 0.10975396, 0.37868051, 0.25900628,
       0.07960065, 0.51744639, 0.27241878, 0.78507658, 0.49181328,
       0.82553098, 0.42673238, 0.33914456, 0.28546738, 0.41136269,
       0.83246662, 0.32374632, 0.66724015, 0.13552456, 0.46893256,
       0.33775133, 0.57526899, 0.12385073, 0.60020893, 0.58881848,
       0.71326235, 0.86031367, 0.2106519 , 0.86117278, 0.35543

4. Write a function 'fit_and_compute_auc' that takes 5 arguments: A training set, a validation set, a list of predictors, a target variable, and a list of variables to standardize. The function should call the 'fit_and_predict_standardized' function to retrieve out-of-sample predictions for the validation set. Based on these and the ground truth labels in the validation set, it should compute and return the AUC. Test the function using the train and test set from above, while standardizing the 'Age', 'RestBP' and 'Chol' variables (and including the remaining predictors). Print the AUC.

In [5]:
from sklearn.metrics import roc_auc_score


def fit_and_compute_auc(train: pd.DataFrame, validation: pd.DataFrame,
                                 predictors: List[str], target: str,
                                 variables_to_standardize: List[str]) -> float:
    predictions = fit_and_predict_standardized(train, validation, predictors,
                                               target, variables_to_standardize)

    return roc_auc_score(validation[target], predictions)

split_auc = fit_and_compute_auc(folds[0], folds[1], predictors, target, 
                                ['Age', 'RestBP', 'Chol'])
print(f'Train/validation split AUC: {split_auc:.2f}')

Train/validation split AUC: 0.84


### 2. Perform a cross-validation.
Use the 'stratified_split' function to split the dataset into 10 folds, stratified on variables you find reasonable. For each fold, use the 'fit_and_compute_auc' function to compute the AUC of the model on the held-out validation set. Print the mean and standard deviation of the AUCs across the 10 folds.

In [6]:
folds = stratified_split(df, 10, ['AHD', 'Age'])
cv_aucs = []

for i in range(len(folds)):
    train = pd.concat([folds[j] for j in range(len(folds)) \
                      if j != i])
    validation = folds[i]
    auc = fit_and_compute_auc(train, validation, predictors, target,
                              ['Age', 'RestBP', 'Chol'])
    cv_aucs.append(auc)

print(f'Cross-validation AUC: {np.mean(cv_aucs):.2f}+/-{np.std(cv_aucs):.2f}')

Cross-validation AUC: 0.85+/-0.09


### 3. Use the bootstrap to achieve a distribution of out-of-bag AUCs.
For 100 iterations, create a bootstrap sample by sampling with replacement from the full dataset until you have a training set equal in size to 80% of the original data. Use the observations not included in the bootstrap sample as the validation set for that iteration.. Fit models and calculate AUCs for each iteration. Print the mean and standard deviation of the AUCs.

In [7]:
from tqdm import tqdm


bootstrap_aucs = []

for iterations in tqdm(range(100)):
    train = df.sample(frac=0.8, replace=True)
    validation = df.loc[~df.index.isin(train.index)]

    auc = fit_and_compute_auc(train, validation, predictors, target,
                              ['Age', 'RestBP', 'Chol'])
    bootstrap_aucs.append(auc)

print(f'Bootstrap AUC: {np.mean(bootstrap_aucs):.2f}+/-'
      f'{np.std(bootstrap_aucs):.2f}')

100%|█████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 295.07it/s]

Bootstrap AUC: 0.84+/-0.03





### 4. Theory
1. List some benefits of wrapping code in functions rather than copying and pasting it multiple times.

- _Code can be reused in multiple different scenarios, meaning you have to write less code_
- _If you discover a bug in the code, that bug only has to be fixed in one place_
- _You hide complex implementation details behind a more abstract interface, making it easier to separate concerns_

2.  Explain three classification metrics and their benefits and drawbacks.

- _Logloss: Has nice mathematical properties that allows for using it as a loss function when fitting models. Hard to interpret_
- _Accuracy: Very intuitive to understand (the proportion of correct predictions). Does not take into account the cost of different misclassifications, __does not handle imbalanced classes__._
- _Area under the receiver operating characteristic curve (AUC/AUROC): Does not rely on setting a classification threshold, handles class imbalance. Hard to interpret literally (e.g. an AUC of 0.95 is generally good, but what does it mean more concretely?), can't be used for optimizing models_

3. Write a couple of sentences comparing the three methods (train/validation, cross-validation, bootstrap) above as approaches to quantify model performance. Which one yielded the best results? Which one would you expect to yield the best results? Can you mention some theoretical benefits and drawbacks with each? Even if you didn't do the optional bootstrap exercise you should reflect on this as an approach.

In [8]:
print(f'Train/validation-split AUC: {split_auc:.2f}')
print(f'Cross-validation AUC: {np.mean(cv_aucs):.2f}+/-{np.std(cv_aucs):.2f}')
print(f'Bootstrap AUC: {np.mean(bootstrap_aucs):.2f}+/-{np.std(bootstrap_aucs):.2f}')

Train/validation-split AUC: 0.84
Cross-validation AUC: 0.85+/-0.09
Bootstrap AUC: 0.84+/-0.03


_In our case, the three methods yield results that are statistically equivalent. This is what we would expect, but given the fact that the single train/validation-split has a lot of variance depending on the exact split, this doesn't always happen. As such, one of the two latter is preferable_

4. Why do we stratify the dataset before splitting?

_To ensure the different folds of the dataset are similar with respect to certain key variables. If we don't do this, we could arrive at models that are very good in whatever portion they are trained on but very poor in everything else, simply due to the training population not being representative for the rest of the data._

5. What other use cases can you think of for the bootstrap method?

_In addition to assessing model performance, the bootstrap can also be used to get an idea of the spread of estimated parameter values (e.g. how important a variable is for a prediction)._