**0. Prepare the dataset for the subsequent modelling.**

    1. Download the heart disease dataset from https://www.statlearning.com/s/Heart.csv
    2. Load the dataset and drop all variables except the predictors Age, Sex, ChestPain, RestBP, Chol, and the target variable AHD. Drop all rows containing a NaN value.
    3. Onehot encode the variable ChestPain. This means that where you before had a single column with one of four values ['typical', 'asymptomatic', 'nonanginal', 'nontypical'], you will now have four binary columns (their names don't matter), akin to 'ChestPain_typical' 'ChestPain_asymptomatic', 'ChestPain_nonanginal', 'ChestPain_nontypical'. A row that before had ChestPain='typical' will now have ChestPain_typical=1 and the other three columns set to 0, ChestPain='asymptomatic' will have ChestPain_asymptomatic=1 and the other three set to 0, etc.
    4. Binary encode the target variable AHD such that 'No'=0 and 'Yes'=1.

In [7]:
import pandas as pd

# 1: Load the dataset
url = "https://www.statlearning.com/s/Heart.csv"
Heart = pd.read_csv(url)

# 2: Keep relevant columns
Heart = Heart[['Age', 'Sex', 'ChestPain', 'RestBP', 'Chol', 'AHD']]

# 3: Drop rows with any missing values
Heart = Heart.dropna()

# 4: One-hot encode 'ChestPain'
Heart = pd.get_dummies(Heart, columns=['ChestPain'])
Heart.iloc[:, -4:] = Heart.iloc[:, -4:].astype(int)

# 5: Binary encode 'AHD': 'No' → 0, 'Yes' → 1
Heart['AHD'] = Heart['AHD'].map({'No': 0, 'Yes': 1})

Heart.head()


1      1
2      1
3      0
4      0
      ..
298    0
299    1
300    1
301    0
302    0
Name: ChestPain_asymptomatic, Length: 303, dtype: int64' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  Heart.iloc[:, -4:] = Heart.iloc[:, -4:].astype(int)
1      0
2      0
3      1
4      0
      ..
298    0
299    0
300    0
301    0
302    1
Name: ChestPain_nonanginal, Length: 303, dtype: int64' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  Heart.iloc[:, -4:] = Heart.iloc[:, -4:].astype(int)
1      0
2      0
3      0
4      1
      ..
298    0
299    0
300    0
301    1
302    0
Name: ChestPain_nontypical, Length: 303, dtype: int64' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  Heart.iloc[:, -4:] = Heart.iloc[:, -4:].astype(int)
1      0
2      0
3      0
4      0
      ..
298    1
299    0
300    0
301    0
302    0
Name: ChestPain_typical, Length: 303, dtype: int64' h

Unnamed: 0,Age,Sex,RestBP,Chol,AHD,ChestPain_asymptomatic,ChestPain_nonanginal,ChestPain_nontypical,ChestPain_typical
0,63,1,145,233,0,0,0,0,1
1,67,1,160,286,1,1,0,0,0
2,67,1,120,229,1,1,0,0,0
3,37,1,130,250,0,0,1,0,0
4,41,0,130,204,0,0,0,1,0


**1. Fit a model using a standard train/validation split through multiple steps.**

1.1. Write a function "stratified_split" that takes three arguments: A dataframe, a number of folds, and a list of variables to stratify by. The function should return a list of dataframes, one for each fold, where the dataframes are stratified by the variables in the list. Test that the function works by splitting the dataset into two folds based on 'AHD', 'Age' and 'RestBP' and print the size of each fold, the counts of 0s and 1s in AHD, and the mean of each of 'Age' and 'RestBP' (all these should be printed individually per fold). Ensure that the function does not modify the original dataframe.

In [9]:
# 1: write a function
from sklearn.model_selection import StratifiedKFold
import pandas as pd

def stratified_split(df, n_folds, stratify_vars):
    df_copy = df.copy()
    
    # Create a new column that represents the stratification group
    df_copy['stratify_group'] = df_copy[stratify_vars].astype(str).agg('-'.join, axis=1)
    
    skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
    folds = []
    
    for _, test_idx in skf.split(df_copy, df_copy['stratify_group']):
        fold_df = df_copy.iloc[test_idx].drop(columns=['stratify_group']).copy()
        folds.append(fold_df)
    
    return folds

# 2: test the function 
# Assume 'Heart' DataFrame is already loaded and preprocessed as per your earlier steps

# Apply stratified_split on 'AHD', 'Age', and 'RestBP'
folds = stratified_split(Heart, 2, ['AHD', 'Age', 'RestBP'])

# Print fold statistics
for i, fold in enumerate(folds):
    print(f"Fold {i+1}")
    print(f"Size: {len(fold)}")
    print(f"AHD=0 count: {(fold['AHD'] == 0).sum()}")
    print(f"AHD=1 count: {(fold['AHD'] == 1).sum()}")
    print(f"Mean Age: {fold['Age'].mean():.2f}")
    print(f"Mean RestBP: {fold['RestBP'].mean():.2f}")
    print("-" * 30)




Fold 1
Size: 152
AHD=0 count: 84
AHD=1 count: 68
Mean Age: 54.66
Mean RestBP: 131.25
------------------------------
Fold 2
Size: 151
AHD=0 count: 80
AHD=1 count: 71
Mean Age: 54.22
Mean RestBP: 132.13
------------------------------




1.2. Write a function 'fit_and_predict' that takes 4 arguments: A training set, a validation set, a list of predictors, and a target variable. The function should fit a logistic regression model to the training set using the predictors and target variable, and return the predictions of the model on the validation set.

In [12]:

from sklearn.linear_model import LogisticRegression

def fit_and_predict(train_df, valid_df, predictors, target):
    # Create and fit the logistic regression model
    model = LogisticRegression(max_iter=1000)
    model.fit(train_df[predictors], train_df[target])
    
    # Predict probabilities on validation set (only the probability of class 1)
    predictions = model.predict_proba(valid_df[predictors])[:, 1]
    
    return predictions

1.3. Write a function 'fit_and_predict_standardized' that takes 5 arguments: A training set, a validation set, a list of predictors, a target variable, and a list of variables to standardize. Using a loop (or a scaler), the function should z-score standardize the given variables in both the training set and the validation set based on the mean and standard deviation in the training set. Then, the function should call the 'fit_and_predict' function and return its result. Ensure that the function does not modify the original dataframes. Test the function using the train and validation set from above (e.g. the two folds from the split), while standardizing the 'Age', 'RestBP' and 'Chol' variables (as mentioned above, the target should be AHD, and you should also include the remaining predictors: 'Sex' and the ChestPain-variables)

In [13]:
# 1: write a function 
def fit_and_predict_standardized(train_df, valid_df, predictors, target, standardize_vars):
    # Copy data to avoid modifying the original dataframes
    train_df = train_df.copy()
    valid_df = valid_df.copy()
    
    # Standardize variables using z-score (based on training data)
    for var in standardize_vars:
        mean = train_df[var].mean()
        std = train_df[var].std()
        train_df[var] = (train_df[var] - mean) / std
        valid_df[var] = (valid_df[var] - mean) / std  # use training stats
    
    # Fit and predict using logistic regression
    return fit_and_predict(train_df, valid_df, predictors, target)

# 2: test the function 
# Define columns
standardize_vars = ['Age', 'RestBP', 'Chol']
predictors = ['Age', 'Sex', 'RestBP', 'Chol'] + [col for col in Heart.columns if col.startswith('ChestPain_')]
target = 'AHD'

# Use folds[0] for training, folds[1] for validation
predictions = fit_and_predict_standardized(folds[0], folds[1], predictors, target, standardize_vars)

# Preview predictions
print(predictions[:5])


[0.8924791  0.87586785 0.29629782 0.57033586 0.85776799]


1.4. Write a function 'fit_and_compute_auc' that takes 5 arguments: A training set, a validation set, a list of predictors, a target variable, and a list of variables to standardize. The function should call the 'fit_and_predict_standardized' function to retrieve out-of-sample predictions for the validation set. Based on these and the ground truth labels in the validation set, it should compute and return the AUC. Test the function using the train and test set from above, while standardizing the 'Age', 'RestBP' and 'Chol' variables (and including the remaining predictors). Print the AUC.

In [15]:
# 1: write a function
from sklearn.metrics import roc_auc_score

def fit_and_compute_auc(train_df, valid_df, predictors, target, standardize_vars):
    # Get predicted probabilities
    preds = fit_and_predict_standardized(train_df, valid_df, predictors, target, standardize_vars)
    
    # Compute AUC using ground truth vs predicted probs
    auc = roc_auc_score(valid_df[target], preds)
    return auc

# 2: test the function 
# Define columns
standardize_vars = ['Age', 'RestBP', 'Chol']
predictors = ['Age', 'Sex', 'RestBP', 'Chol'] + [col for col in Heart.columns if col.startswith('ChestPain_')]
target = 'AHD'

# Use two folds created earlier
train_set = folds[0]
valid_set = folds[1]

# Calculate AUC
auc = fit_and_compute_auc(train_set, valid_set, predictors, target, standardize_vars)
print(f"AUC: {auc:.4f}")


AUC: 0.8528


**2. Perform a cross-validation.**

Use the 'stratified_split' function to split the dataset into 10 folds, stratified on variables you find reasonable. For each fold, use the 'fit_and_compute_auc' function to compute the AUC of the model on the held-out validation set. Print the mean and standard deviation of the AUCs across the 10 folds.

In [18]:
cv_folds = stratified_split(Heart, 10, stratify_vars=['AHD'])

# Cross-validation loop
auc_scores = []

for i in range(10):
    # Use the i-th fold as validation, the rest as training
    valid_fold = cv_folds[i]
    train_folds = [cv_folds[j] for j in range(10) if j != i]
    train_df = pd.concat(train_folds, ignore_index=True)
    
    # Compute AUC
    auc = fit_and_compute_auc(train_df, valid_fold, predictors, target, standardize_vars)
    auc_scores.append(auc)
    
    print(f"Fold {i+1} AUC: {auc:.4f}")

# Report mean and standard deviation of AUCs
mean_auc = np.mean(auc_scores)
std_auc = np.std(auc_scores)
print(f"\nMean AUC over 10 folds: {mean_auc:.4f}")
print(f"Standard Deviation of AUC: {std_auc:.4f}")

Fold 1 AUC: 0.7941
Fold 2 AUC: 0.8613
Fold 3 AUC: 0.7815
Fold 4 AUC: 0.8824
Fold 5 AUC: 0.8125
Fold 6 AUC: 0.8348
Fold 7 AUC: 0.9509
Fold 8 AUC: 0.8616
Fold 9 AUC: 0.9107
Fold 10 AUC: 0.7723

Mean AUC over 10 folds: 0.8462
Standard Deviation of AUC: 0.0552


OPTIONAL 3. Use the bootstrap to achieve a distribution of out-of-bag AUCs.
For 100 iterations, create a bootstrap sample by sampling with replacement from the full dataset until you have a training set equal in size to 80% of the original data. Use the observations not included in the bootstrap sample as the validation set for that iteration.. Fit models and calculate AUCs for each iteration. Print the mean and standard deviation of the AUCs.

In [19]:
from sklearn.utils import resample

# Number of bootstrap iterations
n_iterations = 100
n_size = int(0.8 * len(Heart))

# To store AUCs
bootstrap_aucs = []

for i in range(n_iterations):
    # Bootstrap sample
    train_df = resample(Heart, n_samples=n_size, replace=True, random_state=42 + i)
    
    # Get out-of-bag samples (not in training set)
    oob_indices = list(set(Heart.index) - set(train_df.index))
    if not oob_indices:  # Skip if no out-of-bag samples
        continue
    valid_df = Heart.loc[oob_indices]
    
    # Calculate AUC
    auc = fit_and_compute_auc(train_df, valid_df, predictors, target, standardize_vars)
    bootstrap_aucs.append(auc)

    print(f"Iteration {i+1} AUC: {auc:.4f}")

# Report summary stats
mean_bootstrap_auc = np.mean(bootstrap_aucs)
std_bootstrap_auc = np.std(bootstrap_aucs)
print(f"\nBootstrap Mean AUC: {mean_bootstrap_auc:.4f}")
print(f"Bootstrap Std AUC: {std_bootstrap_auc:.4f}")


Iteration 1 AUC: 0.7968
Iteration 2 AUC: 0.8515
Iteration 3 AUC: 0.7997
Iteration 4 AUC: 0.8011
Iteration 5 AUC: 0.8064
Iteration 6 AUC: 0.7675
Iteration 7 AUC: 0.8950
Iteration 8 AUC: 0.8694
Iteration 9 AUC: 0.8909
Iteration 10 AUC: 0.8358
Iteration 11 AUC: 0.8166
Iteration 12 AUC: 0.8314
Iteration 13 AUC: 0.8628
Iteration 14 AUC: 0.8316
Iteration 15 AUC: 0.8100
Iteration 16 AUC: 0.8363
Iteration 17 AUC: 0.8283
Iteration 18 AUC: 0.8234
Iteration 19 AUC: 0.8040
Iteration 20 AUC: 0.8292
Iteration 21 AUC: 0.8148
Iteration 22 AUC: 0.7839
Iteration 23 AUC: 0.8609
Iteration 24 AUC: 0.8250
Iteration 25 AUC: 0.7983
Iteration 26 AUC: 0.8336
Iteration 27 AUC: 0.8389
Iteration 28 AUC: 0.8202
Iteration 29 AUC: 0.8054
Iteration 30 AUC: 0.8496
Iteration 31 AUC: 0.8765
Iteration 32 AUC: 0.8249
Iteration 33 AUC: 0.8704
Iteration 34 AUC: 0.7878
Iteration 35 AUC: 0.7990
Iteration 36 AUC: 0.8284
Iteration 37 AUC: 0.8366
Iteration 38 AUC: 0.8316
Iteration 39 AUC: 0.8401
Iteration 40 AUC: 0.8491
Iteration

**4. Theory**

**4.1. List some benefits of wrapping code in functions rather than copying and pasting it multiple times.**

Wrapping codes in functions allow you to reuse code efficiently without having to rewrite it multiple times, reducing the redundancy in the scripts. When there are changes, we only need to change the function once in stead of changing it everywhere. 


**4.2. Explain three classification metrics and their benefits and drawbacks.**

a) Accuracy: quatifies the proportion of correctly classified cases out of all cases. The benefit is that it is very interpretable, while the drawback is that it does not handle the imbalanced classes.
b) True positive rate: the proportion of the true positives out of all the ground positives. The benefit is that it is interpretable, and it calculates the proportion of cases that are detected. It is useful when the cost of false negatives is high. However, it only focuses on one side of the classification problem and ignores everything else, like false positives or how well it handles negative cases.
c) True negative rate: the proportion of the true negatives out of all ground negatives. It is interpretable, and useful when the cost of false positives is high. Its drawback is that it is also one-sided.


**4.3. Write a couple of sentences comparing the three methods (train/validation, cross-validation, bootstrap) above as approaches to quantify model performance. Which one yielded the best results? Which one would you expect to yield the best results? Can you mention some theoretical benefits and drawbacks with each? Even if you didn't do the optional bootstrap exercise you should reflect on this as an approach.**

a) the train/validation approach is fast and easy to implement. However it gives onlu one performance estimates, which can vary a lot depending on the split.
b) cross-validation uses all the data to train models, providing more robust estimate by averaging over multiple splits. The drawbacks are that a)there are different choices of k yields different results b) multiple models have to be specified.
c) bootstrap uses all data to train models, and gives a full distribution of performance metrics, can be used for uncertainty estimation(e.g., confidence intervals). The drawback is that there are different choices of b that yields different results

Cross-validation likely yields the most consistent results, while bootstrap provides insight into the variability of model performance. Theoretically, cross-validation is preferrs for model selection, while bootstrap is powerful for uncertainty estimation. The train/validation split, though quick, is the least robust of the three.

**4.4. Why do we stratify the dataset before splitting?**

Because we want to ensure that all folds of the dataset have similar distributions in some given characteristics, such as the outcome variable, age, sex.

**4.5. What other use cases can you think of for the bootstrap method?**

We can use the bootstrap to estimate the confidence intervals for statistics such as regression coeficients. Bootstrap can also be used to simulate the null distribution and calculate p-values when traditional tests aren't valid or assumptions are violated.