![](../img/330-banner.png)

# Tutorial 6

UBC 2024-25

## Outline

During this tutorial, you will work in groups to simulate the behaviour of averaging and stacking classifiers.

All questions can be discussed with your classmates and the TAs - this is not a graded exercise!

In [None]:
import os

%matplotlib inline
import string
import sys
from collections import deque

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

sys.path.append(os.path.join(os.path.abspath(".."), "code"))

from plotting_functions import *
from sklearn import datasets
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.model_selection import GridSearchCV
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from lightgbm.sklearn import LGBMClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import VotingClassifier

from utils import *
DATA_DIR = os.path.join(os.path.abspath(".."), "data/")

import warnings
warnings.filterwarnings("ignore")

## The dataset

For this exercise, we will work with a new dataset on Heart Failure Prediction. You can download the dataset from [Kaggle](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download). We also recommend taking a moment to read the Attribute Information included in this page, which will explain the features included in the dataset. The goal is to predict whether a patient is at risk of heart failure (class 1) or not (class 0).

Use the cell below to read the dataset and check the first few rows (make sure the path matches the location on your computer).

In [None]:
heart_df = pd.read_csv(DATA_DIR + "heart.csv")
heart_df.head()

Luckily for us, it appears that the dataset is complete - we do not need to worry about imputation.

In [None]:
heart_df.info()

The next few cells take care of the basic preprocessing steps needed before we get to the learning part, like creating a training/test split and creating a suitable `ColumnTransformer`. Run them before moving to the next section.

In [None]:
train_df, test_df = train_test_split(heart_df, test_size=0.2, random_state=42)

In [None]:
numeric_features = ["Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak"]

categorical_features = [
    "ChestPainType",
    "RestingECG",
    "ST_Slope",
]

binary_features = ["Sex", "ExerciseAngina"]
passthrough_features = ["FastingBS"]
target_column = "HeartDisease"

In [None]:
numeric_transformer = StandardScaler()

binary_transformer = OneHotEncoder(drop="if_binary", dtype=int)

categorical_transformer = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features),
    (binary_transformer, binary_features),
    (categorical_transformer, categorical_features),
    ("passthrough", passthrough_features),
)

In [None]:
X_train = train_df.drop(columns=[target_column])
y_train = train_df[target_column]

X_test = test_df.drop(columns=[target_column])
y_test = test_df[target_column]

The cell below shows that the dataset is balanced, which is good for our purposes. We will use accuracy as evaluation metric.

In [None]:
train_df["HeartDisease"].value_counts(normalize=True)

## Averaging simulation

In this portion of the exercise, you will need to split in 5 groups (groups can be of different size). Each group will then train a classifier to predict the target based on the available features. The classifiers to train are:

- Decision Tree
- kNN
- Logistic regression
- Random Forest
- LightGBM 

For this exercise, we will not fine tune the classifiers and just use them "off_shelf". 

### <font color='red'>Question 1</font>

After creating a pipeline with the preprocessor and your chosen classifier, use `cross_validate` to score it on the training set, and compare the results with the other groups. Which classifier has the best performance? Which show signs of overfitting? Which one is the slowest to train?


### <font color='red'>Question 2</font>

For this question, we will focus specifically on a small set of samples that were found to be more challenging to classify. You can get the samples by running the cell below.

How many errors does your classifier make when classifying these samples? Compare your result with the other groups: which classifier does the fewest errors?

In [None]:
uncertain_indices = [122,  77,  49,  54,  12, 129,  35, 102,  39,  56]
test_samples = test_df.iloc[uncertain_indices]
test_samples

### <font color='red'>Question 3</font>

Now, you and the other groups are going to *average* your answers to see if your collective classification is better than the individual ones. Fill the table below with the answer from each classifier, and write down the final classification. Did the averaging classifier do better on these 10 samples than the individual ones?

| Sample   | D.T.     | kNN.     | Log.reg. | R.F.     | LightGBM | Final prediction | Correct? |
|----------|----------|----------|----------|----------|----------|----------|----------|
| 122      |          |          |          |          |          |          |          |
| 77       |          |          |          |          |          |          |          |
| 49       |          |          |          |          |          |          |          |
| 54       |          |          |          |          |          |          |          |
| 12       |          |          |          |          |          |          |          |
| 129      |          |          |          |          |          |          |          |
| 35       |          |          |          |          |          |          |          |
| 102      |          |          |          |          |          |          |          |
| 39       |          |          |          |          |          |          |          |
| 56       |          |          |          |          |          |          |          |



Next, you may check if your answers match the ones of sklearn `VotingClassifier`, by running the cells below (for this to work, you will need to copy the classifiers from the other teams; also, change the names in the list if they are different). 

In [None]:
classifiers = {
    "logistic regression": pipe_lr,
    "decision tree": pipe_dt,
    "kNN": pipe_kNN,
    "random forest": pipe_rf,
    "LightGBM": pipe_lgbm,
}

averaging_model = VotingClassifier(
    list(classifiers.items()), voting="hard"
) 

averaging_model.fit(X_train, y_train)

In [None]:
averaging_model.predict(X_test.iloc[uncertain_indices])

In [None]:
averaging_model.score(X_test.iloc[uncertain_indices], y_test.iloc[uncertain_indices])

### <font color='red'>Question 4</font>

If everything went according to plans, you should have gotten a better score on these 10 samples - hurray!

But what about the overall classifier performance? Use cross validation to see if the `VotingClassifier` actually achieves a better validation accuracy than the other classifiers you and other groups have tried. 

In [None]:
scores_averaging = cross_validate(averaging_model, X_train, y_train, return_train_score=True)
pd.DataFrame(scores_averaging)

In [None]:
pd.DataFrame(pd.DataFrame(scores_averaging).mean())

### <font color='red'>Question 5</font>

To answer this question, repeat what you did in Question 3, but this time using **soft voting.** Complete the table with the predicted probability (for class 1) for each sample, and determine the final answer using their average.

How is the performance of the averaging classifier with soft voting on the 10 uncertain samples?

| Sample   | D.T.     | kNN.     | Log.reg. | R.F.     | LightGBM | Average  | Correct? |
|----------|----------|----------|----------|----------|----------|----------|----------|
| 122      |          |          |          |          |          |          |          |
| 77       |          |          |          |          |          |          |          |
| 49       |          |          |          |          |          |          |          |
| 54       |          |          |          |          |          |          |          |
| 12       |          |          |          |          |          |          |          |
| 129      |          |          |          |          |          |          |          |
| 35       |          |          |          |          |          |          |          |
| 102      |          |          |          |          |          |          |          |
| 39       |          |          |          |          |          |          |          |
| 56       |          |          |          |          |          |          |          |


Once again, you can check if your answers match the ones of sklearn `VotingClassifier` with soft voting, by running the cells below. 

In [None]:
averaging_model = VotingClassifier(
    list(classifiers.items()), voting="soft"
) 

averaging_model.fit(X_train, y_train)

In [None]:
averaging_model.predict(X_test.iloc[uncertain_indices])

In [None]:
averaging_model.score(X_test.iloc[uncertain_indices], y_test.iloc[uncertain_indices])

### <font color='red'>Question 6</font>

Let's now use cross-validation to measure the overall performance of this classifier. How does it compare with the other options seen so far?

In [None]:
scores_averaging = cross_validate(averaging_model, X_train, y_train, return_train_score=True)
pd.DataFrame(scores_averaging)

In [None]:
pd.DataFrame(pd.DataFrame(scores_averaging).mean())

## Stacking

Stacking is another ensemble method that adds one more step to what we have seen the `VotingClassifier` do: instead of taking a majority vote or averaging predicted probability, it combines the output of different classifers to create a new feature vector for the sample.

### <font color='red'>Question 7</font>

How the new feature vectors are created depends on the parameters we use when creating the `StackingClassifier`. Review this using the related [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html), and answer the questions below.

- What final estimator is used if none is specified as parameter?
- What would the feature vector look like for sample 122 if stack_method = 'predict'? 
- What would the feature vector look like for sample 122 if stack_method = 'predict_proba'? 

### <font color='red'>Question 8</font>

It is now time to try out the `StackingClassifier`. Run the cells below to create it and see how it performs on the uncertain samples and on the entire dataset. How does it compare to Averaging and the other classifiers?

In [None]:
stacking_model = StackingClassifier(list(classifiers.items()), stack_method = 'predict_proba')

In [None]:
stacking_model.fit(X_train, y_train)
stacking_model.score(X_test.iloc[uncertain_indices], y_test.iloc[uncertain_indices])

In [None]:
scores_stacking = cross_validate(stacking_model, X_train, y_train, return_train_score=True)
pd.DataFrame(scores_stacking)

In [None]:
pd.DataFrame(pd.DataFrame(scores_stacking).mean())

### <font color='red'>Question 9</font>

An interesting thing about using a logistic regressor as final estimator is that we can observe the coefficients associated with each stacked classifier. These coefficients represent the confidence of the final estimator in each classifier's contribution. 

Check the coefficients by running the cells below. Which classifier is the most trustworthy? Which one is the least?

In [None]:
pd.DataFrame(
    data=stacking_model.final_estimator_.coef_.flatten(),
    index=classifiers.keys(),
    columns=["Coefficient"],
).sort_values("Coefficient", ascending=False)

### <font color='red'>Question 10</font>

As last step, make the final call on which classifier, among all the ones you have seen today, should be used for this problem, and provide a thorough justification for your answer.

Finally, do not forget to try your pick on the test set!