![](../img/330-banner.png)

# Tutorial 6

UBC 2024-25

## Outline

During this tutorial, you will work in groups to simulate the behaviour of averaging and stacking classifiers.

All questions can be discussed with your classmates and the TAs - this is not a graded exercise!

In [1]:
import os

%matplotlib inline
import string
import sys
from collections import deque

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

sys.path.append(os.path.join(os.path.abspath("."), "code"))

from plotting_functions import *
from sklearn import datasets
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.model_selection import GridSearchCV
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from lightgbm.sklearn import LGBMClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import VotingClassifier

from utils import *
DATA_DIR = os.path.join(os.path.abspath("."), "data/")

import warnings
warnings.filterwarnings("ignore")

## The dataset

For this exercise, we will work with a new dataset on Heart Failure Prediction. You can download the dataset from [Kaggle](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download). We also recommend taking a moment to read the Attribute Information included in this page, which will explain the features included in the dataset. The goal is to predict whether a patient is at risk of heart failure (class 1) or not (class 0).

Use the cell below to read the dataset and check the first few rows (make sure the path matches the location on your computer).

In [2]:
heart_df = pd.read_csv(DATA_DIR + "heart.csv")
heart_df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


Luckily for us, it appears that the dataset is complete - we do not need to worry about imputation.

In [3]:
heart_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


The next few cells take care of the basic preprocessing steps needed before we get to the learning part, like creating a training/test split and creating a suitable `ColumnTransformer`. Run them before moving to the next section.

In [4]:
train_df, test_df = train_test_split(heart_df, test_size=0.2, random_state=42)

In [5]:
numeric_features = ["Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak"]

categorical_features = [
    "ChestPainType",
    "RestingECG",
    "ST_Slope",
]

binary_features = ["Sex", "ExerciseAngina"]
passthrough_features = ["FastingBS"]
target_column = "HeartDisease"

In [6]:
numeric_transformer = StandardScaler()

binary_transformer = OneHotEncoder(drop="if_binary", dtype=int)

categorical_transformer = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features),
    (binary_transformer, binary_features),
    (categorical_transformer, categorical_features),
    ("passthrough", passthrough_features),
)

In [7]:
X_train = train_df.drop(columns=[target_column])
y_train = train_df[target_column]

X_test = test_df.drop(columns=[target_column])
y_test = test_df[target_column]

The cell below shows that the dataset is balanced, which is good for our purposes. We will use accuracy as evaluation metric.

In [8]:
train_df["HeartDisease"].value_counts(normalize=True)

HeartDisease
1    0.546322
0    0.453678
Name: proportion, dtype: float64

## Averaging simulation

In this portion of the exercise, you will need to split in 5 groups (groups can be of different size). Each group will then train a classifier to predict the target based on the available features. The classifiers to train are:

- Decision Tree
- kNN
- Logistic regression
- Random Forest
- LightGBM 

For this exercise, we will not fine tune the classifiers and just use them "off_shelf". 

### <font color='red'>Question 1</font>

After creating a pipeline with the preprocessor and your chosen classifier, use `cross_validate` to score it on the training set, and compare the results with the other groups. Which classifier has the best performance? Which show signs of overfitting? Which one is the slowest to train?


In [9]:
# Decision tree
pipe_dt = make_pipeline(preprocessor, DecisionTreeClassifier())
dt_scores = cross_validate(pipe_dt, X_train, y_train, cv=10, return_train_score=True)
pd.DataFrame(dt_scores)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.010832,0.003102,0.743243,1.0
1,0.005399,0.002043,0.810811,1.0
2,0.005042,0.001976,0.783784,1.0
3,0.004886,0.001886,0.810811,1.0
4,0.005298,0.001991,0.808219,1.0
5,0.004793,0.002008,0.821918,1.0
6,0.004772,0.001967,0.780822,1.0
7,0.005022,0.001936,0.794521,1.0
8,0.004948,0.001919,0.808219,1.0
9,0.004877,0.001977,0.821918,1.0


In [10]:
pd.DataFrame(dt_scores).mean()

fit_time       0.005587
score_time     0.002080
test_score     0.798427
train_score    1.000000
dtype: float64

In [11]:
# KNN
pipe_kNN = make_pipeline(preprocessor, KNeighborsClassifier())
knn_scores = cross_validate(pipe_kNN, X_train, y_train, cv=10, return_train_score=True)
pd.DataFrame(knn_scores)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.013315,0.100833,0.905405,0.880303
1,0.004571,0.002824,0.824324,0.889394
2,0.004379,0.003018,0.891892,0.881818
3,0.003892,0.002531,0.864865,0.889394
4,0.004221,0.003313,0.835616,0.897126
5,0.003779,0.002709,0.90411,0.878971
6,0.00365,0.00257,0.739726,0.895613
7,0.005264,0.005603,0.876712,0.897126
8,0.004539,0.00374,0.917808,0.877458
9,0.004082,0.00264,0.808219,0.892587


In [12]:
pd.DataFrame(knn_scores).mean()

fit_time       0.005169
score_time     0.012978
test_score     0.856868
train_score    0.887979
dtype: float64

In [13]:
# LR
lr_classifier = LogisticRegression(random_state=123)
pipe_lr = make_pipeline(preprocessor, lr_classifier)
lr_scores = cross_validate(pipe_lr, X_train, y_train, cv=10, return_train_score=True)
pd.DataFrame(lr_scores)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.015249,0.002775,0.918919,0.865152
1,0.006478,0.002043,0.878378,0.868182
2,0.006793,0.002473,0.891892,0.871212
3,0.007321,0.002339,0.864865,0.871212
4,0.006187,0.002055,0.835616,0.877458
5,0.007196,0.002393,0.890411,0.868381
6,0.006597,0.002062,0.780822,0.877458
7,0.00633,0.002357,0.876712,0.874433
8,0.005976,0.002034,0.890411,0.868381
9,0.006283,0.001972,0.835616,0.875946


In [14]:
pd.DataFrame(lr_scores).mean()

fit_time       0.007441
score_time     0.002250
test_score     0.866364
train_score    0.871782
dtype: float64

In [15]:
# random forest
rf_classifier = RandomForestClassifier(random_state=123)
pipe_rf = make_pipeline(preprocessor, rf_classifier)
rf_scores = cross_validate(pipe_rf, X_train, y_train, cv=10, return_train_score=True)
pd.DataFrame(rf_scores)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.102498,0.004975,0.878378,1.0
1,0.09095,0.004474,0.878378,1.0
2,0.081354,0.004458,0.864865,1.0
3,0.080723,0.004258,0.878378,1.0
4,0.080238,0.004339,0.835616,1.0
5,0.078268,0.004076,0.863014,1.0
6,0.078318,0.004087,0.849315,1.0
7,0.078855,0.004128,0.849315,1.0
8,0.079489,0.00424,0.890411,1.0
9,0.077717,0.00436,0.849315,1.0


In [16]:
pd.DataFrame(rf_scores).mean()

fit_time       0.082841
score_time     0.004340
test_score     0.863699
train_score    1.000000
dtype: float64

In [17]:
# light gbm
lgbm_classifier = LGBMClassifier(random_state=123, verbosity=-1)
pipe_lgbm = make_pipeline(preprocessor, lgbm_classifier)
lgbm_scores = cross_validate(pipe_lgbm, X_train, y_train, cv=10, return_train_score=True)
pd.DataFrame(lgbm_scores)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.520835,0.005511,0.864865,1.0
1,0.381099,0.00554,0.918919,1.0
2,0.262096,0.003485,0.851351,1.0
3,0.279371,0.003626,0.878378,1.0
4,0.22993,0.003681,0.794521,1.0
5,0.220427,0.003319,0.863014,1.0
6,0.234952,0.003415,0.821918,1.0
7,0.307419,0.004575,0.835616,1.0
8,0.291829,0.003311,0.917808,1.0
9,0.3244,0.003779,0.808219,1.0


In [18]:
pd.DataFrame(lgbm_scores).mean()

fit_time       0.305236
score_time     0.004024
test_score     0.855461
train_score    1.000000
dtype: float64

### <font color='red'>Question 2</font>

For this question, we will focus specifically on a small set of samples that were found to be more challenging to classify. You can get the samples by running the cell below.

How many errors does your classifier make when classifying these samples? Compare your result with the other groups: which classifier does the fewest errors?

In [19]:
uncertain_indices = [122,  77,  49,  54,  12, 129,  35, 102,  39,  56]
test_samples = test_df.iloc[uncertain_indices]
test_samples

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
294,32,M,TA,95,0,1,Normal,127,N,0.7,Up,1
425,60,M,ATA,160,267,1,ST,157,N,0.5,Flat,1
758,51,M,TA,125,213,0,LVH,125,Y,1.4,Up,0
650,48,M,ASY,130,256,1,LVH,150,Y,0.0,Up,1
768,64,F,ASY,130,303,0,Normal,122,N,2.0,Flat,0
548,66,M,ASY,112,261,0,Normal,140,N,1.5,Up,1
824,37,M,NAP,130,250,0,Normal,187,N,3.5,Down,0
76,32,M,ASY,118,529,0,Normal,130,N,0.0,Flat,1
70,57,M,ATA,140,265,0,ST,145,Y,1.0,Flat,1
110,59,F,ATA,130,188,0,Normal,124,N,1.0,Flat,0


In [20]:
# Fit pipelines
pipe_dt.fit(X_train, y_train)
pipe_kNN.fit(X_train, y_train)
pipe_lr.fit(X_train, y_train)
pipe_rf.fit(X_train, y_train)
pipe_lgbm.fit(X_train, y_train)

samples_X = test_samples.drop(columns=["HeartDisease"])

# Predictions dict
results = {}
results["D.T."] = pipe_dt.predict(samples_X)
results["kNN."] = pipe_kNN.predict(samples_X)
results["Log.reg."] = pipe_lr.predict(samples_X)
results["R.F."] = pipe_rf.predict(samples_X)
results["LightGBM"] = pipe_lgbm.predict(samples_X)

# Predictions df
results = pd.DataFrame(results)

# Average - actual comparisions
results["Final Prediction"] = results.mode(axis=1)
results["Actual"] = test_samples["HeartDisease"].values
results["Correct?"] = (results["Final Prediction"] == results["Actual"])
results["Sample"] = uncertain_indices

results

Unnamed: 0,D.T.,kNN.,Log.reg.,R.F.,LightGBM,Final Prediction,Actual,Correct?,Sample
0,1,1,0,1,1,1,1,True,122
1,0,1,1,1,0,1,1,True,77
2,0,1,0,0,0,0,0,True,49
3,0,1,1,0,1,1,1,True,54
4,1,1,1,1,1,1,0,False,12
5,0,0,0,0,0,0,1,False,129
6,0,0,0,0,1,0,0,True,35
7,1,1,0,1,1,1,1,True,102
8,0,1,1,1,1,1,1,True,39
9,1,0,0,0,0,0,0,True,56


### <font color='red'>Question 3</font>

Now, you and the other groups are going to *average* your answers to see if your collective classification is better than the individual ones. Fill the table below with the answer from each classifier, and write down the final classification. Did the averaging classifier do better on these 10 samples than the individual ones?

| Sample   | D.T.     | kNN.     | Log.reg. | R.F.     | LightGBM | Final prediction | Correct? |
|----------|----------|----------|----------|----------|----------|----------|----------|
| 122      |   1      |    1     |     0    |    1     |    1     |     1    |    Y     |
| 77       |   0      |    1     |     1    |    1     |    0     |     1    |    Y     |
| 49       |   0      |    1     |     0    |    1     |    0     |     0    |    Y     |
| 54       |   0      |    1     |     1    |    0     |    1     |     1    |    Y     |
| 12       |   1      |    1     |     1    |    1     |    1     |     1    |    N     |
| 129      |   0      |    0     |     0    |    0     |    0     |     0    |    N     |
| 35       |   0      |    0     |     0    |    0     |    1     |     0    |    Y     |
| 102      |   1      |    1     |     0    |    1     |    1     |     1    |    Y     |
| 39       |   1      |    1     |     1    |    1     |    1     |     1    |    Y     |
| 56       |   1      |    0     |     0    |    0     |    0     |     0    |    Y     |



Next, you may check if your answers match the ones of sklearn `VotingClassifier`, by running the cells below (for this to work, you will need to copy the classifiers from the other teams; also, change the names in the list if they are different). 

In [21]:
classifiers = {
    "logistic regression": pipe_lr,
    "decision tree": pipe_dt,
    "kNN": pipe_kNN,
    "random forest": pipe_rf,
    "LightGBM": pipe_lgbm,
}

averaging_model = VotingClassifier(
    list(classifiers.items()), voting="hard"
) 

averaging_model.fit(X_train, y_train)

0,1,2
,estimators,"[('logistic regression', ...), ('decision tree', ...), ...]"
,voting,'hard'
,weights,
,n_jobs,
,flatten_transform,True
,verbose,False

0,1,2
,transformers,"[('standardscaler', ...), ('onehotencoder-1', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,'if_binary'
,sparse_output,True
,dtype,<class 'int'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,123
,solver,'lbfgs'
,max_iter,100

0,1,2
,transformers,"[('standardscaler', ...), ('onehotencoder-1', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,'if_binary'
,sparse_output,True
,dtype,<class 'int'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0

0,1,2
,transformers,"[('standardscaler', ...), ('onehotencoder-1', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,'if_binary'
,sparse_output,True
,dtype,<class 'int'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,

0,1,2
,transformers,"[('standardscaler', ...), ('onehotencoder-1', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,'if_binary'
,sparse_output,True
,dtype,<class 'int'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True

0,1,2
,transformers,"[('standardscaler', ...), ('onehotencoder-1', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,'if_binary'
,sparse_output,True
,dtype,<class 'int'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,boosting_type,'gbdt'
,num_leaves,31
,max_depth,-1
,learning_rate,0.1
,n_estimators,100
,subsample_for_bin,200000
,objective,
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


In [22]:
averaging_model.predict(X_test.iloc[uncertain_indices])

array([1, 1, 0, 1, 1, 0, 0, 1, 1, 0])

In [23]:
averaging_model.score(X_test.iloc[uncertain_indices], y_test.iloc[uncertain_indices])

0.8

### <font color='red'>Question 4</font>

If everything went according to plans, you should have gotten a better score on these 10 samples - hurray!

But what about the overall classifier performance? Use cross validation to see if the `VotingClassifier` actually achieves a better validation accuracy than the other classifiers you and other groups have tried. 

In [24]:
scores_averaging = cross_validate(averaging_model, X_train, y_train, return_train_score=True)
pd.DataFrame(scores_averaging)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.503709,0.015161,0.870748,1.0
1,0.410687,0.01439,0.877551,1.0
2,0.335746,0.016143,0.85034,1.0
3,0.329911,0.013658,0.843537,1.0
4,0.356908,0.014176,0.856164,1.0


In [25]:
pd.DataFrame(pd.DataFrame(scores_averaging).mean())

Unnamed: 0,0
fit_time,0.387392
score_time,0.014706
test_score,0.859668
train_score,1.0


### <font color='red'>Question 5</font>

To answer this question, repeat what you did in Question 3, but this time using **soft voting.** Complete the table with the predicted probability (for class 1) for each sample, and determine the final answer using their average.

In [26]:
# prob results dict
results_proba = {}
results_proba["D.T."] = pipe_dt.predict_proba(samples_X)[:, 1]
results_proba["kNN."] = pipe_kNN.predict_proba(samples_X)[:, 1]
results_proba["Log.reg."] = pipe_lr.predict_proba(samples_X)[:, 1]
results_proba["R.F."] = pipe_rf.predict_proba(samples_X)[:, 1]
results_proba["LightGBM"] = pipe_lgbm.predict_proba(samples_X)[:, 1]

# results dict
results = pd.DataFrame(results_proba)
results["Average"] = results.mean(axis=1)
results["Actual"] = test_samples["HeartDisease"].values
results["Correct?"] = (results["Average"].round(decimals=0) == results["Actual"])
results["Sample"] = uncertain_indices
results.round(decimals=3)

Unnamed: 0,D.T.,kNN.,Log.reg.,R.F.,LightGBM,Average,Actual,Correct?,Sample
0,1.0,0.8,0.405,0.62,0.926,0.75,1,True,122
1,0.0,0.6,0.602,0.64,0.397,0.448,1,False,77
2,0.0,0.6,0.392,0.42,0.251,0.333,0,True,49
3,0.0,0.8,0.647,0.46,0.789,0.539,1,True,54
4,1.0,1.0,0.652,0.69,0.806,0.829,0,False,12
5,0.0,0.4,0.347,0.28,0.266,0.259,1,False,129
6,0.0,0.4,0.338,0.42,0.764,0.384,0,True,35
7,1.0,0.6,0.327,0.8,0.974,0.74,1,True,102
8,0.0,1.0,0.675,0.78,0.94,0.679,1,True,39
9,1.0,0.2,0.31,0.43,0.113,0.411,0,True,56


How is the performance of the averaging classifier with soft voting on the 10 uncertain samples?

| Sample   | D.T.     | kNN.     | Log.reg. | R.F.     | LightGBM | Average  | Correct? |
|----------|----------|----------|----------|----------|----------|----------|----------|
| 122      |    1     |   0.8    |0.405     |    0.62  |   0.926  |  0.750   |     Y    |
| 77       |    0     |   0.6    |0.602     |    0.64  |   0.397  |  0.448   |     N    |
| 49       |    0     |   0.6    |0.392     |    0.42  |   0.251  |  0.333   |     Y    |
| 54       |    0     |   0.8    |0.647     |    0.46  |   0.789  |  0.539   |     Y    |
| 12       |    1     |   1      |0.652     |    0.60  |   0.806  |  0.829   |     N    |
| 129      |    0     |   0.4    |0.347     |    0.28  |   0.266  |  0.259   |     N    |
| 35       |    0     |   0.4    |0.338     |    0.42  |   0.764  |  0.384   |     Y    |
| 102      |    1     |   0.6    |0.327     |    0.80  |   0.974  |  0.740   |     Y    |
| 39       |    1     |   1      |0.675     |    0.78  |   0.940  |  0.679   |     Y    |
| 56       |    1     |   0.2    |0.310     |    0.43  |   0.113  |  0.411   |     Y    |


Once again, you can check if your answers match the ones of sklearn `VotingClassifier` with soft voting, by running the cells below. 

In [27]:
averaging_model = VotingClassifier(
    list(classifiers.items()), voting="soft"
) 

averaging_model.fit(X_train, y_train)

0,1,2
,estimators,"[('logistic regression', ...), ('decision tree', ...), ...]"
,voting,'soft'
,weights,
,n_jobs,
,flatten_transform,True
,verbose,False

0,1,2
,transformers,"[('standardscaler', ...), ('onehotencoder-1', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,'if_binary'
,sparse_output,True
,dtype,<class 'int'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,123
,solver,'lbfgs'
,max_iter,100

0,1,2
,transformers,"[('standardscaler', ...), ('onehotencoder-1', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,'if_binary'
,sparse_output,True
,dtype,<class 'int'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0

0,1,2
,transformers,"[('standardscaler', ...), ('onehotencoder-1', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,'if_binary'
,sparse_output,True
,dtype,<class 'int'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,

0,1,2
,transformers,"[('standardscaler', ...), ('onehotencoder-1', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,'if_binary'
,sparse_output,True
,dtype,<class 'int'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True

0,1,2
,transformers,"[('standardscaler', ...), ('onehotencoder-1', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,'if_binary'
,sparse_output,True
,dtype,<class 'int'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,boosting_type,'gbdt'
,num_leaves,31
,max_depth,-1
,learning_rate,0.1
,n_estimators,100
,subsample_for_bin,200000
,objective,
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


In [28]:
averaging_model.predict(X_test.iloc[uncertain_indices])

array([1, 0, 0, 1, 1, 0, 0, 1, 1, 0])

In [29]:
averaging_model.score(X_test.iloc[uncertain_indices], y_test.iloc[uncertain_indices])

0.7

### <font color='red'>Question 6</font>

Let's now use cross-validation to measure the overall performance of this classifier. How does it compare with the other options seen so far?

In [30]:
scores_averaging = cross_validate(averaging_model, X_train, y_train, return_train_score=True)
pd.DataFrame(scores_averaging)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.365787,0.013562,0.877551,1.0
1,0.344799,0.013799,0.884354,1.0
2,0.365461,0.014167,0.85034,1.0
3,0.351538,0.015361,0.836735,1.0
4,0.359618,0.013499,0.849315,1.0


In [31]:
pd.DataFrame(pd.DataFrame(scores_averaging).mean())

Unnamed: 0,0
fit_time,0.357441
score_time,0.014078
test_score,0.859659
train_score,1.0


## Stacking

Stacking is another ensemble method that adds one more step to what we have seen the `VotingClassifier` do: instead of taking a majority vote or averaging predicted probability, it combines the output of different classifers to create a new feature vector for the sample.

### <font color='red'>Question 7</font>

How the new feature vectors are created depends on the parameters we use when creating the `StackingClassifier`. Review this using the related [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html), and answer the questions below.

- What final estimator is used if none is specified as parameter?
- What would the feature vector look like for sample 122 if stack_method = 'predict'? 
- What would the feature vector look like for sample 122 if stack_method = 'predict_proba'? 

**Solution:**

- Default classifier uses Logistic Regression
- For `stack_method = "predict"`, the feature vector would be `[1, 1, 0, 1, 1]`
- For `stack_method = "predict_prob"`, the feature vector would be `[1.0, 0.8, 0.405, 0.59, 0.926]`

### <font color='red'>Question 8</font>

It is now time to try out the `StackingClassifier`. Run the cells below to create it and see how it performs on the uncertain samples and on the entire dataset. How does it compare to Averaging and the other classifiers?

In [32]:
stacking_model = StackingClassifier(list(classifiers.items()), stack_method = 'predict_proba')

In [33]:
stacking_model.fit(X_train, y_train)
stacking_model.score(X_test.iloc[uncertain_indices], y_test.iloc[uncertain_indices])

0.8

In [34]:
scores_stacking = cross_validate(stacking_model, X_train, y_train, return_train_score=True)
pd.DataFrame(scores_stacking)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,1.915701,0.013335,0.884354,0.954003
1,1.817244,0.01294,0.891156,0.957411
2,1.822822,0.013041,0.870748,0.964225
3,1.835531,0.013248,0.829932,0.948893
4,1.85456,0.013328,0.863014,0.954082


In [35]:
pd.DataFrame(pd.DataFrame(scores_stacking).mean())

Unnamed: 0,0
fit_time,1.849172
score_time,0.013178
test_score,0.867841
train_score,0.955723


### <font color='red'>Question 9</font>

An interesting thing about using a logistic regressor as final estimator is that we can observe the coefficients associated with each stacked classifier. These coefficients represent the confidence of the final estimator in each classifier's contribution. 

Check the coefficients by running the cells below. Which classifier is the most trustworthy? Which one is the least?

In [36]:
pd.DataFrame(
    data=stacking_model.final_estimator_.coef_.flatten(),
    index=classifiers.keys(),
    columns=["Coefficient"],
).sort_values("Coefficient", ascending=False)

Unnamed: 0,Coefficient
logistic regression,2.051577
random forest,1.873486
kNN,1.222765
LightGBM,1.054757
decision tree,-0.281677


### <font color='red'>Question 10</font>

As last step, make the final call on which classifier, among all the ones you have seen today, should be used for this problem, and provide a thorough justification for your answer.

Finally, do not forget to try your pick on the test set!

In [37]:
classifier_results = {}
classifier_results["DT"] = pd.DataFrame(dt_scores).mean()
classifier_results["Log.reg."] = pd.DataFrame(lr_scores).mean()
classifier_results["kNN"] = pd.DataFrame(knn_scores).mean()
classifier_results["RF"] = pd.DataFrame(rf_scores).mean()
classifier_results["LightGBM"] = pd.DataFrame(lgbm_scores).mean()
classifier_results["Averaging Model"] = pd.DataFrame(scores_averaging).mean()
classifier_results["Stacking Model"] = pd.DataFrame(scores_stacking).mean()

pd.DataFrame(classifier_results).T

Unnamed: 0,fit_time,score_time,test_score,train_score
DT,0.005587,0.00208,0.798427,1.0
Log.reg.,0.007441,0.00225,0.866364,0.871782
kNN,0.005169,0.012978,0.856868,0.887979
RF,0.082841,0.00434,0.863699,1.0
LightGBM,0.305236,0.004024,0.855461,1.0
Averaging Model,0.357441,0.014078,0.859659,1.0
Stacking Model,1.849172,0.013178,0.867841,0.955723


In [38]:
pipe_lr.score(X_test, y_test)

0.8532608695652174

In [39]:
stacking_model.score(X_test, y_test)

0.875