<img src="https://teaching.bowyer.io/SDSAI/0/img/IMPERIAL_logo_RGB_Blue_2024.svg" alt="Imperial Logo" width="500"/><br /><br />

Ensemble Methods and Unsupervised Learning
==============
### SURG70098 - Surgical Data Science and AI
### Stuart Bowyer

## Intended Learning Outcomes
1.  ???


## MIMIC Dataset
The following code will load the datasets used in this lecture notes

In [71]:
%pip install pandas_gbq

import pandas as pd
import pandas_gbq

project_id = 'mimic-project-439314'  # @param {type:"string"}

df_day1_vitalsign = pandas_gbq.read_gbq("""
  SELECT
    *,
    (dod IS NOT NULL) AND (dod <= dischtime) AS mortality,
    weight / POWER(height/100, 2) > 30 AS obese
  FROM `physionet-data.mimiciv_derived.first_day_vitalsign`
  LEFT JOIN (
    SELECT
      subject_id,
      stay_id,
      gender,
      race,
      dischtime,
      admission_age,
      dod
    FROM
      `physionet-data.mimiciv_derived.icustay_detail`
  )
  USING(subject_id, stay_id)
  LEFT JOIN (
    SELECT
      stay_id,
      AVG(weight) as weight
    FROM
      `physionet-data.mimiciv_derived.weight_durations`
    GROUP BY
      stay_id
  )
  USING(stay_id)
  LEFT JOIN (
    SELECT
      stay_id,
      CAST(AVG(height) AS FLOAT64) AS height
    FROM
      `physionet-data.mimiciv_derived.height`
    GROUP BY
      stay_id
  )
  USING(stay_id)
  WHERE heart_rate_mean IS NOT NULL
""", project_id=project_id)

# Ensemble Methods

## Introduction to Ensemble Methods
*   You probably have noticed that different models have different advantages and disadvantages
*   i.e. sometimes they work well, others they do not
*   Ensemble methods combine models together to improve overall performance by ...
    *   Improving accuracy
    *   Improving stability
    *   Reducing error

**How would you combine models together to optimise their group performance?**

## Bootstrap Aggregating (Bagging)
*   Builds multiple parallel models independently using random (possibly overlapping) subsets of the data and combines their predictions
*   Aim is to reduce overfitting by varying the patterns each model is trained on and reducing variance by combining outputs
*   Very commonly used with decision trees to create **random forests**
    *   `from sklearn.ensemble import RandomForestClassifier`

![](https://upload.wikimedia.org/wikipedia/commons/7/76/Random_Forest_Diagram_Extra_Wide.png)

By <a href="//commons.wikimedia.org/w/index.php?title=User:CollaborativeGeneticist&amp;action=edit&amp;redlink=1" class="new" title="User:CollaborativeGeneticist (page does not exist)">CollaborativeGeneticist</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=113209159">Link</a>

## Boosting
*   Builds sequential models that try to correct the errors of the predecessor
*   Aim is to reduce underfitting (due to weak models) by focusing models on the errors of other models
*   Very commonly used with deision trees to create **gradient boosted trees**
    *   `from sklearn.ensemble import GradientBoostingClassifier`
    *   popular alternative implementation is 'XGBoost'

<img src="https://upload.wikimedia.org/wikipedia/commons/b/b5/Ensemble_Boosting.svg" alt="Description of Image" style="height: 400px; display: block; margin: 0 auto;">

By <a href="//commons.wikimedia.org/wiki/User:Sirakorn" title="User:Sirakorn">Sirakorn</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=85888769">Link</a>


# Class Imbalance
*   One of the common issues you will face is class imbalances
*   i.e. when one of your predicted classes is much more/less common than the others
*   For example, in the MIMIC dataset ...

In [100]:
df_day1_vitalsign.mortality.value_counts(normalize=True)

mortality
False    0.883693
True     0.116307
Name: proportion, dtype: Float64

*   This imbalance can cause training bias and poor performance

### Can you suggest how to address the class imbalance?

### Oversampling
*   Involves creating new observations in the minority class by ...
*   **Random oversampling:** randomly duplicating entries from the minority class
*   **Synthetic Minority Over-sampling Technique (SMOTE):** generating new synthetic samples in the minority class by interpolating between existing observations

### Undersampling
*   Involves dropping observations in the majority class by ...
*   **Random undersampling:** randomly removing entries from the majority class
*   **Tomek links:** removes entries from the majority class that are close to the minority class (i.e. suspected noise)
*   **NearMiss:** removes entries from the majority class that are far from the minority class (i.e. easy classifications)

### Using `imbalanced-learn`
*   Python has a package (parallelling `scikit-learn`) for addressing imbalanced datasets
    *    [Oversampling methods](https://imbalanced-learn.org/stable/references/over_sampling.html)
    *    [Undersampling methods](https://imbalanced-learn.org/stable/references/over_sampling.html)

In [111]:
from imblearn.over_sampling import RandomOverSampler

data = df_day1_vitalsign.dropna(subset=['admission_age', 'heart_rate_mean', 'sbp_mean', 'glucose_mean', 'mortality'])
X = data[['admission_age', 'heart_rate_mean', 'sbp_mean', 'glucose_mean']]
Y = data['mortality']

# Create the random oversampler
ros = RandomOverSampler(random_state=1)

# Apply it to our dataset
X_res, Y_res = ros.fit_resample(X, Y)

print("Previous value counts: ", Y.value_counts().to_list())
print("Resampled value counts:", Y_res.value_counts().to_list())

Previous value counts:  [63076, 8034]
Resampled value counts: [63076, 63076]


## Exercise 6.1 - Supervised Learning Challenge
Use the [Wisconsin Breast Cancer Database](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic) to build the most high performance classifier for predicting tumour malignancy from breast mass features.

98% accuracy is possible

There are instructions on importing the dataset to Python on the above page.

I suggest starting with logistic regression on a subset of features, but you should expect to build up the model complexity and number of features.

You will probably want to use most of the techniques you have learnt in the past two/three lectures:
*   Some basic EDA of this new dataset
*   Ensuring the data are fully prepared
*   Exploring different classification methods
*   Searching for optimal parameters
*   Addressing class imbalances
*   Testing other performance improvements (ensemble methods)
*   Using effective model validation

**Before you start, what metric/s should we use?**

# Wrap Up
*   ???

## Before Next Session
*   Review random forest and gradient boosting methods

### New Material
*   Random Forest - https://williamkoehrsen.medium.com/random-forest-simple-explanation-377895a60d2d
*   xgboost Introduction - https://xgboost.readthedocs.io/en/stable/tutorials/model.html

### Consolidation Reading
*   ???