### <img src=images/gdd-logo.png width=200px align=right>
# Resampling

In the last notebook, you saw how imbalanced data can cause problems when training a machine learning model. You also chose a metric that is more appropriate than accuracy. In this notebook, you will see how you can improve your predictions using resampling.

### Outline 
- [Can you collect more data?](#more)
- [Resampling](#resample)
    - [Undersampling](#under)
    - [Oversampling](#over)
    - [Combining under- and oversampling](#combine)

Let's load in the data again, prepare the feature matrix and target vector and perform a train-test split.

In [None]:
import numpy as np
import pandas as pd

from data import create_Xy
from sklearn.model_selection import train_test_split

In [None]:
stroke = pd.read_csv('data/full_data.csv').rename(str.lower, axis='columns')

In [None]:
categorical_columns = ['gender', 'ever_married', 'work_type', 'residence_type', 'smoking_status']
target = 'stroke'

X, y = create_Xy(stroke, 
                 target=target, 
                 categorical_columns = categorical_columns)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

<a id = 'more'></a>

## Can you collect more data?

There are generally two main causes for an imbalance in your dataset:

1. The imbalance could be due to **collection bias** (e.g., only collecting data that favors one class) or errors made during collection (e.g., misclassifications). In this case it would be worth investigating whether more data can be collected and/or sampling methods can be improved.

2. The second cause of imbalance might be that it is simply a **property of the problem domain**. For example, relatively few people suffer from a stroke when looking at the entire population. In such a case, it can be hard to collect more data from the minority group without introducing more bias.

<mark>**Question:** Why do you think collecting more data only for the minority class could be a problem?</mark>

<details>

  <summary><span style="color:blue">Show answer</span></summary>

E.g. you want to get more data of stroke patients, so you approach different hospitals. But then, all healthy participants would come from the same hospital, leading to bias.

</details>



So what can you do when collecting more data is not an option?

<a id = 'resample'></a>

## Resampling

One solution is to try transform the dataset in order to balance the class distribution. This can be done by selectively deleting examples from the majority class (**undersampling**) or duplicating
or synthesizing new examples in the minority class (**oversampling**).

Balancing the class distributions helps the model to become more sensitive to the minority class. However, it is important to note that resampling is not a silver bullet. Let's now discover some of these methods and discuss their pros and cons.

<img src=images/resampling.png width=600px>

<a id = 'under'></a>

### Undersampling

When undersampling, you're reducing the number of samples from the majority class. There are several different methods of doing this, but the most naive way is by random undersampling, whereby you select random data points from the majority class that will be deleted.

Let's compress our data using Principal Component Analysis (PCA) so that we can visualize it in 2-dimensions.

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

def plot_majority_minority_class(X, y, title=''):
    for label in np.unique(y):
        plt.scatter(X[y==label, 0], X[y==label, 1], label=label)
    plt.legend()
    plt.title(title)
    plt.show()

# reduce number of features to two with PCA
pca = PCA(n_components=2)
X_train_trans = pca.fit_transform(X)

plot_majority_minority_class(X_train_trans, y)

You can see that a lot of points from the majority class are very similar. 

Therefore, it may be the case that not all of those points are needed to learn a model that can distinguish between the two classes.

<mark>**Question:** Should the train-test split be done before or after resampling? Why?</mark>

<details>

  <summary><span style="color:blue">Show answer</span></summary>

Resampling should be applied to the train set.

The purpose of having a test set is so that we can evaluate how the model would perform "in the wild" on unseen data. It should therefore be reflective of real-world data.

</details>



### Imbalanced-learn

<img src=images/imblearn.png width=300px>

A lot of resamplers can be imported from [imbalanced-learn](https://imbalanced-learn.org/stable/). This is a library that provides tools when dealing with classification of imbalanced classes, and relies on scikit-learn.

In [None]:
# !pip install imbalanced-learn

In [None]:
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

rand_under = RandomUnderSampler(sampling_strategy=1.0)
X_train_res, y_train_res = rand_under.fit_resample(X_train, y_train)

print(f"Original dataset shape: {Counter(y_train)}\nResampled dataset shape: {Counter(y_train_res)}")

<mark>**Question:** Try out different values for the *sampling_strategy* parameter, what does it do?</mark>

<details>

  <summary><span style="color:blue">Show answer</span></summary>

Check out the documentation for the [`RandomUnderSampler`](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html#imblearn.under_sampling.RandomUnderSampler)!

</details>

Let's visualise the results after random undersampling.

In [None]:
pca = PCA(n_components=2)
X_train_res_pca = pca.fit_transform(X_train_res)

plot_majority_minority_class(X_train_res_pca, y_train_res)

<mark>**Question:** How does this PCA visualisation compare to the one produced prior to undersampling?</mark>

Of course you now want to know if undersampling helped to increase the performance of our model (which had a not so dazzling F2 score of 0.0 when we started)!

**Note:** To use the RandomUnderSampler from imbalanced-learn, we have to use the `imblearn.pipeline.Pipeline` class, which extends the `sklearn.pipeline.Pipeline` class with support for sampler steps.

In [None]:
from imblearn.pipeline import Pipeline # different from the sklearn Pipeline!
from sklearn.svm import SVC
from sklearn.metrics import make_scorer, fbeta_score
from sklearn.model_selection import StratifiedKFold, cross_val_score

model = SVC()

under = RandomUnderSampler(sampling_strategy=0.5)

pipeline = Pipeline(steps=[('under', under),
                          ('model', model)])

cv = StratifiedKFold(n_splits=10)
ftwo_scorer = make_scorer(fbeta_score, beta=2)

scores = cross_val_score(pipeline, X_train, y_train, scoring=ftwo_scorer, cv=cv, n_jobs=-1)

print(f"Mean score: {(np.mean(scores)):.3f}")

That's already an increase in performance!

### Other undersampling methods

Although random undersampling is simple and often effective, there is a limitation. Data points are randomly removed without considering how important or useful they are to determine the decision boundary between classes. This means that you can easily delete useful information. 

Other undersampling algorithms exist that try to identify redundant examples for deletion or useful examples for non-deletion.

<img src=images/tomek-links.png width=600px>

###  <mark>**Exercise: Try other undersampling methods**</mark>

Choose **one** of the following undersamplers to investigate:

1. Near-Miss-3
2. Tomek Links
3. One-Sided Selection

1. Import and instantiate your chose method using [imbalanced-learn](https://imbalanced-learn.org/stable/references/under_sampling.html).

In [None]:
from imblearn.under_sampling import TomekLinks, NearMiss, OneSidedSelection

# your code here

In [None]:
# %load answers/ex1-1.py

2. Transform the training data.

In [None]:
# %load answers/ex1-2.py

3. Visualize the results after transforming the data using the `plot_majority_minority_class` function.

In [None]:
# %load answers/ex1-3.py

4. Investigate its performance by making a pipeline with the undersampler and a SVC, and then using cross validation.

In [None]:
# %load answers/ex1-4.py

#### Bonus questions

Investigate what the undersampler does a little bit more. Look at the documentation and/or google around!

1. Does your method select examples to keep, to delete or a combination of both?

2. Describe in one or two sentence(s) how it undersamples.

3. You may have found that some of the undersamplers still gave a F-score of 0. Any idea why this could be the case? And how you could solve it?

*Hint: Check what the model predicts using `.fit()` and `.predict()`. Is it predicting both classes?*

<a id = 'over'></a>

### Oversampling

When oversampling, you're adding copies or synthetic examples of datapoints from the minority class. This helps to balance the class distribution, thus increasing the weights of those datapoints. 

### <mark>**Exercise: Random Oversampling**</mark>



1. Import and instantiate the [RandomOverSampler](https://imbalanced-learn.org/stable/references/over_sampling.html) transformer from imbalanced-learn. Then transform the data.

In [None]:
# add code

In [None]:
# %load answers/ex2-1.py

2. Visualize the results.
    <br>a) Use PCA and the `plot_majority_minority_class` function.
    <br>b) How does this PCA visualisation compare to the undersampling one?

In [None]:
# add code

In [None]:
# %load answers/ex2-2.py

3. Assess the performance of the oversampling pipeline.
    <br>a) Create a pipeline containing the RandomOversampler and a SVC. Use cross-validation to see the performance.
    <br>b) How does the performance compare to the Undersampler you implemented previously?

In [None]:
# add code

In [None]:
# %load answers/ex2-3.py

### Creating synthetic examples of the minority class

Although random oversampling can balance the class distribution, it does not provide any additional information to the model. An alternative is to *synthesize* new examples from the minority class.

There are lots of different approaches to synthesizing new data, including **SMOTE**, **KMeans SMOTE** and **ADASYN**. You can read more about these methods [here](https://imbalanced-learn.org/stable/over_sampling.html#smote-adasyn). 

<img src=images/smote.png width=600px>

Let's take a look what happens to the performance when you use SMOTE:

In [None]:
from imblearn.over_sampling import SMOTE

model = SVC()
over = SMOTE()

pipeline = Pipeline(steps=[('over', over),
                          ('model', model)])

cv = StratifiedKFold(n_splits=10)
scores_smote = cross_val_score(pipeline, X_train, y_train, scoring=ftwo_scorer, cv=cv, n_jobs=-1)

print(f'Mean score (SMOTE): {(np.mean(scores_smote)):.3f}')

<a id = 'combine'></a>

### <mark>Exercise: Combining methods</mark>

You can also combine methods (e.g., oversampling with undersampling) to improve performance.

An example of first oversampling and then undersampling is provided below.

1. Experiment with different values for the `sampling_strategy`.
2. Evaluate how the model generalises on the **test** data. Use the `ftwo_scorer` function to get the test score of the model.
3. **Bonus:** Add Tomek Links to the pipeline. Where would it make most sense and does it improve performance?

In [None]:
from imblearn.over_sampling import RandomOverSampler

over = RandomOverSampler(sampling_strategy=0.3)
under = RandomUnderSampler(sampling_strategy=0.5)

pipeline_comb = Pipeline(steps=[('over', over),
                           ('under', under),
                          ('model', model)])

scores_comb = cross_val_score(pipeline_comb, X_train, y_train, scoring=ftwo_scorer, cv=cv, n_jobs=-1)

print(f'Mean score: {(np.mean(scores_comb)):.3f}')

In [None]:
# %load answers/ex3.py

# A note on resampling using class weights

In sklearn, we have the option of using the `class_weight` parameter in the model. This is a way of ensuring balance in the class distribution without resampling the data.
In effect, this parameter mimics resampling by assigning a weight to each class. The weight is inversely proportional to the class frequency.

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import make_scorer, fbeta_score
from sklearn.model_selection import StratifiedKFold, cross_val_score

model = SVC(class_weight='balanced')

cv = StratifiedKFold(n_splits=10)
ftwo_scorer = make_scorer(fbeta_score, beta=2)

scores = cross_val_score(model, X_train, y_train, scoring=ftwo_scorer, cv=cv, n_jobs=-1)

print(f"Mean score: {(np.mean(scores)):.3f}")

## Summary

This notebook covered how to improve predictions by using several resamplers from imbalanced-learn. This included undersampling and oversampling, and a combination of both. Besides random over/undersampling, there are several other algorithms available that resample in a less naive way. 