### <img src=images/gdd-logo.png width=200px align=right>

# Introduction to imbalanced data

In this notebook, you will examine why imbalanced data can cause problems for machine learning and discuss how to pick an appropriate metric.

### Outline
- [When is data imbalanced?](#when)
- [Why imbalanced data can cause problems](#why)
- [Choosing a model evalution metric](#metric)
- [Cross-validation on imbalanced data](#cross-val)

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score

## About the data

<img src='images/who.png' width='500px' align='right' style="padding: 15px">

According to the World Health Organization (WHO), strokes are the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.

You will use this dataset to build a model that can **predict whether a patient is likely to have a `stroke`** (based on input parameters like gender, age and whether or not they smoke). 

In [None]:
stroke = pd.read_csv('data/full_data.csv').rename(str.lower, axis='columns')
stroke

<a id = 'when'></a>

## When is data imbalanced?

In machine learning, data is considered to be imbalanced when the class distribution is not equal. This means that there are significantly more examples of one class than the other.  

For example, if you were trying to train a model to detect fraud in financial transactions, the number of fraudulent transactions would likely be much smaller than the number of non-fraudulent transactions. In this case, the data would be imbalanced, with the "non-fraud" class being much more prevalent than the "fraud" class.
<br><br>

<img src=images/imbalanced_visualisation.svg width=500px align="center">

<mark>**Question:** Can you think of other examples or fields where data will often be imbalanced?</mark>

<details>

  <summary><span style="color:blue">Show examples</span></summary>

E.g.
    
* Identification of rare diseases
* Natural disasters
* Spam emails
* Machinery break-down
...
    
</details>

By examining the number of times each class occurs, you can see that the stroke dataset is also quite imbalanced.

In [None]:
stroke['stroke'].value_counts(normalize=True)

<a id = 'why'></a>

## Why imbalanced data can cause problems

Imbalanced data can pose challenges in machine learning, as models may develop biases towards the dominant class. This occurs because the model has more instances to learn from that class, potentially leading it to prioritize the more prevalent class('s features). Consequently, this imbalance can result in poor performance when predicting the less prevalent class.

Let's demonstrate this by training a Support Vector Cassifier (SVC) using scikit-learn.

In [None]:
categorical_columns = ['gender', 'ever_married', 'work_type', 'residence_type', 'smoking_status']
target = 'stroke'

def create_Xy(df, target_col):
    return (
        df.drop(columns=target_col),
        df[target_col]
    )

X, y = create_Xy(stroke, 
                 target_col=target)

X.shape, y.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    random_state=42, 
                                                    stratify=y)

In [None]:
from sklearn.svm import SVC

model = SVC()

onehot = ColumnTransformer([
    ('ct', OneHotEncoder(drop='first'), categorical_columns)
])

pipeline = Pipeline(steps=[
    ('onehot', onehot),
    ('scaler', MinMaxScaler()),
    ('model', model)])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

accuracy_score(y_test, y_pred)

At first glance, this looks like a great result, because the resulting accuracy is really high. 

However, the confusion matrix reveals a difference picture:

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(
         pipeline, X_test, y_test)

Accuracy is defined as the number of correct predictions divided by the total number of predictions made. Therefore, if a model consistently predicts the majority class, it can attain a high accuracy score, especially when the majority class greatly outnumbers the minority class.

This is precisely the situation we observe here: Our model consistently predicts 0. Consequently, it fails to identify instances of the minority class (1), which is undoubtedly problematic.

<a id = 'metric'></a>

## Choosing a model evalution metric

Clearly accuracy is not an appropriate metric for this problem, but how do you choose the right metric? 

This depends on the business problem at hand:
- Do you want to predict **class labels or probabilities**? 

- What is the **cost of false positives** and **false negatives** (E.g. how much woud it cost to deny someone a loan when they are able to pay back vs. giving someone a loan that they will default on?)

<mark>**Question:** What other things may you want to consider when choosing a metric?</mark>

<details>

  <summary><span style="color:blue">Show examples</span></summary>

E.g.
- Are both classes equally important or not?
- Is there an imbalance in the classes?
- Do you have a regression or classification problem?
    
</details>

A cheat sheet like the one below can be helpful but really it's down to you, your understanding of the data and of the subject matter to make the final choice.

<img src=images/metrics.png width=700px align=center source=https://machinelearningmastery.com/framework-for-imbalanced-classification-projects/*>

[Source](https://machinelearningmastery.com/framework-for-imbalanced-classification-projects/)

## <mark>**Exercise: picking a metric**</mark>

1. Discuss in groups what could be a good metric to use for our stroke prediction model (using the flow chart above). 

2. Once you've decided on a metric, look it up on the [metrics page of Scikit-Learn](https://scikit-learn.org/stable/modules/classes.html#classification-metrics). Import it and then calculate the score for the predictions `y_pred` made earlier. 

In [None]:
# Add your code here

3. How does the performance with the new metric compare to the result of the *accuracy_score()* function?

<details>

  <summary><span style="color:blue">Show answer</span></summary>

An example of a suitable metric in this case could the **F2 score**. This is because finding people who will have a stroke is more important (i.e. the positive class is more important). If we miss a stroke case, it will be more costly than if we falsely diagnose someone with a stroke (i.e. false negatives are more costly). 
    

```python
from sklearn.metrics import fbeta_score

fbeta_score(y_test, y_pred, beta=2)
```

</details>

Sklearn also provides a `classification_report` function which conveniently provides the basic metrics:

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred,
                            zero_division='warn'
                            ))

<a id = 'cross-val'></a>

## Cross validation on an imbalanced dataset

You can use cross-validation to give your model the opportunity to train on multiple train-test splits (e.g. K-fold). This gives you a better indication of how well your model will perform on unseen data.

<img src="images/crossvalidation.png" style="display: block;margin-left: auto;margin-right: auto;height: 200px"/>

In order to make sure the folds contain stratified data during cross-validation, you can use `StratifiedKFold`. 

<mark>**Question:** Why is it important to make sure our imbalanced data is stratified when performing K-fold cross validation?<mark>
    

<details>

  <summary><span style="color:blue">Show answer</span></summary>

If you are for example using 10-fold cross validation, the data is randomly split into 10 folds. This makes it likely that one or more folds will have few or no examples from the minority class. This means that some or perhaps many of the model evaluations will be misleading, as the model need only predict the majority class correctly.
    
Note that a single run of [Stratified K-Fold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) may result in a noisy estimate of the model's performance, as different splits of the data might result in very different results. In such scenarios, [Repeated Stratified K-fold cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html) can be used: it runs the process *n* times and reports the mean result across all *folds* from all *runs*. It has the benefit of providing more reliable results, at the cost of fitting and evaluating many more models. 

</details>

In the code below we demonstrate how to implement `StratifiedKFold` cross-validaiton. 

The flow chart above suggests that the F2 score would be a suitable metric for this problem. It can be implemented through the `make_scorer` function.

In [None]:
from sklearn.metrics import make_scorer, fbeta_score
from sklearn.model_selection import StratifiedKFold, cross_val_score

cv = StratifiedKFold(n_splits=10)
ftwo_scorer = make_scorer(fbeta_score, beta=2)

scores = cross_val_score(pipeline, X_train, y_train, scoring=ftwo_scorer, cv=cv)

print(f"Mean score: {np.round(np.mean(scores), 3)}")

The mean F2-score is zero (since the precision is always zero with no stroke predictions).

Unlike the accuracy score we achieved previously, the [F2 score](https://machinelearningmastery.com/fbeta-measure-for-machine-learning/) suggests the model is not performing well.

However, in the next notebook you will learn some techniques to help your models perform better on imbalanced data.

## Summary

In this notebook, you saw how imbalanced data can cause problems when training a machine learning model. You also discussed the considerations to make when choosing an evaluation metric.