<h1 style="background-color:yellow;font-family:newtimeroman;font-size:350%;text-align:center;border-radius: 15px 50px;">Table Of Content</h1>


* [1. Introduction](#1)
    * [1.1 Aim of the notebook](#1.1)
    * [1.2 Libraries And Utilities](#1.2)
    * [1.3 Data Loading](#1.3)
* [2. Exploratory Data Analysis(EDA)](#2)
    * [2.1 Continous Features](#2.1)
    * [2.2 Target Variable](#2.2)
* [3. Problems with highly imbalanced datasets](#3)
    * [3.1 Choice of metric](#3.1)
    * [3.2 Overview of the different methods](#3.2)
* [4. UnderBagging techniques (Ensemble Undersampling)](#4)
    * [4.1 EasyEnsemble from scratch](#4.1) 
    * [4.2 BalanceCascade from scratch](#4.2)
* [5. Conclusion](#5)

<h1 style="background-color:yellow;font-family:newtimeroman;font-size:350%;text-align:center;border-radius: 15px 50px;">Introduction</h1>

<a id="1.1"></a>
<h3 style="background-color:yellow;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;">Aim of the notebook</h3>

The goal of this notebook is to introduce why and how we should use **UnderBagging** techniques to deal with **highly imbalanced, large scale and noisy datasets.**

I decided to do this experiment on the famous credit card fraud detection dataset but this is not the best to do so. Indeed, it is small (only 300 000 samples whereas in real worlds million of samples) and is not as imbalanced as real worlds highly imbalanced datasets (sometimes with 10^6 : 1 ratio). I will try to find another dataset to show how **UnderBagging outperforms many simple undersampling techniques.**

<a id="1.2"></a>
<h3 style="background-color:yellow;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;">Libraries And Utilities</h3>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score,recall_score,confusion_matrix, f1_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from random import randint

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings('ignore')

import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

from imblearn.under_sampling import RandomUnderSampler

In [None]:
hr=pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
stroke=pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')

<a id="1.3"></a>
<h3 style="background-color:yellow;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;">Data Loading</h3>

In [None]:
df=pd.read_csv('../input/creditcardfraud/creditcard.csv')
df=df.rename(columns={'Class':'target'})
print(df.shape)

In [None]:
features=[f'V{i}' for i in range(1,29)]
features.append('Amount')

<a id="2"></a>
<h1 style="background-color:yellow;font-family:newtimeroman;font-size:300%;text-align:center;border-radius: 15px 50px;">Exploratory Data Analysis</h1>

<a id="2.1"></a>
<h3 style="background-color:yellow;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;">Continuous Features</h3>

In [None]:
fig=plt.figure(figsize=(15, 10), facecolor='whitesmoke')

fig.suptitle("Continous Features Distribution",x=0.5,y=0.95, fontsize="xx-large",fontweight="bold")

for plot in range(1,13):
    
    locals()["ax"+str(plot)]=fig.add_subplot(4,3,plot)
    locals()["ax"+str(plot)].set_facecolor("whitesmoke")
    locals()["ax"+str(plot)].set_yticklabels([])
    locals()["ax"+str(plot)].tick_params(axis='y', which=u'both',length=0)
    
    for direction in ["top","right", 'left']:
        locals()["ax"+str(plot)].spines[direction].set_visible(False)

i = 1
for feature in features[:12]:
   
    sns.kdeplot(df[feature], ax=locals()["ax"+str(i)], shade=True, color='gold', alpha=0.9, zorder=2)
    locals()["ax"+str(i)].grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1,5))
    locals()["ax"+str(i)].set_ylabel(feature, fontsize=10, fontweight='bold').set_rotation(0)
    locals()["ax"+str(i)].set_xlabel('')
    locals()["ax"+str(i)].set_xlim(-5, 5)


    i += 1

<p style="text-align: center;"><span style='font-family: "Times New Roman", Times, serif; font-size: 24px;font-weight: bold''>Exploring the relationship between continuous features and the target variable</span></p>

In [None]:
fig=plt.figure(figsize=(15, 10), facecolor='whitesmoke')

fig.suptitle("Continous features distribution w.r.t target variable",x=0.5,y=0.95, fontsize="xx-large",fontweight="bold")

for plot in range(1,13):
    
    locals()["ax"+str(plot)]=fig.add_subplot(3,4,plot)
    locals()["ax"+str(plot)].set_facecolor("whitesmoke")
    locals()["ax"+str(plot)].set_yticklabels([])
    locals()["ax"+str(plot)].tick_params(axis='y', which=u'both',length=0)
    
    for direction in ["top","right", 'left']:
        locals()["ax"+str(plot)].spines[direction].set_visible(False)

i = 1
for feature in features[:12]:
   
    sns.kdeplot(df[df.target==0][feature], ax=locals()["ax"+str(i)], shade=True, color='gold', alpha=0.9, zorder=2)
    sns.kdeplot(df[df.target==1][feature], ax=locals()["ax"+str(i)], shade=True, color='darkorange', alpha=0.9, zorder=2)
    locals()["ax"+str(i)].grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1,5))
    locals()["ax"+str(i)].set_ylabel(feature, fontsize=10, fontweight='bold').set_rotation(0)
    locals()["ax"+str(i)].set_xlabel('')
    locals()["ax"+str(i)].set_xlim(-10, 10)


    i += 1

<p style="text-align: center;"><span style='font-family: "Times New Roman", Times, serif; font-size: 20px;'>It seems that there is a clear distinction between the distributions of the variables of each class.</span></p>

<a id="2.2"></a>
<h3 style="background-color:yellow;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;">Target Variable</h3>

In [None]:
ir=int(len(df[df.target==0])/len(df[df.target==1]))

fig=plt.figure(figsize=(15, 10), facecolor='whitesmoke')
plt.suptitle(f'Imbalance ratio {ir}:1',x=0.5,y=0.95, fontsize="xx-large",fontweight="bold")

ax1=fig.add_subplot(1,1,1)
ax1.set_facecolor("whitesmoke")

for direction in ["top","right", 'left']:
    ax1.spines[direction].set_visible(False)

sns.countplot(df.target,ax=ax1,color='gold')
plt.show()

<a id="3"></a>
<h1 style="background-color:yellow;font-family:newtimeroman;font-size:300%;text-align:center;border-radius: 15px 50px;">Problems with highly imbalanced datasets</h1>

<a id="3.1"></a>
<h3 style="background-color:yellow;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;">Choice of metric</h3>

In [None]:
# Let's create a simple model to highlight the problem with imbalanced datasets
X,y=df[features],df.target

kf=StratifiedKFold(n_splits=5)
accuracy=[]
recall=[]

for train_idx,test_idx in kf.split(X,y):
    
    X_train,y_train= X.iloc[train_idx],y.iloc[train_idx]
    X_test,y_test= X.iloc[test_idx],y.iloc[test_idx]

    model=LogisticRegression()
    model.fit(X_train,y_train)

    predictions=model.predict(X_test)
    
    accuracy.append(accuracy_score(y_test,predictions))
    recall.append(recall_score(y_test,predictions))

print(f'Model accuracy over the 5 folds: {np.round(np.mean(accuracy),7)}')

It seems that our model is doing pretty well if we look only at the accuracy. Unfortunately, our model is terrible. To show you why, let's create a simple baseline model. A model totally untalented and unintellectual that is to say a model that always predicts 0 (ie the transaction is legitimate).

In [None]:
# creating simple predictions
X,y=df[features],df.target

kf=StratifiedKFold(n_splits=5)
accuracy_baseline=[]
recall_baseline=[]

for train_idx,test_idx in kf.split(X,y):
    
    X_train,y_train= X.iloc[train_idx],y.iloc[train_idx]
    X_test,y_test= X.iloc[test_idx],y.iloc[test_idx]

    simple_preds=np.zeros(len(X_test))
    
    accuracy_baseline.append(accuracy_score(y_test,simple_preds))
    recall_baseline.append(recall_score(y_test,simple_preds))


print(f'Model accuracy is {np.round(np.mean(accuracy_baseline),7)}')

It seems that predicting always that the transaction is legitimate is also giving a very good accuracy. In both cases, this accuracy is hiding a harsh reality. Let's look at the percentage of frauds detected by our model. Of course for the baseline, 0% of frauds are detected.

In [None]:
print(f'Percentage of frauds detected over the 5 folds: {np.round(np.mean(recall),7)}')

This 99% of accuracy is hiding a harsh reality, our model has a fraud detection rate of 60%. This fraud detection rate can be much worse when the dataset is very large scale and noisy. It highlights the fact that in imbalanced datasets the choice of evaluation metric is very important. There are a lot of factors that will influence the choice of evaluation metric. Since the goal of this notebook is to present ensembles of undersampling techniques, we will not cover how to choose the best eveluation metric for your problems, but there are a lot of ressources out there talking about that.

<a id="3.2"></a>
<h3 style="background-color:yellow;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;">Overview of the different methods</h3>

There are two well known different ways of dealing with imbalanced datasets: oversampling and undersampling. In this part I will give a fast explanation of undersampling techniques.The goal is to give you enough context so that everyone can understand what are ensemble undersampling techniques and why it is used for.

<p style="text-align: center;"><span style='font-family: "Times New Roman", Times, serif; font-size: 24px;font-weight: bold''>Undersampling techniques</span></p>

**Undersampling techniques refers to all the techniques that aims at reducing the number of majority samples**. I will show you a simple example of undersampling technique which is called **Random Under-Sampling**. It aims at reducing the number of majority samples by randomly sampling majority samples. There are many others techniques that enable to do a different under-sampling job like Near Miss Undersampling, Condensed Nearest Neighbor, Tomek Links and many others. But we will not cover that in this notebook since its goal is to make an introduction of UnderBagging techniques.

In [None]:
# Creating a RandomUnderSampler from scratch. I could have used imblearn library. 
# But just for fun we will reproduce it from scratch.

def rus(X_train: pd.DataFrame, y_train: pd.DataFrame, sampling_strategy: float):
    
    """ Simple implementation of RandomUnderSampling """
    
    train=pd.concat([X_train,y_train],axis=1)
    
    train_maj=train[train.target==0]
    train_min=train[train.target==1]
    
    train_maj_rus=train_maj.sample(int(1/sampling_strategy*len(train_min)),random_state=randint(1,100000))
    
    train_rus=pd.concat([train_maj_rus,train_min])
    
    X_train_rus= train_rus.drop('target',axis=1)
    y_train_rus= train_rus.target
    
    return X_train_rus,y_train_rus

In [None]:
# Let's reuse our simple LogisticRegression but this time using random under sampling technique.
X,y=df[features],df.target

kf=StratifiedKFold(n_splits=5)

recall=[]
f1=[]

for train_idx,test_idx in kf.split(X,y):
    
    X_train,y_train= X.iloc[train_idx],y.iloc[train_idx]
    X_test,y_test= X.iloc[test_idx],y.iloc[test_idx]
    
    X_train_rus,y_train_rus= rus(X_train,y_train,sampling_strategy=0.5)

    model=LogisticRegression()
    model.fit(X_train_rus,y_train_rus)

    predictions=model.predict(X_test)
    
    f1.append(f1_score(y_test,predictions))
    recall.append(recall_score(y_test,predictions))


print(f'Percentage of frauds detected over the 5 folds: {np.round(np.mean(recall),7)}')
print(f'F1_score over the 5 folds: {np.round(np.mean(f1),7)}')

Here we go, we increase the frauds detection rate using random under sampling strategy. But of course, we increased the number of legitimate transactions classified as frauds by our model.

<p style="text-align: center;"><span style='font-family: "Times New Roman", Times, serif; font-size: 24px;font-weight: bold''> Problem with simple undersampling techniques </span></p>

Even if undersampling techniques tackle the problem of class imbalance by reducing the level of imbalance of the dataset, there are some drawbacks: It only uses a subset of major class samples. Thus, we lose information from the ignored samples. Also, undersampling is very sensitive to noise. The more the training set is large scale and noisy, the more it is important to find a way to tackle the different problems cited above. That's exactly the reason why ensembles of under-sampling techniques has been created. 
        

<a id="4"></a>
<h1 style="background-color:yellow;font-family:newtimeroman;font-size:300%;text-align:center;border-radius: 15px 50px;">UnderBagging techniques</h1>

In this part, I will implement from scratch EasyEnsemble and BalanceCascade algorithms (two underbagging techniques). As I said above, these two algorithms has been introduced to tackle the problem of loss of information and sensitiveness to noisy samples produced by using a simple undersampling technique. 

<a id="4.1"></a>
<h3 style="background-color:yellow;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;"> EasyEnsemble from scratch </h3>

<p style="text-align: center;"><span style='font-family: "Times New Roman", Times, serif; font-size: 24px;font-weight: bold''> Simple explanation of the algorithm </span></p>

The algorithm consists of repeating the random under-sampling strategy n times. Each time, we create a new subset of data (result of the undersampling) and train a new classifier on it. At the end, our model is composed of n classifiers. To make a prediction, we simply take the average of the different classifiers. This is why these kind of algorithms are called UnderBagging.

Basically, I will:

* Randomly under-sample the training set.
* create and tune the hyperparameters of an xgboost model using optuna on this under-sampled training set.
* Save my model.
* Randomly under-sample the training set another time.
* create and tune the hyperparameters of a new xgboost model using optuna on this under-sampled training set.
* Save the model.
* And so on and so forth for n iterations.
* At the end, each of my xgboost model will make predictions on the test set.
* Final prediction is just an average of the predictions of the different xgboost model.


<p style="text-align: center;"><span style='font-family: "Times New Roman", Times, serif; font-size: 24px;font-weight: bold''> EasyEnsemble algorithm </span></p>

In [None]:
# optuna tuning
def objective(trial: pd.DataFrame, X_train: pd.DataFrame, y_train: pd.DataFrame):
    
    """ Simple function to tune hyperparameters of each XGBoost that will constitute to the final model """
    
    params={'lambda': trial.suggest_loguniform('lambda', 1e-2, 5.0),
        'alpha': trial.suggest_loguniform('alpha', 1e-2, 5.0),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.6,0.7,0.8,0.9, 1.0]),
        'subsample': trial.suggest_categorical('subsample', [0.6,0.7,0.8,1.0]),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.01,0.012,0.014,0.016,0.018, 0.02,0.05]),
        'n_estimators': trial.suggest_int('n_estimators',50,500),
        'max_depth': trial.suggest_categorical('max_depth', [2,3,5,7,9,11]),
        'random_state': trial.suggest_categorical('random_state', [24, 48,2021]),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 300),
        }
    
    f1=[]
    recall=[]
    
    kf= StratifiedKFold(n_splits=5)
     
    for train_idx,test_idx in kf.split(X_train,y_train):
        
        X_train_tuning,y_train_tuning= X_train.iloc[train_idx],y_train.iloc[train_idx]

        X_test_tuning,y_test_tuning= X_train.iloc[test_idx],y_train.iloc[test_idx]
        
        model=XGBClassifier(**params,eval_metric='auc',n_jobs=-1)
        
        model.fit(X_train_tuning,y_train_tuning)
        
        predictions=model.predict(X_test_tuning)
        
        f1.append(f1_score(y_test_tuning,predictions))
        
        #recall.append(recall_score(y_test_tuning,predictions))
        
    return np.mean(f1)

In [None]:
# create the xgboost model
def create_xgb(X_train: pd.DataFrame, y_train: pd.DataFrame):
    
    """ Takes as input the training set composed of X_train and y_train. 
    It returns an xgboost model tuned with the specified training set. """
    
    study=optuna.create_study(direction='maximize')
    study.optimize(lambda trial: objective(trial, X_train, y_train), n_trials=50)
    
    params=study.best_params
    
    model=XGBClassifier(**params,eval_metric='auc',n_jobs=-1)
    
    model.fit(X_train, y_train, verbose=0)
    
    return model

In [None]:
# the underbagging algorithm
def easyensemble(X_train: pd.DataFrame, y_train: pd.DataFrame, features: list, n_estimators: int):
    
    """ Simple implementation of easyensemble but with XGBoost as learners instead of AdaBoost.
    Takes as input a training set, the different features and the number of XGBoost model. """
        
    models=[]
    
    for estimator in range(1,n_estimators):
        
        undersampler= RandomUnderSampler(sampling_strategy=0.5,random_state=randint(0,100000))
        
        X_train_rus,y_train_rus= undersampler.fit_resample(X_train, y_train)
        
        models.append(create_xgb(X_train_rus,y_train_rus))
        
         
    return models

In [None]:
# cross validation function
def cross_val(df: pd.DataFrame, features: list, n_estimators: int):
    
    """ Cross validation function. """
    X,y=df[features],df.target
    
    kf= StratifiedKFold(n_splits=5)
    recall=[]
    f1=[]
    
    i=1
    
    for train_idx,test_idx in kf.split(X,y):
        
        X_train_tuning,y_train_tuning= X.iloc[train_idx],y.iloc[train_idx]

        X_test,y_test= X.iloc[test_idx],y.iloc[test_idx]
    
        models=easyensemble(X_train_tuning, y_train_tuning, features, n_estimators)
    
    
        y_preds_proba=0
    
        for model in models:
        
            y_preds_proba+=model.predict_proba(X_test)[:,1]
        
        y_preds_proba=y_preds_proba/len(models)
    
        
        predictions=(y_preds_proba>0.5).astype(int)
    
        
        recall.append(recall_score(y_test,predictions))
        f1.append(f1_score(y_test,predictions))
        
        print(f'Cross validation {i} done.')
        i+=1
    
        
    print(f'Percentage of frauds detected over the 5 folds: {np.round(np.mean(recall),7)}')
    print(f'F1_score over the 5 folds: {np.round(np.mean(f1),7)}')
    
    
    return models

In [None]:
models=cross_val(df, features, n_estimators=10)

As you can see, Easyensemble improves the performance of my previous logistic regression model (same frauds detection rate but it decreases the number of legitimate transactions predicted as fraudulent by the model (check the f1 score). In large-scale and very noisy datasets, the difference in terms of performance between the easyensemble model and the random under sampler will be much higher. The [conclusion](#5) summarizes why their is not a signficant difference between the performance of the easyensemble algorithm and the simple logistic regression.

<a id="4.2"></a>
<h3 style="background-color:yellow;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;"> BalanceCascade from scratch </h3>

The algorithm consists of repeating the random under-sampling strategy n times. Each time we create a new subset of data (result of the undersampling), we train a new classifier on it. The only difference between BalanceCascade and EasyEnsemble is that at each iteration, we drop from the training set, a percentage of the correctly classified majority class samples. It enables to reduce the redundant informations in our majority class (ie samples from the majority class that are very easy to classify).At the end, our model is composed of n classifiers. To make a prediction, we simply take the average of the different classifiers. 

Basically, I will:

* Randomly undersample the training set.
* Create and tune the hyperparameters of an xgboost model using optuna on this under-sampled training set.
* Save my model.
* Make predictions on all the majority class samples.
* Drop from the training set the majority class samples that are very easy to predict.
* Randomly under-sample the new training set another time.
* create and tune the hyperparameters of a new xgboost model using optuna on this under-sampled training set.
* Save the model.
* Make predictions based on the two xgboost models on all my majority class samples.
* Drop from the training set the majority class samples that are very easy to predict.
* And so on and so forth while the number of majority class samples in the training set is higher than the number of minority class samples in the training set.
* At the end, each of my xgboost models will make predictions on the test set.
* Final prediction is just an average of the predictions of the different xgboost model.


<p style="text-align: center;"><span style='font-family: "Times New Roman", Times, serif; font-size: 24px;font-weight: bold''> BalanceCascade algorithm </span></p>

In [None]:
# balancecascade algorithm
def balancecascade(train: pd.DataFrame, features: list, n_estimators: int):
    
    """ Simple implementation of the BalanceCascade algorithm. 
    It takes as input a training set, the differetn features used 
    to train the model, and the number of estimators. The only small 
    trick is that instead of throwing all correctly classified samples 
    of the majority class, it throws a unique percentage of it at each iteration. 
    This process is not random, it throws the easiest samples to predict.
    Thus, we do not lose all the correctly classified samples.
    """
    
    train_maj= train[train.target==0]
    train_min= train[train.target==1]
    
    n_maj= len(train_maj)
    n_min= len(train_min)
    
    ratio=n_min/n_maj
    
    keep_rate=np.power(ratio, 1/(n_estimators-1))
    
    n_models=0
    model_list=[]
    
    while len(train_maj)>len(train_min):
        
        train=pd.concat([train_maj,train_min],axis=0)
        
        X_train,y_train=train[features],train.target
        
        undersampler=RandomUnderSampler(sampling_strategy=1, random_state=randint(0,10000))
        X_train_rus,y_train_rus=undersampler.fit_resample(X_train,y_train)
        
        model_list.append(create_xgb(X_train_rus, y_train_rus))
                          
        y_probs=0
                          
        for model in model_list:
            
            y_probs+=model.predict_proba(train_maj[features])[:,1]
        
        y_probs=y_probs/len(model_list)
                          
        train_maj['proba']=y_probs
                          
        train_maj=train_maj.sort_values('proba',ascending=False)[:int(keep_rate*len(train_maj)+1)]

        
        
    return model_list

In [None]:
# cross validation function
def cross_val(df: pd.DataFrame, features: list, n_estimators: int):
    
    """ Cross validation function. """
    X,y=df[features],df.target
    
    kf= StratifiedKFold(n_splits=5)
    recall=[]
    f1=[]
    
    i=1
    
    for train_idx,test_idx in kf.split(X,y):
        
        X_train_tuning,y_train_tuning= X.iloc[train_idx],y.iloc[train_idx]

        X_test,y_test= X.iloc[test_idx],y.iloc[test_idx]
        
        train=pd.concat([X_train_tuning,y_train_tuning],axis=1)
    
        models=balancecascade(train, features, n_estimators)
    

    
        y_preds_proba=0
    
        for model in models:
        
            y_preds_proba+=model.predict_proba(X_test)[:,1]
        
        y_preds_proba=y_preds_proba/len(models)
    
        
        predictions=(y_preds_proba>0.5).astype(int)
    
        
        recall.append(recall_score(y_test,predictions))
        f1.append(f1_score(y_test,predictions))
        
        print(f'Cross validation {i} done.')
        i+=1
    
        
    print(f'Percentage of frauds detected over the 5 folds: {np.round(np.mean(recall),7)}')
    print(f'F1_score over the 5 folds: {np.round(np.mean(f1),7)}')
    
    
    return models

In [None]:
models=cross_val(df, features, n_estimators=10)

As you can see, **BalanceCascade improves the performance of my previous logistic regression model** (approximately same recall but way better F1-score). It also outperforms Easyensemble in terms of F1-score. When I say outperforms, it is because in real world, we cannot have a model that produces to much false negatives (ie legitimate transactions classified as frauds). In large-scale and very noisy datasets, the difference in terms of performance will be much higher.

<a id="5"></a>
<h1 style="background-color:yellow;font-family:newtimeroman;font-size:300%;text-align:center;border-radius: 15px 50px;"> Conclusion </h1>

Unfortunately, this dataset is not the best to show the performance of **UnderBagging techniques**. UnderBagging techniques are optimized to tackle highly imbalanced, very noisy, and large-scale datasets. Also, it is very useful when there are overlap between classes. This is not the case in this dataset. **But, there exists many applications of Machine Learning where the dataset corresponds to the caracteristics cited above.**

If this kind of notebook is useful for the community, I will make others focusing on more advanced single undersampling techniques and also on oversampling techniques.