# Dealing with imbalanced data


![](https://image.freepik.com/vetores-gratis/balanca-da-justica-equilibrio-de-peso_29937-3252.jpg)

## What is Imbalanced Data?
Imbalanced data typically refers to a problem with classification problems where the classes are not represented equally.

For example, you may have a 2-class (binary) classification problem with 100 instances (rows). A total of 80 instances are labeled with Class-1 and the remaining 20 instances are labeled with Class-2.

This is an imbalanced dataset and the ratio of Class-1 to Class-2 instances is 80:20 or more concisely 4:1.

You can have a class imbalance problem on two-class classification problems as well as multi-class classification problems. Most techniques can be used on either. [Continue reading...](https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/)


## How we will deal with it?

- Changing metric
- Oversampling 
- Undersampling 


These are some methods that we will use. There's other methods that can be used.


* [Importing Packages & EDA](#1)
* [First Model](#2)
* [ROC AUC SCORE](#3)
* [Oversampling our Data](#4)
* [Undersampling our Data](#5)
* [Mix Oversampling with Undersampling](#6)
* [Comparing our Scores](#7)

## Importing Packages & EDA <a id="1"></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter


plt.style.use('ggplot')
plt.rcParams.update({'font.size': 14})

#ML Models
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
#Metrics 
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

#Plot
import scikitplot as skplt

import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('../input/passenger-list-for-the-estonia-ferry-disaster/estonia-passenger-list.csv')
df.head()

In [None]:
#Checking nan values
df.isnull().mean() * 100

In [None]:
#Plot How many Survived
ax = df['Survived'].value_counts().plot(kind='bar', figsize=(12,6), color=['skyblue', 'violet'], rot=0,
                                  title='How many Survived?')

As we can see, the numbers of people that died is very high.

In [None]:
df['Sex'].value_counts().plot(kind='bar', color=['skyblue', 'violet'], rot=0, title='Numbers of Passagers by sex', figsize=(16,6));

In [None]:
df.groupby(['Sex', 'Survived'])['Survived'].count().unstack().plot(kind='bar', color=['skyblue', 'violet'], rot=0, title='Survived by Sex',
                                                                      figsize=(16,6));

In [None]:
df.groupby(['Country', 'Survived'])['Survived'].count().unstack().plot(kind='bar', color=['skyblue', 'violet'], rot=45, title='Survived by Country',
                                                                      figsize=(16,6));
print(f'How many Countrys the data have: {df["Country"].nunique()}')

16 Countrys is a large number for countrys that don't give to much information, we will reduce to 3 countrys only.

In [None]:
df['Country'] = df['Country'].apply(lambda x: 'Other' if x not in df['Country'].value_counts()[:3].index.to_list() else x)

In [None]:
#Plot again
df.groupby(['Country', 'Survived'])['Survived'].count().unstack().plot(kind='bar', color=['skyblue', 'violet'], rot=45, title='Survived by Country',
                                                                      figsize=(16,6));
print(f'How many Countrys the data have: {df["Country"].nunique()}')

In [None]:
age_bins = [0, 10, 18, 30, 55, 100]
group_names = ['child', 'teenager', 'young adult', 'adult', 'elderly']
df['cat_age'] = pd.cut(df['Age'], age_bins, right=False, labels=group_names)

In [None]:
df['cat_age'].value_counts().plot(kind='bar', color='skyblue', rot=45, title='Passengers by category age', figsize=(16,6));

In [None]:
df.groupby(['cat_age', 'Survived'])['Survived'].count().unstack().plot(kind='bar', color=['skyblue', 'violet'], rot=45, figsize=(16, 6),
                                                                  title='Survived by category Age');

In [None]:
df_clean = df.drop(['PassengerId', 'Firstname', 'Lastname'], axis=1)

In [None]:
df_clean.head()

In [None]:
X = df_clean.drop(['Survived', 'cat_age'], axis=1)
X = pd.get_dummies(X, columns=['Country', 'Sex', 'Category'], drop_first=True)
y = df_clean['Survived']
X.head()

## First Model <a id="2"></a>

We will test our data with differents models

In [None]:
Seed = 12
knn = KNeighborsClassifier()
dt = DecisionTreeClassifier(random_state=Seed)
svc = SVC(gamma='auto', random_state=Seed)
ada = AdaBoostClassifier()
rf = RandomForestClassifier()
lr = LogisticRegression(max_iter=1000)
ls = LinearSVC()
bc = BaggingClassifier(base_estimator=DecisionTreeClassifier(),random_state=Seed)

In [None]:
model_list = [('KNeighborsClassifier', knn),
              ('DecisionTree', dt),
              ('SVC', svc),
              ('AdaBoost', ada),
              ('RandomForest', rf),
              ('LogisticRegression', lr),
              ('LinearSVC', ls),
              ('BaggingClassifier', bc)]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=Seed)

In [None]:
y_test

In [None]:
%%time 
for name, model in model_list:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f'Model: {name} test_acc: {acc * 100:.2f}%')

Our best score was 87.54% not bad, however if we predict everyone died, we have the same score 87.54% so our model is not that good.

In [None]:
predict_all_died = accuracy_score(y_test, np.zeros(len(y_test)))
print(f'Accuracy score if we predict everyone died in your test data: {predict_all_died *100:.2f}%')

## ROC AUC SCORE <a id="3"></a>

### What is AUC - ROC Curve?
AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.
[Continue reading...](https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5)

In [None]:
predict_all_died = roc_auc_score(y_test, np.zeros(len(y_test)))
print(f'ROC AUC SCORE if we predict everyone died in your test data: {predict_all_died*100:.2f}%')

Much better measure for your model, lets run again our models.

In [None]:
%%time 
for name, model in model_list:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred)
    print(f'Model: {name} test_acc: {acc * 100:.2f}% roc_auc_test: {roc_auc * 100:.2f}%')

![](https://pbs.twimg.com/media/D47GZ7tU4AAf4yI.jpg)
Again we didn't beat the baseline, even after we change the metric score.

## Oversampling our Data <a id="4"></a>

### What is Oversampling?
When one class of data is the underrepresented minority class in the data sample, over sampling techniques maybe used to duplicate these results for a more balanced amount of positive results in training. Over sampling is used when the amount of data collected is insufficient. A popular over sampling technique is SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class.
[Continue reading...](https://whatis.techtarget.com/definition/over-sampling-and-under-sampling)

So for work with Oversampling we will import a package name imbalanced is based in sklearn for work with imbalanced dataset.<br>
Read their doc:<br>
https://imbalanced-learn.readthedocs.io/en/stable/

In [None]:
# Oversampling
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.over_sampling import BorderlineSMOTE
from imblearn.over_sampling import SMOTENC

In [None]:
smote = SMOTE()
adasyn = ADASYN()
bl = BorderlineSMOTE()
smote_nc = SMOTENC(categorical_features=[0, 1], random_state=Seed)

In [None]:
oversampling_list = [('SMOTE', smote),
                     ('ADASYN', adasyn),
                     ('BorderlineSMOTE', bl),
                     ('SMOTENC', smote_nc)]

In [None]:
#Create your validade set.
X, X_val, y, y_val = train_test_split(X, y, test_size=0.1, random_state=Seed)

In [None]:
#See our target
y.value_counts().plot(kind='bar', color='skyblue', rot=0, title='Data without oversampling');

In [None]:
# Now we will resample our data to make equal.
X_resampled, y_resampled = smote.fit_resample(X, y)
y_resampled.value_counts().plot(kind='bar', color='skyblue', rot=0, title='Data with oversampling');

In [None]:
%%time
results = []
for imblearn, method in oversampling_list:
    X_resampled, y_resampled = method.fit_resample(X, y)
    X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=Seed)
    for name, model in model_list:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_pred_val = model.predict(X_val)
        acc = round(accuracy_score(y_test, y_pred),4)
        acc_val = round(accuracy_score(y_val, y_pred_val),4)
        roc_test = round(roc_auc_score(y_test, y_pred), 4)
        roc_val = round(roc_auc_score(y_val, y_pred_val), 4)
        results.append({'Method': imblearn,'Model': name, 'test_acc': acc, 'val_acc': acc_val, 'roc_test': roc_test, 'roc_val': roc_val})

In [None]:
results = pd.DataFrame(results)

In [None]:
results.sort_values(by='roc_val', ascending=False).head(10)

The accuracy has decreased but we increase our roc auc score.

## Undersampling our Data <a id= "5"></a>

### What is undersampling?

Undersampling is the opposite of oversampling is simple as that, we will decrease the target with more values, it's not recommended for small dataset like ours, but we will try.

In [None]:
#Undersampling

from imblearn.under_sampling import ClusterCentroids
from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import EditedNearestNeighbours 
from imblearn.under_sampling import RepeatedEditedNearestNeighbours 
from imblearn.under_sampling import AllKNN
from imblearn.under_sampling import CondensedNearestNeighbour
from imblearn.under_sampling import OneSidedSelection
from imblearn.under_sampling import NeighbourhoodCleaningRule

In [None]:
cc = ClusterCentroids(random_state=Seed)
rus = RandomUnderSampler(random_state=Seed)
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()
allknn = AllKNN()
cnn = CondensedNearestNeighbour(random_state=Seed)
oss = OneSidedSelection(random_state=Seed)
ncr = NeighbourhoodCleaningRule()

In [None]:
undersampling_list = [('ClusterCentroids', cc),
                      ('RandomUnderSampler', rus),
                      ('EditedNearestNeighbours', enn),
                      ('RepeatedEditedNearestNeighbours', renn),
                      ('AllKNN', allknn),
                      ('CondensedNearestNeighbour', cnn),
                      ('OneSidedSelection', oss),
                      ('NeighbourhoodCleaningRule', ncr)]

In [None]:
y.value_counts().plot(kind='bar', color='skyblue', rot=0, title='Data without undersampling');

In [None]:
X_resampled, y_resampled = cc.fit_resample(X, y)
y_resampled.value_counts().plot(kind='bar', color='skyblue', rot=0, title='Data with undersampling');

In [None]:
%%time
results_under = []
for imblearn, method in undersampling_list:
    X_resampled, y_resampled = method.fit_resample(X, y)
    X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=Seed)
    for name, model in model_list:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_pred_val = model.predict(X_val)
        acc = round(accuracy_score(y_test, y_pred),4)
        acc_val = round(accuracy_score(y_val, y_pred_val),4)
        roc_test = round(roc_auc_score(y_test, y_pred), 4)
        roc_val = round(roc_auc_score(y_val, y_pred_val), 4)
        results_under.append({'Method': imblearn,'Model': name, 'test_acc': acc, 'val_acc': acc_val, 'roc_test': roc_test, 'roc_val': roc_val})

In [None]:
test_under = pd.DataFrame(results_under)
test_under.head()

In [None]:
test_under.sort_values(by=['roc_val', 'roc_test'], ascending=[False, False])[:10]

We got a better result, but undersampling reduced our data too much, maybe our results are more luck than a good model.

## Mix Oversampling with Undersampling <a id="6"></a>
We can mix our imbalanced methods.

In [None]:
%%time
mixed = []
for imblearn, method in oversampling_list:
    X_resampled, y_resampled = method.fit_resample(X, y)
    for imblearn2, method2 in undersampling_list:
        X_resampled1, y_resampled1 = method2.fit_resample(X_resampled, y_resampled)
        X_train, X_test, y_train, y_test = train_test_split(X_resampled1, y_resampled1, test_size=0.2, random_state=Seed)
        for name, model in model_list:
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            y_pred_val = model.predict(X_val)
            acc = round(accuracy_score(y_test, y_pred),4)
            acc_val = round(accuracy_score(y_val, y_pred_val),4)
            roc_test = round(roc_auc_score(y_test, y_pred), 4)
            roc_val = round(roc_auc_score(y_val, y_pred_val), 4)
            mixed.append({'Method 1': imblearn,'Method 2': imblearn2, 'Model': name, 'test_acc': acc, 'val_acc': acc_val, 'roc_test': roc_test, 'roc_val': roc_val})

In [None]:
mixed = pd.DataFrame(mixed)
mixed.head()

In [None]:
mixed.sort_values(by=['roc_val', 'roc_test'], ascending=[False, False])[:10]

## Comparing our Scores <a id="7"></a>

In [None]:
fig, (axes1, axes2) = plt.subplots(1, 2, figsize=(16, 6))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=Seed)
dt.fit(X_train, y_train)
y_probas = dt.predict_proba(X_val)
y_pred = dt.predict(X_val)
acc = accuracy_score(y_val, y_pred )
skplt.metrics.plot_roc(y_val, y_probas, cmap='cool', plot_micro=False, plot_macro=False,ax= axes1)
skplt.metrics.plot_confusion_matrix(y_val, y_pred, ax=axes2)
fig.suptitle(f'Results without sampling Data and DecisionTree  Accuracy Validation Test: {acc *100:.2f}%\n', fontsize=20) 
plt.show()

In [None]:
fig, (axes1, axes2) = plt.subplots(1, 2, figsize=(16, 6))

X_resampled, y_resampled = smote_nc.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=Seed)
ada.fit(X_train, y_train)
y_probas = ada.predict_proba(X_val)
y_pred = ada.predict(X_val)
acc = accuracy_score(y_val, y_pred )
skplt.metrics.plot_roc(y_val, y_probas, cmap='cool', plot_micro=False, plot_macro=False,ax= axes1)
skplt.metrics.plot_confusion_matrix(y_val, y_pred, ax=axes2)
fig.suptitle(f'Results with SMOTENC and AdaBoost  Accuracy Validation Test: {acc *100:.2f}%\n', fontsize=20)
plt.show()

In [None]:
fig, (axes1, axes2) = plt.subplots(1, 2, figsize=(16, 6))

X_resampled, y_resampled = renn.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=Seed)
ada.fit(X_train, y_train)
y_probas = ada.predict_proba(X_val)
y_pred = ada.predict(X_val)
acc = accuracy_score(y_val, y_pred )
skplt.metrics.plot_roc(y_val, y_probas, cmap='cool', plot_micro=False, plot_macro=False,ax= axes1)
skplt.metrics.plot_confusion_matrix(y_val, y_pred, ax=axes2)
fig.suptitle(f'Results with RepeatedEditedNearestNeighbours and AdaBoost  Accuracy Validation Test: {acc *100:.2f}%\n', fontsize=20)
plt.show()

In [None]:

fig, (axes1, axes2) = plt.subplots(1, 2, figsize=(16, 6))

X_resampled, y_resampled = smote_nc.fit_resample(X, y)
X_resampled, y_resampled = enn.fit_resample(X_resampled, y_resampled)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=Seed)
lr.fit(X_train, y_train)
y_probas = lr.predict_proba(X_val)
y_pred = lr.predict(X_val)
acc = accuracy_score(y_val, y_pred)
skplt.metrics.plot_roc(y_val, y_probas, cmap='cool', plot_micro=False, plot_macro=False,ax= axes1)
skplt.metrics.plot_confusion_matrix(y_val, y_pred, ax=axes2)
fig.suptitle(f'Results with SMOTENC, EditedNearestNeighbours and LogisticRegression  Accuracy Validation Test: {acc *100:.2f}%\n', fontsize=20)
plt.show()