In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

# Welcome to a full guide on Data Imbalance!

#### If you have played around with any sort of machine learning challenges or tasks in the real world, I am very sure you are more likely to run into datasets that are highly imbalanced than perfectly balanced ones. 
#### This is why many experts say that for machine learning tasks, majority of the job done is for data cleaning. 
#### Data imbalance is a crucial fault to spot out as it can result in highly skewed results if left unattended. 
#### In my previous research, I also faced a highly skewed binary dataset, where the ratio was about 4:1. Faced with a highly skewed dataset, I had to employ a lot of strategies to mitigate this observation. 
#### In this guide, I will walk you through some of the options you could take to combat against data imbalance
#### Let's dive in then, shall we? 

<div style="width:100%;text-align: center;"> <img align=middle src="https://images.unsplash.com/photo-1523801999971-dbb0cc4f400e?w=500&auto=format&fit=crop&q=60&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxzZWFyY2h8OHx8ZGl2ZSUyMGlufGVufDB8fDB8fHwy" style="width:25%;height:25%;margin:auto;"> </div>
<h5><center>Image from <a href="https://unsplash.com/s/photos/dive-in">unsplash</a></center></h5>

# 1. Introduction

#### Data imbalance is a phenomenon for classification problems where the distribution of data points amongst classes are not equal. 
#### For example, income distribution in society is a highly imbalanced distribution where there are very few people in the upper portion of the distribution while there are a bigger group in the lower portions. 
#### In real life, data imbalance is a common observation. But in machine learning tasks, this can be a huge headache for engineers & scientists.  

#### Let me give you an example where data imbalance can bring detrimental results. 
#### Loan approval tasks is a common machine learning task and a very approach one too, where huge datasets can be found on Kaggle. 
#### Let's say for instance, the dataset contains 90% of race A and 10% of race B people. 
#### Assume that for reasons such as wealth distribution and geographical location, race A people have a higher approval rate. 
#### The under representation of race B will worsen the scenario for their loan approval where they may suffer from higher rate of loan disapproval not because of their credit-worthiness, but due to the fact that the data is imbalanced. 
#### These are just some scenarios that can have a huge ripple effect in society and even affect individual lives if imbalanced datasets are not paid enough attention. 
#### Thankfully, there are a few options we can deploy to combat this phenomenon. 

# 2. Methods to combat data imbalance

## 2.1 Using the right evaluation metric

#### This is probably the first thing that I believe everyone faced with a machine learning task should think about. 
#### What is the most suitable evaluation metric to choose. I will talk more about evaluation metrics in another guide. 
#### With an imbalanced dataset, there are several evaluation metrics you can choose from in order to mitigate the imbalanced nature of the data. 
#### For example, with a 1% scam rate in a scam or not scam dataset, if your machine learning algorithm yields a 99% **accuracy**, it probably does not mean much. 
#### This is because a simple algorithm such as labelling everything as not being scam would already yield a 99% accuracy, due to the highly imbalanced nature of the dataset. 
#### Thus, in this case, accuracy is clearly the worst metric to use. 
#### Now, some of the more useful metrics we can use in these scenarios include

##### - Precision/Specificity/Recall/Sensitivity
##### - F1 score 
##### - AUC

### 2.1.1 Precision/Specificity/Recall/Sensitivity

$$\LARGE Precision = \frac{True \: Positives}{True \: Positives + False \: Positives}$$
$$\LARGE Specificity = \frac{True\:Negatives}{True\:Negatives + False \: Positives}$$

#### Precision accounts for the accuracy of positive predictions while Specificity accounts for how well the algorithm avoids false positives, or false alarms. 