<a href="https://colab.research.google.com/github/sureshmecad/Google-Colab/blob/master/10_Imbalance_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://towardsdatascience.com/having-an-imbalanced-dataset-here-is-how-you-can-solve-it-1640568947eb

https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28

https://www.kdnuggets.com/2020/01/5-most-useful-techniques-handle-imbalanced-datasets.html

https://datascience.foundation/sciencewhitepaper/understanding-imbalanced-datasets-and-techniques-for-handling-them

https://www.analyticsvidhya.com/blog/2017/03/imbalanced-data-classification/

-------------

- **Binary Classification Problem:** A classification predictive modeling problem where all examples belong to **one of two classes.**

- **Multiclass Classification Problem:** A classification predictive modeling problem where all examples belong to **one of three classes.**

- Many real-world classification problems have an **imbalanced class distribution**, such as 
  
  1. Fraud detection

  2. spam detection

  3. Churn prediction

  4. Claim Prediction

  5. Default Prediction

  6. Anomaly Detection

  7. Outlier Detection

  8. Intrusion Detection

  9. Conversion Prediction.

- **Ad Serving:** Click prediction datasets also don’t have a high clickthrough rate.

- **Content moderation:** Does a post contain NSFW content?


This problem is predominant in scenarios where anomaly detection is crucial like

- Identification of rare diseases like cancer; tumours etc,

- Electricity theft & pilferage

- Fraudulent transactions in banks

- Identify customer churn rate ( that is, what fraction of customers continue using a service)

- Natural Disasters like Earthquakes
Spam emails, etc.

- So we define an **imbalanced dataset** as a dataset where the majority class is much larger than the minority class. There is no limit to how big the majority class has to be. Even when the **majority class is twice the size of the minority class**, it is still an imbalanced dataset.

- Standard classifier algorithms like **Decision Tree and Logistic Regression** have a **bias** towards classes which have number of instances. They tend to **only predict the majority class data.** The features of the **minority class are treated as noise** and are often ignored. Thus, there is a high probability of the misclassification of the minority class as compared to the majority class.

-----------

### What Is Data Imbalance?

### 1. Finance

- **Fraud detection datasets commonly have a fraud rate of ~1–2%**

- Data imbalance usually reflects an unequal distribution of classes within a dataset. For example, in a **credit card fraud detection** dataset, most of the credit card transactions are **not fraud** and a **very few classes are fraud transactions.** This leaves us with something like 50:1 ratio between the fraud and non-fraud classes. 

### 2. Transportation/Airline

- **Will Airplane failure occur?**

- Suppose that you are working in a given company and you are asked to create a model that, based on various measurements at your disposal, predicts whether a product is **defective or not.** You decide to use your favourite classifier, train it on the data and voila : you get a 96.2% accuracy !
Your boss is astonished and decides to use your model without any further tests. A few weeks later he enters your office and underlines the uselessness of your model. Indeed, the model you created has not found any defective product from the time it has been used in production.
After some investigations, you find out that there is only around 3.8% of the product made by your company that are defective and your model just always answers “not defective”, leading to a 96.2% accuracy. The kind of “naive” results you obtained is due to the imbalanced dataset you are working with. The goal of this article is to review the different methods that can be used to tackle classification problems with imbalanced classes.

- In such cases, you get a **pretty high accuracy** just by **predicting the majority class, but you fail to capture the minority class,** which is most often the point of creating the model in the first place.

### 3. Healthcare / Medical

- **Does a patient has cancer?**

- Let us say we have a dataset of cancer patients to be used in predictive modelling, and based on some inputs, the model will predict whether a patient is diagnosed with cancer or is a healthy patient. The resulting value can be either called a class or target or dependent variables. As this is a classification problem, we will use 'class'. So, for this example, we have two class values as “Cancer” and “Healthy/No Cancer”.

- Let us suppose we have a dataset of 1000 patients, out of which 80 are cancer patients and the rest (920) are healthy. This is an example of an imbalanced dataset, as the majority class is about 9 times bigger than the minority class. Here the majority class is Healthy, and minority class is “Cancer”. Such a dataset is called an Imbalanced Dataset.

## 4. Email

- A typical example of imbalanced data is encountered in e-mail classification problem where emails are classified into ham or spam. The number of spam emails is usually lower than the number of relevant (ham) emails. So, using the original distribution of two classes leads to imbalanced dataset.

-------------

## 1. Resampling Your Dataset

- You can add copies of instances from the under-represented class called over-sampling (or more formally sampling with replacement)

- You can delete instances from the over-represented class, called under-sampling.

### Some Rules of Thumb
- Consider testing under-sampling when you have an a lot data (tens- or hundreds of thousands of instances or more)

- Consider testing over-sampling when you don’t have a lot of data (tens of thousands of records or less)

- Consider testing random and non-random (e.g. stratified) sampling schemes.

- Consider testing different resampled ratios (e.g. you don’t have to target a 1:1 ratio in a binary classification problem, try other ratios)

## 2. Synthetic Samples

- SMOTE or the Synthetic Minority Over-sampling Technique

- As its name suggests, SMOTE is an oversampling method. It works by creating synthetic samples from the minor class instead of creating copies. The algorithm selects two or more similar instances (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighboring instances.

- Decision trees often perform well on imbalanced datasets. The splitting rules that look at the class variable used in the creation of the trees, can force both classes to be addressed.

- If in doubt, try a few popular decision tree algorithms like C4.5, C5.0, CART, and Random Forest.