# <center>Diving deep with Imbalanced Data</center>

Consider the following situation - 

You are working on your dataset. You create a classification model and get 90% accuracy immediately. The results seem fantastic to you. You dive a little deeper and discover that almost entirety of the data belongs to one class. Damn! Imbalanced data can cause you a lot of frustration.

You feel very frustrated when you discovered that your data has imbalanced classes and that all of the great results you thought you were getting turn out to be a lie. What is even more frustrating is the good books don't even cover this topic in a holistic manner.

This is an example of an **imbalanced** dataset and the frustrating results it can cause.

In this tutorial, you will discover the techniques that you can use to deliver good results on datasets with imbalanced data. Specifically, you will cover:

- What is imbalanced data?
- Challenges faced with imbalanced datasets
- Approaches for handling imbalanced data
- Further reading on the topic

Let's first see what is imbalanced data.

## What is imbalanced data?

Imbalanced data typically refers to a problem with classification tasks where the classes are not represented equally.

For example, you may have a binary classification problem with 100 instances. A total of 80 instances are labeled with Class-1 and the remaining 20 instances are labeled with Class-2.

This is an imbalanced dataset and the ratio of Class-1 to Class-2 instances is 80:20 or more concisely 4:1.

You can have a class imbalance problem on two-class classification problems as well as multi-class classification problems. Most techniques discussed in this tutorial can be used on either.

The following image is also a representative of imbalanced data - 

<img src = "https://www.kdnuggets.com/wp-content/uploads/imbalanced-data-1.png"></img>

[Source]("https://www.kdnuggets.com/wp-content/uploads/imbalanced-data-1.png")

Be it a Kaggle competition or real test dataset, class imbalance problem is one of the most common ones. 

Most of the real-world classification problems display some level of class imbalance, which is when each class does not make up an equal portion of your data-set. It is important to properly adjust your metrics and methods to adjust for your goals. If this is not done, you may end up optimizing for a meaningless metric in the context of your use case.

There are problems where a class imbalance is not just common, it is expected. For example, in datasets like those that characterize fraudulent transactions are imbalanced. The vast majority of the transactions will be in the “Not-Fraud” class and a very small minority will be in the “Fraud” class.

Another example is customer churn datasets, where the vast majority of customers stay with the service (the “No-Churn” class) and a small minority cancel their subscription (the “Churn” class).

When there is a modest class imbalance like 4:1 in the example above it can cause problems.

Let's now take a look at the challenges faced with imbalanced datasets.

## Challenges faced with imbalanced datasets:

One of the main challenges faced by the utility industry today is _electricity theft_. Electricity theft is the third largest form of theft worldwide. Utility companies are increasingly turning towards advanced analytics and machine learning algorithms to identify consumption patterns that indicate theft.

However, one of the biggest stumbling blocks is the humongous data and its distribution. Fraudulent transactions are significantly lower than normal healthy transactions i.e. accounting it to around 1-2 % of the total number of observations. The ask is to improve identification of the rare minority class as opposed to achieving higher overall accuracy.

Some examples of imbalanced datasets: 
- Datasets to identify customer churn where a vast majority of customers will continue using the service. Specifically, Telecommunication companies where Churn Rate is lower than 2 %.
- Datasets to identify rare diseases in medical diagnostics etc.
- Natural Disaster like Earthquakes

Machine learning algorithms tend to produce unsatisfactory classifiers when faced with imbalanced datasets. For any imbalanced data set, if the event to be predicted belongs to the minority class and the event rate is less than 5%, it is usually referred to as a rare event.

Before studying the approaches, let's first learn about **Accuracy Paradox** which is very relevant for this topic.

### Accuracy Paradox:

The accuracy paradox is the name for the exact situation in the introduction to this post.

It is the case where your accuracy measures tell the story that you have excellent accuracy (such as 90%), but the accuracy is only reflecting the underlying class distribution.

It is very common, because classification accuracy is often the first measure you use when evaluating models on your classification problems. 

**What is going on in our models when you train on an imbalanced dataset?**

As you might have guessed, the reason you get 90% accuracy on an imbalanced data (with 90% of the instances in Class-1) is because your models look at the data and cleverly decide that the best thing to do is to always predict “Class-1” and achieve high accuracy.

This is best seen when using a simple rule based algorithm. If you print out the rule in the final model you will see that it is very likely predicting one class regardless of the data it is asked to predict.

Instead, a properly calibrated method may achieve a lower accuracy, but would have a substantially higher true positive rate (or recall), which is really the metric you should have been optimizing for.

Following section will now cover some of the approaches for tackling imbalanced datasets. 

## Approaches for handling imbalanced data:

### Evaluation metric: 

Accuracy is not the metric to use when working with an imbalanced dataset. We have seen that it is misleading.

There are metrics that have been designed to tell you a more truthful story when working with imbalanced classes. Let's go through them one by one.

**Confusion matrix**: Evaluation of a classification algorithm performance is measured by the Confusion Matrix which contains information about the actual and the predicted class.

<img src = "https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/03/16142827/ICP1.png"></img>

[Source]("https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/03/16142827/ICP1.png")

**Precision**: Precision is the number of True Positives divided by the number of True Positives and False Positives. Put another way, it is the number of positive predictions divided by the total number of positive class values predicted. It is also called the Positive Predictive Value (PPV).

Precision can be thought of as a measure of a classifier's *exactness*. A low precision can also indicate a large number of False Positives.

**Recall**: Recall is the number of True Positives divided by the number of True Positives and the number of False Negatives. Put another way it is the number of positive predictions divided by the number of positive class values in the test data. It is also called Sensitivity or the True Positive Rate.

Recall can be thought of as a measure of a classifier's *completeness*. A low recall indicates many False Negatives.

Now, you will go in a bit-more details about these two terms with an example. 

**An example illustrating Precision and Recall**:

The [breast cancer dataset](http://archive.ics.uci.edu/ml/datasets/Breast+Cancer) is a standard machine learning dataset. It contains 9 attributes describing 286 women that have suffered and survived breast cancer and whether or not breast cancer recurred within 5 years.

It is a binary classification problem. Of the 286 women, 201 did not suffer a recurrence of breast cancer, leaving the remaining 85 that did.

False Negatives are probably worse than False Positives for this problem. More detailed screening can clear the False Positives, but False Negatives are sent home and lost to follow-up evaluation. 

A model that only predicted no recurrence of breast cancer would achieve an accuracy of (201/286) * 100 or 70.28%. You call this  <b>All No Recurrence</b>. This is a high accuracy, but a terrible model. If it was used alone for decision support to inform doctors, it would send home 85 women with incorrectly thinking their breast cancer was not going to reoccur (high False Negatives).

A model that only predicted the recurrence of breast cancer would achieve an accuracy of (85/286) * 100 or 29.72%. You’ll call this <b>All Recurrence</b>. This model has terrible accuracy and would send home 201 women thinking that had a recurrence of breast cancer but really didn’t (high False Positives).

Aligning to the concept of **confusion matrix** a perfect classifier would correctly predict 201 no recurrence and 85 recurrence which would be entered into the top left cell no recurrence/no recurrence (True Negatives) and bottom right cell recurrence/recurrence (True Positives).

But most of the time that's not the case. Let's see the two confusion matrices of All No Recurrence and All Recurrence:

**All No Recurrence**:
<img src = "https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2014/03/no_recurrence_confusion_matrix.png">
**All Recurrence**:
<img src = "https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2014/03/recurrence_confusion_matrix.png">

[Source](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2014/03/recurrence_confusion_matrix.png)

You can calculate the Precision and Recall easily now: 

**Precision**:
- The precision of the All No Recurrence model is 0/(0+0) or not a number, or 0.
- The precision of the All Recurrence model is 85/(85+201) or 0.30.

**Recall**:
- The recall of the All No Recurrence model is 0/(0+85) or 0.
- The recall of the All Recurrence model is 85/(85+0) or 1.

Well! You have now enough reasons as to wonder why considering only classification accuracy to evaluate your classification  model is not a good choice. 

Let's proceed to the next approach. 

### Resampling your dataset:

Dealing with imbalanced datasets entails strategies such as improving classification algorithms or balancing classes in the training data (data preprocessing) before providing the data as input to the machine learning algorithm. The later technique is preferred as it has wider application.

The main objective of balancing classes is to either increasing the frequency of the minority class or decreasing the frequency of the majority class. This is done in order to obtain approximately the same number of instances for both the classes. 

This change is called sampling your dataset and there are two main methods that you can use to even-up the classes:

- You can add copies of instances from the under-represented class called **over-sampling** (or more formally sampling with replacement), or
- You can delete instances from the over-represented class, called **under-sampling**.
These approaches are often very easy to implement and fast to run. They are an excellent starting point.

In fact, advisable to always try both approaches on all of your imbalanced datasets, just to see if it gives you a boost in your preferred accuracy measures.

Let's learn about over-sampling and under-sampling in a bit more detail.

#### Random Under-Sampling: 

Random under-sampling aims to balance class distribution by randomly eliminating majority class examples.  This is done until the majority and minority class instances are balanced out.

Total Observations = 1000

Fraudulent   Observations =20

Non Fraudulent Observations = 980

Event Rate= 2 %

In this case, you are taking 10 % samples without replacement from Non Fraud instances.  And combining them with Fraud instances.

Non Fraudulent Observations after random under sampling = 10 % of 980 =98

Total Observations after combining them with Fraudulent observations = 20+98=118

Event Rate for the new dataset after under sampling = 20/118 = 17%

 

**Advantages of this approach**:

- It can help improve run time and storage problems by reducing the number of training data samples when the training data set is huge.

**Disadvantages**:
- It can discard potentially useful information which could be important for building rule classifiers.
- The sample chosen by random under sampling may be a biased sample. And it will not be an accurate representative of the population. Thereby, resulting in inaccurate results with the actual test data set.
 

#### Random Over-Sampling:

Over-Sampling increases the number of instances in the minority class by randomly replicating them in order to present a higher representation of the minority class in the sample.

Total Observations = 1000

Fraudulent   Observations =20

Non Fraudulent Observations = 980

Event Rate= 2 %

In this case we are replicating 20 fraud observations   20 times.

Non Fraudulent Observations = 980

Fraudulent Observations after replicating the minority class observations= 400

Total Observations in the new data set after oversampling=1380

Event Rate for the new data set after under sampling= 400/1380 = 29 %

**Advantages of random over-sampling**:
- Unlike under sampling this method leads to no information loss.
- Outperforms under sampling

**Disadvantages**:

- It increases the likelihood of overfitting since it replicates the minority class events.

**Some Rules of Thumb**:
- Consider testing under-sampling when you have an a lot data (tens- or hundreds of thousands of instances or more)
- Consider testing over-sampling when you don’t have a lot of data (tens of thousands of records or less)
- Consider testing random and non-random (e.g. stratified) sampling schemes.
- Consider testing different resampled ratios (e.g. you don’t have to target a 1:1 ratio in a binary classification problem, try other ratios)

Now you will the study the next approach for handling imbalanced data. 

### Trying out different perspectives:

There are fields of study dedicated to imbalanced datasets. They have their own algorithms, measures and terminology.

Taking a look and thinking about your problem from these point of views can sometimes give you good ideas.

Two you might like to consider are **anomaly detection** and **change detection**.

- Anomaly detection is the detection of rare events. This might be a machine malfunction indicated through its vibrations or a malicious activity by a program indicated by it’s sequence of system calls. The events are rare and when compared to normal operation.

This shift in thinking considers the minor class as the outliers class which might help you think of new ways to separate and classify samples.

- Change detection is similar to anomaly detection except rather than looking for an anomaly it is looking for a change or difference. This might be a change in behavior of a user as observed by usage patterns or bank transactions.

Both of these shifts take a more real-time stance to the classification problem that might give you some new ways of thinking about your problem and maybe some more techniques to try.

The above approach can actually help you in playing between different domains and coming up with something new. It's always recommended to dig deeper in this. Anyway, you will now move onto the next and final approach of handling imbalanced data for this post:

### Try Generate Synthetic Samples:
A simple way to generate synthetic samples is to randomly sample the attributes from instances in the minority class.

You could sample them empirically within your dataset or you could use a method like _Naive Bayes_ that can sample each attribute independently when run in reverse. You will have more and different data, but the non-linear relationships between the attributes may not be preserved.

There are systematic algorithms that you can use to generate synthetic samples. The most popular of such algorithms is called SMOTE or the **Synthetic Minority Over-sampling Technique**. It was proposed in 2002 and you can take a look at the [original SMOTE paper]("http://www.jair.org/papers/paper953.html"). Following info-graphic will give you a fair idea about the synthetic samples:

<img src = "https://cdn-images-1.medium.com/max/1600/1*uAiwqUNhqaSZmsXCrl9kVQ.png"></img>

[Source]("https://cdn-images-1.medium.com/max/1600/1*uAiwqUNhqaSZmsXCrl9kVQ.png")

As its name suggests, SMOTE is an oversampling method. It works by creating synthetic samples from the minor class instead of creating copies. The algorithm selects two or more similar instances (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighboring instances. 

However, there are some advantages and disadvantages of SMOTE - 

**Advantages** - 
- Mitigates the problem of overfitting caused by random oversampling as synthetic examples are generated rather than replication of instances. 
- No loss of useful information.

**Disadvantages** - 
- While generating synthetic examples SMOTE does not take into consideration neighboring examples from other classes. This can result in increase in overlapping of classes and can introduce additional noise.
- SMOTE is not very effective for high dimensional data.

### Wrap up!

So far, you got yourself introduced to the concept of imbalanced data and the kind of problem it creates while designing and developing machine learning models. You also saw several reasons as to why it is important to tackle imbalanced data. After that, you studied four different approaches that can help you to handle imbalanced datasets effectively.

A lot of important concepts in one go! Absolutely amazing!

That is all for this tutorial. In the next tutorial, you will actually implement some of the approaches in Python with a real-world dataset. 

Below are some paper links if you are very keen to study even more about the topic of imbalanced data:
- [Learning from Imbalanced Data]("http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5128907")
- [Addressing the Curse of Imbalanced Training Sets: One-Sided Selection]("http://sci2s.ugr.es/keel/pdf/algorithm/congreso/kubat97addressing.pdf")
- [A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data]("http://dl.acm.org/citation.cfm?id=1007735")

**References**:
- [Towards Data Science article on Imbalanced data]("https://towardsdatascience.com/dealing-with-imbalanced-classes-in-machine-learning-d43d6fa19d2")
- [Python Machine Learning]("https://g.co/kgs/a35QhF")