# Dealing With Class Imbalance For Classification Problems !

# Method 1: Oversampling Underrepresented Class 

# SMOTE Method [Synthetic Minority Oversampling Technique]

Imbalanced classification involves developing predictive models on classification datasets that have a severe class imbalance.

The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on, the minority class, although typically it is performance on the minority class that is most important.

One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.

# Method 2: Underrepresenting Overrepresented Classes

# Random Under-sampling
Random under-sampling is a technique that randomly selects a number of images from the majority class and removes them from the dataset. This reduces the number of observations from the majority class, which may help the data to get balanced. 



# METHOD 3: Class Weighting the loss function. 

Poor predictions of under weighted classes are penalised more heavily in the loss function. Something along the lines of
weights = [w1, w2, w3, ...]
class_weights = torch.FloatTensor(weights)
learn.crit = nn.CrossEntropyLoss(weight=class_weights)
where you weight w1, w2, w3 in whatever way you wish. I’ve worked with sample size / (num classes * class frequency) but have to admit I’ve not had much luck getting class weighting to work well.

# Use the right evaluation metrics 

Applying inappropriate evaluation metrics for model generated using imbalanced data can be dangerous. Imagine our training data is the one illustrated in graph above. If accuracy is used to measure the goodness of a model, a model which classifies all testing samples into “0” will have an excellent accuracy (99.8%), but obviously, this model won’t provide any valuable information for us.

In this case, other alternative evaluation metrics can be applied such as:

Precision/Specificity: how many selected instances are relevant.

Recall/Sensitivity: how many relevant instances are selected.

F1 score: harmonic mean of precision and recall.

MCC: correlation coefficient between the observed and predicted binary classifications.

AUC: relation between true-positive rate and false positive rate.

# Use K-fold Cross-Validation in the right way

It is noteworthy that cross-validation should be applied properly while using over-sampling method to address imbalance problems.

If cross-validation is applied after over-sampling, basically what we are doing is overfitting our model to a specific artificial bootstrapping result. That is why cross-validation should always be done before over-sampling the data,  Only by resampling the data repeatedly, randomness can be introduced into the dataset to make sure that there won’t be an overfitting problem.

# Ensemble different resampled datasets
 
The easiest way to successfully generalize a model is by using more data. The problem is that out-of-the-box classifiers like logistic regression or random forest tend to generalize by discarding the rare class. One easy best practice is building n models that use all the samples of the rare class and n-differing samples of the abundant class. Given that you want to ensemble 10 models, you would keep e.g. the 1.000 cases of the rare class and randomly sample 10.000 cases of the abundant class. Then you just split the 10.000 cases in 10 chunks and train 10 different models.


