# Imbalanced Classes in Machine Learning
![Strawberries v. Cherries](https://images.unsplash.com/photo-1527323928721-cd2f2bd43fe1?ixlib=rb-0.3.5&ixid=eyJhcHBfaWQiOjEyMDd9&s=4ccd32c39ea133b329e919a8e253e278&auto=format&fit=crop&w=2700&q=80)

The image above evokes a sense of balance: the girl is hold roughly the same amout of strawberries and cherries. If I were to sample from those baskets to teach a machine to tell the difference, an estimated algorithm could differienate accurately after just a few samples. 

Now imagine: What if I only had a basket of strawberries and just a few cherries. Do you think an estimated machine learning algorithm could accurately differeniate?

Imbalance is common in datasets. Rarely do we see datsets with equal numbers of observation categories. Many machine learning applications expect imbalanced classes. Domains such as fraud detection, signal detection, and customer churn often have heavily imbalanced classes. 

## Accuracy Paradox
Enter the [accuracy paradox](https://en.wikipedia.org/wiki/Accuracy_paradox). 

The case in which a model's has excellent accuracy (~ >90%), but the accuracy only reflects the underlying class distribution. 

Most of you have already encountered this problem, but do you know how to deal with it?

## Strategies to Combat Imbanced Classes

* Different performance metrics
* Resample your data (up-sample & down-sample)
* Test different algorithms (tree-based methods)
* Try penalized models
* Change perspective

### Diferent performance metrics
* *Confusion Matrix*: A breakdown of predictions into a table showing * correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).
* *Precision*: A measure of a classifiers exactness.
* *Recall*: A measure of a classifiers completeness
* *F1 Score* (or F-score): A weighted average of precision and recall.
* *Kappa* (or Cohen’s kappa): Classification accuracy normalized by the imbalance of the classes in the data.

### Resampling your data
![Downsampling](https://chrisalbon.com/images/machine_learning_flashcards/Downsampling_print.png)

Check out the technique of Chris Albon's [blog](https://chrisalbon.com/machine_learning/preprocessing_structured_data/handling_imbalanced_classes_with_downsampling/)

P.S. Chris Albon is a godsend. Read his stuff. 

![Upsampling](https://chrisalbon.com/images/machine_learning_flashcards/Upsampling_print.png)

[How to upsample](https://chrisalbon.com/machine_learning/preprocessing_structured_data/handling_imbalanced_classes_with_upsampling/)

### Test different algorithms
If your choosen algorith isn't performing well with imbalanced data, try something else! Several tree based methods just as boosting and bagging are well suited to handling imbalanced data. Try models such as XGBoost or Ada-boost on your imbalnced data. They hanlde imbalance natively. 

### Try penalized models

Make mistakes costly! There are ways to penalized misclassification during training. 

You can modify the parameters of sklearn's SVM to create a "Penalized-SVM" model. See this blog [post](https://elitedatascience.com/imbalanced-classes)