# Modeling Unbalanced Classes

Some classification models are better suited than others to outliers, low occurrence of a class, or rare events. The most common methods to add robustness to a classifier are related to stratified sampling to re-balance the training data. This module will walk you through both stratified sampling methods and more novel approaches to model data sets with unbalanced classes. 

## Learning Objectives

Identify class weights and sampling as methods to deal with unbalanced classes in a data set.

Recognize the syntax for building for sampling, blagging, and nearest neighbor methods for modeling unbalanced classes.


## Model Interpretability

In general, we need explanation methods that can make the behaviors and predictions of machine learning models understandable to humans. We need to use those methods to understand the model structures, what important features should be included in the model, and how those models map features to prediction outcomes.


In addition, sometimes knowing how models work exactly may give us more insights than merely predicting the outcomes. For example, understanding how an AI system diagnoses cancer may help human health experts identify evidence-based risk factors. For decision makers, interpretability is important especially for those in very sensitive or high-risk domains such as finance or health. We need to be confident and be able to trust that the model is working correctly. Black-box machine learning systems cannot be trusted unless they can be monitored and interpreted. As such, building trustable models is sometimes even more important than building high-performing models.

Understanding machine learning models for:
- Model explaination
- Model trust
- Model debug 

Recap
- We can only trust and effectively debug machine
learning models if they are understandable
- Self-interpretable models have simple and
intuitive structures
- Non- self- interpretable models have complex
structures and can be described as black-box
models

### Examples of Self-Interpretable and Non-Self-Interpretable Models

#### Self-Interpretable

Linear models are probably the most widely used predictive models due to their simplicity and effectiveness, especially in the financial industry. Their structure is simple with just a linear combination of features that predict values. As such, linear model prediction outcomes often require minimal effort to understand.

Tree models such as decision trees, are another popular self-interpretable type of model. The main characteristic of tree models is they mimic human’s reasoning process via creating a set of IF-THEN-ELSE rules. 

The K-nearest neighbor model, or KNN, can also be considered a self-interpretable model if the feature spaces can be comprehensible and kept small.

#### Non-Self-Interpretable Models

Ensemble Models

![](./images/70_ModelInterpretationMethods.png)



### Model-Agnostic Explanations

![](./images/71_ModelAgnosticExplainations.png)

Feature importance

Measure the importance of features

1. Simplify your model by only including important features
2. Interpret how predictions were made

Permutation feature importance
- The basic idea of permutation feature importance is very simple. For each feature, we shuffle its feature values and use the model to make predictions based on the shuffled values. In most cases, the prediction error will increase. Permuting important or impactful features will tend to generate large prediction errors and less important features will tend to generate small error increases. As such, feature importance can be measured by calculating the difference between the prediction errors before and after permutation. 

![](./images/72_PermutationFeatureImportanceExample.png)

- Partial Dependency Plot is an effective way to illustrate the relationship between a feature and the model outcome. It essentially visualizes the marginal effects of a feature, that is, it shows how the model outcome changes when a specific feature changes in its distribution. Note that we keep the rest of the features unchanged while changing the interested feature. 

Impurity-based feature importance

Shapley Additive exPlanations (SHAP) values

### Surrogate Models

![](./images/73_SurogateModels.png)

![](./images/74_GlobalSurrogateModels.png)

Local surrogate
- Global surrogate models may not always work
  - Large inconsistency between surrogate models and black-box models
  - Multiple data instance groups or clusters in the dataset
- Explain specific interested data instances locally
- A local surrogate model is built on one or a few instances

Local Interpretable Model-Agnostic Explanations (LIME)
![](./images/75_LocalInterpretableModel-AgnosticExplanations.png)


### Practice Lab: Model Interpretability

### Practice: Model interpretability

## Introduction to Unbalanced Classes

the classifiers themselves in order to learn the actual parameters and decisions are built to optimize accuracy specifically. They are built to get as many correct as possible no matter the class. And hence, they'll often perform poorly on under represent in classes.

Ways to deal with unbalanced Classes

- Undersampling (Downsampling) here means only taking as many of the larger class of our majority class, as there are available of our smaller class of our minority class. So you see here we have a lot of the majority class only six of the minority class So we randomly select only six from our majority class, so that we are now working with a balanced dataset.
Play video starting at :2:4 and follow transcript2:04

- Oversampling (Up sampling) is essentially creating copies of the row of smaller outcome until we have a balanced sample.

- Mix Downsampling and Up Sampling


### Upsampling and Downsampling

Steps for unbalanced datasets:
- Do a stratified test-train split
- Up or down sample the full dataset
- Build models

Downsampling

- Downsampling adds tremendous importance to the minor class, typically shooting up
recall and bringing down precision.

- Values like 0.8 recall and 0.15 precision isn't uncommon.

Upsampling

- Upsampling mitigates some of the excessive weight on the minor class. Recall is still
typically higher than precision, but the gap is lesser.

- Values like 0.7 recall and 0.4 precision isn't uncommon. And are often considered good
results for an unbalanced dataset.

![](./images/76_CrossValidation.png)

Every classifier used produces a different model.

Every dataset we use (produced by various sampling, say) produces a different model.

We can choose the best model using any criteria including AUC (area under the curve).
Remember each model produces a different ROC curve.

Once a model is chosen. You can walk along the ROC curve and pick any point on it.
Each point has different precision/recall values.

### Modeling Approaches: Weighting and Stratified Sampling

In this section, we will cover:
- Additional approaches to dealing with unbalanced outcomes
- Random and Synthetic Over Sampling
- Techniques for Under Sampling
- Using Balanced Bagging (Bagging) to address unbalanced class data

Lots of Approaches
- General sklearn approaches
- Oversampling
- Undersampling
- Combination
- Ensembles
- Check out http://contrib.scikit-learn.org/imbalanced-learn

Weighting
- Many models allow weighted observations
- Adjust these so total weights are equal across classes
- Easy to do, when it's available
- No need to sacrifice data 

Stratified Sampling
- Train-test split, "stratify" option
- ShuffleSplit -> StratifiedShuffleSplit
- KFold -> StratifiedKFold -> RepeatedStratifiedKFold

### Modeling Approaches: Random and Synthetic Oversampling

Random Oversampling
- Simplest oversampling approach
- Resample with replacement from minority class
- No concerns about geometry of feature space
- Good for categorical data

![](./images/77_SyntheticOverSampling.png)

#### Synthetic Oversampling
- Start with a point in the minority class
- Choose one of K nearest neighbors
- Add a new point between them
- Two main approaches:
  - SMOTE : Synthetic Minority Oversampling Technique
    - Regular: Connect minority class points to any neighbor (even other classes)
    - Borderline: Classify points as outlier, safe, or in-danger
      - 1: Connect minority in-danger points only to minority points
      - 2: Connect minority in-danger points to whatever is nearby
    - SVM: Use minority support vectors to generate new points
  - ADASYN : ADAptive SYNthetic sampling
    - For each minority point:
      - Look at classes in neighborhood
      - Generate new samples proportional to competing classes
    - Motivated by KNN, but helps other classifiers as well

### Modeling Approaches: Nearing Neighbor Methods

In this section, we're going to focus on the idea of UNDERSAMPLING. Now, the concept here is going to be to try and decrease the size of that majority class so that it is similar in size to that minority class.

![](./images/78_NearMiss1.png)

![](./images/78_NearMiss2.png)

![](./images/78_NearMiss3.png)

![](./images/81_TomekLinks.png)

![](./images/82_EditedNearestNeighbors.png)

### Modeling Approaches: Blagging

Combination Over/Under
- SMOTE + Tomek's link
- SMOTE + Edited Nearest Neighbors

![](./images/83_BalancedBagging.png)

#### Unbalanced Classes: Summary

All of this happens after the test set has been split.

Use sensible metrics
- AUC
- F1
- Cohen's Kappa
- Not accuracy - too easy to fool in this case



# Summary/Review

## Modeling Unbalanced Classes

Classification algorithms are built to optimize accuracy, which makes it challenging to create a model when there is not a balance across the number of observations of different classes. Common methods to approach balancing the classes are:

- Downsampling or removing observations from the most common class

- Upsampling or duplicating observations from the rarest class or classes

- A mix of downsampling and upsampling

## Modeling Approaches for Unbalanced Classes

Specific algorithms to upsample and downsample are:

- Stratified sampling

- Random oversampling

- Synthetic oversampling, the main two approaches being Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic sampling (ADASYN)

- Cluster Centroids implementations like NearMiss, Tomek Links, and Nearest Neighbors  

