# Random Undersampling


Random undersampling consists in extracting at random samples from the majority class, until they reach a certain proportion compared to the minority class, typically 50:50.

- **Criteria for data exclusion**: Random
- **Final Dataset size**: 2 x minority class

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

from imblearn.under_sampling import RandomUnderSampler

## Learning Objectives

By the end of this notebook, you will understand:
1. What random undersampling is and when to use it
2. How to implement random undersampling using imbalanced-learn
3. The advantages and disadvantages of this technique
4. How undersampling affects machine learning model performance

## Why Random Undersampling?

When dealing with imbalanced datasets, machine learning algorithms often:
- **Bias towards the majority class** (predicting the most frequent class)
- **Poor performance on minority class** (missing important cases)
- **High accuracy but low practical value** (accuracy paradox)

Random undersampling is one of the simplest techniques to address class imbalance by reducing the majority class size.

## Import Libraries

Let's start by importing the necessary libraries for data manipulation, visualization, and machine learning:

## Create data

We will create data where the classes have different degrees of separateness.

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html

In [None]:
def make_data(sep):
    
    # returns arrays
    X, y = make_classification(n_samples=1000,
                           n_features=2,
                           n_redundant=0,
                           n_clusters_per_class=1,
                           weights=[0.99],
                           class_sep=sep,# how separate the classes are
                           random_state=1)
    
    # trasform arrays into pandas df and series
    X = pd.DataFrame(X, columns =['varA', 'varB'])
    y = pd.Series(y)
    
    return X, y

## Understanding Class Separability

Before we dive into undersampling, let's understand how **class separability** affects our data. We'll create synthetic datasets with different degrees of overlap between classes to see how this impacts the effectiveness of undersampling techniques.

The `make_classification` function allows us to control:
- **class_sep**: How well-separated the classes are (higher = easier to distinguish)
- **weights**: The proportion of each class (creating imbalance)
- **n_samples**: Total number of samples

In [None]:
# make datasets with different class separateness
# and plot

for sep in [0.1, 1., 2.]:
    
    X, y = make_data(sep)
    
    print(y.value_counts())
    
    sns.scatterplot(
        data=X, x="varA", y="varB", hue=y
    )
    
    plt.title('Separation: {}'.format(sep))
    plt.show()

### Visualizing Different Levels of Class Separability

Let's create and visualize datasets with different separation levels to understand how class overlap affects our data:

As we increase the parameter **sep**, the minority and majority class show less degree of overlap.

### Key Observations from the Plots

**What you should notice:**
- **Separation = 0.1**: Classes heavily overlap (harder to distinguish)
- **Separation = 1.0**: Moderate overlap (some difficulty in classification)  
- **Separation = 2.0**: Well-separated classes (easier to classify)

**Class Distribution**: Notice that we have approximately 990 samples of class 0 (majority) and only 10 samples of class 1 (minority) - a 99:1 imbalance!

## Random Undersampling

[RandomUnderSampler](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html)

Selects samples from the majority class at random, until we have as many observations as those in the minority class.

In [None]:
# create data

X, y = make_data(sep=2)

# set up the random undersampling class

rus = RandomUnderSampler(
    sampling_strategy='auto',  # samples only the majority class
    random_state=0,  # for reproducibility
    replacement=True # if it should resample with replacement
)  

X_resampled, y_resampled = rus.fit_resample(X, y)

### Implementing Random Undersampling

Now let's apply random undersampling to our well-separated dataset (sep=2). We'll use the `RandomUnderSampler` from imbalanced-learn:

**Key Parameters:**
- **sampling_strategy='auto'**: Automatically balances by sampling majority class to match minority class size
- **random_state**: Ensures reproducible results
- **replacement**: Whether to sample with replacement (allows duplicate selections)

In [None]:
# size of original data

X.shape, y.shape

### Examining the Dataset Size Changes

Let's examine how the dataset size changes after undersampling:

In [None]:
# size of undersampled data

X_resampled.shape, y_resampled.shape

In [None]:
# number of minority class observations

y.value_counts()

In [None]:
# plot of original data

sns.scatterplot(
    data=X, x="varA", y="varB", hue=y
)

plt.title('Original dataset')
plt.show()

### Visual Comparison: Before and After Undersampling

Let's visualize the original imbalanced dataset vs. the balanced dataset after undersampling:

In [None]:
# plot undersampled data

sns.scatterplot(
    data=X_resampled, x="varA", y="varB", hue=y_resampled
)

plt.title('Undersampled dataset')
plt.show()

The samples show a similar observation as they do in the original dataset. This is the product of removing data at random.

**HOMEWORK**: 

- change the degree of separateness when creating the data, and re-run the cells above. Then compare the plots.

## Changing the balancing ratio

In [None]:
# now, I will resample the data, so that I obtain
# twice as many observations from the majority as
# those from the minority

rus = RandomUnderSampler(
    sampling_strategy= 0.5,  # remember balancing ratio = x min / x maj
    random_state=0,  
    replacement=False # if it should resample with replacement
)  

X_resampled, y_resampled = rus.fit_resample(X, y)

### Controlling the Balance Ratio

Random undersampling doesn't always need to create a 50:50 balance. We can specify different ratios using the `sampling_strategy` parameter:

**sampling_strategy = 0.5** means: minority_samples / majority_samples = 0.5
- If we have 10 minority samples, we'll keep 20 majority samples (10/20 = 0.5)

In [None]:
# size of undersampled data

X_resampled.shape, y_resampled.shape

In [None]:
# see that we have twice as many of the majority now

y_resampled.value_counts()

In [None]:
# and we can also specify how many observations we want
# from each class

rus = RandomUnderSampler(
    sampling_strategy= {0:100, 1:15},  # remember balancing ratio = x min / x maj
    random_state=0,  
    replacement=False # if it should resample with replacement
)  

X_resampled, y_resampled = rus.fit_resample(X, y)

### Specifying Exact Sample Counts

For even more precise control, we can specify exactly how many samples we want from each class using a dictionary:

In [None]:
# size of undersampled data

X_resampled.shape, y_resampled.shape

In [None]:
# we have what we asked for :)

y_resampled.value_counts()

**Perfect Control:**
- **Class 0 (majority)**: Exactly 100 samples as requested
- **Class 1 (minority)**: Exactly 15 samples as requested  
- **Total samples**: 115 samples
- **Custom ratio**: 100:15 ≈ 6.7:1 ratio

This approach is useful when you have domain knowledge about the optimal class distribution for your specific problem.

## Load data

In [None]:
# load data
data = pd.read_csv('../kdd2004.csv')

data.head()

## Real-World Application: Credit Risk Dataset

Now let's apply random undersampling to a real-world dataset - the KDD 2004 credit risk dataset. This dataset contains features related to credit applications and whether they were approved or rejected.

**Why this dataset?**
- **Real imbalanced data**: Reflects actual business scenarios
- **High dimensionality**: 74 features to work with
- **Practical relevance**: Credit risk is a common use case for imbalanced learning

In [None]:
# size of data
data.shape

## Imbalanced target

In [None]:
# imbalanced target
data.target.value_counts() / len(data)

### Analyzing Class Imbalance

Let's examine the severity of class imbalance in this real-world dataset:

## Separate train and test

**Severe Imbalance Detected:**
- **Class 0 (rejected)**: ~91.4% of all applications 
- **Class 1 (approved)**: ~8.6% of all applications
- **Imbalance ratio**: Approximately 10.6:1
- **Problem**: Models will likely predict "rejection" for most cases, missing many valid approvals

This level of imbalance is typical in credit risk scenarios where most applications are rejected for various reasons.

In [None]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),  # drop the target
    data['target'],  # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

### Creating Train/Test Split

**Important principle:** Always split your data BEFORE applying any resampling technique to avoid data leakage.

**Why this order matters:**
1. **Split first**: Ensures test set represents real-world distribution
2. **Resample training only**: Test set remains untouched for unbiased evaluation
3. **Prevent data leakage**: No information from test set influences training

## Random Undersampling

[RandomUnderSampler](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html)

In [None]:
rus = RandomUnderSampler(
    sampling_strategy='auto',  # samples only from majority class
    random_state=0,  # for reproducibility
    replacement=True # if it should resample with replacement
)  

X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

### Applying Random Undersampling to Real Data

Now let's apply random undersampling to our credit risk training data. We'll use `replacement=True` to allow sampling with replacement, which can be useful when the minority class is very small.

In [None]:
# size of undersampled data

X_resampled.shape, y_resampled.shape

In [None]:
# number of positive class in original dataset
y_train.value_counts()

In [None]:
# final data size is 2 times the number of observations
# with positive class:

y_train.value_counts()[1] * 2

## Plot data

Let's compare how the data looks before and after the undersampling.

In [None]:
sns.scatterplot(data=data.sample(1784, random_state=0),
                x="0",
                y="1",
                hue="target")

### Visualizing High-Dimensional Data

Since we're dealing with 74 features, we can only visualize 2 dimensions at a time. Let's examine features 0 and 1 to see how the class distributions change:

**Note**: We're sampling 1,784 points from the original data to match the undersampled dataset size for fair comparison.

In [None]:
col_names = [str(i) for i in range(74)] +['target']

data_resampled = pd.concat([X_resampled, y_resampled], axis=1)
data_resampled.columns = col_names

sns.scatterplot(data=data_resampled, x="0", y="1", hue="target")

The distributions are similar to that of the original data. The reason you see more purple dots, is because now they are not covered by the pink ones.

In [None]:
sns.scatterplot(data=data.sample(1784, random_state=0),
                x="4",
                y="5",
                hue="target")

In [None]:
sns.scatterplot(data=data_resampled, x="4", y="5", hue="target")

## Machine learning performance comparison

Let's compare model performance with and without undersampling.

In [None]:
# function to train random forests and evaluate the performance

def run_randomForests(X_train, X_test, y_train, y_test):
    
    rf = RandomForestClassifier(n_estimators=200, random_state=39, max_depth=4)
    rf.fit(X_train, y_train)

    print('Train set')
    pred = rf.predict_proba(X_train)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
    
    print('Test set')
    pred = rf.predict_proba(X_test)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

### The Ultimate Test: Machine Learning Performance

Now comes the crucial question: **Does random undersampling actually improve model performance?**

We'll compare a Random Forest classifier trained on:
1. **Original imbalanced dataset** 
2. **Undersampled balanced dataset**

**Evaluation metric**: We'll use ROC-AUC because it's robust to class imbalance and measures the model's ability to distinguish between classes.

Let's define a function to train and evaluate Random Forest models:

In [None]:
# evaluate performance of algorithm built
# using imbalanced dataset

run_randomForests(X_train,
                  X_test,
                  y_train,
                  y_test)

#### Performance on Original Imbalanced Data

First, let's see how well our model performs when trained on the original, highly imbalanced dataset:

In [None]:
# evaluate performance of algorithm built
# using undersampled dataset

run_randomForests(X_resampled,
                  X_test,
                  y_resampled,
                  y_test)

#### Performance on Undersampled Balanced Data

Now let's see the performance improvement when training on the balanced dataset created through random undersampling:

There is a big jump in model performance.

**HOMEWORK**

- Try random undersampling with and without replacement, and with different machine learning models.

### 🎉 Remarkable Performance Improvement!

**Key Results:**
- **Training set improvement**: Significant increase in ROC-AUC score
- **Test set improvement**: Substantial boost in generalization performance  
- **Success**: Random undersampling successfully addressed the class imbalance problem!

**Why does this work?**
1. **Balanced learning**: Model sees equal examples of both classes during training
2. **Better decision boundaries**: Algorithm can learn to distinguish minority class patterns
3. **Reduced bias**: Less tendency to always predict the majority class
4. **Improved sensitivity**: Better at identifying positive cases (approved applications)

**Trade-offs to consider:**
- ✅ **Pros**: Simple, fast, often effective, preserves data distribution
- ❌ **Cons**: Potential information loss, smaller training set, may remove important samples

## Summary and Next Steps

**What we learned:**
- Random undersampling is effective for addressing class imbalance
- It can significantly improve model performance on minority class detection
- The technique is simple to implement and computationally efficient
- Proper train/test splitting is crucial to avoid data leakage