# Understanding Imbalanced Data: Condensed Nearest Neighbours (CNN) Undersampling

## What is Imbalanced Data?

Before diving into CNN, let's understand the problem we're solving:

**Imbalanced data** occurs when one class (majority class) has significantly more observations than another class (minority class). This is extremely common in real-world scenarios:
- Credit card fraud detection (99.9% legitimate, 0.1% fraudulent transactions)
- Medical diagnosis (95% healthy, 5% diseased patients)
- Email spam detection (90% legitimate, 10% spam emails)

## Why is Imbalanced Data Problematic?

1. **Bias towards majority class**: Machine learning algorithms tend to predict the majority class more often
2. **Poor minority class detection**: Models struggle to identify the rare but often important minority class
3. **Misleading accuracy**: A model that always predicts "not fraud" achieves 99.9% accuracy but catches 0% of actual fraud

## Approaches to Handle Imbalanced Data

There are several strategies:
1. **Oversampling**: Increase minority class samples (SMOTE, ADASYN)
2. **Undersampling**: Reduce majority class samples (Random, Tomek Links, **CNN**)
3. **Hybrid methods**: Combine both approaches
4. **Algorithm-level**: Use cost-sensitive learning or ensemble methods

Today we'll focus on **Condensed Nearest Neighbours (CNN)**, a smart undersampling technique.

# Condensed Nearest Neighbours (CNN)


The algorithms works as follows:

1) Put all minority class observations in a group, typically group O

2) Add 1 sample (at random) from the majority class to group O

3) Train a KNN with group O

4) Take a sample of the majority class that is not in group O yet

5) Predict its class with the KNN from point 3

6) If the prediction was correct, go to 4 and repeat

7) If the prediction was incorrect, add that sample to group O, go to 3 and repeat

8) Continue until all samples of the majority class were either assigned to O or left out

9) Final version of Group O is our undersampled dataset


====

- **Criteria for data exclusion**: Samples outside the boundary between the classes
- **Final Dataset size**: varies

====

This algorithm tends to pick points near the fuzzy boundary between the classes, and transfer those to the group O, in our example. 

If the classes are similar, group O will contain a fair amount of both classes. If the classes are very different, group O would contain mostly 1 class, the minority class.

**Caution:**

- CNN tends to add noise to the undersampled dataset
- Computationally expensive, because it trains 1 KNN every time an observation is added to the minority class group.

In this notebook, we will first understand what Condensed Nearest Neigbours is doing with simulated data, and then compare its effect on model performance with real data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

from imblearn.under_sampling import CondensedNearestNeighbour

## Understanding the Libraries We'll Use

Let's import the essential libraries for our CNN exploration:

- **pandas & matplotlib/seaborn**: For data manipulation and visualization
- **sklearn.datasets.make_classification**: Creates synthetic imbalanced datasets for experimentation  
- **sklearn.ensemble.RandomForestClassifier**: Our machine learning model for performance comparison
- **sklearn.metrics.roc_auc_score**: Evaluation metric that works well with imbalanced data
- **imblearn.under_sampling.CondensedNearestNeighbour**: The CNN implementation from imbalanced-learn library

**Why ROC-AUC?** Unlike accuracy, ROC-AUC considers both true positive rate (sensitivity) and false positive rate, making it ideal for imbalanced datasets where we care about minority class detection.

## Create data

We will create data where the classes have different degrees of separateness.

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html

In [None]:
def make_data(sep):
    
    # returns arrays
    X, y = make_classification(n_samples=1000,
                           n_features=2,
                           n_redundant=0,
                           n_clusters_per_class=1,
                           weights=[0.99],
                           class_sep=sep,# how separate the classes are
                           random_state=1)
    
    # trasform arrays into pandas df and series
    X = pd.DataFrame(X, columns =['varA', 'varB'])
    y = pd.Series(y)
    
    return X, y

## Creating Synthetic Datasets for Learning

Before working with real data, we'll create controlled synthetic datasets to understand how CNN behaves under different conditions.

**Key Parameters in our data generation function:**
- **n_samples=1000**: Total number of data points
- **n_features=2**: Two features (varA, varB) so we can visualize easily
- **weights=[0.99]**: Creates severe imbalance - 99% majority class, 1% minority class
- **class_sep**: Controls how separable the classes are (this is what we'll experiment with)

**Why start with synthetic data?** 
1. We can control the exact conditions and see how CNN responds
2. We can visualize the results in 2D space
3. We understand the "ground truth" of class separation

In [None]:
# make datasets with different class separateness
# and plot

for sep in [0.1, 1, 2]:
    
    X, y = make_data(sep)
    
    print(y.value_counts())
    
    sns.scatterplot(
        data=X, x="varA", y="varB", hue=y
    )
    
    plt.title('Separation: {}'.format(sep))
    plt.show()

## Visualizing Different Degrees of Class Separation

Let's see how our classes look with different separation values:

- **sep=0.1**: Classes heavily overlap (very difficult to separate)
- **sep=1.0**: Classes partially overlap (moderate difficulty)  
- **sep=2.0**: Classes well separated (easier to distinguish)

**What to observe:**
1. Class imbalance: Notice the tiny minority class (orange/class 1) vs. large majority class (blue/class 0)
2. Separation effects: How overlap changes with different separation values
3. Decision boundary complexity: Where would you draw a line to separate these classes?

This visualization helps us understand why CNN's approach of focusing on boundary regions makes sense!

## Undersample with Condensed Nearest Neighbours


[CondensedNearestNeighbour](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.CondensedNearestNeighbour.html)


### Well separated classes

In [None]:
# create data

X, y = make_data(sep=2)

# set up condensed nearest neighbour transformer

cnn = CondensedNearestNeighbour(
    sampling_strategy='auto',  # undersamples only the majority class
    random_state=0,  # for reproducibility
    n_neighbors=1,# default
    n_jobs=4)  # I have 4 cores in my laptop

X_resampled, y_resampled = cnn.fit_resample(X, y)

## Applying CNN to Well-Separated Classes

Now let's see CNN in action! We'll start with well-separated classes (sep=2) to see the clearest example of how CNN works.

**CNN Parameters explained:**
- **sampling_strategy='auto'**: Only undersample the majority class (keeps all minority samples)
- **random_state=0**: Ensures reproducible results for teaching
- **n_neighbors=1**: Uses 1-NN for boundary detection (can be adjusted)
- **n_jobs=4**: Uses multiple CPU cores for faster processing

**What CNN will do:**
1. Start with all minority class samples (these are precious, don't remove any!)
2. Add one random majority class sample
3. Train a KNN classifier
4. Test remaining majority samples - keep those that are misclassified (near boundary)
5. Repeat until all majority samples are processed

In [None]:
# size of original data

X.shape, y.shape

In [None]:
# size of undersampled data

X_resampled.shape, y_resampled.shape

In [None]:
# number of minority class observations

y.value_counts()

## Analyzing the Results: Data Size Changes

Let's examine what happened to our dataset size after CNN undersampling:

**Key observations to make:**
1. **Original dataset**: 1000 samples with severe imbalance
2. **Undersampled dataset**: Significantly reduced size, but better balance
3. **Minority class preservation**: CNN keeps ALL minority class samples (this is crucial!)
4. **Majority class reduction**: Only keeps the "informative" majority class samples

**Why this matters**: CNN doesn't randomly throw away majority class samples. It intelligently selects only those that are close to the decision boundary, preserving the most informative samples for learning.

In [None]:
sns.scatterplot(
    data=X, x="varA", y="varB", hue=y
)

plt.title('Original dataset')
plt.show()

In [None]:
# plot undersampled data

sns.scatterplot(
    data=X_resampled, x="varA", y="varB", hue=y_resampled
)

plt.title('Undersampled dataset')
plt.show()

## Visual Comparison: Before and After CNN

Now comes the most important part - visualizing what CNN actually selected!

**What to look for in the comparison:**

**Original dataset:**
- Massive majority class (blue) scattered throughout the space
- Tiny minority class (orange) clustered in one region
- Clear separation between classes

**After CNN undersampling:**
- Much smaller majority class representation
- Same minority class (CNN never touches minority samples)
- **Key insight**: Notice which majority class points were kept - they're the ones closest to the minority class region!

**The CNN "intelligence"**: CNN kept only the majority class samples that are near the decision boundary. The isolated majority class samples (far from minorities) were removed as "redundant" - they don't help distinguish between classes.

Condensed Nearest Neighbours retains the observations from the majority class, that are more similar to those in the minority class.

**Note how values bigger where varA > 3, and varB >3 have not been included in the undersampled dataset**

### Partially separated classes

In [None]:
# create data
X, y = make_data(sep=0.5)

# set up condensed nearest neighbour transformer

cnn = CondensedNearestNeighbour(
    sampling_strategy='auto',  # undersamples only the majority class
    random_state=0,  # for reproducibility
    n_neighbors=1,
    n_jobs=4)  # I have 4 cores in my laptop

X_resampled, y_resampled = cnn.fit_resample(X, y)

## Testing CNN with Partially Separated Classes

Now let's see how CNN behaves when classes are harder to separate (sep=0.5). This is more realistic - real-world classes often overlap significantly.

**Hypothesis**: When classes overlap more, CNN should:
1. Keep MORE majority class samples (because more are near the boundary)
2. Result in a LARGER undersampled dataset
3. Preserve more complex decision boundary information

**Why this experiment matters**: Well-separated classes are easy for any classifier. The real test is how methods perform when classes overlap - this is where CNN's boundary-focused approach should shine.

In [None]:
# original data

X.shape, y.shape

In [None]:
# undersampled data

X_resampled.shape, y_resampled.shape

Note that more samples were included in the final training set, compared to the previous case where classes were more separated.

In [None]:
sns.scatterplot(
    data=X, x="varA", y="varB", hue=y
)

plt.title('Original dataset')
plt.show()

In [None]:
# plot undersampled data

sns.scatterplot(
    data=X_resampled, x="varA", y="varB", hue=y_resampled
)

plt.title('Undersampled dataset')
plt.show()

## Comparing Results: Well-Separated vs. Overlapping Classes

**Key Learning Moment**: Compare the dataset sizes and visualizations between well-separated (sep=2) and overlapping (sep=0.5) classes.

**What you should observe:**
1. **Dataset size difference**: The overlapping classes resulted in a larger undersampled dataset
2. **More retained samples**: More majority class points were kept when classes overlapped
3. **Boundary complexity**: The decision boundary is more complex when classes overlap

**CNN's adaptive behavior**: CNN automatically adjusts to data complexity:
- Simple, well-separated data → Aggressive undersampling (small final dataset)
- Complex, overlapping data → Conservative undersampling (larger final dataset)

This is why CNN is considered "intelligent" undersampling - it adapts to your data's characteristics!

Note again, how CNN preserves the observations from the majority class that look more similar to those in the minority class.

**HOMEWORK**

- Although CNN was originally developed using 1 KNN, try changing the number of neighbours and compare the sizes of the undersampled datasets and the distribution of the observations in the plots.

===


## Condensed Nearest Neighbours

### Real data - Performance comparison

Does it work well with real datasets? 

Well, it will depend on the dataset, so we need to try and compare the models built on the whole dataset, and that built on the undersampled dataset.

In [None]:
# load data
# only a few observations to speed the computaton

data = pd.read_csv('../kdd2004.csv').sample(10000)

data.head()

## Transitioning to Real-World Data

Now that we understand CNN's behavior with controlled synthetic data, let's test it on real data where:
1. **We don't control the class separation**
2. **We have many more features** (not just 2D)
3. **The patterns are more complex** and realistic
4. **We care about actual predictive performance**

**Important considerations for real data:**
- CNN can be computationally expensive (trains many KNN models)
- Real data may have noise that CNN could amplify
- Performance gains aren't guaranteed - it depends on your specific dataset

**The critical question**: Does CNN's intelligent boundary-focused undersampling translate to better machine learning performance on real problems?

In [None]:
# imbalanced target
data.target.value_counts() / len(data)

In [None]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),  # drop the target
    data['target'],  # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

In [None]:
# this is going to take a while

cnn = CondensedNearestNeighbour(
    sampling_strategy='auto',  # sundersamples only the majority class
    random_state=0,  # for reproducibility
    n_neighbors=1,
    n_jobs=4) 

X_resampled, y_resampled = cnn.fit_resample(X_train, y_train)

## Applying CNN to Real Data: Patience Required!

**⚠️ Performance Warning**: CNN is computationally intensive because:
1. **Multiple KNN trainings**: One KNN model for each candidate majority sample
2. **Iterative process**: Each addition requires retraining
3. **Distance calculations**: KNN requires computing distances to all training points

**Why it takes time**: For each majority class sample, CNN must:
- Train a KNN on current selected samples
- Predict the class of the candidate sample
- Decide whether to include it based on prediction accuracy

**In production**: Consider using CNN on a representative sample first, or use faster alternatives like Random Undersampling for initial experiments.

**The trade-off**: Computational cost vs. intelligent sample selection - CNN's smart selection often justifies the wait!

In [None]:
# size of undersampled data

X_resampled.shape, y_resampled.shape

In [None]:
# number of positive class in original dataset
y_train.value_counts()

## Plot data

Let's compare how the data looks before and after the undersampling.

In [None]:
# original data

sns.scatterplot(data=X_train,
                x="0",
                y="1",
                hue=y_train)

plt.title('Original data')

In [None]:
# undersampled data

sns.scatterplot(data=X_resampled,
                x="0",
                y="1",
                hue=y_resampled)

plt.title('Undersampled data')

## Visualizing Real Data: Understanding CNN's Selections

**Challenge with high-dimensional data**: We can only visualize 2 features at a time, but our dataset has many more features. CNN makes decisions based on ALL features, not just the two we're plotting.

**What to look for in these plots:**
1. **Distribution changes**: How the class balance changes after undersampling
2. **Pattern preservation**: Whether important data patterns are maintained
3. **Boundary regions**: CNN should keep samples near class boundaries (where classes mix)

**Important caveat**: These 2D projections don't show the full picture. CNN operates in the full feature space, so some selections might look odd in 2D but make perfect sense in higher dimensions.

**Learning objective**: Focus on understanding the overall pattern of how CNN transforms the dataset rather than individual point selections.

In [None]:
# original data

sns.scatterplot(data=X_train,
                x="4",
                y="5",
                hue=y_train)

plt.title('Original data')

In [None]:
sns.scatterplot(data=X_resampled,
                x="4",
                y="5",
                hue=y_resampled)

plt.title('Undersampled data')

## Machine learning performance comparison

Let's compare model performance with and without undersampling.

In [None]:
# function to train random forests and evaluate the performance

def run_randomForests(X_train, X_test, y_train, y_test):
    
    rf = RandomForestClassifier(n_estimators=200, random_state=39, max_depth=4)
    rf.fit(X_train, y_train)

    print('Train set')
    pred = rf.predict_proba(X_train)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
    
    print('Test set')
    pred = rf.predict_proba(X_test)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

## The Ultimate Test: Machine Learning Performance

This is where theory meets practice! We'll train identical Random Forest models on:
1. **Original imbalanced data** (baseline performance)
2. **CNN undersampled data** (testing CNN's effectiveness)

**Our evaluation setup**:
- **Random Forest**: Robust, popular algorithm that works well with tabular data
- **ROC-AUC score**: Perfect for imbalanced classification (considers both classes equally)
- **Train/Test split**: Ensures fair comparison and prevents overfitting

**What we're testing**:
- Does CNN's intelligent sample selection improve model performance?
- Is the computational cost of CNN justified by better results?
- How much performance gain (if any) do we get from CNN vs. original data?

**Success criteria**: CNN should improve the model's ability to distinguish between classes, reflected in higher ROC-AUC scores.

In [None]:
# evaluate performance of algorithm built
# using imbalanced dataset

run_randomForests(X_train,
                  X_test,
                  y_train,
                  y_test)

In [None]:
# evaluate performance of algorithm built
# using undersampled dataset

run_randomForests(X_resampled,
                  X_test,
                  y_resampled,
                  y_test)

## Performance Analysis: Interpreting the Results

**How to interpret ROC-AUC scores**:
- **0.5**: Random guessing (no better than coin flip)
- **0.6-0.7**: Poor performance 
- **0.7-0.8**: Fair performance
- **0.8-0.9**: Good performance
- **0.9-1.0**: Excellent performance

**Key comparisons to make**:
1. **Training vs. Testing**: Check for overfitting (big gap indicates overfitting)
2. **Original vs. CNN**: Did CNN improve performance on the test set?
3. **Practical significance**: Is the improvement worth the computational cost?

**What good results look like**:
- CNN test score > Original test score
- Similar training and testing scores (no overfitting)
- Meaningful improvement (not just 0.01 difference)

**If CNN doesn't help**: That's also valuable learning! Not all datasets benefit from CNN - it depends on the nature of your class boundaries and data distribution.

## Key Takeaways and Next Steps

**What we learned about CNN**:
1. **Intelligent selection**: CNN doesn't randomly remove samples - it focuses on class boundaries
2. **Adaptive behavior**: Keeps more samples when classes overlap, fewer when well-separated  
3. **Computational trade-off**: More expensive than random undersampling but potentially more effective
4. **Performance varies**: Effectiveness depends on your specific dataset and problem

**When to consider CNN**:
✅ **Good for**: Datasets with clear class boundaries, sufficient computational resources  
✅ **Good for**: When you need to preserve minority class information completely  
✅ **Good for**: Complex datasets where boundary regions are crucial  

⚠️ **Be cautious with**: Very noisy datasets (CNN might preserve noise)  
⚠️ **Be cautious with**: Time-constrained projects (can be slow)  
⚠️ **Be cautious with**: Very high-dimensional data (curse of dimensionality affects KNN)

**Alternative undersampling methods to explore**:
- **Random Undersampling**: Fast baseline
- **Tomek Links**: Removes overlapping samples  
- **Edited Nearest Neighbours**: Removes misclassified samples
- **OneSidedSelection**: Combines multiple strategies

**HOMEWORK**

- Try changing the number of neighbours used to select the observations from the majority class and with different machine learning models. Compare final dataset size, model performance and the distributions of the observations before and after the undersampling.
- **Extension**: Compare CNN with other undersampling techniques on the same dataset
- **Challenge**: Test CNN on different types of imbalanced problems (text classification, image classification, etc.)