If a dataset has 90 labels being in class 1, and 10 in class 2, a classifier that always predicted class 1 would have a 90% accuracy score, despite learning almost nothing. In order to deal with this, we can balance the way we score things.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.model_selection import GridSearchCV, cross_val_score

In [2]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)
# Breast cancer data, with diagnosis as target variable
print(df.loc[:, 1].value_counts())

y = df.loc[:, 1].values
X = df.loc[:, 2:].values

le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)

1
B    357
M    212
Name: count, dtype: int64


In [3]:
# Let's create an imbalanced dataset:
X_imbalanced = np.vstack((X[y == 0], X[y == 1][:40]))
y_imbalanced = np.hstack((y[y == 0], y[y == 1][:40]))

# Let's try a uniform predictor that has learned only the majority class
prediction = np.zeros(y_imbalanced.shape[0])
print(f"Prediction accuracy: {np.mean(y_imbalanced == prediction) * 100:.3f}%. Clearly, this isn't a good metric.")

Prediction accuracy: 89.924%. Clearly, this isn't a good metric.


There are a few ways to deal with this.

1) Choosing appropriate metrics for the purpose. If you can about making sure you detect everyone with malignant tumors (even if you catch a few false positives for it) to recommend further screening, then you want to maximize the true positive rate (recall).

2) Use larger penalties to wrong predictions on the minority class. With scikit-learn, this can be as simple as setting class_weight='balanced' in classifiers.

3) Upsample the minority or downsample the majority to maintain an even proportion of samples for training. Scikit-learn makes this simple with the resample method, by creating a new training set with taking samples with replacement from the minority.

4) Generate synthetic data for the minority class.

There is no single best way - try a few things, and see how they fare.

In [4]:
# Generating an upsampled dataset for the minority class the imbalanced dataset

from sklearn.utils import resample

print(f"Initial samples: {np.bincount(y_imbalanced)}")

X_upsampled, y_upsampled = resample(X_imbalanced[y_imbalanced == 1], y_imbalanced[y_imbalanced == 1], # specify the part to resample from
                                    replace=True,# Draw with replacement
                                    n_samples=X_imbalanced[y_imbalanced == 0].shape[0],# Make as many samples as the other class
                                    random_state=42)

print(f"Post resampling: {np.bincount(y_upsampled)}")

X_balanced = np.vstack((X[y == 0], X_upsampled))
y_balanced = np.hstack((y[y == 0], y_upsampled))

print(f"Post balancing: {np.bincount(y_balanced)}")


Initial samples: [357  40]
Post resampling: [  0 357]
Post balancing: [357 357]


In [None]:
# Downsampling the majority works similarly

X_downsampled, y_downsampled = resample(X_imbalanced[y_imbalanced == 0], y_imbalanced[y_imbalanced == 0],
                                        replace=False,# We don't need to draw with replacement since we have more than enough
                                        n_samples=X_imbalanced[y_imbalanced==1].shape[0],
                                        random_state=42)

print(f"Post resampling: {np.bincount(y_downsampled)}")

X_balanced = np.vstack((X_imbalanced[y_imbalanced == 1], X_downsampled))
y_balanced = np.hstack((y_imbalanced[y_imbalanced == 1], y_downsampled))

print(f"Post balancing: {np.bincount(y_balanced)}")

Post resampling: [40]
Post balancing: [40 40]


With both of the upsampled and downsampled datasets, the uniform predictor would only have an accuracy score of 50%.

Finally, generating synthetic training datasets can also be taken up with other libraries. The most popular algorith mfor this is SMOTE (Synthetic Minority Oversampling TEchnique).
SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.

Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k=5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space.

There is an sklearn compatible module for it: https://github.com/scikit-learn-contrib/imbalanced-learn.