https://elitedatascience.com/imbalanced-classes

data processing:
* Up-sample the minority class
* Down-sample the majority class
* SMOTE

Algorithm-specific
* Penalize algorithms (cost-sensitive training) -- see SVM in sklearn
* Use tree-based algorithms -- see Random Forests

Change evaluation metric
* use metrics beyond accuracy such as recall, precision, and AUROC. 

not relevant?
* https://machinelearningmastery.com/k-fold-cross-validation/

In [1]:
!pip install imblearn



In [2]:
import pandas
print('pandas',pandas.__version__)
import numpy as np
print('numpy',np.__version__)
from numpy.random import choice
import sklearn.utils
import imblearn.over_sampling

pandas 0.23.4
numpy 1.13.3


# create fake data

In [3]:
num_rows=10000
ratio_of_classes=0.9

In [4]:
df1 = pandas.DataFrame(np.abs(np.random.randn(num_rows, 4)), columns=list('ABCD'))
df2 = pandas.DataFrame(np.random.randint(10,size=(num_rows, 4)), columns=list('EFGH'))
# https://stackoverflow.com/questions/10803135/weighted-choice-short-and-simple
elements=[0, 1]
weights=[1-ratio_of_classes, ratio_of_classes]
df3 = pandas.DataFrame([np.random.choice(elements, p=weights) for _ in range(num_rows)], columns=['J'])
cleaned_df = pandas.concat([df1, df2,df3], axis=1, join_axes=[df1.index])
cleaned_df.head()

Unnamed: 0,A,B,C,D,E,F,G,H,J
0,1.034239,1.607263,0.098639,1.586527,9,4,6,3,1
1,0.027902,0.99401,0.979589,0.196169,8,9,0,4,1
2,0.310414,0.128078,0.254571,1.373261,6,0,2,2,1
3,1.481125,0.865779,0.439763,0.386991,7,8,2,0,1
4,0.721356,0.85671,0.841289,0.418589,4,6,5,4,1


In [5]:
X = cleaned_df.drop('J', axis=1)
y=cleaned_df['J']
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.33, random_state=42)

# Up-sample the minority class

resample with replacement

1. separate observations from each class into different DataFrames.
1. resample the minority class with replacement, setting the number of samples to match that of the majority class.
1. combine the up-sampled minority class DataFrame with the original majority class DataFrame.

In [6]:
# Separate majority and minority classes
df_majority = cleaned_df[cleaned_df['J']==1]
df_minority = cleaned_df[cleaned_df['J']==0]

In [7]:
print('majority:',df_majority.shape)
print('minority:',df_minority.shape)

majority: (8996, 9)
minority: (1004, 9)


In [8]:
# Upsample minority class
df_minority_upsampled = sklearn.utils.resample(df_minority, 
                                               replace=True,     # sample with replacement
                                               n_samples=df_majority.shape[0],    # to match majority class
                                               random_state=42) # reproducible results

In [9]:
# Combine majority class with upsampled minority class
df_upsampled = pandas.concat([df_majority, df_minority_upsampled])
 
# Display new class counts
df_upsampled['J'].value_counts()

1    8996
0    8996
Name: J, dtype: int64

see also https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html

In [10]:
ros = imblearn.over_sampling.RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, y)

# Down-sample the majority class

1. separate observations from each class into different DataFrames.
1. resample the majority class without replacement, setting the number of samples to match that of the minority class.
1. combine the down-sampled majority class DataFrame with the original minority class DataFrame.

In [11]:
print('majority:',df_majority.shape)
print('minority:',df_minority.shape)

majority: (8996, 9)
minority: (1004, 9)


In [12]:
# Downsample majority class
df_majority_downsampled = sklearn.utils.resample(df_majority, 
                                                 replace=False,    # sample without replacement
                                                 n_samples=df_minority.shape[0],     # to match minority class
                                                 random_state=42) # reproducible results

In [13]:
# Combine minority class with downsampled majority class
df_downsampled = pandas.concat([df_majority_downsampled, df_minority])

# Display new class counts
df_downsampled['J'].value_counts()

1    1004
0    1004
Name: J, dtype: int64

# SMOTE: Synthetic Minority Over-sampling Technique

https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.SMOTE.html

https://stackoverflow.com/questions/15065833/imbalance-in-scikit-learn

SMOTE creates synthetic observations of the minority class by:
1. Finding the k-nearest-neighbors for minority class observations (finding similar observations)
1. Randomly choosing one of the k-nearest-neighbors and using it to create a similar, but randomly tweaked, new observation.

In [14]:
sm = imblearn.over_sampling.SMOTE(random_state=2)
sm

SMOTE(k_neighbors=5, kind='deprecated', m_neighbors='deprecated', n_jobs=1,
   out_step='deprecated', random_state=2, ratio=None,
   sampling_strategy='auto', svm_estimator='deprecated')

# 

In [15]:
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())

In [16]:
print('X_train',X_train.shape)
print('X_train_res',X_train_res.shape)

X_train (6700, 8)
X_train_res (12024, 8)


In [17]:
print('y_train     ratio:',sum(y_train    ==0),'/',sum(y_train==1),'=',sum(y_train==0)/sum(y_train==1))
print('y_train_res ratio:',sum(y_train_res==0),'/',sum(y_train_res==1),'=',sum(y_train_res==0)/sum(y_train_res==1))

y_train     ratio: 688 / 6012 = 0.11443779108449767
y_train_res ratio: 6012 / 6012 = 1.0


# combine approaches

https://imbalanced-learn.org/en/stable/combine.html
    
https://imbalanced-learn.org/en/stable/auto_examples/combine/plot_comparison_combine.html#sphx-glr-auto-examples-combine-plot-comparison-combine-py