In [19]:
import pandas as pd
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from collections import Counter

In [9]:
# import data
df_train = pd.read_csv(r'../data/train.csv')
X = df_train.comment_text
y = df_train.target.apply(lambda x: 'toxic' if x >= 0.5 else 'non-toxic')

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 17)

train = pd.concat([X_train, y_train], axis = 1) # training set

# separate toxic and non-toxic instances
non_toxic = train[train.target == 'non-toxic']
toxic = train[train.target == 'toxic']

# 1. Handling imbalanced data
## 1.1 Oversampling minority class (toxic)

"Oversampling can be defined as adding more copies of the minority class. Oversampling can be a good choice when you don’t have a ton of data to work with.
We will use the resampling module from Scikit-Learn to randomly replicate samples from the minority class.
Important Note:
Always split into test and train sets BEFORE trying oversampling techniques! Oversampling before splitting the data can allow the exact same observations to be present in both the test and train sets. This can allow our model to simply memorize specific data points and cause overfitting and poor generalization to the test data." (Boyle, 2019)

In [13]:
# oversample minority class
toxic_oversampled = resample(toxic,
                          replace = True, # sample with replacement
                          n_samples = len(non_toxic), # match the size of non-toxic set
                          random_state = 27)

# training set combined with non_toxic and oversampled toxic instances
train_oversampled = pd.concat([not_toxic, toxic_oversampled])

In [20]:
# check size
Counter(train_oversampled.target)

Counter({'non-toxic': 1328440, 'toxic': 1328440})

## 1.2 Undersampling majority class (non-toxic)

"Undersampling can be defined as removing some observations of the majority class. Undersampling can be a good choice when you have a ton of data -think millions of rows. But a drawback is that we are removing information that may be valuable. This could lead to underfitting and poor generalization to the test set.
We will again use the resampling module from Scikit-Learn to randomly remove samples from the majority class." (Boyle, 2019)

In [21]:
# undersample majority class
non_toxic_undersampled = resample(non_toxic,
                                replace = False, # sample without replacement
                                n_samples = len(toxic), # match the size of toxic set
                                random_state = 17)

# training set combined with toxic and undersampled non-toxic instances
train_undersampled = pd.concat([non_toxic_undersampled, toxic])

In [22]:
# check size
Counter(train_undersampled.target)

Counter({'non-toxic': 115459, 'toxic': 115459})

References:

Boyle, T..(2019).https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18