In [1]:
# globally useful imports of standard libraries needed in this notebook
import numpy as np
import pandas as pd

# import project specific modules used in this notebook
import sys
sys.path.append('../src')
import mindwandering.data

# Mindwandering Dealing with Class Imbalance 

The target label class is relatively imbalanced for this dataset.  Not as bad a some, but a bit.

We will need the features and labels again to treat class imbalance.  Lets use the standard scaled features
for our development of replicating the methods of dealing with class imbalance.

In [2]:
# we will use the standard scaled features in this notebook
df_features = mindwandering.data.get_df_features()
df_features_standard_scaled = mindwandering.data.transform_df_features_standard_scaled(df_features)

# We also need the label data as it is the imbalanced target labels that we need
# to balance the data of.
# we only need the binary mind_wandered_label for this 
df_label = mindwandering.data.get_df_label()
df_label = df_label.mind_wandered_label.copy()

# we need the participiant_id field to perform leave-one-out group cross validation
df_experiment_metadata = mindwandering.data.get_df_experiment_metadata()
participant_id = df_experiment_metadata.participant_id

Lets confirm the imbalance present in our mind_wandered_label target value.

In [3]:
# count up the true and false values in our target label
df_label.value_counts()

False    2963
True     1113
Name: mind_wandered_label, dtype: int64

So approximately 27% of the targets are the positive label, and 73% are the negative label.  In the reference paper they used two methods,
simple downsampling of the majority class, and Synthetic Minority Oversampling Technique (SMOTE) upsampling, which is a kind of data
augmentation technique that imputes new training data instances using existing instances.  In this case it imputes from the minority
class to increase their representation in a training set.  I believe that SMOTE does simple interpolation between a small sample of
instances to create new instances with features somewhere between existing instances.

Both of these methods are available in standard scikit-learn at the time of this project work.

# Majority Class Downsampling to Balance Classes

Lets test out majority class downsampling.  We will be using leave-one-participant-out cross validation.  So lets develop some of that
workflow here.  We will remove one participant's trials for testing, then downsample on the remaining training data and examine the results.

Actually scikit-learn seams to have support for group based cross validation 

[Cross Validation Iterators for Grouped Data](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-grouped-data)

Lets see if we can get that to work the way we need for our data, where each participant can have from 4 to 57 trials in the data.

In [4]:
# imports needed for this section of the notebook
from sklearn.model_selection import LeaveOneGroupOut

In [5]:
# the small example of LeaveOneGroupOut from the scikit-learn documentation
X = [1, 5, 10, 50, 60, 70, 80]
y = [0, 1, 1, 2, 2, 2, 2]
groups = [1, 1, 2, 2, 3, 3, 3]
logo = LeaveOneGroupOut()

for train, test in logo.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[2 3 4 5 6] [0 1]
[0 1 4 5 6] [2 3]
[0 1 2 3] [4 5 6]


What is returned here are actually indexes you can use to extact the trials from the X inputs and y outputs (assuming you had a numpy array or
pandas dataframe you could index using an array of indices.

Our groups will be the participant_ids.  Lets try it out with our features and labels.

In [6]:
for train_idxs, test_idxs in logo.split(df_features_standard_scaled, df_label, groups=participant_id):
    #print(type(train))
    #print(type(test))
    # uncomment to see the participant that was selected for the test
    #print(participant_id.iloc[test])
    pass

This does appear to be working as needed.  We are getting numpy arrays.  The arrays will work to index only the rows we want from the features and
label dataframes to split out the train and test data.

Back to downsampling.  Lets create a logo and get the first participant train/test split.

In [7]:
# extract test and train dataframes
logo = LeaveOneGroupOut()
logo_generator = logo.split(df_features_standard_scaled, df_label, groups=participant_id)
train_idxs, test_idxs = next(logo_generator)
print(test_idxs)
print(participant_id.iloc[test_idxs])

df_features_test = df_features_standard_scaled.iloc[test_idxs].copy()
df_features_train = df_features_standard_scaled.iloc[train_idxs].copy()
df_label_test = df_label.iloc[test_idxs].copy()
df_label_train = df_label.iloc[train_idxs].copy()

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51]
1     1002-UM
2     1002-UM
3     1002-UM
4     1002-UM
5     1002-UM
6     1002-UM
7     1002-UM
8     1002-UM
9     1002-UM
10    1002-UM
11    1002-UM
12    1002-UM
13    1002-UM
14    1002-UM
15    1002-UM
16    1002-UM
17    1002-UM
18    1002-UM
19    1002-UM
20    1002-UM
21    1002-UM
22    1002-UM
23    1002-UM
24    1002-UM
25    1002-UM
26    1002-UM
27    1002-UM
28    1002-UM
29    1002-UM
30    1002-UM
31    1002-UM
32    1002-UM
33    1002-UM
34    1002-UM
35    1002-UM
36    1002-UM
37    1002-UM
38    1002-UM
39    1002-UM
40    1002-UM
41    1002-UM
42    1002-UM
43    1002-UM
44    1002-UM
45    1002-UM
46    1002-UM
47    1002-UM
48    1002-UM
49    1002-UM
50    1002-UM
51    1002-UM
52    1002-UM
Name: participant_id, dtype: object


In [8]:
# what is the ratio of the labels in the training data?
df_label_train.value_counts()

False    2921
True     1103
Name: mind_wandered_label, dtype: int64

If not obvious, the labels (and hopefully data as well), now have had participant 1002-UM removed, there are less total trials than before.  
The number of trials in the training data and labels is $2921 + 1103 = 4024$, so the 52 trials of participant 1002-UM have been removed.  The data
still has the roughly 28%/72% imbalance.

Scikit-learn provides a mechanism to downsample the majority class, though this is a relatively simple procedure we could do ourself.  Actually as
of this project, the imbalance-learn is a community project of scikit-learn, I suppose it will get folded into scikit-learn proper at some point.

Downsampling the majority usually is done by removing without replacement items in the features (and the corresponding label) at random, in the paper we are
replicating.  

In [9]:
# imports needed for this section of the notebook
from imblearn.under_sampling import RandomUnderSampler

In [10]:
undersample = RandomUnderSampler(random_state=0)
df_features_train_resampled, df_label_train_resampled = undersample.fit_resample(df_features_train, df_label_train)

In [11]:
df_label_train_resampled.value_counts()

False    1103
True     1103
Name: mind_wandered_label, dtype: int64

So fairly simple operation, the default behavior is to downsample until the classes are equalized.

This should work in our leave-one-participant-out training loop.  For example, to get the next train/test iteration, we would do this again.

In [12]:
# leave out the next participant
train_idxs, test_idxs = next(logo_generator)

# split the participant to test out and leave train data with rest of data
df_features_test = df_features_standard_scaled.iloc[test_idxs].copy()
df_features_train = df_features_standard_scaled.iloc[train_idxs].copy()
df_label_test = df_label.iloc[test_idxs].copy()
df_label_train = df_label.iloc[train_idxs].copy()

# perform downsampling on this set of training data
df_features_train_resampled, df_label_train_resampled = undersample.fit_resample(df_features_train, df_label_train)

# see the result
df_label_train_resampled.value_counts()

False    1113
True     1113
Name: mind_wandered_label, dtype: int64

# Minority Class Upsampling using SMOTE

Conceptually the upsampling is much more complicated.  This is a data generation technique.  A good quick overview is
[SMOTE for Imbalanced Classification](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)

But using SMOTE upscaling from the `imbalance-learn` library will work pretty much the same as random downsampling.

In [13]:
# imports needed for this section of the notebook
from imblearn.over_sampling import SMOTE

In [14]:
logo = LeaveOneGroupOut()
logo_generator = logo.split(df_features_standard_scaled, df_label, groups=participant_id)
train_idxs, test_idxs = next(logo_generator)

df_features_test = df_features_standard_scaled.iloc[test_idxs].copy()
df_features_train = df_features_standard_scaled.iloc[train_idxs].copy()
df_label_test = df_label.iloc[test_idxs].copy()
df_label_train = df_label.iloc[train_idxs].copy()

In [15]:
oversample = SMOTE(random_state=0)
df_features_train_resampled, df_label_train_resampled = oversample.fit_resample(df_features_train, df_label_train)

In [16]:
df_label_train_resampled.value_counts()

False    2921
True     2921
Name: mind_wandered_label, dtype: int64

In [17]:
# oversample for the next leave-one-out train/test split
train_idxs, test_idxs = next(logo_generator)

df_features_test = df_features_standard_scaled.iloc[test_idxs].copy()
df_features_train = df_features_standard_scaled.iloc[train_idxs].copy()
df_label_test = df_label.iloc[test_idxs].copy()
df_label_train = df_label.iloc[train_idxs].copy()

In [18]:
oversample = SMOTE(random_state=0)
df_features_train_resampled, df_label_train_resampled = oversample.fit_resample(df_features_train, df_label_train)

In [19]:
df_label_train_resampled.value_counts()

False    2913
True     2913
Name: mind_wandered_label, dtype: int64