 # Dealing with Class Imbalance with SMOTE

### In this kernel, I will use a simple Deep Learning model and compare its performance on normal data and data augmented with SMOTE

> Check https://arxiv.org/pdf/1106.1813.pdf

I use SMOTE to add **sentence level** noise to our data.

#### The model is the following one :
* GloVe Embedding
* Bidirectional GRU
* MaxPool
* Dense 
* Probably some Dropouts


#### Feel free to give any feedback, it is always appreciated.

In [None]:
import numpy as np
import pandas as pd
import keras
import seaborn as sns
import matplotlib.pyplot as plt
from time import time
from collections import Counter

## How does SMOTE work ?

> " The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors "

> " Synthetic samples are generated in the following way: Take the diﬀerence between the feature vector (sample) under consideration and its nearest neighbor. Multiply this diﬀerence by a random number between 0 and 1, and add it to the feature vector under consideration. This causes the selection of a random point along the line segment between two speciﬁc features. This approach eﬀectively forces the decision region of the minority class to become more general. "

I am using the class from imblearn,  see https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.SMOTE.html

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, n_classes=2, weights=[0.2, 0.8], class_sep=0.95, random_state=0)

In [None]:
plt.figure(figsize=(12, 8))
plt.title('Repartition before SMOTE')
plt.scatter(X[y==1][:, 0], X[y==1][:, 1], label='class 1')
plt.scatter(X[y==0][:, 0], X[y==0][:, 1], label='class 0')
plt.legend()
plt.grid(False)
plt.show()

In [None]:
smt = SMOTE()
X_smote, y_smote = smt.fit_resample(X, y)

In [None]:
plt.figure(figsize=(12, 8))
plt.title('Repartition after SMOTE')
plt.scatter(X_smote[y_smote==1][:, 0], X_smote[y_smote==1][:, 1], label='class 1')
plt.scatter(X_smote[y_smote==0][:, 0], X_smote[y_smote==0][:, 1], label='class 0')
plt.legend()
plt.grid(False)
plt.show()

## Loading data

In [None]:
df = pd.read_csv("../input/train.csv")
print("Number of texts: ", df.shape[0])

In [None]:
df = df.sample(30000)

## Class imbalance

In [None]:
plt.figure(figsize = (10, 8))
sns.countplot(df['target'])
plt.show()

In [None]:
print(Counter(df['target']))

There is way more 0s than 1s in our dataset, data is very unbalanced and one should consider using oversampling or undersampling.

I don't recommand undersampling in Kaggle competitions, because you want to have as much data as possible for your training. 

## Making Data for the network
We apply the following steps :
* Splitting
* Tokenizing
* Padding

In [None]:
max_len = 50
len_voc = 40000

### Train/Test split
It is important to split before oversampling ! 

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.5)

### Tokenizing

In [None]:
def make_tokenizer(texts, len_voc):
    from keras.preprocessing.text import Tokenizer
    t = Tokenizer(num_words=len_voc)
    t.fit_on_texts(texts)
    return t

In [None]:
tokenizer = make_tokenizer(df['question_text'], len_voc)

In [None]:
X_train = tokenizer.texts_to_sequences(df_train['question_text'])
X_test = tokenizer.texts_to_sequences(df_test['question_text'])

### Padding

In [None]:
from keras.preprocessing.sequence import pad_sequences

X_train = pad_sequences(X_train, maxlen=max_len, padding='post', truncating='post')
X_test = pad_sequences(X_test, maxlen=max_len, padding='post', truncating='post')

### Targets

In [None]:
y_train = df_train['target'].values
y_test = df_test['target'].values

### Embeddings

In [None]:
def get_coefs(word,*arr): 
    return word, np.asarray(arr, dtype='float32')

def load_embedding(file):
    if file == '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec':
        embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file) if len(o)>100)
    else:
        embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file, encoding='latin'))
    return embeddings_index

In [None]:
def make_embedding_matrix(embedding, tokenizer, len_voc):
    all_embs = np.stack(embedding.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]
    word_index = tokenizer.word_index
    embedding_matrix = np.random.normal(emb_mean, emb_std, (len_voc, embed_size))
    
    for word, i in word_index.items():
        if i >= len_voc:
            continue
        embedding_vector = embedding.get(word)
        if embedding_vector is not None: 
            embedding_matrix[i] = embedding_vector
    
    return embedding_matrix

In [None]:
glove = load_embedding('../input/embeddings/glove.840B.300d/glove.840B.300d.txt')

In [None]:
embed_mat = make_embedding_matrix(glove, tokenizer, len_voc)

In [None]:
X_train_emb = embed_mat[X_train]
X_test_emb = embed_mat[X_test]

## Oversampling

In [None]:
train_size, max_len, embed_size = X_train_emb.shape
X_train_emb_r = X_train_emb.reshape(train_size, max_len*embed_size)

In [None]:
smt = SMOTE(sampling_strategy=0.2)
X_smote, y_smote = smt.fit_sample(X_train_emb_r, y_train)

In [None]:
X_smote = X_smote.reshape((X_smote.shape[0], max_len, embed_size))

In [None]:
plt.figure(figsize = (10, 8))
plt.subplot(1, 2, 1)
sns.countplot(y_train)
plt.title('Reparition before SMOTE')
plt.subplot(1, 2, 2)
sns.countplot(y_smote)
plt.title('Reparition after SMOTE')
plt.show()

## Now let us train a model

### Making model

In [None]:
from keras.models import Model
from keras.layers import Dense, Bidirectional, CuDNNGRU, GlobalMaxPool1D, Input, Dropout
from keras.optimizers import Adam

In [None]:
def make_model(max_len, len_voc=50000, embed_size=300):
    inp = Input(shape=(max_len, 300))
    x = Bidirectional(CuDNNGRU(64, return_sequences=True))(inp)
    x = GlobalMaxPool1D()(x)
    x = Dense(1, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.001), metrics=['accuracy'])
    return model

In [None]:
model = make_model(max_len)
model_smote = make_model(max_len)

In [None]:
model.summary()

### Callbacks

In [None]:
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau

reduce_lr = ReduceLROnPlateau(monitor='val_acc', factor=0.1, patience=2, verbose=1, min_lr=0.000001)
checkpoints = ModelCheckpoint('weights.hdf5', monitor="val_acc", mode="max", verbose=True, save_best_only=True)

reduce_lr_smote = ReduceLROnPlateau(monitor='val_acc', factor=0.1, patience=2, verbose=1, min_lr=0.000001)
checkpoints_smote = ModelCheckpoint('smote_weights.hdf5', monitor="val_acc", mode="max", verbose=True, save_best_only=True)

### Fitting

In [None]:
model.fit(X_train_emb, y_train, batch_size=128, epochs=3, validation_data=[X_test_emb, y_test], callbacks=[checkpoints, reduce_lr])

In [None]:
model_smote.fit(X_smote, y_smote, batch_size=128, epochs=3, validation_data=[X_test_emb, y_test], callbacks=[checkpoints_smote, reduce_lr_smote])

In [None]:
model.load_weights('weights.hdf5')
model_smote.load_weights('smote_weights.hdf5')

### Predictions

In [None]:
pred_test = model.predict([X_test_emb], verbose=1)
pred_test_smote = model_smote.predict([X_test_emb], batch_size=256, verbose=1)

### Tweaking threshold

In [None]:
def tweak_threshold(pred, truth):
    from sklearn.metrics import f1_score
    scores = []
    for thresh in np.arange(0.1, 0.501, 0.01):
        thresh = np.round(thresh, 2)
        score = f1_score(truth, (pred>thresh).astype(int))
        scores.append(score)
    return round(np.max(scores), 4)

In [None]:
print(f"Scored {tweak_threshold(pred_test, y_test)} without SMOTE (test data)")

In [None]:
print(f"Scored {tweak_threshold(pred_test_smote, y_test)} with SMOTE (test data)")

 ## Conclusion

It appears that SMOTE does not help improve the results. However, it makes the network learning faster.

**Moreover, there is one big problem, this method is not compatible larger datasets.**

You have to apply SMOTE on embedded sentences, which takes way too much memory. 

A solution is to use a generator for our training, which realizes oversampling on batches. I've tried it, but my generator was very slow.

So I'm going to stick with these results for now, and try another data augmentation technique.

If you have any improvement idea feel free to let me know.

#### Thanks for reading ! 
 