# Imblanced Target Variable with Text Data

In this notebook I will show 4 different techniques for handling imblanced target variable 
1. Oversampling the minority class using imblearn
2. Undersampling the majority class using imblearn
3. Using the `class_weight` parameter in a sklearn model
4. Data Augmentation - by translating the text into another language and then translating it back 

In [None]:
# load in the data 

import pandas as pd 
import numpy as np

df = pd.read_csv('/kaggle/input/spam-text-message-classification/SPAM text message 20170820 - Data.csv')

In [None]:
df['Category'].value_counts(normalize = True)

87% of my data is of class ham and 13% is of class spam 

For this notebook, I am going to be focusing on different techniques for handling imbalanced classes.  For this reason I am going to be using TF-IDF and a Random Forest Classifier for all of the different techniques. 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# convert all text to lowercase 
df['Message'] = df['Message'].str.lower()

# perform train test split 
X_train, X_test, y_train, y_test = train_test_split(df['Message'], df['Category'], random_state=11)

# vectorize text using TFIDF
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

To begin I am starting with a random forest model where I do not do anything to the classes even though they are imbalanced


### Baseline Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report


rf = RandomForestClassifier(random_state = 11)
rf.fit(X_train_tfidf, y_train)
print(classification_report(y_test, rf.predict(X_test_tfidf)))

I see that I get a relatively low recall on the minority class `spam` of 0.85

### Random Over Sampling

Next I am going to try random over sampling 

In [None]:
# check distribution before applying over sampling 

df['Category'].value_counts()

In [None]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()
X_ros, y_ros = ros.fit_resample(X_train_tfidf, y_train)

# check distribution after applying over sampling 
y_ros.value_counts()

Applying the same model with the over sampled data 

In [None]:
rf = RandomForestClassifier()
rf.fit(X_ros, y_ros)
print(classification_report(y_test, rf.predict(X_test_tfidf)))

We get very similar results as the baseline classifier 

### Random Under Sampling

In [None]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()
X_rus, y_rus = rus.fit_resample(X_train_tfidf, y_train)

In [None]:
# check distribution after random under sampling 
y_rus.value_counts()

In [None]:
rf = RandomForestClassifier()
rf.fit(X_rus, y_rus)
print(classification_report(y_test, rf.predict(X_test_tfidf)))

This time we see that the precision score went down a bit for the minority class, but the recall increased.  F1-Score increased from 0.90 (baseline) to 0.92.

### Class Weight 

In [None]:
rf = RandomForestClassifier(class_weight = 'balanced')
rf.fit(X_train_tfidf, y_train)
print(classification_report(y_test, rf.predict(X_test_tfidf)))

F1-Score for the minority class went down to 0.88

### Data Augmentation 

Now we will try translating the Spam Messages to another language and then translate them back to English.  The idea is that we will add a little noise by performing a translation.

An example of this can be seen below 

In [None]:
!pip install googletrans==3.1.0a0

Now lets see an example of this for a single message.  I am going to take a message, translate it to French, and then translate it back to English 

In [None]:
from googletrans import Translator

translator = Translator()

# translate to French
french = translator.translate(df.loc[2, 'Message'], dest = 'fr')
# translate back to English
translator.translate(french.text, dest = 'en').text

In [None]:
# original message 
df.loc[2, 'Message']

We see that the orginal message is slightly different than the translated message.  This allows me to add new data to the dataset that is slighly different than the original messages. 

Now I'm going to do this for all of the Spam messages in the training set 

In [None]:
df_train = pd.concat([X_train, y_train], axis = 1)
df_train.head(2)

I am going to take each Spam message and then randomly translate that message to either French, Spanish, or German, then will translate that back to English

In [None]:
import time 
translated_text = []

for message in df_train[df_train['Category'] == 'spam']['Message']:
    language = np.random.choice(['fr', 'es', 'de'])
    translated_message = translator.translate(message, dest = language)
    translated_text.append(translator.translate(translated_message.text, dest = 'en').text)
    time.sleep(1)

Combine the translated and non-translated messages to one dataframe 

In [None]:
translations_df = pd.DataFrame({'Message': translated_text,'Category': 'spam'})
df_train_translations = pd.concat([df_train, translations_df])

In [None]:
df_train_translations['Category'].value_counts()

If you remember from earlier, we originally had 570 spam messages and we now have 1,140 spam messages after the data augmentation. 

Perform same TF-IDF that I did earlier 

In [None]:
df_train_translations['Message'] = df_train_translations['Message'].str.lower()

# perform TFIDF 
X_train_trans_tfidf = tfidf.transform(df_train_translations['Message'])

Use the randomforest classifier on the translated data 

In [None]:
rf.fit(X_train_trans_tfidf, df_train_translations['Category'])
print(classification_report(y_test, rf.predict(X_test_tfidf)))

We see that I get a F1 score on the minority class `spam` of 0.90.

## Conclusions 
We have tried 3 different techniques for handling the unbalanced class.  Next steps, trying out more data augmentation because even after doubling the number of `spam` messages there were still a lot less `spam` messages than `ham` messages with the data augmentation technique.  