 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

# `Dealing with imbalanced data`

* working with imbalanced data usually leads to suboptimal results

* **our models become biased towards classes that are better represented**

* the main cause of the accuracy trap
    * a situation where accuracy is high, but the model is bad
    * in a highly biased dataset the model can start always predicting the majority class

* even though tracking how good a model is using the F1 score or something similar allows us to avoid the accuracy trap, it still doesn't solve our problems

* if you are working with typical imbalanced datasets or text data, there are particular methods you can use to decrease the imbalance as much as possible

**Standard solutions to the problem:**

* random undersampling 
* random oversampling
* SMOTE

**NLP specific solutions:**

* data augmentation using synonyms
* data augmentation using back-translation

**What you can use in combination with the aforementioned:**

* create multiple random samples by undersampling the majority class
* afterward train multiple models: each model should train on the whole minority class and one of the random samples you created in the previous step
* in the end create an ensemble of models to make predictions

# `Standard solutions`

* these methods were not created specifically for working with text data, so they can be applied to any type of data 

* for random undersampling and oversampling we can use Scikit-Learn, but for SMOTE we will use the `imblearn` library

## `Random undersampling`

* **the idea:** if you have more data in one class then in the other, remove examples from the majority class until we have the same number of examples for our two classes

* somewhat wasteful because we end up working with less data than we had in the beginning

**Example:**

In [1]:
# Let's first import the libraries we will use

import pandas as pd
from sklearn.utils import resample

In [2]:
# Create a DataFrame

df = pd.read_csv("https://edlitera-datasets.s3.amazonaws.com/wine_data_classification.csv")

In [3]:
# Take a look at the first five rows of our dataframe

df.head()

Unnamed: 0,description,wine_type
0,"Aromas include tropical fruit, broom, brimston...",great_wine
1,"This is ripe and fruity, a wine that is smooth...",great_wine
2,"Tart and snappy, the flavors of lime flesh and...",great_wine
3,"Pineapple rind, lemon pith and orange blossom ...",great_wine
4,"Much like the regular bottling from 2012, this...",great_wine


In [4]:
# Check how many examples we have for each class

df["wine_type"].value_counts()

great_wine       96336
superior_wine    33635
Name: wine_type, dtype: int64

We observe that the data is imbalanced: we have far more "great wines" as opposed to "superior wines".

In [5]:
# One-hot encode data

df["wine_type"] = df["wine_type"].map({"great_wine": 0, "superior_wine": 1})

In [6]:
# Separate the two classes

great_wine = df[df.wine_type == 0]
superior_wine = df[df.wine_type == 1]

In [7]:
# Use random undersampling to create a sample dataframe of the majority class
# that has the same number of examples as the minority class

majority_undersampled = resample(
    great_wine,
    replace=False, # sample without replacement
    n_samples=len(superior_wine), # match minority n
    random_state=27) # reproducible results


In [8]:
# Create balanced dataset by combining undersampled majority with minority

balanced_df = pd.concat([majority_undersampled, superior_wine])

In [9]:
# Check data balance

balanced_df["wine_type"].value_counts()

0    33635
1    33635
Name: wine_type, dtype: int64

## `Random oversampling`

* **the idea:** if you have more data in one class then in the other, add fake examples to the minority class until we have the same number of examples for our two classes

* also called upsampling

* the fake examples are duplicates of our real examples

* potentially problematic: increases how much data we have, but depending on the imbalance we can have a lot of duplicates in our dataset which can lead to poor model performance 

    * especially problematic if duplicate examples end up in both our testing and training data
    * **always separate data into training and testing data before oversampling to make sure that duplicates do not artifically increase the performance of your model**

**Example:**

In [10]:
# Let's first import the libraries we will use

import pandas as pd
from sklearn.utils import resample

In [11]:
# Create a DataFrame

df = pd.read_csv("https://edlitera-datasets.s3.amazonaws.com/wine_data_classification.csv")

In [12]:
# Take a look at the first five rows of our dataframe

df.head()

Unnamed: 0,description,wine_type
0,"Aromas include tropical fruit, broom, brimston...",great_wine
1,"This is ripe and fruity, a wine that is smooth...",great_wine
2,"Tart and snappy, the flavors of lime flesh and...",great_wine
3,"Pineapple rind, lemon pith and orange blossom ...",great_wine
4,"Much like the regular bottling from 2012, this...",great_wine


In [13]:
# Check how many examples we have for each class

df["wine_type"].value_counts()

great_wine       96336
superior_wine    33635
Name: wine_type, dtype: int64

In [14]:
# One-hot encode data

df["wine_type"] = df["wine_type"].map({"great_wine": 0, "superior_wine": 1})

In [15]:
# Let's shuffle the dataset now

df = df.sample(frac=1).reset_index(drop=True)

In [16]:
len(df)

129971

In [17]:
# Define training and testing data

train_data = df.iloc[:85_000, :]
valid_data = df.iloc[85_000:100_000, :]
test_data = df.iloc[100_000:, :]

In [18]:
# Check train data for imbalance

train_data["wine_type"].value_counts()

0    62892
1    22108
Name: wine_type, dtype: int64

In [19]:
# Separate the two classes

great_wine = train_data[train_data.wine_type == 0]
superior_wine = train_data[train_data.wine_type == 1]

In [20]:
# Use random oversampling to create a sample dataframe of the majority class
# that has the same number of examples as the minority class


minority_upsampled = resample(
    superior_wine,
    replace=True, # sample with replacement
    n_samples=len(great_wine), # match majority n
    random_state=27) # reproducible results

In [21]:
# Create balanced dataset by combining upsampled majority with minority

balanced_df = pd.concat([minority_upsampled, great_wine])

In [22]:
# Check new data 

balanced_df["wine_type"].value_counts()

1    62892
0    62892
Name: wine_type, dtype: int64

In [23]:
# Define new train data

train_data = balanced_df.copy()

## `SMOTE`

* short for **S**YNTHETIC **M**INORITY **O**VERSAMPLING **TE**CHNIQUE

* type of oversampling where we create fake examples not by creating duplicates of real ones, but by creating synthetic examples 

* part of the **`imblearn`** library
    * you can install it by running the command **`pip install imblearn`**

**How are synthetic examples created:**

* the algorithm finds a real example in the minority class
* it takes a look at its k nearest neighbours
* in our feature space, it will create random values that lie somewhere between our real example and its nearest neighbours

**Because of how it works it can be relatively slow (depending on how much data you have) !**

**Simple example:**

In [24]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import SMOTE 

df = pd.DataFrame({
    'sentences': ['i like this',
                  'this is terrible',
                  'it will not turn out well',
                  'i hate this',
                  'it has potential'],
    'sentiment': [1, 0, 0, 0, 1]})


vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['sentences'])
 
sm = SMOTE(k_neighbors=1, random_state=2) 

X_train_resampled, y_train_resampled = sm.fit_resample(X, df.sentiment.values) 

In [25]:
# Check sentences after SMOTE

vectorizer.inverse_transform(X_train_resampled)

[array(['this', 'like'], dtype='<U9'),
 array(['terrible', 'is', 'this'], dtype='<U9'),
 array(['well', 'out', 'turn', 'not', 'will', 'it'], dtype='<U9'),
 array(['hate', 'this'], dtype='<U9'),
 array(['potential', 'has', 'it'], dtype='<U9'),
 array(['potential', 'has', 'it', 'like', 'this'], dtype='<U9')]

**There is an extra sentence created by SMOTE !**

* let's prove that

In [26]:
# Number of original sentences

len(df["sentences"])

5

In [27]:
# Number of sentences in our resampled data

X_train_resampled.shape[0]

6

**Complex example:**

* SMOTE works much better on preprocessed data
    * because we are just demonstrating how to create extra examples here we won't go through the whole process of cleaning data, but you should do that when you use SMOTE in practice
    * we will do just a little bit of basic data cleaning

In [28]:
# Let's first import the libraries we will use

import pandas as pd
import numpy as np

from imblearn.over_sampling import SMOTE

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer

In [29]:
# Create a DataFrame

df = pd.read_csv("https://edlitera-datasets.s3.amazonaws.com/wine_data_classification.csv")

In [30]:
# Take a look at the first five rows of our dataframe

df.head()

Unnamed: 0,description,wine_type
0,"Aromas include tropical fruit, broom, brimston...",great_wine
1,"This is ripe and fruity, a wine that is smooth...",great_wine
2,"Tart and snappy, the flavors of lime flesh and...",great_wine
3,"Pineapple rind, lemon pith and orange blossom ...",great_wine
4,"Much like the regular bottling from 2012, this...",great_wine


In [31]:
# Check how many examples we have for each class

df["wine_type"].value_counts()

great_wine       96336
superior_wine    33635
Name: wine_type, dtype: int64

In [32]:
# One-hot encode data

df["wine_type"] = df["wine_type"].map({"great_wine": 0, "superior_wine": 1})

In [33]:
# Lowercase text data

df["description"]= df["description"].str.lower()

In [34]:
# Tokenize data

tokenizer = RegexpTokenizer(r"\w+")

words = df["description"].apply(tokenizer.tokenize)

In [35]:
# Define a list of stopwords

stopword_list = stopwords.words("english")

# Remove stopwords

words_without_stopwords = words.apply(lambda i: [word for word in i if not word in stopword_list])

In [36]:
# Define the stemmer we will use

snowball_stemmer = SnowballStemmer(language='english') # for Snowball Stemmer you need to define the language parameter

# Perform stemming

words_stemmed = words_without_stopwords.apply(lambda i: [snowball_stemmer.stem(word) for word in i])

In [37]:
# Add the cleaned data to our original Dataframe

df["description"] = words_stemmed.apply(lambda elem: " ".join(elem))

In [38]:
# Define the vectorizer

vectorizer = TfidfVectorizer(
    analyzer="word", 
    token_pattern=r"\w+",
    max_features=500)

In [39]:
# Define dependent features

X = df["description"]

# Define independent feature

y = df["wine_type"]

In [40]:
# Separate data into training data and testing data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.3, 
    random_state=41)

# Separate data into training data and validation data

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train, y_train, 
    test_size=0.3, 
    random_state=41)

In [41]:
# Vectorize data 

vectorized_train_X = vectorizer.fit_transform(X_train)

In [42]:
vectorized_train_X.shape[0]

63685

In [43]:
# Define SMOTE algorithm

smote = SMOTE(random_state=777, k_neighbors=3)

In [44]:
# Create resampled data

X_train_resampled, y_train_resampled = smote.fit_resample(vectorized_train_X, y_train.values)

In [45]:
# Check if we have increased the size of our train dataset

X_train_resampled.shape[0]

94154

In [46]:
# Check if we have balanced out our labels

np.unique(y_train_resampled,return_counts=True)

(array([0, 1], dtype=int64), array([47077, 47077], dtype=int64))

# `NLP specific solutions`

* solutions that can be used primarily for text data

* keep in mind: even if something doesn't make sense to us humans, it doesn't mean models can't use it effectively

* very advanced methods that don't always lead to better results: depending on the problem you are trying to solve you migh even get worse results than before
    <br>
    
    * main reason: you can easily overfit your data because, even though the examples that are synthetically created are not the same as the original examples, they are still pretty similar

## `Data augmentation using synonyms`

* very simple, yet effective method

* a very good library for data augmentation for NLP problems: **`nlpaug`**
    * installed with **`pip install nlpaug`**
    * requires some other modules such as **`torch`**

**The procedure:**

* take examples of the minority class
* for each example, replace some words with their synonyms (by using a very big corpus of words)
* somewhat error prone

**Example:**

In [47]:
import nlpaug
import nlpaug.augmenter.word as naw

In [48]:
# Define synonym augmenter
# Define what corpus you want to use to find synonyms, and the maximum number of words that get replaced

aug = naw.SynonymAug(aug_src='wordnet', aug_max=3)

In [49]:
# Define example text

text = "Worst chocolate cake I ever had."

In [50]:
# Augment text

augmented_text = aug.augment(text)

In [51]:
# Show augmented text

augmented_text

'Worst chocolate cake I always had.'

## `Data augmentation using back-translation`

* a bit more complicated

**The procedure:**

* take examples of the minority class
* for each example, translate it into a foreign language
* after getting a translation, translate that translation back into the original language

* you can use any translator (even Google translate)
    <br>
    
    * to use the API from Google translate in Python, you need to install the **googletrans** package
    * to be more specific, you need to install this version: **pip install googletrans==3.1.0a0**

**How does the Google translate API work?**

* to translate, you use the **`translate()`** function from the **`Translator`** 

* the object that that function returns has the following attributes:
    <br>
    
    * ***src*** - the source language
    * ***dest*** - destination language, which is by default set to English (en)
    * ***origin*** - original text
    * ***text*** - translated text
    * ***pronunciation*** - pronunciation of the translated text

**Example: translate into French and back**

In [52]:
# Import the translator from Google translate

import googletrans
from googletrans import Translator

In [53]:
# Display list of languages to find the abbreviation for French

print(googletrans.LANGUAGES)

{'af': 'afrikaans', 'sq': 'albanian', 'am': 'amharic', 'ar': 'arabic', 'hy': 'armenian', 'az': 'azerbaijani', 'eu': 'basque', 'be': 'belarusian', 'bn': 'bengali', 'bs': 'bosnian', 'bg': 'bulgarian', 'ca': 'catalan', 'ceb': 'cebuano', 'ny': 'chichewa', 'zh-cn': 'chinese (simplified)', 'zh-tw': 'chinese (traditional)', 'co': 'corsican', 'hr': 'croatian', 'cs': 'czech', 'da': 'danish', 'nl': 'dutch', 'en': 'english', 'eo': 'esperanto', 'et': 'estonian', 'tl': 'filipino', 'fi': 'finnish', 'fr': 'french', 'fy': 'frisian', 'gl': 'galician', 'ka': 'georgian', 'de': 'german', 'el': 'greek', 'gu': 'gujarati', 'ht': 'haitian creole', 'ha': 'hausa', 'haw': 'hawaiian', 'iw': 'hebrew', 'he': 'hebrew', 'hi': 'hindi', 'hmn': 'hmong', 'hu': 'hungarian', 'is': 'icelandic', 'ig': 'igbo', 'id': 'indonesian', 'ga': 'irish', 'it': 'italian', 'ja': 'japanese', 'jw': 'javanese', 'kn': 'kannada', 'kk': 'kazakh', 'km': 'khmer', 'ko': 'korean', 'ku': 'kurdish (kurmanji)', 'ky': 'kyrgyz', 'lo': 'lao', 'la': 'lat

In [54]:
# Define translator

translator = Translator()

In [55]:
# Define example text

original_text = "Worst chocolate cake I ever had."

In [56]:
# Translate text into French

first_translation = translator.translate(original_text, src="en", dest="fr")

In [57]:
# Display translated text

first_translation.text

"Le pire gâteau au chocolat que j'ai jamais eu."

In [58]:
# Translate text back into English

back_translation = translator.translate(first_translation.text, src='fr', dest="en")

In [59]:
# Text translated back in English

back_translation.text

"The worst chocolate cake I've ever had."

 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>