<a href="https://colab.research.google.com/github/royn5618/Medium_Blog_Codes/blob/master/Emotion%20Detection/EmotionClassifier_Model_Improvement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SetUp

I have used a publicly available dataset on Kaggle and on Hugging Face datasets. This dataset contains a list of documents with corresponding emotion labels. The work in this notebook is to optimize the model created in EmotionClassifier.ipynb.

**Data Source on Kaggle:** https://www.kaggle.com/praveengovi/emotions-dataset-for-nlp

**Data Source on HuggingFace:** https://huggingface.co/datasets/emotion

**Dataset Citation:** Saravia, E., Liu, H. C. T., Huang, Y. H., Wu, J., & Chen, Y. S. (2018). Carer: Contextualized affect representations for emotion recognition. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 3687–3697)

**License on Kaggle:** CC BY-SA 4.0 | **License on HuggingFace:** Unknown

## Imports

In [1]:
# !pip install chart_studio
# !pip install tensorflow==2.4.1

In [2]:
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.graph_objs import *
import plotly.figure_factory as ff
import chart_studio
import chart_studio.plotly as py

# to avoid warnings 
import warnings
warnings.filterwarnings("ignore")

import nltk
# nltk.download('stopwords')
# nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Keras imports
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, Dropout, LSTM

# Scikit Learn imports
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

import config

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Notebook Configs

In [3]:
username=config.user_name
api_key=config.my_plotly_api_key
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)

STOPWORDS = stopwords.words('english')
PORTER_STEMMER = PorterStemmer()

## Data Imports

I will be splitting train data into training and validation sets and using the test data to evaluate the model

In [4]:
train_data = pd.read_csv('Data/train.txt', sep=';', names=['text', 'emotion'])
train_data.head()

Unnamed: 0,text,emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [5]:
test_data = pd.read_csv('Data/test.txt', sep=';', names=['text', 'emotion'])
test_data.head()

Unnamed: 0,text,emotion
0,im feeling rather rotten so im not very ambiti...,sadness
1,im updating my blog because i feel shitty,sadness
2,i never make her separate from me because i do...,sadness
3,i left with my bouquet of red and yellow tulip...,joy
4,i was feeling a little vain when i did this one,sadness


In [6]:
val_data = pd.read_csv('Data/val.txt', sep=';', names=['text', 'emotion'])
val_data.head()

Unnamed: 0,text,emotion
0,im feeling quite sad and sorry for myself but ...,sadness
1,i feel like i am still looking at a blank canv...,sadness
2,i feel like a faithful servant,love
3,i am just feeling cranky and blue,anger
4,i can have for a treat or if i am feeling festive,joy


# Improvements

1. Clean the text data and normalize it
2. Using metrics recall and precision to compile the model
3. Redesigning Data Strategy for training and validation
4. Redesigning model - 
    - Increase dropout rate from 60% to 80%
    - Use 64 units in first LSTM layer
    - Use 16 units in second LSTM layer


## Data Preparation

### Text Pre-processing

In [7]:
def preprocess_text(text):
    filtered_text = []
    for each_word in word_tokenize(text):
        if each_word not in STOPWORDS:
            filtered_text.append(PORTER_STEMMER.stem(each_word))
    return " ".join(filtered_text)

In [8]:
''' TEST '''

preprocess_text("I am walking about not")

'i walk'

In [9]:
train_data['text'] = train_data.text.apply(preprocess_text)
test_data['text'] = test_data.text.apply(preprocess_text)
val_data['text'] = val_data.text.apply(preprocess_text)

### Label Encoding

Convert each label into a crresponding integer.

In [10]:
train_data["emotion"] = train_data["emotion"].astype('category')
train_data["emotion_label"] = train_data["emotion"].cat.codes
train_data.head()

Unnamed: 0,text,emotion,emotion_label
0,didnt feel humili,sadness,4
1,go feel hopeless damn hope around someon care ...,sadness,4
2,im grab minut post feel greedi wrong,anger,0
3,ever feel nostalg fireplac know still properti,love,3
4,feel grouchi,anger,0


In [11]:
test_data["emotion"] = test_data["emotion"].astype('category')
test_data["emotion_label"] = test_data["emotion"].cat.codes
test_data.head()

Unnamed: 0,text,emotion,emotion_label
0,im feel rather rotten im ambiti right,sadness,4
1,im updat blog feel shitti,sadness,4
2,never make separ ever want feel like asham,sadness,4
3,left bouquet red yellow tulip arm feel slightl...,joy,2
4,feel littl vain one,sadness,4


In [12]:
val_data["emotion"] = val_data["emotion"].astype('category')
val_data["emotion_label"] = val_data["emotion"].cat.codes
val_data.head()

Unnamed: 0,text,emotion,emotion_label
0,im feel quit sad sorri ill snap soon,sadness,4
1,feel like still look blank canva blank piec paper,sadness,4
2,feel like faith servant,love,3
3,feel cranki blue,anger,0
4,treat feel festiv,joy,2


### One Hot Encoding and Train Test Data Prep

Convert the label to one-hot encoded form and organize text column as training feature and the one-hot encoded label as training labels.

In [13]:
train_features, train_labels = train_data['text'], tf.one_hot(
    train_data["emotion_label"], 6)
test_features, test_labels = test_data['text'], tf.one_hot(
    test_data["emotion_label"], 6)
val_features, val_labels = val_data['text'], tf.one_hot(
    val_data["emotion_label"], 6)

### Decoder

I will be using this to decode the one-hot encoded prediction(s) to the text labels.

In [14]:
def get_labels_from_oh_code(oh_code):
    """ Takes in one-hot encoded matrix
    Returns a list of decoded categories"""
    label_code = np.argmax(oh_code, axis=1)
#     print(label_code)
    label = test_data.emotion.cat.categories[label_code]
#     print(list(label))
    return list(label)

In [15]:
"Test Method"
test= np.array(train_labels[:5])
get_labels_from_oh_code(test)

['sadness', 'sadness', 'anger', 'love', 'anger']

## Model Training

### Text Pre-processing

I have used Keras tokenizer to convert tokens into integers, using ```<OOV>``` to represent Out of Vocabulary (OoV) terms.
It is a flexible way of handling unseen vocabulary in test data.

Using pad_sequences, again I have padded/truncated sequences to make their lengths uniform.

In [16]:
# Vocab Size

total_data = ' '.join(list(train_data.text))
len(set(word_tokenize(total_data))) # vocab size

10375

In [17]:
# Max Seq Length

list_seq_lengths = [len(word_tokenize(each_text)) for each_text in list(train_data.text)]

In [18]:
fig = px.box(
    x=list_seq_lengths,
    template='plotly_white'
)
fig.update_layout(
    title={
        'text': "Boxplot of Number of Words per Sample | Emotion Prediction",
        'x': 0.4,
        'xanchor': 'center'
    })
fig.update_yaxes(title='Frequency').update_xaxes(
    title='Number of Words')
fig.update_layout(showlegend=False)
fig.update_layout(hovermode='x')
fig.show()

In [19]:
py.plot(fig, filename="Boxplot of Number of Words Per Sample | Emotion Prediction", auto_open = True)

'https://plotly.com/~royn5618/116/'

In [20]:
vocab_size = 10000
vector_size = 300
max_seq_len = 20

tokenizer = Tokenizer(oov_token = "<OOV>", num_words=vocab_size, lower=True)
tokenizer.fit_on_texts(train_data['text'])

sequences_train = tokenizer.texts_to_sequences(train_data['text'])
sequences_test = tokenizer.texts_to_sequences(test_data['text'])
sequences_val = tokenizer.texts_to_sequences(val_data['text'])

padded_train = pad_sequences(sequences_train, padding = 'post', maxlen=max_seq_len)
padded_test = pad_sequences(sequences_test, padding = 'post', maxlen=max_seq_len)
padded_val = pad_sequences(sequences_val, padding = 'post', maxlen=max_seq_len)

### Model Training

In [21]:
EPOCHS = 20
PATIENCE = 5
FILEPATH='models/emotion_classifier_improved.h5'

In [22]:
def get_model():
    model = Sequential()
    model.add(
        Embedding(input_dim=vocab_size,
                  output_dim=vector_size,
                  input_length=max_seq_len))
    model.add(Dropout(0.8))
    model.add(LSTM(64,return_sequences=True))
    model.add(LSTM(16))
    model.add(Dense(6,activation='softmax'))
    return model

In [23]:
callbacks = [
    keras.callbacks.EarlyStopping(monitor='val_loss',
                                  mode='min',
                                  patience=PATIENCE,
                                  verbose=1,
                                  restore_best_weights=True),
    keras.callbacks.ModelCheckpoint(filepath=FILEPATH,
                                    verbose=1,
                                    save_best_only=True)
]

In [24]:
model = get_model()
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 20, 300)           3000000   
_________________________________________________________________
dropout (Dropout)            (None, 20, 300)           0         
_________________________________________________________________
lstm (LSTM)                  (None, 20, 64)            93440     
_________________________________________________________________
lstm_1 (LSTM)                (None, 16)                5184      
_________________________________________________________________
dense (Dense)                (None, 6)                 102       
Total params: 3,098,726
Trainable params: 3,098,726
Non-trainable params: 0
_________________________________________________________________


In [25]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=[tf.keras.metrics.Recall(), tf.keras.metrics.Precision()])

In [26]:
tf.config.run_functions_eagerly(True)
history = model.fit(padded_train,
                    train_labels,
                    validation_data=(padded_val, val_labels),
                    callbacks=callbacks,
                    epochs=EPOCHS)

Epoch 1/20

Epoch 00001: val_loss improved from inf to 1.04490, saving model to models/emotion_classifier_improved.h5
Epoch 2/20

Epoch 00002: val_loss improved from 1.04490 to 0.46911, saving model to models/emotion_classifier_improved.h5
Epoch 3/20

Epoch 00003: val_loss improved from 0.46911 to 0.33492, saving model to models/emotion_classifier_improved.h5
Epoch 4/20

Epoch 00004: val_loss improved from 0.33492 to 0.31489, saving model to models/emotion_classifier_improved.h5
Epoch 5/20

Epoch 00005: val_loss improved from 0.31489 to 0.28647, saving model to models/emotion_classifier_improved.h5
Epoch 6/20

Epoch 00006: val_loss improved from 0.28647 to 0.26816, saving model to models/emotion_classifier_improved.h5
Epoch 7/20

Epoch 00007: val_loss did not improve from 0.26816
Epoch 8/20

Epoch 00008: val_loss did not improve from 0.26816
Epoch 9/20

Epoch 00009: val_loss did not improve from 0.26816
Epoch 10/20

Epoch 00010: val_loss did not improve from 0.26816
Epoch 11/20
Restori

In [27]:
history.history.keys()

dict_keys(['loss', 'recall', 'precision', 'val_loss', 'val_recall', 'val_precision'])

### Training - Validation Loss

In [28]:
metric_to_plot = "loss"
epochs = list(range(1, max(history.epoch) + 2))
training_loss = history.history[metric_to_plot]
validation_loss = history.history["val_" + metric_to_plot]

trace1 = {
    "mode": "lines+markers",
    "name": "Training Loss",
    "type": "scatter",
    "x": epochs,
    "y": training_loss
}

trace2 = {
    "mode": "lines+markers",
    "name": "Validation Loss",
    "type": "scatter",
    "x": epochs,
    "y": validation_loss
}

data = Data([trace1, trace2])
layout = {
    "title": "Training - Validation Loss | Emotion Prediction",
    "xaxis": {
        "title": "Number of epochs",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    },
    "yaxis": {
        "title": "Loss",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    }
}
fig = Figure(data=data, layout=layout)
fig.update_layout(hovermode="x unified")
fig.show()

In [29]:
py.plot(fig, filename="Train-Val Loss | Emotion Prediction", auto_open = True)

'https://plotly.com/~royn5618/108/'

### Training - Validation Classification Metrics

In [30]:
metric_to_plot = "recall"
epochs = list(range(1, max(history.epoch) + 2))
training_loss = history.history[metric_to_plot]
validation_loss = history.history["val_" + metric_to_plot]

trace1 = {
    "mode": "lines+markers",
    "name": "Training Recall",
    "type": "scatter",
    "x": epochs,
    "y": training_loss
}

trace2 = {
    "mode": "lines+markers",
    "name": "Validation Recall",
    "type": "scatter",
    "x": epochs,
    "y": validation_loss
}

data = Data([trace1, trace2])
layout = {
    "title": "Training - Validation Recall | Emotion Prediction",
    "xaxis": {
        "title": "Number of epochs",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    },
    "yaxis": {
        "title": "Recall",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    }
}
fig = Figure(data=data, layout=layout)
fig.update_layout(hovermode="x unified")
fig.show()

In [31]:
py.plot(fig, filename="Train-Val Recall | Emotion Prediction", auto_open = True)

'https://plotly.com/~royn5618/110/'

In [32]:
metric_to_plot = "precision"
epochs = list(range(1, max(history.epoch) + 2))
training_loss = history.history[metric_to_plot]
validation_loss = history.history["val_" + metric_to_plot]

trace1 = {
    "mode": "lines+markers",
    "name": "Training Precision",
    "type": "scatter",
    "x": epochs,
    "y": training_loss
}

trace2 = {
    "mode": "lines+markers",
    "name": "Validation Precision",
    "type": "scatter",
    "x": epochs,
    "y": validation_loss
}

data = Data([trace1, trace2])
layout = {
    "title": "Training - Validation Precision | Emotion Prediction",
    "xaxis": {
        "title": "Number of epochs",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    },
    "yaxis": {
        "title": "Precision",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    }
}
fig = Figure(data=data, layout=layout)
fig.update_layout(hovermode="x unified")
fig.show()

In [33]:
py.plot(fig, filename="Train-Val Precision | Emotion Prediction", auto_open = True)

'https://plotly.com/~royn5618/112/'

### Classification Report

In [34]:
# model = keras.models.load_model(FILEPATH)

In [35]:
# Model Evaluation on Train Data

y_pred_one_hot_encoded = (model.predict(padded_train)> 0.5).astype("int32")
y_pred_one_hot_encoded

array([[0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0],
       ...,
       [0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0]], dtype=int32)

In [36]:
y_pred = np.array(tf.argmax(y_pred_one_hot_encoded, axis=1))
print(classification_report(train_data['emotion_label'], y_pred))

              precision    recall  f1-score   support

           0       0.93      0.96      0.95      2159
           1       0.96      0.94      0.95      1937
           2       0.99      0.95      0.97      5362
           3       0.87      0.96      0.91      1304
           4       0.98      0.98      0.98      4666
           5       0.90      0.92      0.91       572

    accuracy                           0.96     16000
   macro avg       0.94      0.95      0.94     16000
weighted avg       0.96      0.96      0.96     16000



In [37]:
# Model Evaaluation on Validation Data

y_pred_one_hot_encoded = (model.predict(padded_val)> 0.5).astype("int32")
y_pred = np.array(tf.argmax(y_pred_one_hot_encoded, axis=1))
print(classification_report(val_data['emotion_label'], y_pred))

              precision    recall  f1-score   support

           0       0.83      0.92      0.87       275
           1       0.84      0.83      0.83       212
           2       0.94      0.90      0.92       704
           3       0.78      0.82      0.80       178
           4       0.94      0.93      0.94       550
           5       0.81      0.83      0.82        81

    accuracy                           0.89      2000
   macro avg       0.86      0.87      0.86      2000
weighted avg       0.90      0.89      0.89      2000



In [38]:
# Model Evaaluation on Test Data

y_pred_one_hot_encoded = (model.predict(padded_test)> 0.5).astype("int32")
y_pred = np.array(tf.argmax(y_pred_one_hot_encoded, axis=1))
print(classification_report(test_data['emotion_label'], y_pred))

              precision    recall  f1-score   support

           0       0.82      0.90      0.86       275
           1       0.86      0.88      0.87       224
           2       0.93      0.90      0.91       695
           3       0.73      0.85      0.78       159
           4       0.96      0.92      0.94       581
           5       0.75      0.68      0.71        66

    accuracy                           0.89      2000
   macro avg       0.84      0.85      0.85      2000
weighted avg       0.90      0.89      0.89      2000



### Confusion Matrix

In [39]:
z = confusion_matrix(test_data['emotion_label'], y_pred)
z

array([[248,   9,   8,   0,  10,   0],
       [ 11, 197,   1,   0,   5,  10],
       [ 14,   2, 623,  48,   3,   5],
       [  6,   0,  17, 135,   1,   0],
       [ 20,  10,  14,   3, 534,   0],
       [  5,  10,   4,   0,   2,  45]])

In [40]:
z = z[::-1]
z

array([[  5,  10,   4,   0,   2,  45],
       [ 20,  10,  14,   3, 534,   0],
       [  6,   0,  17, 135,   1,   0],
       [ 14,   2, 623,  48,   3,   5],
       [ 11, 197,   1,   0,   5,  10],
       [248,   9,   8,   0,  10,   0]])

In [41]:
x = list(train_data.emotion.cat.categories)
z_text = [[str(y) for y in x] for x in z]

In [42]:
fig = ff.create_annotated_heatmap(z, 
                                  x = x,
                                  y = x[::-1], # same labels
                                  annotation_text=z_text, 
                                  colorscale='sunsetdark',
                                  showscale = True
                                 )

fig.update_layout(title_text = "Confusion Matrix (Test Data) | Emotion Prediction")
fig.show()

In [43]:
py.plot(fig, filename="Confusion Matrix (Test Data) | Emotion Prediction", auto_open = True)

'https://plotly.com/~royn5618/121/'