<a href="https://colab.research.google.com/github/royn5618/Medium_Blog_Codes/blob/master/GenAI_4_NLP_Systems/EmotionClassifier_KerasTuner_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SetUp

I have used a publicly available dataset on Kaggle and on Hugging Face datasets. This dataset contains a list of documents with corresponding emotion labels. The work in this notebook is to optimize the model created in EmotionClassifier.ipynb using KerasTuner.

**Data Source on Kaggle:** https://www.kaggle.com/praveengovi/emotions-dataset-for-nlp

**Data Source on HuggingFace:** https://huggingface.co/datasets/emotion

**Dataset Citation:** Saravia, E., Liu, H. C. T., Huang, Y. H., Wu, J., & Chen, Y. S. (2018). Carer: Contextualized affect representations for emotion recognition. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 3687–3697)

**License on Kaggle:** CC BY-SA 4.0 | **License on HuggingFace:** Unknown

In [1]:
!pip install chart_studio
!pip install -q -U keras-tuner



In [2]:
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.graph_objs import *
import plotly.figure_factory as ff
import chart_studio
import chart_studio.plotly as py

# to avoid warnings
import warnings
warnings.filterwarnings("ignore")

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Keras imports
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, Dropout, LSTM

# Keras Tuner
import keras_tuner as kt

# Scikit Learn imports
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# import config

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Notebook Configs

In [3]:
# username=config.user_name
# api_key=config.my_plotly_api_key
# chart_studio.tools.set_credentials_file(username=username, api_key=api_key)

STOPWORDS = stopwords.words('english')
PORTER_STEMMER = PorterStemmer()

## Data Imports

In [5]:
train_data = pd.read_csv('updated_train_v2.csv', header=1, names=['text', 'emotion'])
train_data.head()

Unnamed: 0,text,emotion
0,i can go from feeling so hopeless to so damned...,sadness
1,im grabbing a minute to post i feel greedy wrong,anger
2,i am ever feeling nostalgic about the fireplac...,love
3,i am feeling grouchy,anger
4,ive been feeling a little burdened lately wasn...,sadness


In [6]:
test_data = pd.read_csv('test.txt', sep=';', names=['text', 'emotion'])
test_data.head()

Unnamed: 0,text,emotion
0,im feeling rather rotten so im not very ambiti...,sadness
1,im updating my blog because i feel shitty,sadness
2,i never make her separate from me because i do...,sadness
3,i left with my bouquet of red and yellow tulip...,joy
4,i was feeling a little vain when i did this one,sadness


In [7]:
val_data = pd.read_csv('val.txt', sep=';', names=['text', 'emotion'])
val_data.head()

Unnamed: 0,text,emotion
0,im feeling quite sad and sorry for myself but ...,sadness
1,i feel like i am still looking at a blank canv...,sadness
2,i feel like a faithful servant,love
3,i am just feeling cranky and blue,anger
4,i can have for a treat or if i am feeling festive,joy


## Data Preparation

In [8]:
def preprocess_text(text):
    filtered_text = []
    for each_word in word_tokenize(text):
        if each_word not in STOPWORDS:
            filtered_text.append(PORTER_STEMMER.stem(each_word))
    return " ".join(filtered_text)

In [9]:
train_data['text'] = train_data.text.apply(preprocess_text)
test_data['text'] = test_data.text.apply(preprocess_text)
val_data['text'] = val_data.text.apply(preprocess_text)

In [10]:
train_data["emotion"] = train_data["emotion"].astype('category')
train_data["emotion_label"] = train_data["emotion"].cat.codes
train_data.head()

Unnamed: 0,text,emotion,emotion_label
0,go feel hopeless damn hope around someon care ...,sadness,4
1,im grab minut post feel greedi wrong,anger,0
2,ever feel nostalg fireplac know still properti,love,3
3,feel grouchi,anger,0
4,ive feel littl burden late wasnt sure,sadness,4


In [11]:
test_data["emotion"] = test_data["emotion"].astype('category')
test_data["emotion_label"] = test_data["emotion"].cat.codes
test_data.head()

val_data["emotion"] = val_data["emotion"].astype('category')
val_data["emotion_label"] = val_data["emotion"].cat.codes
val_data.head()

train_features, train_labels = train_data['text'], tf.one_hot(
    train_data["emotion_label"], 6)
test_features, test_labels = test_data['text'], tf.one_hot(
    test_data["emotion_label"], 6)
val_features, val_labels = val_data['text'], tf.one_hot(
    val_data["emotion_label"], 6)

def get_labels_from_oh_code(oh_code):
    """ Takes in one-hot encoded matrix
    Returns a list of decoded categories"""
    label_code = np.argmax(oh_code, axis=1)
#     print(label_code)
    label = test_data.emotion.cat.categories[label_code]
#     print(list(label))
    return list(label)

In [12]:
total_data = ' '.join(list(train_data.text))
len(set(word_tokenize(total_data))) # vocab size

10387

## Model Training

In [13]:
vocab_size = 10000
vector_size = 300
max_seq_len = 20

tokenizer = Tokenizer(oov_token = "<OOV>", num_words=vocab_size, lower=True)
tokenizer.fit_on_texts(train_data['text'])

sequences_train = tokenizer.texts_to_sequences(train_data['text'])
sequences_test = tokenizer.texts_to_sequences(test_data['text'])
sequences_val = tokenizer.texts_to_sequences(val_data['text'])

padded_train = pad_sequences(sequences_train, padding = 'post', maxlen=max_seq_len)
padded_test = pad_sequences(sequences_test, padding = 'post', maxlen=max_seq_len)
padded_val = pad_sequences(sequences_val, padding = 'post', maxlen=max_seq_len)

In [14]:
def model_builder(hp):
    model = Sequential()
    hp_vector_size = hp.Int('vector_size', min_value=100, max_value=500, step=100)
    model.add(
        Embedding(input_dim=vocab_size,
                  output_dim=vector_size,
                  input_length=max_seq_len))
    hp_dropout_rate = hp.Float('dropout_rate', min_value=0.6, max_value=0.9, step=0.1)
    model.add(Dropout(hp_dropout_rate))
    hp_lstm_units1 = hp.Int('lstm_units1', min_value=32, max_value=512, step=32)
    model.add(LSTM(hp_lstm_units1,return_sequences=True))
    hp_lstm_units2 = hp.Int('lstm_units2', min_value=16, max_value=512, step=32)
    model.add(LSTM(hp_lstm_units2))
    model.add(Dense(6,activation='softmax'))
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])
    model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learning_rate),
              loss='categorical_crossentropy',
              metrics=[tf.keras.metrics.Recall(), tf.keras.metrics.Precision()])
    return model

In [15]:
tuner = kt.Hyperband(model_builder,
                     objective=kt.Objective("val_recall", direction="max"),
                     max_epochs=20,
                     factor=5,
                     directory="model_trials_2",
                     project_name="emotion_detector_2"
                     )

In [16]:
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_recall', patience=5)

In [17]:
tuner.search(padded_train,
             train_labels,
             epochs=20,
             validation_data=(padded_val, val_labels),
             callbacks=[stop_early]
             )

# Get the optimal hyperparameters
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

Trial 13 Complete [00h 01m 32s]
val_recall: 0.5360000133514404

Best val_recall So Far: 0.9020000100135803
Total elapsed time: 00h 15m 11s


In [18]:
best_hps.get('vector_size')

100

In [19]:
best_hps.get('dropout_rate')

0.7

In [20]:
best_hps.get('lstm_units1')

352

In [21]:
best_hps.get('lstm_units2')

272

In [22]:
best_hps.get('learning_rate')

0.001

In [23]:
# Build the model with the optimal hyperparameters and train it on the data for 50 epochs
model_best_hp = tuner.hypermodel.build(best_hps)
callbacks = [
    keras.callbacks.EarlyStopping(monitor='val_loss',
                                  mode='min',
                                  patience=5,
                                  verbose=1,
                                  restore_best_weights=True)
]
history = model_best_hp.fit(padded_train,
                            train_labels,
                            epochs=20,
                            callbacks=callbacks,
                            validation_data=(padded_val, val_labels))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 9: early stopping


In [24]:
# Model Evaluation on Train Data
y_pred_one_hot_encoded = (model_best_hp.predict(padded_train)> 0.5).astype("int32")
y_pred = np.array(tf.argmax(y_pred_one_hot_encoded, axis=1))
print(classification_report(train_data['emotion_label'], y_pred))

              precision    recall  f1-score   support

           0       0.91      0.96      0.93      2159
           1       0.90      0.90      0.90      1937
           2       0.97      0.97      0.97      5362
           3       0.94      0.87      0.91      1304
           4       0.98      0.97      0.98      4665
           5       0.91      0.93      0.92      1716

    accuracy                           0.95     17143
   macro avg       0.94      0.93      0.93     17143
weighted avg       0.95      0.95      0.95     17143



In [25]:
# Model Evaaluation on Validation Data
y_pred_one_hot_encoded = (model_best_hp.predict(padded_val)> 0.5).astype("int32")
y_pred = np.array(tf.argmax(y_pred_one_hot_encoded, axis=1))
print(classification_report(val_data['emotion_label'], y_pred))

              precision    recall  f1-score   support

           0       0.84      0.92      0.88       275
           1       0.82      0.80      0.81       212
           2       0.94      0.92      0.93       704
           3       0.86      0.75      0.80       178
           4       0.96      0.94      0.95       550
           5       0.69      0.88      0.77        81

    accuracy                           0.90      2000
   macro avg       0.85      0.87      0.86      2000
weighted avg       0.90      0.90      0.90      2000



In [26]:
# Model Evaaluation on Test Data
y_pred_one_hot_encoded = (model_best_hp.predict(padded_test)> 0.5).astype("int32")
y_pred = np.array(tf.argmax(y_pred_one_hot_encoded, axis=1))
print(classification_report(test_data['emotion_label'], y_pred))

              precision    recall  f1-score   support

           0       0.86      0.90      0.88       275
           1       0.84      0.85      0.85       224
           2       0.92      0.93      0.92       695
           3       0.77      0.70      0.73       159
           4       0.97      0.92      0.95       581
           5       0.60      0.76      0.67        66

    accuracy                           0.89      2000
   macro avg       0.83      0.84      0.83      2000
weighted avg       0.89      0.89      0.89      2000



### Training - Validation Loss

In [27]:
metric_to_plot = "loss"
epochs = list(range(1, max(history.epoch) + 2))
training_loss = history.history[metric_to_plot]
validation_loss = history.history["val_" + metric_to_plot]

trace1 = {
    "mode": "lines+markers",
    "name": "Training Loss",
    "type": "scatter",
    "x": epochs,
    "y": training_loss
}

trace2 = {
    "mode": "lines+markers",
    "name": "Validation Loss",
    "type": "scatter",
    "x": epochs,
    "y": validation_loss
}

data = Data([trace1, trace2])
layout = {
    "title": "Training - Validation Loss | Emotion Prediction",
    "xaxis": {
        "title": "Number of epochs",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    },
    "yaxis": {
        "title": "Loss",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    }
}
fig = Figure(data=data, layout=layout)
fig.update_layout(hovermode="x unified")
fig.show()

## Confusion Matrix

In [28]:
z = confusion_matrix(test_data['emotion_label'], y_pred)
z

array([[248,  11,  10,   0,   6,   0],
       [  4, 190,   1,   0,   2,  27],
       [  7,   4, 646,  30,   3,   5],
       [  7,   2,  34, 111,   5,   0],
       [ 21,   6,  13,   3, 537,   1],
       [  1,  12,   1,   0,   2,  50]])

In [29]:
z = z[::-1]
z

array([[  1,  12,   1,   0,   2,  50],
       [ 21,   6,  13,   3, 537,   1],
       [  7,   2,  34, 111,   5,   0],
       [  7,   4, 646,  30,   3,   5],
       [  4, 190,   1,   0,   2,  27],
       [248,  11,  10,   0,   6,   0]])

In [30]:
x = list(train_data.emotion.cat.categories)
z_text = [[str(y) for y in x] for x in z]

In [31]:
fig = ff.create_annotated_heatmap(z,
                                  x = x,
                                  y = x[::-1], # same labels
                                  annotation_text=z_text,
                                  colorscale='sunsetdark',
                                  showscale = True
                                 )

fig.update_layout(title_text = "Confusion Matrix (Test Data) | Emotion Prediction")
fig.show()
