# SetUp

Data Source: https://www.kaggle.com/praveengovi/emotions-dataset-for-nlp

## Imports

In [50]:
import sys
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.graph_objs import *
import plotly.figure_factory as ff
import chart_studio
import chart_studio.plotly as py

# to avoid warnings 
import warnings
warnings.filterwarnings("ignore")

# text processing
import nltk
from nltk.tokenize import word_tokenize
from gensim.models import word2vec

# Keras imports
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D, Dropout, LSTM
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Scikit Learn imports
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.manifold import TSNE

import config

## Notebook Configs

In [52]:
username='royn5618'
api_key=config.my_plotly_api_key
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)

## Data Imports

I will be splitting train data into training and validation sets and using the test data to evaluate the model

In [53]:
train_data = pd.read_csv('Data/train.txt', sep=';', names=['text', 'emotion'])
train_data.head()

Unnamed: 0,text,emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [54]:
test_data = pd.read_csv('Data/test.txt', sep=';', names=['text', 'emotion'])
test_data.head()

Unnamed: 0,text,emotion
0,im feeling rather rotten so im not very ambiti...,sadness
1,im updating my blog because i feel shitty,sadness
2,i never make her separate from me because i do...,sadness
3,i left with my bouquet of red and yellow tulip...,joy
4,i was feeling a little vain when i did this one,sadness


# Data Preparation

## Label Encoding

Convert each label into a crresponding integer.

In [55]:
train_data["emotion"] = train_data["emotion"].astype('category')
train_data["emotion_label"] = train_data["emotion"].cat.codes
train_data.head()

Unnamed: 0,text,emotion,emotion_label
0,i didnt feel humiliated,sadness,4
1,i can go from feeling so hopeless to so damned...,sadness,4
2,im grabbing a minute to post i feel greedy wrong,anger,0
3,i am ever feeling nostalgic about the fireplac...,love,3
4,i am feeling grouchy,anger,0


In [56]:
test_data["emotion"] = test_data["emotion"].astype('category')
test_data["emotion_label"] = test_data["emotion"].cat.codes
test_data.head()

Unnamed: 0,text,emotion,emotion_label
0,im feeling rather rotten so im not very ambiti...,sadness,4
1,im updating my blog because i feel shitty,sadness,4
2,i never make her separate from me because i do...,sadness,4
3,i left with my bouquet of red and yellow tulip...,joy,2
4,i was feeling a little vain when i did this one,sadness,4


## One Hot Encoding and Train Test Data Prep

Convert the label to one-hot encoded form and organize text column as training feature and the one-hot encoded label as training labels.

In [57]:
train_features, train_labels = train_data['text'], tf.one_hot(
    train_data["emotion_label"], 6)
test_features, test_labels = test_data['text'], tf.one_hot(
    test_data["emotion_label"], 6)

In [58]:
train_features[:5]

0                              i didnt feel humiliated
1    i can go from feeling so hopeless to so damned...
2     im grabbing a minute to post i feel greedy wrong
3    i am ever feeling nostalgic about the fireplac...
4                                 i am feeling grouchy
Name: text, dtype: object

In [59]:
train_labels[:5]

<tf.Tensor: shape=(5, 6), dtype=float32, numpy=
array([[0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0.]], dtype=float32)>

## Decoder

I will be using this to decode the one-hot encoded prediction(s) to the text labels.

In [60]:
def get_labels_from_oh_code(oh_code):
    """ Takes in one-hot encoded matrix
    Returns a list of decoded categories"""
    label_code = np.argmax(oh_code, axis=1)
#     print(label_code)
    label = test_data.emotion.cat.categories[label_code]
#     print(list(label))
    return list(label)

In [61]:
"Test Method"
test= np.array(train_labels[:5])
get_labels_from_oh_code(test)

['sadness', 'sadness', 'anger', 'love', 'anger']

# Generate Word Embeddings

## Select max_sequnce_length

To train a Deep Learning model, the input length has to be standardized.
To do this, we need to get an overview of the length of the texts in terms of words in the dataset.

Why words? Because I will be encoding each word into an number using Keras tokenizer and hence the input length to the Keras model will be decided based on the list of numbers hence produced.

Example: 
```
I am happy -> [1, 2, 3]
I am sad -> [1, 2, 4]
I am not in the mood right now -> [1, 2, 5, 6, 7, 8, 9, 10]
```

The first two sentences are short and the last one is long. If I select 3 as the number of words then key information in the third sentence will be missed. While, if we choose the length of the longest sentence, then, for the first two sentences, 5 redundant information will be sent like so - ```1, 2, 3, 0, 0, 0, 0 ,0``` . This is called padding and is used to pad shorter sentences with zeroes to match the standardized length of sequences. But the words, 'right' and 'now' are redundant. Hence, keeping the length till 6 words is optimum here.

```
max_sequence_length = 6

I am happy -> [1, 2, 3, 0, 0, 0]
I am sad -> [1, 2, 4, 0, 0, 0]
I am not in the mood right now -> [1, 2, 5, 6, 7, 8]
```
Hence, the last sentence is truncated in the end.

Thus, it is important that we try to select an optimum number for the length of the words to cut down unnecessary overhead on the model to speed up model training and make optimum usage of training infrastructure.

In [62]:
tokenized_train_features = [word_tokenize(each_train_text) for each_train_text in train_features]
tokenized_test_features = [word_tokenize(each_test_text) for each_test_text in test_features]

In [63]:
list_len_text_by_words = [len(each_tokenized_text) for each_tokenized_text in tokenized_train_features]

In [64]:
# fig = px.histogram(
#     x=list_len_text_by_words,
#     template='plotly_white'
# )
# fig.update_layout(
#     title={
#         'text': "Histogram of Number of Words per Sample",
#         'x': 0.4,
#         'xanchor': 'center'
#     })
# fig.update_yaxes(title='Frequency').update_xaxes(
#     title='Number of Words')
# fig.update_layout(showlegend=False)
# fig.update_layout(hovermode='x')
# fig.show()

In [65]:
# py.plot(fig, filename="Histogram of Sentence Length by Number of Words", auto_open = True)

In [66]:
fig = px.box(
    x=list_len_text_by_words,
    template='plotly_white'
)
fig.update_layout(
    title={
        'text': "Boxplot of Number of Words per Sample",
        'x': 0.4,
        'xanchor': 'center'
    })
fig.update_xaxes(
    title='Number of Words')
fig.update_layout(showlegend=False)
fig.update_layout(hovermode='x')
fig.show()

## Train Word2Vec Model

Word embeddings are dense vector representations of natural language text using the concept - **similar words occur in similar contexts** . It is a way of determining numeric representation of texts, that attempts to capture the contextual similarity among the words that have occurred across multiple documents.

To explain this further - words are represented as vectors (i.e. numbers) in such a way that the words that have occurred in similar contexts are closely spaced in the vector space. For example, if we lay down the keywords we have so far, assuming they have occurred across N documents, then, vectors of 'happy' and 'joy' will have a distance that is lesser than the distance between 'happy' and 'devastated' or 'joy' and 'annoying', whereas the distance between 'sadness' and 'devastated' will be lesser than 'happy' or 'joy'.

To decide the dimension of the vectors, I have selected 300, which a standard choice. The choice of dimensionality for word vectors has huge influence on the performance of a word embedding. Other traditional approaches use vocabulary size to determine the vector dimension - which demonstrates the 'curse of dimensionality' since the vocabulary of a corpus could be in the order of hundred thousands which ultimately increases model complexity, time complexity as well as infrastructural usage for such huge data. Hence, it is common to use dimension reduction techniques or optimum dimension selection techniques in text feature extraction tasks.

In [67]:
vector_size = 300

In [68]:
w2v_model = word2vec.Word2Vec(
    tokenized_train_features,
    vector_size=vector_size,  # Dimensionality of the word vectors
    window=20,
    min_count=1,
    sg=1  # 1 for skip-gram; otherwise CBOW
)
w2v_model

<gensim.models.word2vec.Word2Vec at 0x24099df6790>

In [69]:
# Vocabulary
len(w2v_model.wv)

15210

In [70]:
# Similar Terms Check
w2v_model.wv.most_similar('sad', topn=10)

[('lonely', 0.9338263869285583),
 ('worried', 0.9044061303138733),
 ('bitchy', 0.8974629044532776),
 ('horribly', 0.8961544036865234),
 ('angry', 0.8959505558013916),
 ('unhappy', 0.8953084945678711),
 ('normally', 0.8941256403923035),
 ('insecure', 0.890570878982544),
 ('terrible', 0.8836125135421753),
 ('indecisive', 0.8831015229225159)]

In [71]:
# Configs for Embedding Layer
vocab_size = len(w2v_model.wv)
max_seq_len = 20

# Get the embedding matrix
vocab = w2v_model.wv.key_to_index.keys()
embedding_matrix = w2v_model.wv[vocab]

In [72]:
vocab_size, vector_size, max_seq_len

(15210, 300, 20)

In [73]:
vocab_list = list(w2v_model.wv.key_to_index.keys())

## Pre-process Test Data for Word2Vec Embeddings

In [74]:
def remove_OOV_vocab(sample: list, list_vocab):
    """ Takes in tokenized sample in the form of list 
    and the vocabulary list and removes tokens from sample that
    are not in the vocabulary list"""
    in_vocab_sample = []
    for each_token in sample:
        if each_token in list_vocab:
            in_vocab_sample.append(each_token)
    return in_vocab_sample

In [75]:
tokenized_test_features = [remove_OOV_vocab(each_test_sample, vocab_list) for each_test_sample in tokenized_test_features]

## Text to Sequences using Gensim

This is used to convert the sequences to the keys of the words in the word2vec model.

For example, if the w2v_model has assigned -

```
Assume the following is a dict generated by word2vec model -
the words are keys and 
the numbers are indices then -

i -> 1 
am -> 2
happy -> 3
sad -> 4
not -> 5
in -> 6
the -> 7
mood -> 8

```

then the following code generates -

```

I am happy -> [1, 2, 3]
I am sad -> [1, 2, 4]
I am not in the mood -> [1, 2, 5, 6, 7, 8]

```

Note, this automatically done in using tokenizer from Keras while in this case, I have to do this myself.

In [76]:
def w2v_indexed_token_sequences(w2v_model, list_features):
    indexed_features = []
    for each_seq in list_features:
        list_token_indices = []
        for each_token in each_seq:
            try:
                list_token_indices.append(w2v_model.wv.key_to_index[each_token])
            except KeyError as e:
                continue
        indexed_features.append(list_token_indices)
    return indexed_features

In [77]:
indexed_train_features = w2v_indexed_token_sequences(w2v_model, tokenized_train_features)

indexed_test_features = w2v_indexed_token_sequences(w2v_model, tokenized_test_features)

len(indexed_train_features), len(indexed_test_features)

(16000, 2000)

## Standardize Sequence Length as Selected Earlier

In [78]:
# pad / truncate and standardize the length as explained in section 3.1

padded_train = pad_sequences(indexed_train_features, padding = 'post', maxlen=max_seq_len, truncating='post')
padded_test = pad_sequences(indexed_test_features, padding = 'post', maxlen=max_seq_len, truncating='post')

In [79]:
# # Check all sequences have same length

# list_len_text_by_words = [len(each) for each in padded_train]

# fig = px.histogram(
#     x=list_len_text_by_words,
#     template='plotly_white'
# )
# fig.update_layout(
#     title={
#         'text': "Histogram of <b>Sentence Length</b> by Number of Words",
#         'x': 0.4,
#         'xanchor': 'center'
#     })
# fig.update_yaxes(title='Frequency').update_xaxes(
#     title='Length of Sentences by Words')
# fig.update_layout(showlegend=False)
# fig.update_layout(hovermode='x')
# fig.show()

# Multiclass Classifier

## With Skipgram Embeddings

### Model Training

In [80]:
def get_model():
    model = Sequential()
    model.add(
        Embedding(input_dim=vocab_size,
                  output_dim=vector_size,
                  weights=[embedding_matrix],
                  input_length=max_seq_len))
    model.add(Dropout(0.6))
    model.add(LSTM(max_seq_len,return_sequences=True))
    model.add(LSTM(6))
    model.add(Dense(6,activation='softmax'))
    return model

In [81]:
callbacks = [
    keras.callbacks.EarlyStopping(monitor="val_recall",
                                  mode='max',
                                  patience=2,
                                  verbose=1,
                                  restore_best_weights=True),
    keras.callbacks.ModelCheckpoint(filepath='models/lstm_with_w2v_val_recall.h5',
                                    verbose=1,
                                    save_best_only=True)
]

In [82]:
model = get_model()
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 20, 300)           4563000   
_________________________________________________________________
dropout_4 (Dropout)          (None, 20, 300)           0         
_________________________________________________________________
lstm_8 (LSTM)                (None, 20, 20)            25680     
_________________________________________________________________
lstm_9 (LSTM)                (None, 6)                 648       
_________________________________________________________________
dense_4 (Dense)              (None, 6)                 42        
Total params: 4,589,370
Trainable params: 4,589,370
Non-trainable params: 0
_________________________________________________________________


In [83]:
metrics = [
    keras.metrics.Precision(name='precision'),
    keras.metrics.Recall(name='recall')
]

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=metrics)

In [84]:
tf.config.run_functions_eagerly(True)
history = model.fit(padded_train, 
                    train_labels,
                    validation_split=0.33,
                    callbacks=callbacks,
                    epochs=10)

2022/04/15 13:08:43 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '49dba60b27d64eeabb0a78cd29d7ece9', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current keras workflow


Epoch 1/10

Epoch 00001: val_loss improved from inf to 1.46118, saving model to models\lstm_with_w2v_val_recall.h5
Epoch 2/10

Epoch 00002: val_loss improved from 1.46118 to 1.24022, saving model to models\lstm_with_w2v_val_recall.h5
Epoch 3/10

Epoch 00003: val_loss improved from 1.24022 to 1.00919, saving model to models\lstm_with_w2v_val_recall.h5
Epoch 4/10

Epoch 00004: val_loss improved from 1.00919 to 0.87277, saving model to models\lstm_with_w2v_val_recall.h5
Epoch 5/10

Epoch 00005: val_loss improved from 0.87277 to 0.73966, saving model to models\lstm_with_w2v_val_recall.h5
Epoch 6/10

Epoch 00006: val_loss improved from 0.73966 to 0.62609, saving model to models\lstm_with_w2v_val_recall.h5
Epoch 7/10

Epoch 00007: val_loss improved from 0.62609 to 0.62317, saving model to models\lstm_with_w2v_val_recall.h5
Epoch 8/10

Epoch 00008: val_loss did not improve from 0.62317
Epoch 9/10
Restoring model weights from the end of the best epoch.

Epoch 00009: val_loss did not improve fr



INFO:tensorflow:Assets written to: C:\Users\nroy0\AppData\Local\Temp\tmpz2ghahf8\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\nroy0\AppData\Local\Temp\tmpz2ghahf8\model\data\model\assets


### Training - Validation Loss

In [35]:
metric_to_plot = "loss"
epochs = list(range(1, max(history.epoch) + 2))
training_loss = history.history[metric_to_plot]
validation_loss = history.history["val_" + metric_to_plot]

trace1 = {
    "mode": "lines+markers",
    "name": "Training Loss",
    "type": "scatter",
    "x": epochs,
    "y": training_loss
}

trace2 = {
    "mode": "lines+markers",
    "name": "Validation Loss",
    "type": "scatter",
    "x": epochs,
    "y": validation_loss
}

data = Data([trace1, trace2])
layout = {
    "title": "Training - Validation Loss with Word2Vec Embeddings",
    "xaxis": {
        "title": "Number of epochs",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    },
    "yaxis": {
        "title": "Loss",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    }
}
fig = Figure(data=data, layout=layout)
fig.update_layout(hovermode="x unified")
fig.show()

In [36]:
py.plot(fig, filename="Train-Val Loss - Using Word2Vec Embeddings | Monitor Val_Recall", auto_open = True)

'https://plotly.com/~royn5618/19/'

### Training - Validation Accuracy

In [37]:
metric_to_plot = "accuracy"
epochs = list(range(1, max(history.epoch) + 2))
training_loss = history.history[metric_to_plot]
validation_loss = history.history["val_" + metric_to_plot]

trace1 = {
    "mode": "lines+markers",
    "name": "Training Accuracy",
    "type": "scatter",
    "x": epochs,
    "y": training_loss
}

trace2 = {
    "mode": "lines+markers",
    "name": "Validation Accuracy",
    "type": "scatter",
    "x": epochs,
    "y": validation_loss
}

data = Data([trace1, trace2])
layout = {
    "title": "Training - Validation Accuracy with Word2Vec Embeddings",
    "xaxis": {
        "title": "Number of epochs",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    },
    "yaxis": {
        "title": "Accuracy",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    }
}
fig = Figure(data=data, layout=layout)
fig.update_layout(hovermode="x unified")
fig.show()

In [38]:
py.plot(fig, filename="Train-Val Accuracy - Using Word2Vec Embeddings | Monitor Val_Recall", auto_open = True)

'https://plotly.com/~royn5618/21/'

### Classification Report

In [39]:
model_with_w2v = keras.models.load_model('models/lstm_with_w2v_val_recall.h5')

In [40]:
# Model Evaluation on Training Data

y_pred_one_hot_encoded = (model_with_w2v.predict(padded_train)> 0.5).astype("int32")
y_pred_one_hot_encoded

array([[0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0],
       ...,
       [0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0]])

In [41]:
y_pred = np.array(tf.argmax(y_pred_one_hot_encoded, axis=1))
print(classification_report(train_data['emotion_label'], y_pred))

              precision    recall  f1-score   support

           0       0.59      0.92      0.72      2159
           1       0.95      0.80      0.87      1937
           2       0.95      0.92      0.93      5362
           3       0.89      0.78      0.83      1304
           4       0.94      0.93      0.94      4666
           5       1.00      0.04      0.07       572

    accuracy                           0.87     16000
   macro avg       0.89      0.73      0.73     16000
weighted avg       0.89      0.87      0.86     16000



In [42]:
# Model Evaluation on Test Data

y_pred_one_hot_encoded = (model_with_w2v.predict(padded_test)> 0.5).astype("int32")
y_pred = np.array(tf.argmax(y_pred_one_hot_encoded, axis=1))
print(classification_report(test_data['emotion_label'], y_pred))

              precision    recall  f1-score   support

           0       0.49      0.86      0.62       275
           1       0.91      0.69      0.78       224
           2       0.89      0.85      0.87       695
           3       0.80      0.62      0.70       159
           4       0.89      0.86      0.88       581
           5       1.00      0.02      0.03        66

    accuracy                           0.79      2000
   macro avg       0.83      0.65      0.65      2000
weighted avg       0.83      0.79      0.79      2000



### Confusion Matrix

In [43]:
z = confusion_matrix(test_data['emotion_label'], y_pred)

In [44]:
z

array([[237,   4,  13,   2,  19,   0],
       [ 45, 154,   6,   1,  18,   0],
       [ 71,   1, 588,  21,  14,   0],
       [ 28,   0,  24,  99,   8,   0],
       [ 54,   8,  20,   0, 499,   0],
       [ 52,   2,  10,   0,   1,   1]], dtype=int64)

In [45]:
z = z[::-1] # invert z idx values
z

array([[ 52,   2,  10,   0,   1,   1],
       [ 54,   8,  20,   0, 499,   0],
       [ 28,   0,  24,  99,   8,   0],
       [ 71,   1, 588,  21,  14,   0],
       [ 45, 154,   6,   1,  18,   0],
       [237,   4,  13,   2,  19,   0]], dtype=int64)

In [46]:
x = list(train_data.emotion.cat.categories)
z_text = [[str(y) for y in x] for x in z]

In [47]:
fig = ff.create_annotated_heatmap(z, 
                                  x = x,
                                  y = x[::-1], # same labels
                                  annotation_text=z_text, 
                                  colorscale='sunsetdark',
                                  showscale = True
                                 )

fig.update_layout(title_text = "Confusion Matrix (Test Data) - Using Word2vec Embeddings | Monitor Val_Recall")
fig.show()

In [48]:
py.plot(fig, filename="Confusion Matrix (Test Data) - Using Word2vec Embeddings", auto_open = True)

'https://plotly.com/~royn5618/41/'

## Without Skipgram embeddings

### Text Pre-processing

I have used Keras tokenizer to convert tokens into integers, using ```<OOV>``` to represent Out of Vocabulary (OoV) terms.
This is a significant advantage over the word2ve model since this is a flexible way of handling unseen vocabulary in test data.

Using pad_sequences, again I have padded/truncated sequences to make their lengths uniform.

In [49]:
vocab_size = 15000

tokenizer = Tokenizer(oov_token = "<OOV>", num_words=vocab_size)
tokenizer.fit_on_texts(train_data['text'])

sequences_train = tokenizer.texts_to_sequences(train_data['text'])
sequences_test = tokenizer.texts_to_sequences(test_data['text'])

padded_train = pad_sequences(sequences_train, padding = 'post', maxlen=max_seq_len)
padded_test = pad_sequences(sequences_test, padding = 'post', maxlen=max_seq_len)

### Model Training

In [50]:
def get_no_w2v_model():
    model = Sequential()
    model.add(
        Embedding(input_dim=vocab_size,
                  output_dim=vector_size, # keeping it same as the word2vec embedding size
                  input_length=max_seq_len))
    model.add(Dropout(0.6))
    model.add(LSTM(max_seq_len,return_sequences=True))
    model.add(LSTM(6))
    model.add(Dense(6,activation='softmax'))
    return model

In [51]:
callbacks = [
    keras.callbacks.EarlyStopping(monitor="val_recall",
                                  mode='max',
                                  patience=2,
                                  verbose=1,
                                  restore_best_weights=True),
    keras.callbacks.ModelCheckpoint(filepath='models/lstm_with_no_w2v_val_recall.h5',
                                    verbose=1,
                                    save_best_only=True)
]

In [52]:
no_w2v_model = get_no_w2v_model()
no_w2v_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 300)           4500000   
_________________________________________________________________
dropout_1 (Dropout)          (None, 20, 300)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 20, 20)            25680     
_________________________________________________________________
lstm_3 (LSTM)                (None, 6)                 648       
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 42        
Total params: 4,526,370
Trainable params: 4,526,370
Non-trainable params: 0
_________________________________________________________________


In [53]:
no_w2v_model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=metrics) # metrics was defined already

In [54]:
tf.config.run_functions_eagerly(True)
history = no_w2v_model.fit(padded_train,
                           train_labels,
                           validation_split=0.33,
                           callbacks=callbacks,
                           epochs=10)

Epoch 1/10

Epoch 00001: val_loss improved from inf to 1.28690, saving model to models\lstm_with_no_w2v.h5
Epoch 2/10

Epoch 00002: val_loss improved from 1.28690 to 0.92101, saving model to models\lstm_with_no_w2v.h5
Epoch 3/10

Epoch 00003: val_loss improved from 0.92101 to 0.80621, saving model to models\lstm_with_no_w2v.h5
Epoch 4/10

Epoch 00004: val_loss improved from 0.80621 to 0.76537, saving model to models\lstm_with_no_w2v.h5
Epoch 5/10

Epoch 00005: val_loss improved from 0.76537 to 0.75237, saving model to models\lstm_with_no_w2v.h5
Epoch 6/10

Epoch 00006: val_loss improved from 0.75237 to 0.73796, saving model to models\lstm_with_no_w2v.h5
Epoch 7/10

Epoch 00007: val_loss improved from 0.73796 to 0.72449, saving model to models\lstm_with_no_w2v.h5
Epoch 8/10

Epoch 00008: val_loss improved from 0.72449 to 0.71084, saving model to models\lstm_with_no_w2v.h5
Epoch 9/10

Epoch 00009: val_loss did not improve from 0.71084
Epoch 10/10
Restoring model weights from the end of t

### Training - Validation Loss

In [55]:
metric_to_plot = "loss"
epochs = list(range(1, max(history.epoch) + 2))
training_loss = history.history[metric_to_plot]
validation_loss = history.history["val_" + metric_to_plot]

trace1 = {
    "mode": "lines+markers",
    "name": "Training Loss",
    "type": "scatter",
    "x": epochs,
    "y": training_loss
}

trace2 = {
    "mode": "lines+markers",
    "name": "Validation Loss",
    "type": "scatter",
    "x": epochs,
    "y": validation_loss
}

data = Data([trace1, trace2])
layout = {
    "title": "Training - Validation Loss without Word2Vec Embeddings",
    "xaxis": {
        "title": "Number of epochs",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    },
    "yaxis": {
        "title": "Loss",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    }
}
fig = Figure(data=data, layout=layout)
fig.update_layout(hovermode="x unified")
fig.show()

In [56]:
py.plot(fig, filename="Train-Val Loss - Without Word2Vec Embeddings", auto_open = True)

'https://plotly.com/~royn5618/23/'

### Training - Validation Accuracy

In [57]:
metric_to_plot = "accuracy"
epochs = list(range(1, max(history.epoch) + 2))
training_loss = history.history[metric_to_plot]
validation_loss = history.history["val_" + metric_to_plot]

trace1 = {
    "mode": "lines+markers",
    "name": "Training Accuracy",
    "type": "scatter",
    "x": epochs,
    "y": training_loss
}

trace2 = {
    "mode": "lines+markers",
    "name": "Validation Accuracy",
    "type": "scatter",
    "x": epochs,
    "y": validation_loss
}

data = Data([trace1, trace2])
layout = {
    "title": "Training - Validation Accuracy without Word2Vec Embeddings",
    "xaxis": {
        "title": "Number of epochs",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    },
    "yaxis": {
        "title": "Accuracy",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    }
}
fig = Figure(data=data, layout=layout)
fig.update_layout(hovermode="x unified")
fig.show()

In [58]:
py.plot(fig, filename="Train-Val Accuracy - Without Word2Vec Embeddings", auto_open = True)

'https://plotly.com/~royn5618/25/'

### Classification Report

In [59]:
model_with_no_w2v = keras.models.load_model('models/lstm_with_no_w2v_val_recall.h5')

In [60]:
# Model Evaluation on Train Data

y_pred_one_hot_encoded = (model_with_no_w2v.predict(padded_train)> 0.5).astype("int32")
y_pred_one_hot_encoded

array([[0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0],
       ...,
       [0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0]])

In [61]:
y_pred = np.array(tf.argmax(y_pred_one_hot_encoded, axis=1))
print(classification_report(train_data['emotion_label'], y_pred))

              precision    recall  f1-score   support

           0       0.81      0.90      0.85      2159
           1       0.94      0.79      0.86      1937
           2       0.94      0.95      0.94      5362
           3       0.94      0.85      0.89      1304
           4       0.93      0.96      0.94      4666
           5       0.83      0.85      0.84       572

    accuracy                           0.91     16000
   macro avg       0.90      0.88      0.89     16000
weighted avg       0.91      0.91      0.91     16000



In [62]:
# Model Evaluation on Test Data

y_pred_one_hot_encoded = (model_with_no_w2v.predict(padded_test)> 0.5).astype("int32")
y_pred = np.array(tf.argmax(y_pred_one_hot_encoded, axis=1))
print(classification_report(test_data['emotion_label'], y_pred))

              precision    recall  f1-score   support

           0       0.70      0.80      0.75       275
           1       0.86      0.68      0.76       224
           2       0.84      0.87      0.85       695
           3       0.81      0.61      0.70       159
           4       0.83      0.87      0.85       581
           5       0.68      0.64      0.66        66

    accuracy                           0.81      2000
   macro avg       0.79      0.74      0.76      2000
weighted avg       0.81      0.81      0.81      2000



### Confusion Matrix

In [90]:
z = confusion_matrix(test_data['emotion_label'], y_pred)
z

array([[219,   9,  16,   0,  31,   0],
       [ 28, 152,  10,   3,  16,  15],
       [ 22,   6, 603,  15,  45,   4],
       [  8,   3,  41,  97,  10,   0],
       [ 27,   3,  42,   4, 504,   1],
       [  7,   3,  10,   1,   3,  42]], dtype=int64)

In [91]:
z = z[::-1]
z

array([[  7,   3,  10,   1,   3,  42],
       [ 27,   3,  42,   4, 504,   1],
       [  8,   3,  41,  97,  10,   0],
       [ 22,   6, 603,  15,  45,   4],
       [ 28, 152,  10,   3,  16,  15],
       [219,   9,  16,   0,  31,   0]], dtype=int64)

In [92]:
x = list(train_data.emotion.cat.categories)
z_text = [[str(y) for y in x] for x in z]

In [93]:
fig = ff.create_annotated_heatmap(z, 
                                  x = x,
                                  y = x[::-1], # same labels
                                  annotation_text=z_text, 
                                  colorscale='sunsetdark',
                                  showscale = True
                                 )

fig.update_layout(title_text = "Confusion Matrix (Test Data) - Without Word2vec Embedding")
fig.show()

In [94]:
py.plot(fig, filename="Confusion Matrix (Test Data) - Without Word2Vec Embeddings", auto_open = True)

'https://plotly.com/~royn5618/51/'

# Embeddings Evauation

This section is a comparison of the learnt embeddings. I saved the Keras's embedding layer weights from both the approaches and tried to assess the similarity scores.

## Embedding Layer Weights

Access the learnt weights of the model's layers.

### For the model which was initialized with Word2Vec embeddings

In [67]:
model_with_w2v.layers[0].get_weights()

[array([[ 0.03973487,  0.19689943,  0.03704248, ..., -0.01101642,
          0.17275847, -0.2392132 ],
        [ 0.04456045,  0.27536082, -0.136009  , ..., -0.1221706 ,
          0.06944917, -0.38726106],
        [-0.01038808,  0.04619833, -0.07106846, ..., -0.00913581,
          0.16384092, -0.18079036],
        ...,
        [-0.01566491,  0.09922612, -0.00568743, ...,  0.00604014,
          0.02581415, -0.03263379],
        [-0.03503877,  0.03242027,  0.02721957, ...,  0.00440053,
          0.07493401, -0.11078279],
        [ 0.00125698,  0.06520706, -0.02576341, ..., -0.00873958,
          0.03677673, -0.01705454]], dtype=float32)]

In [68]:
model_with_w2v.layers[0].get_weights()[0].shape

(15210, 300)

Recall, the vocab size was 15210 and we asked the model to train on 300 embedding size.

### For the model which was initialized with Word2Vec embeddings

In [69]:
model_with_no_w2v.layers[0].get_weights()

[array([[-0.06949931, -0.00954619,  0.05791519, ..., -0.01286481,
          0.00376441,  0.00770771],
        [ 0.01208823,  0.04595273, -0.03059881, ...,  0.01962453,
         -0.03966961, -0.00019684],
        [-0.00266718,  0.04909729, -0.00192641, ..., -0.01576019,
         -0.02966305, -0.0380706 ],
        ...,
        [ 0.0219018 ,  0.0383893 , -0.01931952, ...,  0.04786373,
          0.03288705,  0.02512505],
        [-0.00446367,  0.02723484,  0.03732138, ..., -0.02051033,
         -0.00519607,  0.02905004],
        [ 0.00051548, -0.03814612, -0.0290032 , ...,  0.04325107,
         -0.00672792,  0.04048965]], dtype=float32)]

In [70]:
model_with_no_w2v.layers[0].get_weights()[0].shape

(15000, 300)

Recall that we set the embedding layer's vocab size as 15000 (or input size) and specified the output size as 300 (which is the embedding size)

In [71]:
tokenizer.word_index['happy']

154

In [72]:
vocab_list_keras = np.array(list(tokenizer.word_index.keys())[:vocab_size])
len(vocab_list_keras)

15000

Note that we just selected the top 15000 and used it here since by default the tokenizer orders the words by the frequency of its occurrence. Hence, we can just select the top N words, where N = vocab_size here.

In [73]:
vocab_list_keras

array(['<OOV>', 'i', 'feel', ..., 'weaving', 'shapes', 'muster'],
      dtype='<U74')

#### Cast to Word2Vec Format to obtain similarities

Organize the array in the following format:

```
word0 vec1_0 vec2_0 ... vec300_0
word1 vec1_1 vec2_1 ... vec300_1
word2 vec1_2 vec2_2 ... vec300_2
...
wordN vec1_N vec2_N ... vec300_N
```

In [74]:
formatted_array = np.c_[vocab_list_keras, model_with_no_w2v.layers[0].get_weights()[0]]
formatted_array

array([['<OOV>', '-0.06949931', '-0.009546187', ..., '-0.012864809',
        '0.0037644117', '0.007707712'],
       ['i', '0.012088228', '0.045952726', ..., '0.019624535',
        '-0.03966961', '-0.00019683689'],
       ['feel', '-0.002667178', '0.049097292', ..., '-0.015760185',
        '-0.029663052', '-0.038070597'],
       ...,
       ['weaving', '0.021901798', '0.0383893', ..., '0.047863733',
        '0.032887045', '0.025125053'],
       ['shapes', '-0.0044636726', '0.027234841', ..., '-0.020510329',
        '-0.005196072', '0.029050041'],
       ['muster', '0.0005154833', '-0.038146116', ..., '0.04325107',
        '-0.0067279227', '0.040489648']], dtype='<U74')

In [75]:
np.savetxt("keras_embeddings.txt", formatted_array, delimiter=" ", fmt='%s')

**GloVe and Word2Vec File Formats**

The two file formats are slightly different - 

GloVe files do not have the total array size in the beginning while Word2Vec has it as shown below:

```
num_words num_dim
word0 vec1_0 vec2_0 ... vec300_0
word1 vec1_1 vec2_1 ... vec300_1
word2 vec1_2 vec2_2 ... vec300_2
...
wordN vec1_N vec2_N ... vec300_N
```

What we have is the GloVe format. 

*Gensim allows you to change the format as shown in the code below:*


In [76]:
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

keras_embeddings_glove_format = 'keras_embeddings.txt'
keras_embeddings_w2v_format = "keras_embeddings_word2vec.txt"
glove2word2vec(keras_embeddings_glove_format, keras_embeddings_w2v_format) 
keras_without_w2v = KeyedVectors.load_word2vec_format(keras_embeddings_w2v_format)

In [77]:
keras_without_w2v.most_similar("happy")

[('lover', 0.46577760577201843),
 ('flow', 0.45874735713005066),
 ('playing', 0.45785024762153625),
 ('everyday', 0.4549572765827179),
 ('image', 0.454139769077301),
 ('pee', 0.45041099190711975),
 ('core', 0.446612149477005),
 ('pleased', 0.4422570466995239),
 ('instantly', 0.44054874777793884),
 ('un', 0.4401867389678955)]

#### Cast the Keras embeddings that were initialized with Word2vec 

This will give us the updated weights based on the trained classifier.

In [78]:
model_with_w2v.layers[0].get_weights()

[array([[ 0.03973487,  0.19689943,  0.03704248, ..., -0.01101642,
          0.17275847, -0.2392132 ],
        [ 0.04456045,  0.27536082, -0.136009  , ..., -0.1221706 ,
          0.06944917, -0.38726106],
        [-0.01038808,  0.04619833, -0.07106846, ..., -0.00913581,
          0.16384092, -0.18079036],
        ...,
        [-0.01566491,  0.09922612, -0.00568743, ...,  0.00604014,
          0.02581415, -0.03263379],
        [-0.03503877,  0.03242027,  0.02721957, ...,  0.00440053,
          0.07493401, -0.11078279],
        [ 0.00125698,  0.06520706, -0.02576341, ..., -0.00873958,
          0.03677673, -0.01705454]], dtype=float32)]

In [79]:
model_with_w2v.layers[0].get_weights()[0].shape

(15210, 300)

In [80]:
vocab_list_w2v = np.array(list(w2v_model.wv.key_to_index.keys()))

Recall that we set the embedding layer's vocab size as 15000 (or input size) and specified the output size as 300 (which is the embedding size)

In [81]:
formatted_array2 = np.c_[vocab_list_w2v, model_with_w2v.layers[0].get_weights()[0]]
formatted_array2

array([['i', '0.039734874', '0.19689943', ..., '-0.011016421',
        '0.17275847', '-0.2392132'],
       ['feel', '0.044560447', '0.27536082', ..., '-0.1221706',
        '0.06944917', '-0.38726106'],
       ['and', '-0.010388085', '0.046198327', ..., '-0.009135812',
        '0.16384092', '-0.18079036'],
       ...,
       ['beet', '-0.015664913', '0.09922612', ..., '0.0060401387',
        '0.025814146', '-0.03263379'],
       ['consequently', '-0.035038773', '0.032420266', ...,
        '0.004400531', '0.07493401', '-0.11078279'],
       ['subbing', '0.0012569765', '0.065207064', ..., '-0.008739584',
        '0.036776725', '-0.017054543']], dtype='<U74')

In [82]:
np.savetxt("keras_withw2v_embeddings.txt", formatted_array2, delimiter=" ", fmt='%s')

In [83]:
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

keras_embeddings_glove_format = 'keras_withw2v_embeddings.txt'
keras_embeddings_w2v_format = "keras_withw2v_embeddings_word2vec.txt"
glove2word2vec(keras_embeddings_glove_format, keras_embeddings_w2v_format) 
keras_with_w2v = KeyedVectors.load_word2vec_format(keras_embeddings_w2v_format)

In [84]:
keras_with_w2v.most_similar("happy")

[('fine', 0.8425421118736267),
 ('birth', 0.8311166763305664),
 ('dance', 0.8257489800453186),
 ('youtube', 0.8233530521392822),
 ('hurts', 0.8193178772926331),
 ('tasks', 0.8170125484466553),
 ('excited', 0.8168256282806396),
 ('wise', 0.8162508606910706),
 ('splendid', 0.8154001832008362),
 ('currently', 0.8136612176895142)]

In [85]:
w2v_model.wv.most_similar("happy")

[('forever', 0.8926950097084045),
 ('birthday', 0.8863252997398376),
 ('yes', 0.8857314586639404),
 ('contented', 0.8852509260177612),
 ('whats', 0.8823444843292236),
 ('focused', 0.8819975852966309),
 ('aggravated', 0.8785808086395264),
 ('bored', 0.8751237988471985),
 ('lol', 0.8740817904472351),
 ('awfully', 0.8725798726081848)]

It is essential to note that, while word2vec is designed to capture the context of given words, the Keras embedding layer is simply a look-up layer whose weights are updated based on the task it is solving and the error propagated. Since they have fundamentally different working principles, the comparison here cannot focus on the quality of similarities obtained, and therefore the words in the top 10 most similar words to 'happy' for the Keras embeddings layer without initial weights do not seem to be related enough.

Thanks for visiting!