<a href="https://colab.research.google.com/github/royn5618/Talks_Resources/blob/main/PyConPortugal2022/simple_keras_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**About:**

This notebook has naive implmentation of an NLP classifier that predicts emotions.

**Data Source on Kaggle:**

https://www.kaggle.com/praveengovi/emotions-dataset-for-nlp

**Data Source on HuggingFace:**

https://huggingface.co/datasets/emotion

# Data import

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [2]:
train_data = pd.read_csv('Data/train.txt', sep=';', names=['text', 'emotion'])
train_data.head()

Unnamed: 0,text,emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [3]:
test_data = pd.read_csv('Data/test.txt', sep=';', names=['text', 'emotion'])
test_data.head()

Unnamed: 0,text,emotion
0,im feeling rather rotten so im not very ambiti...,sadness
1,im updating my blog because i feel shitty,sadness
2,i never make her separate from me because i do...,sadness
3,i left with my bouquet of red and yellow tulip...,joy
4,i was feeling a little vain when i did this one,sadness


# Data preparation
## Label Encoding

Encode target labels with value between 0 and n_classes-1.

In [4]:
train_data["emotion"] = train_data["emotion"].astype('category')
train_data["emotion_label"] = train_data["emotion"].cat.codes
train_data.head()

Unnamed: 0,text,emotion,emotion_label
0,i didnt feel humiliated,sadness,4
1,i can go from feeling so hopeless to so damned...,sadness,4
2,im grabbing a minute to post i feel greedy wrong,anger,0
3,i am ever feeling nostalgic about the fireplac...,love,3
4,i am feeling grouchy,anger,0


In [5]:
test_data["emotion"] = test_data["emotion"].astype('category')
test_data["emotion_label"] = test_data["emotion"].cat.codes
test_data.head()

Unnamed: 0,text,emotion,emotion_label
0,im feeling rather rotten so im not very ambiti...,sadness,4
1,im updating my blog because i feel shitty,sadness,4
2,i never make her separate from me because i do...,sadness,4
3,i left with my bouquet of red and yellow tulip...,joy,2
4,i was feeling a little vain when i did this one,sadness,4


## One Hot Encoding

Encode categorical features as a one-hot numeric array.
For example:

```
0 -> [1, 0, 0, 0, 0, 0]
1 -> [0, 1, 0, 0, 0, 0]
...
5 -> [0, 0, 0, 0, 0, 1]

```




In [6]:
import tensorflow as tf

In [7]:
train_features, train_labels = train_data['text'], tf.one_hot(train_data["emotion_label"], 6)
test_features, test_labels = test_data['text'], tf.one_hot(test_data["emotion_label"], 6)

In [8]:
train_features[:5]

0                              i didnt feel humiliated
1    i can go from feeling so hopeless to so damned...
2     im grabbing a minute to post i feel greedy wrong
3    i am ever feeling nostalgic about the fireplac...
4                                 i am feeling grouchy
Name: text, dtype: object

In [9]:
train_labels[:5]

<tf.Tensor: shape=(5, 6), dtype=float32, numpy=
array([[0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0.]], dtype=float32)>

## Decoder

Takes in one-hot encoded matrix returns a list of decoded categories. (To be used after predictions)

In [10]:
def get_labels_from_oh_code(oh_code):
    """ Takes in one-hot encoded matrix
    Returns a list of decoded categories"""
    label_code = np.argmax(oh_code, axis=1)
#     print(label_code)
    label = test_data.emotion.cat.categories[label_code]
#     print(list(label))
    return list(label)

In [11]:
"Test Method"
test= np.array(train_labels[:5])
get_labels_from_oh_code(test)

['sadness', 'sadness', 'anger', 'love', 'anger']

## Text Preprocessing

 - Breaks down a text into smaller units, commonly by the words in it. These words are called tokens and the process is called tokenization.

 - Keras uses a set of vocabulary and any token out of that vocab list is replaced with OOV. This is a better strategy to tackle unexpected vocab in unseen text data.

 - For ANNs, usually a standard length is chosen for the each text length. Any text longer is truncated from either front(pre) or back (post), any text shorted padded with zeros either front(pre) or back (post).

In [12]:
import tensorflow.keras as keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [13]:
vocab_size = 15000
max_seq_len = 20

tokenizer = Tokenizer(oov_token = "<OOV>", num_words=vocab_size)
tokenizer.fit_on_texts(train_data['text'])

sequences_train = tokenizer.texts_to_sequences(train_data['text'])
sequences_test = tokenizer.texts_to_sequences(test_data['text'])

padded_train = pad_sequences(sequences_train, padding = 'post', maxlen=max_seq_len)
padded_test = pad_sequences(sequences_test, padding = 'post', maxlen=max_seq_len)

# Model Training

## Build Model

 - Sequential Model: a linear stack of layers
 - Embedding Layer: Accepts text input and generates a vectorized (dense) output per token for the given sequence.
 - Dropout: Discards or randomly ignores a certain fraction of the input.
 - LSTM: Knows certain information from the past and also learns from new inputs, sequentially, used for texts and time-series modeling.
 - Dense/Fully-connected layer: Neurons of the layer are connected to every neuron of its preceding layer
 - Softmax Activation: A mathematical function that converts a vector of numbers into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector

In [14]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, Dropout, LSTM

In [15]:
vector_size = 300

def get_model():
    model = Sequential()
    model.add(
        Embedding(input_dim=vocab_size,
                  output_dim=vector_size,
                  input_length=max_seq_len))
    model.add(Dropout(0.6))
    model.add(LSTM(max_seq_len,return_sequences=True))
    model.add(LSTM(6))
    model.add(Dense(6,activation='softmax'))
    return model

### Set Callbacks

Enable monitoring model training, saving best models, and get other information or carry out tasks.

In [16]:
callbacks = [
    keras.callbacks.EarlyStopping(monitor="val_loss",
                                  patience=2,
                                  verbose=1,
                                  mode="min",
                                  restore_best_weights=True),
    keras.callbacks.ModelCheckpoint(filepath='models/best_model.h5',
                                    verbose=1,
                                    save_best_only=True)
]

### Verify your model

One final view of your model architecture.

In [17]:
model = get_model()
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 300)           4500000   
                                                                 
 dropout (Dropout)           (None, 20, 300)           0         
                                                                 
 lstm (LSTM)                 (None, 20, 20)            25680     
                                                                 
 lstm_1 (LSTM)               (None, 6)                 648       
                                                                 
 dense (Dense)               (None, 6)                 42        
                                                                 
Total params: 4,526,370
Trainable params: 4,526,370
Non-trainable params: 0
_________________________________________________________________


## Compile Model

Provide instructions on how the model will learn from its mistakes.



In [18]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

## Train Model

In [19]:
history = model.fit(padded_train,
                    train_labels,
                    validation_split=0.33,
                    callbacks=callbacks,
                    epochs=10)

Epoch 1/10
Epoch 1: val_loss improved from inf to 1.34902, saving model to models/best_model.h5
Epoch 2/10
Epoch 2: val_loss improved from 1.34902 to 1.05466, saving model to models/best_model.h5
Epoch 3/10
Epoch 3: val_loss improved from 1.05466 to 0.82240, saving model to models/best_model.h5
Epoch 4/10
Epoch 4: val_loss improved from 0.82240 to 0.72616, saving model to models/best_model.h5
Epoch 5/10
Epoch 5: val_loss did not improve from 0.72616
Epoch 6/10
Epoch 6: val_loss improved from 0.72616 to 0.66059, saving model to models/best_model.h5
Epoch 7/10
Epoch 7: val_loss did not improve from 0.66059
Epoch 8/10

Epoch 8: val_loss did not improve from 0.66059
Epoch 8: early stopping


## Visualize and verify the Loss per epoch

In [20]:
from plotly.graph_objs import *

In [21]:
metric_to_plot = "loss"
epochs = list(range(1, max(history.epoch) + 2))
training_loss = history.history[metric_to_plot]
validation_loss = history.history["val_" + metric_to_plot]

trace1 = {
    "mode": "lines+markers",
    "name": "Training Loss",
    "type": "scatter",
    "x": epochs,
    "y": training_loss
}

trace2 = {
    "mode": "lines+markers",
    "name": "Validation Loss",
    "type": "scatter",
    "x": epochs,
    "y": validation_loss
}

data = Data([trace1, trace2])
layout = {
    "title": "Training - Validation Loss",
    "xaxis": {
        "title": "Number of epochs",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    },
    "yaxis": {
        "title": "Loss",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    }
}
fig = Figure(data=data, layout=layout)
fig.update_layout(hovermode="x unified")
fig.show()

Please replace it with a list or tuple of instances of the following types
  - plotly.graph_objs.Scatter
  - plotly.graph_objs.Bar
  - plotly.graph_objs.Area
  - plotly.graph_objs.Histogram
  - etc.



## Visualize and verify the metric per epoch

In [22]:
metric_to_plot = "accuracy"
epochs = list(range(1, max(history.epoch) + 2))
training_loss = history.history[metric_to_plot]
validation_loss = history.history["val_" + metric_to_plot]

trace1 = {
    "mode": "lines+markers",
    "name": "Training Accuracy",
    "type": "scatter",
    "x": epochs,
    "y": training_loss
}

trace2 = {
    "mode": "lines+markers",
    "name": "Validation Accuracy",
    "type": "scatter",
    "x": epochs,
    "y": validation_loss
}

data = Data([trace1, trace2])
layout = {
    "title": "Training - Validation Accuracy",
    "xaxis": {
        "title": "Number of epochs",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    },
    "yaxis": {
        "title": "Accuracy",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    }
}
fig = Figure(data=data, layout=layout)
fig.update_layout(hovermode="x unified")
fig.show()


plotly.graph_objs.Data is deprecated.
Please replace it with a list or tuple of instances of the following types
  - plotly.graph_objs.Scatter
  - plotly.graph_objs.Bar
  - plotly.graph_objs.Area
  - plotly.graph_objs.Histogram
  - etc.




# Model Evaluation

 - Unbalanced data, hence, accuracy is not a good idea.
 - Precision and recall are better and weighted average of these
 - metrics as well for overall model performance

 ```
 actual  prediction
 0       1
 0       0
 0       0 
 1       1
 1       1
 0       0
 ```

 Precision: 1/2, we have one correct prediction (TP) and total 3 positive predictions.

 Recall: 2/2, we have two correct predictions (TP) and total 2 actually positive data points. 

In [23]:
from sklearn.metrics import classification_report

In [24]:
""" Demo """
list1 = [0, 0, 0, 1, 1, 0]
list2 = [1, 0, 0, 1, 1, 0]
print(classification_report(list1, list2))

              precision    recall  f1-score   support

           0       1.00      0.75      0.86         4
           1       0.67      1.00      0.80         2

    accuracy                           0.83         6
   macro avg       0.83      0.88      0.83         6
weighted avg       0.89      0.83      0.84         6



In [25]:
# model variable still holds the best model but 
# you can also reload a saved model like this
best_model = keras.models.load_model('models/best_model.h5')

In [26]:
y_pred_one_hot_encoded = (best_model.predict(padded_train)> 0.5).astype("int32")
y_pred_one_hot_encoded

array([[0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0],
       ...,
       [0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0]], dtype=int32)

In [27]:
y_pred = np.array(tf.argmax(y_pred_one_hot_encoded, axis=1))
print(classification_report(train_data['emotion_label'], y_pred))

              precision    recall  f1-score   support

           0       0.74      0.90      0.81      2159
           1       0.92      0.87      0.89      1937
           2       0.94      0.92      0.93      5362
           3       0.81      0.90      0.85      1304
           4       0.97      0.92      0.94      4666
           5       0.88      0.65      0.75       572

    accuracy                           0.90     16000
   macro avg       0.88      0.86      0.86     16000
weighted avg       0.91      0.90      0.90     16000



In [28]:
# Model Evaluation on Test Data
y_pred_one_hot_encoded = (best_model.predict(padded_test)> 0.5).astype("int32")
y_pred = np.array(tf.argmax(y_pred_one_hot_encoded, axis=1))
print(classification_report(test_data['emotion_label'], y_pred))

              precision    recall  f1-score   support

           0       0.60      0.81      0.69       275
           1       0.83      0.80      0.82       224
           2       0.85      0.83      0.84       695
           3       0.66      0.75      0.71       159
           4       0.94      0.83      0.88       581
           5       0.75      0.45      0.57        66

    accuracy                           0.80      2000
   macro avg       0.77      0.75      0.75      2000
weighted avg       0.82      0.80      0.81      2000

