<a href="https://colab.research.google.com/github/royn5618/Talks_Resources/blob/main/PyConPortugal2022/simple_keras_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**About:**

This notebook has naive implmentation of an NLP classifier that predicts emotions.

**Data Source on Kaggle:**

https://www.kaggle.com/praveengovi/emotions-dataset-for-nlp

**Data Source on HuggingFace:**

https://huggingface.co/datasets/emotion

# Data import

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [None]:
train_data = pd.read_csv('Data/train.txt', sep=';', names=['text', 'emotion'])
train_data.head()

In [None]:
test_data = pd.read_csv('Data/test.txt', sep=';', names=['text', 'emotion'])
test_data.head()

# Data preparation
## Label Encoding

Encode target labels with value between 0 and n_classes-1.

In [None]:
train_data["emotion"] = train_data["emotion"].astype('category')
train_data["emotion_label"] = train_data["emotion"].cat.codes
train_data.head()

In [None]:
test_data["emotion"] = test_data["emotion"].astype('category')
test_data["emotion_label"] = test_data["emotion"].cat.codes
test_data.head()

## One Hot Encoding

Encode categorical features as a one-hot numeric array.
For example:

```
0 -> [1, 0, 0, 0, 0, 0]
1 -> [0, 1, 0, 0, 0, 0]
...
5 -> [0, 0, 0, 0, 0, 1]

```




In [None]:
import tensorflow as tf

In [None]:
train_features, train_labels = train_data['text'], tf.one_hot(train_data["emotion_label"], 6)
test_features, test_labels = test_data['text'], tf.one_hot(test_data["emotion_label"], 6)

In [None]:
train_features[:5]

In [None]:
train_labels[:5]

## Decoder

Takes in one-hot encoded matrix returns a list of decoded categories. (To be used after predictions)

In [None]:
def get_labels_from_oh_code(oh_code):
    """ Takes in one-hot encoded matrix
    Returns a list of decoded categories"""
    label_code = np.argmax(oh_code, axis=1)
#     print(label_code)
    label = test_data.emotion.cat.categories[label_code]
#     print(list(label))
    return list(label)

In [None]:
"Test Method"
test= np.array(train_labels[:5])
get_labels_from_oh_code(test)

## Text Preprocessing

 - Breaks down a text into smaller units, commonly by the words in it. These words are called tokens and the process is called tokenization.

 - Keras uses a set of vocabulary and any token out of that vocab list is replaced with OOV. This is a better strategy to tackle unexpected vocab in unseen text data.

 - For ANNs, usually a standard length is chosen for the each text length. Any text longer is truncated from either front(pre) or back (post), any text shorted padded with zeros either front(pre) or back (post).

In [None]:
import tensorflow.keras as keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
vocab_size = 15000
max_seq_len = 20

tokenizer = Tokenizer(oov_token = "<OOV>", num_words=vocab_size)
tokenizer.fit_on_texts(train_data['text'])

sequences_train = tokenizer.texts_to_sequences(train_data['text'])
sequences_test = tokenizer.texts_to_sequences(test_data['text'])

padded_train = pad_sequences(sequences_train, padding = 'post', maxlen=max_seq_len)
padded_test = pad_sequences(sequences_test, padding = 'post', maxlen=max_seq_len)

# Model Training

## Build Model

 - Sequential Model: a linear stack of layers
 - Embedding Layer: Accepts text input and generates a vectorized (dense) output per token for the given sequence.
 - Dropout: Discards or randomly ignores a certain fraction of the input.
 - LSTM: Knows certain information from the past and also learns from new inputs, sequentially, used for texts and time-series modeling.
 - Dense/Fully-connected layer: Neurons of the layer are connected to every neuron of its preceding layer
 - Softmax Activation: A mathematical function that converts a vector of numbers into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, Dropout, LSTM

In [None]:
vector_size = 300

def get_model():
    model = Sequential()
    model.add(
        Embedding(input_dim=vocab_size,
                  output_dim=vector_size,
                  input_length=max_seq_len))
    model.add(Dropout(0.6))
    model.add(LSTM(max_seq_len,return_sequences=True))
    model.add(LSTM(6))
    model.add(Dense(6,activation='softmax'))
    return model

### Set Callbacks

Enable monitoring model training, saving best models, and get other information or carry out tasks.

In [None]:
callbacks = [
    keras.callbacks.EarlyStopping(monitor="val_loss",
                                  patience=2,
                                  verbose=1,
                                  mode="min",
                                  restore_best_weights=True),
    keras.callbacks.ModelCheckpoint(filepath='models/best_model.h5',
                                    verbose=1,
                                    save_best_only=True)
]

### Verify your model

One final view of your model architecture.

In [None]:
model = get_model()
model.summary()

## Compile Model

Provide instructions on how the model will learn from its mistakes.



In [None]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

## Train Model

In [None]:
history = model.fit(padded_train,
                    train_labels,
                    validation_split=0.33,
                    callbacks=callbacks,
                    epochs=10)

## Visualize and verify the Loss per epoch

In [None]:
from plotly.graph_objs import *

In [None]:
metric_to_plot = "loss"
epochs = list(range(1, max(history.epoch) + 2))
training_loss = history.history[metric_to_plot]
validation_loss = history.history["val_" + metric_to_plot]

trace1 = {
    "mode": "lines+markers",
    "name": "Training Loss",
    "type": "scatter",
    "x": epochs,
    "y": training_loss
}

trace2 = {
    "mode": "lines+markers",
    "name": "Validation Loss",
    "type": "scatter",
    "x": epochs,
    "y": validation_loss
}

data = Data([trace1, trace2])
layout = {
    "title": "Training - Validation Loss",
    "xaxis": {
        "title": "Number of epochs",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    },
    "yaxis": {
        "title": "Loss",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    }
}
fig = Figure(data=data, layout=layout)
fig.update_layout(hovermode="x unified")
fig.show()

## Visualize and verify the metric per epoch

In [None]:
metric_to_plot = "accuracy"
epochs = list(range(1, max(history.epoch) + 2))
training_loss = history.history[metric_to_plot]
validation_loss = history.history["val_" + metric_to_plot]

trace1 = {
    "mode": "lines+markers",
    "name": "Training Accuracy",
    "type": "scatter",
    "x": epochs,
    "y": training_loss
}

trace2 = {
    "mode": "lines+markers",
    "name": "Validation Accuracy",
    "type": "scatter",
    "x": epochs,
    "y": validation_loss
}

data = Data([trace1, trace2])
layout = {
    "title": "Training - Validation Accuracy",
    "xaxis": {
        "title": "Number of epochs",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    },
    "yaxis": {
        "title": "Accuracy",
        "titlefont": {
            "size": 18,
            "color": "#7f7f7f"
        }
    }
}
fig = Figure(data=data, layout=layout)
fig.update_layout(hovermode="x unified")
fig.show()

# Model Evaluation

 - Unbalanced data, hence, accuracy is not a good idea.
 - Precision and recall are better and weighted average of these
 - metrics as well for overall model performance

 ```
 actual  prediction
 0       1
 0       0
 0       0 
 1       1
 1       1
 0       0
 ```

 Precision: 1/2, we have one correct prediction (TP) and total 3 positive predictions.

 Recall: 2/2, we have two correct predictions (TP) and total 2 actually positive data points. 

In [None]:
from sklearn.metrics import classification_report

In [None]:
""" Demo """
list1 = [0, 0, 0, 1, 1, 0]
list2 = [1, 0, 0, 1, 1, 0]
print(classification_report(list1, list2))

In [None]:
# model variable still holds the best model but 
# you can also reload a saved model like this
best_model = keras.models.load_model('models/best_model.h5')

In [None]:
y_pred_one_hot_encoded = (best_model.predict(padded_train)> 0.5).astype("int32")
y_pred_one_hot_encoded

In [None]:
y_pred = np.array(tf.argmax(y_pred_one_hot_encoded, axis=1))
print(classification_report(train_data['emotion_label'], y_pred))

In [None]:
# Model Evaluation on Test Data
y_pred_one_hot_encoded = (best_model.predict(padded_test)> 0.5).astype("int32")
y_pred = np.array(tf.argmax(y_pred_one_hot_encoded, axis=1))
print(classification_report(test_data['emotion_label'], y_pred))