# NLP Seminar 4: Text classification with neural networks (MLPs, RNNs)

In this seminar, we cover the implementation of "all-in-one" text classification neural networks using the `tensorflow` package. The resulting models include both a neural text embedding and a neural classifier in a single pipeline.

The focus is here on two different types of architectures: multi-layer perceptrons (MLPs) and (bidirectional) recurrent neural netrowks (RNNs).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, losses, optimizers
from tensorflow.keras import Sequential

## Data preparation

We again use the Simpsons script dataset. As a toy example, the classification goal is to determine which character is speaking given a textual dialogue line.

Data source: https://www.kaggle.com/datasets/prashant111/the-simpsons-dataset

In [None]:
simpsons = pd.read_csv("data/simpsons_script_lines.csv",
                       usecols=["raw_character_text", "raw_location_text", "spoken_words", "normalized_text"],
                       dtype={'raw_character_text':'string', 'raw_location_text':'string',
                              'spoken_words':'string', 'normalized_text':'string'})
simpsons.head()

In [None]:
simpsons.info()

In [None]:
simpsons = simpsons.dropna().drop_duplicates().reset_index(drop=True)
simpsons.head()

We restrict the data to the 10 most frequent characters.

In [None]:
n_classes = 10
simpsons['raw_character_text'].value_counts(dropna=False)[:n_classes]

In [None]:
main_characters = simpsons['raw_character_text'].value_counts(dropna=False)[:n_classes].index.to_list()

simpsons_main = simpsons.query("`raw_character_text` in @main_characters")
simpsons_main

In [None]:
X = simpsons_main["normalized_text"].to_numpy()
y = simpsons_main["raw_character_text"].to_numpy()
y

In [None]:
y_int = np.array([np.where(np.array(main_characters)==char)[0].item() for char in y])
y_int

Train-validation data split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y_int, test_size=0.2, random_state=42, shuffle=True)

## TextVectorization and Embedding layers

In a `Sequantial` tensorflow model, the [`TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer is used to transform raw text into bag of words.
For efficiency reasons, the bag-of-words representations are encoded as integers instead of one-hot-vectors.

In [None]:
corp = [??]

In [None]:
max_vocab = 100
sequence_length = 10
vectorize_layer = layers.TextVectorization(max_tokens=max_vocab, standardize='lower_and_strip_punctuation',
                                           output_mode='int', output_sequence_length=sequence_length)
vectorize_layer.adapt(corp)

In [None]:
vectorize_layer(??)

In [None]:
vectorize_layer.get_vocabulary()

Then, the [`Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer is usually used to transform the bag-of-words integers to (trainable) vectorial embeddings.

In [None]:
embedder = layers.Embedding(input_dim=len(vectorize_layer.get_vocabulary()), output_dim=8,
                            embeddings_initializer='uniform')

In [None]:
embedder(??)

## MLP network architectures

In [None]:
# Vocabulary size and number of words in a sequence.
max_vocab = 10000
sequence_length = 100

The `TextVectorization` layer needs to be "adapted" (i.e. needs to construct the vocabulary) before beeing added in a `Sequential` model.

In [None]:
# Use the text vectorization layer to normalize, split, and map strings to
# integers. Note that the layer uses the custom standardization defined above.
# Set maximum_sequence length as all samples are not of the same length.

vectorize_layer = layers.TextVectorization(max_tokens=??, standardize='lower_and_strip_punctuation',#or custom
                                           output_mode='int', output_sequence_length=??)
vectorize_layer.adapt(??)

In [None]:
vocab = np.array(vectorize_layer.get_vocabulary())
vocab[:20]

We construct a toy MLP example on top of the embedding layers, using the `Sequential` class.

In [None]:
embedding_dim=50

MLP_model = Sequential([
    ??,
    layers.Embedding(input_dim=??, output_dim=??,
                     embeddings_initializer='uniform'),
    layers.GlobalAveragePooling1D(),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(n_classes, activation='softmax')
], name="MLP_model")

MLP_model.summary()

We assign the `Adam` optimizer and the `SparseCategoricalCrossentropy` loss to the model for training.
We then train the model on the training data and log the validation loss.

In [None]:
MLP_model.compile(optimizer=optimizers.Adam(),
                  loss=losses.SparseCategoricalCrossentropy(from_logits=False),
                  metrics=['accuracy'])

epochs = 10
history_MLP = MLP_model.fit(X_train, y_train, validation_data=(X_valid, y_valid),
                            batch_size=128, epochs=epochs)

MLP_model.save_weights("./checkpoints/MLP_10/MLP_10")
#MLP_model.save("./checkpoints/MLP_10_model")

In [None]:
MLP_model.load_weights("./checkpoints/MLP_10/MLP_10")

We evaluate the model on the validation data, and inspect the training `history`.

In [None]:
?, ? = MLP_model.evaluate(??, ??)

print("Loss: ", loss)
print("Accuracy: ", accuracy)

In [None]:
history_MLP_dict = history_MLP.history
history_MLP_dict.keys()

In [None]:
def plot_training(history_dict):
    acc = history_dict['accuracy']
    val_acc = history_dict['val_accuracy']
    loss = history_dict['loss']
    val_loss = history_dict['val_loss']
    
    epochs = range(1, len(loss) + 1)
    
    plt.figure(figsize=(10, 4))
    plt.subplot(1, 2, 1)
    plt.plot(epochs, loss, 'b--', label='Training')
    plt.plot(epochs, val_loss, 'r-', label='Validation')
    plt.xlabel('Epochs')
    plt.ylabel('Cross-Entropy Loss')
    plt.legend()
    
    plt.subplot(1, 2, 2)
    plt.plot(epochs, acc, 'b--', label='Training')
    plt.plot(epochs, val_acc, 'r-', label='Validation')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend(loc='lower right')
    
    plt.show()

plot_training(history_MLP_dict)

We predict the expected character for a few new line examples.

In [None]:
examples = ["Homer, come here!",
            "Science is amazing.",
            "Duh"]

MLP_model.predict(examples)

In [None]:
def predict_class(tf_model, data, class_names=main_characters):
    """Returns the string model predictions (class name of the largest network output activation)."""
    return np.array(class_names)[tf_model.predict(data).argmax(axis=1)]

In [None]:
predict_class(??, ??)

A few tutorials for more details:
- https://www.tensorflow.org/text/guide/word_embeddings
- https://www.tensorflow.org/tutorials/keras/text_classification

##  Recurrent network architectures

We now construct a few recurrent architectures on top of the embedding layers. Those allow to capture sequential dependencies in the input text sequences through latent variables, and also dealing with varying text sequence lengths.

The three main type of recurrent layers implemented in `tensorflow` are: `SimpleRNN`, `LSTM` and `GRU`.
We here use the LSTM layers, but one can in principle substitute it for any of the others.

As the meaning of a word can sometimes depend not only on the context preceding it, but also succeeding it, "Bidirectional" recurrent architectures are often used in some NLP tasks. The `Bidirectional` layer wrapper allows to implement this easily in a `Sequential` model

A few tutorials to go further:
- https://www.tensorflow.org/text/tutorials/text_classification_rnn
- https://www.tensorflow.org/guide/keras/rnn

In [None]:
max_vocab = 10000
vectorize_layer = layers.TextVectorization(max_tokens=max_vocab, standardize='lower_and_strip_punctuation',
                                           output_mode='int', output_sequence_length=None)
vectorize_layer.adapt(X_train)

Here are a few example of LSTM-based architectures:

In [None]:
embedding_dim=50

# Using masking with 'mask_zero=True' to handle the variable sequence lengths in subsequent layers.

LSTM_model = tf.keras.Sequential([
    vectorize_layer,
    layers.Embedding(input_dim=len(vectorize_layer.get_vocabulary()), output_dim=embedding_dim,
                     embeddings_initializer='uniform', mask_zero=True),
    layers.LSTM(64),
    layers.Dense(64, activation='relu'),
    layers.Dense(n_classes, activation='softmax')
], name="LSTM_model")

BLSTM_model = tf.keras.Sequential([
    vectorize_layer,
    layers.Embedding(input_dim=len(vectorize_layer.get_vocabulary()), output_dim=embedding_dim,
                     embeddings_initializer='uniform', mask_zero=True),
    layers.Bidirectional(layers.LSTM(64), merge_mode='concat'),
    layers.Dense(n_classes, activation='softmax')
], name="BLSTM_model")

BLSTM2_model = tf.keras.Sequential([
    vectorize_layer,
    layers.Embedding(input_dim=len(vectorize_layer.get_vocabulary()), output_dim=embedding_dim,
                     embeddings_initializer='uniform', mask_zero=True),
    layers.Bidirectional(layers.LSTM(64,  return_sequences=True), merge_mode='concat'),
    layers.Bidirectional(layers.LSTM(32), merge_mode='concat'),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(n_classes, activation='softmax')
], name="BLSTM2_model")


(Using masking with `'mask_zero=True'` in the embedding to handle the variable sequence lengths in subsequent layers: https://www.tensorflow.org/guide/keras/masking_and_padding)

In [None]:
BLSTM_model.compile(optimizer=optimizers.Adam(learning_rate=1e-4),
                    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
                    metrics=['accuracy'])

epochs = 10#100
history_lstm = BLSTM_model.fit(X_train, y_train, validation_data=(X_valid, y_valid),
                               batch_size=128, epochs=epochs)

BLSTM_model.save_weights("./checkpoints/BLSTM_10/BLSTM_10")
#BLSTM_model.save("./checkpoints/BLSTM_10_model")

In [None]:
BLSTM_model.load_weights("./checkpoints/BLSTM_10/BLSTM_10")
history_lstm_dict = history_lstm.history

In [None]:
test_loss, test_acc = BLSTM_model.evaluate(X_valid, y_valid)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

plot_training(history_lstm_dict)

In [None]:
BLSTM_model.predict(examples)
predict_class(BLSTM_model, examples)

## Using pretrained word vectors

Instead of randomly initializing the `Embedding` layer, one can give pretrained word vectors to it.
We here use pretrained GloVe vectors.

In [None]:
from gensim.models import KeyedVectors
import gensim.downloader as gensim_api

In [None]:
embedding_dim = 100
glv_embd = gensim_api.load("glove-wiki-gigaword-100")

In [None]:
max_vocab = 10000
vectorize_layer = layers.TextVectorization(max_tokens=max_vocab, standardize='lower_and_strip_punctuation',
                                           output_mode='int', output_sequence_length=None)
vectorize_layer.adapt(X_train)

In [None]:
vocab = vectorize_layer.get_vocabulary(include_special_tokens=True)
embedding_matrix = np.random.uniform(-0.05, 0.05, (len(vocab), embedding_dim))

oov_words = list()
for i, w in enumerate(vocab):
    try:
        embedding_matrix[i,] = glv_embd.get_vector(w, norm=True)
    except:
        embedding_matrix[i,] = embedding_matrix[1,]
        oov_words += [w]

#print(oov_words)
print(len(oov_words))

In [None]:
embedder = layers.Embedding(input_dim=len(vectorize_layer.get_vocabulary()), output_dim=embedding_dim,
                            embeddings_initializer=keras.initializers.Constant(??),
                            mask_zero=True, trainable=False, name='embedding')

In [None]:
X_train[[7]]

In [None]:
glv_embd.get_vector("finally", norm=True)

In [None]:
embedder(??)

In [None]:
sum(glv_embd.get_vector("finally", norm=True) != embedder(vectorize_layer(X_train[[7]])).numpy()[0,0,])

In [None]:
#Pretrained embeddings:
BLSTM_Glv_model = tf.keras.Sequential([
    vectorize_layer,
    layers.Embedding(input_dim=len(vectorize_layer.get_vocabulary()), output_dim=embedding_dim,
                     embeddings_initializer=keras.initializers.Constant(embedding_matrix),
                     mask_zero=True, trainable=False, name='embedding'),
    layers.Bidirectional(layers.LSTM(64)),
    layers.Dense(n_classes, activation='softmax')
], name="BLSTM_Glv_model")

In [None]:
BLSTM_Glv_model.compile(optimizer=optimizers.Adam(learning_rate=1e-4),
                        loss=losses.SparseCategoricalCrossentropy(from_logits=False),
                        metrics=['accuracy'])

epochs_f = 5
epochs_t = epochs_f + 10

#Probably not ideal, but as an idea of what one can do:

history_glv_f = BLSTM_Glv_model.fit(X_train, y_train, validation_data=(X_valid, y_valid),
                                    batch_size=128, epochs=epochs_f)
BLSTM_Glv_model.save_weights("./checkpoints/Glv_5/Glv_5")

BLSTM_Glv_model.layers[1].trainable = True

history_glv_l = BLSTM_Glv_model.fit(X_train, y_train, validation_data=(X_valid, y_valid),
                                    batch_size=128, epochs=epochs_t, initial_epoch=epochs_f)
BLSTM_Glv_model.save_weights("./checkpoints/Glv_15/Glv_15")

In [None]:
BLSTM_Glv_model.load_weights("./checkpoints/Glv_15/Glv_15")

In [None]:
test_loss, test_acc = BLSTM_Glv_model.evaluate(X_valid, y_valid)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

# Exercise: Fine tuning

Try to achieve the best validation accuracy by fine-tuning the hyperparameters of any method above.

The following two `callbacks` examples could be useful during training:

    early_stopping_cb = keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True, verbose=1)
    checkpoint_cb = keras.callbacks.ModelCheckpoint(filepath = "checkpoint/metwork_epoch={epoch:02d}-val_accuracy={val_accuracy:.2f}.h5")
    callbacks=[early_stopping_cb, checkpoint_cb]

To go further, `keras_tuner` can also be useful: https://www.tensorflow.org/tutorials/keras/keras_tuner.

#### draft...

In [None]:
#json.dump(history_MLP.history, open("./checkpoints/history/MLP_10.txt", 'w'))
history_MLP_dict = json.load(open("./checkpoints/history/MLP_10.txt", 'r'))

In [None]:
#json.dump(history_lstm.history, open("./checkpoints/history/BLSTM_10.txt", 'w'))
history_lstm_dict = json.load(open("./checkpoints/history/BLSTM_10.txt", 'r'))