# Simple Sentiment Classification: LSTM

We continue the sentiment analysis on the IMDb dataset, and extent the bag-of-word approach of the previous notebook with the following components:
* we use an encoding that takes the order of the words into account
* we use pre-trained word embeddings
* we use LSTM layers to take into account the neightborhood of the words.

## Set-up
First of all, we need to load the libraries that we will need for this task. We will use keras and tensorflow for this code example, so we load the relevant parts of this framework:

In [None]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.layers import Input, TextVectorization, Embedding, Flatten, LSTM, Bidirectional
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import Constant
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical

In [None]:
# some more general libraries for evaluation purposes:
import matplotlib.pyplot as plt
import datetime
import pickle

In [None]:
# initialize random number generators to ensure reproducibility:
tf.random.set_seed(123)
np.random.seed(123)

In [None]:
# set some model parameters
VOCAB_SIZE = 5000
NUM_EPOCHS = 50 # set lower for fast results - set higher for good results
BUFFER_SIZE = 10000
BATCH_SIZE = 512
EMBED_DIM = 100

Optional: Set up Google Drive

In [None]:
use_gdrive = False
if use_gdrive:
    from google.colab import drive
    drive.mount('/content/gdrive', force_remount = True)

    # define target directories
    targetDir_root = 'gdrive/MyDrive/CAS_AIS_2024_FS/Results/'
    targetDir_models = targetDir_root + 'trainedWeights/'
    targetDir_results = targetDir_root + 'PerformanceMeasures/'

## Loading the Data
Also, the data loading as before:

In [None]:
train_ds, val_ds, test_ds = tfds.load(
    name = "imdb_reviews",
    split = [ 'train[:80%]', 'train[80%:]', 'test' ],
    as_supervised = True)

In [None]:
for example, label in train_ds.take(1):
  print("Input: ", example)
  print(10*".")
  print('Target labels: ', label)
  print(50*"-")

## Text Representation:
A first change considers the text representation: While we used `output_mode = "count"` in the previous notebook, we now drop this additional argument for the `TextVectorization`

In [None]:
encoderSEQ = TextVectorization(max_tokens=VOCAB_SIZE)
# previously, we had 'output_mode = "count", ' as additional arguments for TextVectorization
encoderSEQ.adapt(train_ds.map(lambda text, label: text))

The vocabulary is still the same as for the `encoderBoW`:

In [None]:
vocab = np.array(encoderSEQ.get_vocabulary())
vocab[:20]

The first word in the vocabulary is `[UNK]`, the token for the unknown words. Afterwards, we have a number of token for very common words, the so-called **stop words**. The first one being 'the'. So, in the numerical vector that we get after coding, the first column corresponds to all unknown words (i.e. all words that do not appear in the vocabulary), and the second column corresponds to the word 'the'. Also some *domain-specific* words occur frequenty: `movie` and `film` indicate that the vocabulary was built on movie reviews.

We can now get an example encoding:

In [None]:
encoderSEQ("the").numpy()

In [None]:
example

In [None]:
encoderSEQ(example).numpy()

Now, the output is a sequence of the word indices. So we can try to reconstruct the input text:

In [None]:
print("Original: ", example.numpy())
print("Reconstruction: ", " ".join(vocab[encoderSEQ(example)]))

## Preparation for Model Comparison
We want to move on to more complex models. In order to be prepared, we first define a function that does the training and evaluation for us:

In [None]:
def fitAndEval(myModel, from_logits = True, model_name = ''):
    # compile
    myModel.compile(loss = BinaryCrossentropy(from_logits=from_logits),
                    optimizer = 'adam', metrics = ['accuracy'])

    # set seeds
    tf.random.set_seed(123)

    # Train
    myHistory = myModel.fit(
        train_ds.shuffle(buffer_size=BUFFER_SIZE).batch(BATCH_SIZE),
        validation_data = val_ds.batch(BATCH_SIZE),
        epochs = NUM_EPOCHS, verbose = 1,
        callbacks = [ EarlyStopping(monitor='val_accuracy', patience=5,
                                    verbose=False, restore_best_weights=True)])

    # Evaluate Training Progress
    myHistory_dict = myHistory.history
    myHistory_dict.keys()

    resDict = {}
    resDict['train_loss'] = myHistory_dict['loss']
    resDict['val_loss'] = myHistory_dict['val_loss']
    resDict['train_accuracy'] = myHistory_dict['accuracy']
    resDict['val_accuracy'] = myHistory_dict['val_accuracy']
    resDict['epochs'] = range(1, len(resDict['train_accuracy']) + 1)
    resDict['model_name'] = model_name

    return resDict

# A first LSTM Model
Now, let's define and train our first LSTM using the helper function `fitAndEval`:

In [None]:
# initialize random number generators to ensure reproducibility:
tf.random.set_seed(123)
np.random.seed(123)

In [None]:
model_embed_1LSTM = Sequential()
model_embed_1LSTM.add(Input(shape=(1,), dtype='string'))
model_embed_1LSTM.add(encoderSEQ)
model_embed_1LSTM.add(Embedding(VOCAB_SIZE, EMBED_DIM))
model_embed_1LSTM.add(Bidirectional(LSTM(64)))
model_embed_1LSTM.add(Dense(1, activation="sigmoid"))

This is the first model that will need a significant training time. Therefore, we have implemented two variants of running this notebook - either to train the models from scratch, or to use the precomputed weights. To run the model from scratch, set `train_from_scatch` to `True`. We suggest you don't change the model and file names, so it will save the parameters of the results when you train from scratch, and it will load the model weights and results otherwise.

In [None]:
train_from_scatch = False

model_name = 'model_5kW_embed_1LSTM_ADAM'
model_weight_file = model_name + '_weights'
model_result_file = model_name + '_Results.pkl'

if train_from_scatch: 
    resDict_embed_1LSTM = fitAndEval(model_embed_1LSTM, from_logits=False,
                                     model_name = model_name)
    # save weights and results
    model_embed_1LSTM.save_weights(model_weight_file)
    with open(model_result_file, 'wb') as f:
        pickle.dump(resDict_embed_1LSTM, f)
else:
    model_embed_1LSTM.load_weights(model_weight_file)
    with open(model_result_file, 'rb') as input_file:
        resDict_embed_1LSTM = pickle.load(input_file)

In [None]:
resDict_embed_1LSTM['model_name'] = model_name

resDict_embed_1LSTM['model_name']

model_name = 'model_5kW_embed_1LSTM_ADAM'
model_weight_file = model_name + '_weights'
model_result_file = model_name + '_Results.pkl'
model_result_file

In [None]:
with open(model_result_file, 'wb') as f:
    pickle.dump(resDict_embed_1LSTM, f)


In [None]:
model_embed_1LSTM.summary()

In [None]:
plt.plot(resDict_embed_1LSTM['epochs'], resDict_embed_1LSTM['train_accuracy'],
         'r:', label = resDict_embed_1LSTM['model_name'] +', Training acc')
plt.plot(resDict_embed_1LSTM['epochs'], resDict_embed_1LSTM['val_accuracy'],
         'r',  label = resDict_embed_1LSTM['model_name'] +', Validation acc')

plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

We see that our model does quite some overfitting: On the training set, it reaches an accuracy of over 95% in several epochs, but on the validation data, the performance does not go above approx. 86%.

# Using Pretrained Word Embeddings


Here, we are using the pretrained word embeddings from glove:

In [None]:
have_glove = False # set to true when downloaded

if use_gdrive & have_glove:
    glove_file = targetDir_models + 'glove.6B.100d.txt'
    # have_glove = !test -f $glove_file;
    %cp gdrive/MyDrive/CAS_AIS_2024_FS/Results/trainedWeights/glove.6B.100d.txt .

else:
    # The following commands need to be executed the first time this notebook is ran:
    !wget http://nlp.stanford.edu/data/glove.6B.zip
    !unzip -q glove.6B.zip

if use_gdrive:
    %cp glove.6B.100d.txt !targetDir_models

Now we will use these pretrained word vectors to represent our texts:

In [None]:
path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

In [None]:
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((VOCAB_SIZE, EMBED_DIM))
for i, word in enumerate(encoderSEQ.get_vocabulary()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
        # print(word)
print("Converted %d words (%d misses)" % (hits, misses))

## Pretrained Word Embeddings without Adaptation
We use the same LSTM model as above, but initialize the embedding with the pretrained data from gloVe.

In [None]:
model_pe100_1LSTM = Sequential()
model_pe100_1LSTM.add(Input(shape=(1,), dtype='string'))
model_pe100_1LSTM.add(encoderSEQ)
model_pe100_1LSTM.add(Embedding(
    VOCAB_SIZE,
    EMBED_DIM,
    embeddings_initializer=Constant(embedding_matrix),
    trainable=False))
model_pe100_1LSTM.add(Bidirectional(LSTM(64)))
model_pe100_1LSTM.add(Dense(1, activation="sigmoid"))

In [None]:
train_from_scatch = False

model_name = 'model_5kW_pe100_1LSTM_ADAM'
model_weight_file = model_name + '_weights'
model_result_file = model_name + '_Results.pkl'

if train_from_scatch: 
    resDict_pe100_1LSTM = fitAndEval(model_pe100_1LSTM, from_logits=False,
                                     model_name = model_name)
    # save weights and results
    model_pe100_1LSTM.save_weights(model_weight_file)
    with open(model_result_file, 'wb') as f:
        pickle.dump(resDict_pe100_1LSTM, f)
else:
    model_pe100_1LSTM.load_weights(model_weight_file)
    with open(model_result_file, 'rb') as input_file:
        resDict_pe100_1LSTM = pickle.load(input_file)

In [None]:
plt.plot(resDict_embed_1LSTM['epochs'], resDict_embed_1LSTM['train_accuracy'],
         'r:', label = resDict_embed_1LSTM['model_name'] +', Training acc')
plt.plot(resDict_embed_1LSTM['epochs'], resDict_embed_1LSTM['val_accuracy'],
         'r',  label = resDict_embed_1LSTM['model_name'] +', Validation acc')

plt.plot(resDict_pe100_1LSTM['epochs'], resDict_pe100_1LSTM['train_accuracy'],
         'b:', label = resDict_pe100_1LSTM['model_name'] +', Training acc')
plt.plot(resDict_pe100_1LSTM['epochs'], resDict_pe100_1LSTM['val_accuracy'],
         'b',  label = resDict_pe100_1LSTM['model_name'] +', Validation acc')

plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

## Pretrained Word Embeddings with Adaptation
As we have seen the performance actually falling below the performance of the LSTM model with embeddings trained from scratch, we implement a third model, where the pretrained embeddings serve as starting point, from where we allow the model to further train and adapt the embeddings as needed.

In [None]:
model_ae100_1LSTM = Sequential()
model_ae100_1LSTM.add(Input(shape=(1,), dtype='string'))
model_ae100_1LSTM.add(encoderSEQ)
model_ae100_1LSTM.add(Embedding(
    VOCAB_SIZE,
    EMBED_DIM,
    embeddings_initializer=Constant(embedding_matrix),
    trainable=True))
model_ae100_1LSTM.add(Bidirectional(LSTM(64)))
model_ae100_1LSTM.add(Dense(1, activation="sigmoid"))

In [None]:
train_from_scatch = False

model_name = 'model_5kW_ae100_1LSTM_ADAM'
model_weight_file = model_name + '_weights'
model_result_file = model_name + '_Results.pkl'

if train_from_scatch: 
    resDict_ae100_1LSTM = fitAndEval(model_ae100_1LSTM, from_logits=False,
                                     model_name = model_name)
    # save weights and results
    model_ae100_1LSTM.save_weights(model_weight_file)
    with open(model_result_file, 'wb') as f:
        pickle.dump(resDict_ae100_1LSTM, f)
else:
    model_ae100_1LSTM.load_weights(model_weight_file)
    with open(model_result_file, 'rb') as input_file:
        resDict_ae100_1LSTM = pickle.load(input_file)

In [None]:
plt.plot(resDict_embed_1LSTM['epochs'], resDict_embed_1LSTM['train_accuracy'],
         'r:', label = resDict_embed_1LSTM['model_name'] +', Training acc')
plt.plot(resDict_embed_1LSTM['epochs'], resDict_embed_1LSTM['val_accuracy'],
         'r',  label = resDict_embed_1LSTM['model_name'] +', Validation acc')

plt.plot(resDict_pe100_1LSTM['epochs'], resDict_pe100_1LSTM['train_accuracy'],
         'b:', label = resDict_pe100_1LSTM['model_name'] +', Training acc')
plt.plot(resDict_pe100_1LSTM['epochs'], resDict_pe100_1LSTM['val_accuracy'],
         'b',  label = resDict_pe100_1LSTM['model_name'] +', Validation acc')

plt.plot(resDict_ae100_1LSTM['epochs'], resDict_ae100_1LSTM['train_accuracy'],
         'g:', label = resDict_ae100_1LSTM['model_name'] +', Training acc')
plt.plot(resDict_ae100_1LSTM['epochs'], resDict_ae100_1LSTM['val_accuracy'],
         'g',  label = resDict_ae100_1LSTM['model_name'] +', Validation acc')

plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='best')
plt.grid(True)
plt.show()

Copy files to Google Drive (if wanted):

In [None]:
# copy files
if use_gdrive:
    %cp model_pe100_1LSTM_weights* $targetDir_models
    %cp model_pe100_1LSTM_Results* $targetDir_results
    %cp model_embed_1LSTM_weights* $targetDir_models
    %cp model_embed_1LSTM_Results* $targetDir_results
    %cp model_ae100_1LSTM_weights* $targetDir_models
    %cp model_ae100_1LSTM_Results* $targetDir_results