# Simple Sentiment Classification: LSTM

We continue the sentiment analysis on the IMDb dataset, and extent the bag-of-word approach of the previous notebook with the following components:
* we use an encoding that takes the order of the words into account
* we use pre-trained word embeddings
* we use LSTM layers to take into account the neightborhood of the words.

## Set-up
First of all, we need to load the libraries that we will need for this task. We will use keras and tensorflow for this code example, so we load the relevant parts of this framework:

In [4]:
# commented out while running locally
!pip install tensorflow_datasets

Collecting tensorflow_datasets
  Downloading tensorflow_datasets-4.9.9-py3-none-any.whl.metadata (11 kB)
Collecting absl-py (from tensorflow_datasets)
  Downloading absl_py-2.3.1-py3-none-any.whl.metadata (3.3 kB)
Collecting array_record>=0.5.0 (from tensorflow_datasets)
  Downloading array_record-0.8.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.1 kB)
Collecting dm-tree (from tensorflow_datasets)
  Downloading dm_tree-0.1.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Collecting etils>=1.6.0 (from etils[edc,enp,epath,epy,etree]>=1.6.0; python_version < "3.11"->tensorflow_datasets)
  Downloading etils-1.13.0-py3-none-any.whl.metadata (6.5 kB)
Collecting immutabledict (from tensorflow_datasets)
  Downloading immutabledict-4.2.1-py3-none-any.whl.metadata (3.5 kB)
Collecting promise (from tensorflow_datasets)
  Downloading promise-2.3.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting protobuf>=3.20 (fr

In [21]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

In [22]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.layers import Input, TextVectorization, Embedding, Flatten, LSTM, Bidirectional
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import Constant
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical

In [23]:
# some more general libraries for evaluation purposes:
import matplotlib.pyplot as plt
import datetime
import pickle

In [24]:
# initialize random number generators to ensure reproducibility:
tf.random.set_seed(123)
np.random.seed(123)

In [25]:
# set some model parameters
VOCAB_SIZE = 5000
NUM_EPOCHS = 5 # set lower for fast results - set higher for good results
BUFFER_SIZE = 10000
BATCH_SIZE = 512
EMBED_DIM = 100

## Loading the Data
Also, the data loading as before:

In [26]:
train_ds, val_ds, test_ds = tfds.load(
    name = "imdb_reviews",
    split = [ 'train[:80%]', 'train[80%:]', 'test' ],
    as_supervised = True)

In [27]:
for example, label in train_ds.take(1):
  print("Input: ", example)
  print(10*".")
  print('Target labels: ', label)
  print(50*"-")

Input:  tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string)
..........
Target labels:  tf.Tensor(0, shape=(), dtype=int64)
--------------------------------------------------


2025-09-19 09:29:48.127979: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


## Text Representation:
A first change considers the text representation: While we used `output_mode = "count"` in the previous notebook, we now drop this additional argument for the `TextVectorization`

In [28]:
encoderSEQ = TextVectorization(max_tokens=VOCAB_SIZE)
# previously, we had 'output_mode = "count", ' as additional arguments for TextVectorization
encoderSEQ.adapt(train_ds.map(lambda text, label: text))

The vocabulary is still the same as for the `encoderBoW`:

In [29]:
vocab = np.array(encoderSEQ.get_vocabulary())
vocab[:20]

array(['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i',
       'this', 'that', 'br', 'was', 'as', 'for', 'with', 'movie', 'but'],
      dtype='<U16')

The first word in the vocabulary is `[UNK]`, the token for the unknown words. Afterwards, we have a number of token for very common words, the so-called **stop words**. The first one being 'the'. So, in the numerical vector that we get after coding, the first column corresponds to all unknown words (i.e. all words that do not appear in the vocabulary), and the second column corresponds to the word 'the'. Also some *domain-specific* words occur frequenty: `movie` and `film` indicate that the vocabulary was built on movie reviews.

We can now get an example encoding:

In [14]:
encoderSEQ("the").numpy()

array([2])

In [19]:
example

<tf.Tensor: shape=(), dtype=string, numpy=b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.">

In [16]:
encoderSEQ(example).numpy()

array([  11,   14,   34,  411,  376,   18,   90,   27,    1,    8,   33,
       1322, 4160,   41,  501,    1,  193,   25,   86,  152,   19,   11,
        216,  316,   27,   65,  241,  213,    8,  485,   56,   65,   86,
        115,   95,   22,    1,   11,   93,  635,  739,   11,   18,    7,
         34,  396,    1,  169, 2483,  409,    2,   88, 1205,  137,   67,
        144,   52,    2,    1,    1,   67,  245,   65, 2939,   16,    1,
       2795,    1,    1, 1441,    1,    3,   40,    1, 1659,   17, 4160,
         14,  156,   19,    4, 1205,  853,    1,    8,    4,   18,   12,
         14, 3839,    5,   98,  146, 1222,   10,  231,  683,   12,   48,
         25,   93,   39,   11,    1,  152,   39, 1322,    1,   50,  408,
         10,   95, 1157,  845,  140,    9])

Now, the output is a sequence of the word indices. So we can try to reconstruct the input text:

In [20]:
print("Original: ", example.numpy())
print("Reconstruction: ", " ".join(vocab[encoderSEQ(example)]))

Original:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
Reconstruction:  this was an absolutely terrible movie dont be [UNK] in by christopher walken or michael [UNK] both are great actors but this must simply be their worst role in history even their great acting could not [UNK] this movies ridiculous storyline this movie is an e

**Exercise:** Encode any text you like. To apply all functions necessary, we need to transform the text to a tensor format.

In [33]:
my_text = tf.convert_to_tensor("I an excided to learn about NLP", dtype=tf.string) # fill in this line
encoderSEQ(my_text).numpy()

array([ 10,  34,   1,   6, 824,  43,   1])

<details>
  <summary>Click to see the solution</summary>
  
  ```python
  # solution
  my_text = tf.convert_to_tensor("I am excited to learn about NLP", dtype=tf.string)
  encoderSEQ(my_text).numpy()
  ```
</details>


In [34]:
print("Original: ", my_text.numpy()) # fill in this line
print("Reconstruction: ", " ".join(vocab[encoderSEQ(my_text) ])) # fill in this line

Original:  b'I an excided to learn about NLP'
Reconstruction:  i an [UNK] to learn about [UNK]


<details>
  <summary>Click to see the solution</summary>
  
  ```python
  # solution
  print("Original: ", my_text.numpy())
  print("Reconstruction: ", " ".join(vocab[encoderSEQ(my_text)]))
  ```
</details>

## Preparation for Model Comparison
We want to move on to more complex models. In order to be prepared, we first define a function that does the training and evaluation for us:

In [35]:
def fitAndEval(myModel, from_logits = True, model_name = ''):
    # compile
    myModel.compile(loss = BinaryCrossentropy(from_logits=from_logits),
                    optimizer = 'adam', metrics = ['accuracy'])

    # set seeds
    tf.random.set_seed(123)

    # Train
    myHistory = myModel.fit(
        train_ds.shuffle(buffer_size=BUFFER_SIZE).batch(BATCH_SIZE),
        validation_data = val_ds.batch(BATCH_SIZE),
        epochs = NUM_EPOCHS, verbose = 1,
        callbacks = [ EarlyStopping(monitor='val_accuracy', patience=5,
                                    verbose=False, restore_best_weights=True)])

    # Evaluate Training Progress
    myHistory_dict = myHistory.history
    myHistory_dict.keys()

    resDict = {}
    resDict['train_loss'] = myHistory_dict['loss']
    resDict['val_loss'] = myHistory_dict['val_loss']
    resDict['train_accuracy'] = myHistory_dict['accuracy']
    resDict['val_accuracy'] = myHistory_dict['val_accuracy']
    resDict['epochs'] = range(1, len(resDict['train_accuracy']) + 1)
    resDict['model_name'] = model_name

    return resDict

# A first LSTM Model
Now, let's define and train our first LSTM using the helper function `fitAndEval`:

In [36]:
# initialize random number generators to ensure reproducibility:
tf.random.set_seed(123)
np.random.seed(123)

In [37]:
model_embed_1LSTM = Sequential()
model_embed_1LSTM.add(Input(shape=(1,), dtype='string'))
model_embed_1LSTM.add(encoderSEQ)
model_embed_1LSTM.add(Embedding(VOCAB_SIZE, EMBED_DIM))
model_embed_1LSTM.add(Bidirectional(LSTM(64)))
model_embed_1LSTM.add(Dense(1, activation="sigmoid"))

**Exercise:** Your task is to print the summary of the model we have created one cell above.

In [38]:
model_embed_1LSTM.summary() # fill in this line

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (Text  (None, None)              0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, None, 100)         500000    
                                                                 
 bidirectional (Bidirection  (None, 128)               84480     
 al)                                                             
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 584609 (2.23 MB)
Trainable params: 584609 (2.23 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


<details>
  <summary>Click to see the solution</summary>
  
  ```python
  # solution
  model_embed_1LSTM.summary()
  ```
</details>

This is the first model that will need a significant training time. Therefore, we have implemented two variants of running this notebook - either to train the models from scratch, or to use the precomputed weights. To run the model from scratch, set `train_from_scatch` to `True`. We suggest you don't change the model and file names, so it will save the parameters of the results when you train from scratch, and it will load the model weights and results otherwise.

In [None]:
train_from_scatch = True

model_name = 'Own Embeddings 100 dim'
model_weight_file = model_name.replace(' ', '_') + '_weights'
model_result_file = model_name.replace(' ', '_') + '_Results.pkl'

if train_from_scatch: 
    resDict_embed_1LSTM = fitAndEval(model_embed_1LSTM, from_logits=False,
                                     model_name = model_name)
    # save weights and results
    model_embed_1LSTM.save_weights(model_weight_file)
    with open(model_result_file, 'wb') as f:
        pickle.dump(resDict_embed_1LSTM, f)
else:
    model_embed_1LSTM.load_weights(model_weight_file)
    with open(model_result_file, 'rb') as input_file:
        resDict_embed_1LSTM = pickle.load(input_file)

Epoch 1/5


In [None]:
plt.plot(resDict_embed_1LSTM['epochs'], resDict_embed_1LSTM['train_accuracy'],
         'r:', label = resDict_embed_1LSTM['model_name'] +', Training acc')
plt.plot(resDict_embed_1LSTM['epochs'], resDict_embed_1LSTM['val_accuracy'],
         'r',  label = resDict_embed_1LSTM['model_name'] +', Validation acc')

plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

We see that our model does quite some overfitting: On the training set, it reaches an accuracy of over 95% in several epochs, but on the validation data, the performance does not go above approx. 86%.

# Using Pretrained Word Embeddings


Here, we are using the pretrained word embeddings from glove:

In [None]:
have_glove = True # set to true when downloaded

if not have_glove:
    # The following commands need to be executed the first time this notebook is ran:
    !wget http://nlp.stanford.edu/data/glove.6B.zip
    !unzip -q glove.6B.zip

Now we will use these pretrained word vectors to represent our texts:

In [None]:
path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

**Exercise:** You can get the glove embedding for any word you like with "embeddings_index.get(your_word)". If this function returns no vector, then this word does not have a glove embedding.

<details>
  <summary>Click to see the solution</summary>
  
  ```python
  # solution
  embeddings_index.get("movie") # possible solution
  ```
</details>

In the next step, we apply this function to all words in our vocabulary. This way, we can check which words have a glove embedding.

In [None]:
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((VOCAB_SIZE, EMBED_DIM))
for i, word in enumerate(encoderSEQ.get_vocabulary()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
        # print(word)
print("Converted %d words (%d misses)" % (hits, misses))

## Pretrained Word Embeddings without Adaptation
We use the same LSTM model as above, but initialize the 100-dimensional embedding with the pretrained data from gloVe.

In [None]:
model_pe100_1LSTM = Sequential()
model_pe100_1LSTM.add(Input(shape=(1,), dtype='string'))
model_pe100_1LSTM.add(encoderSEQ)
model_pe100_1LSTM.add(Embedding(
    VOCAB_SIZE,
    EMBED_DIM,
    embeddings_initializer=Constant(embedding_matrix),
    trainable=False))
model_pe100_1LSTM.add(Bidirectional(LSTM(64)))
model_pe100_1LSTM.add(Dense(1, activation="sigmoid"))

In [None]:
train_from_scatch = True

model_name = 'Pretrained Embeddings 100 dim'
model_weight_file = model_name.replace(' ', '_') + '_weights'
model_result_file = model_name.replace(' ', '_') + '_Results.pkl'

if train_from_scatch: 
    resDict_pe100_1LSTM = fitAndEval(model_pe100_1LSTM, from_logits=False,
                                     model_name = model_name)
    # save weights and results
    model_pe100_1LSTM.save_weights(model_weight_file)
    with open(model_result_file, 'wb') as f:
        pickle.dump(resDict_pe100_1LSTM, f)
else:
    model_pe100_1LSTM.load_weights(model_weight_file)
    with open(model_result_file, 'rb') as input_file:
        resDict_pe100_1LSTM = pickle.load(input_file)

In [None]:
plt.plot(resDict_embed_1LSTM['epochs'], resDict_embed_1LSTM['train_accuracy'],
         'r:', label = resDict_embed_1LSTM['model_name'] +', Training acc')
plt.plot(resDict_embed_1LSTM['epochs'], resDict_embed_1LSTM['val_accuracy'],
         'r',  label = resDict_embed_1LSTM['model_name'] +', Validation acc')

plt.plot(resDict_pe100_1LSTM['epochs'], resDict_pe100_1LSTM['train_accuracy'],
         'b:', label = resDict_pe100_1LSTM['model_name'] +', Training acc')
plt.plot(resDict_pe100_1LSTM['epochs'], resDict_pe100_1LSTM['val_accuracy'],
         'b',  label = resDict_pe100_1LSTM['model_name'] +', Validation acc')

plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

## Pretrained Word Embeddings with Adaptation
As we have seen the performance actually falling below the performance of the LSTM model with embeddings trained from scratch, we implement a third model, where the pretrained embeddings serve as starting point, from where we allow the model to further train and adapt the embeddings as needed.

In [None]:
model_ae100_1LSTM = Sequential()
model_ae100_1LSTM.add(Input(shape=(1,), dtype='string'))
model_ae100_1LSTM.add(encoderSEQ)
model_ae100_1LSTM.add(Embedding(
    VOCAB_SIZE,
    EMBED_DIM,
    embeddings_initializer=Constant(embedding_matrix),
    trainable=True))
model_ae100_1LSTM.add(Bidirectional(LSTM(64)))
model_ae100_1LSTM.add(Dense(1, activation="sigmoid"))

In [None]:
train_from_scatch = True

model_name = 'Pretrained Adapted Embeddings 100 dim'
model_weight_file = model_name.replace(' ', '_') + '_weights'
model_result_file = model_name.replace(' ', '_') + '_Results.pkl'

if train_from_scatch: 
    resDict_ae100_1LSTM = fitAndEval(model_ae100_1LSTM, from_logits=False,
                                     model_name = model_name)
    # save weights and results
    model_ae100_1LSTM.save_weights(model_weight_file)
    with open(model_result_file, 'wb') as f:
        pickle.dump(resDict_ae100_1LSTM, f)
else:
    model_ae100_1LSTM.load_weights(model_weight_file)
    with open(model_result_file, 'rb') as input_file:
        resDict_ae100_1LSTM = pickle.load(input_file)

In [None]:
plt.plot(resDict_embed_1LSTM['epochs'], resDict_embed_1LSTM['train_accuracy'],
         'r:', label = resDict_embed_1LSTM['model_name'] +', Training acc')
plt.plot(resDict_embed_1LSTM['epochs'], resDict_embed_1LSTM['val_accuracy'],
         'r',  label = resDict_embed_1LSTM['model_name'] +', Validation acc')

plt.plot(resDict_pe100_1LSTM['epochs'], resDict_pe100_1LSTM['train_accuracy'],
         'b:', label = resDict_pe100_1LSTM['model_name'] +', Training acc')
plt.plot(resDict_pe100_1LSTM['epochs'], resDict_pe100_1LSTM['val_accuracy'],
         'b',  label = resDict_pe100_1LSTM['model_name'] +', Validation acc')

plt.plot(resDict_ae100_1LSTM['epochs'], resDict_ae100_1LSTM['train_accuracy'],
         'g:', label = resDict_ae100_1LSTM['model_name'] +', Training acc')
plt.plot(resDict_ae100_1LSTM['epochs'], resDict_ae100_1LSTM['val_accuracy'],
         'g',  label = resDict_ae100_1LSTM['model_name'] +', Validation acc')

plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='best')
plt.grid(True)
plt.show()

**Exercise:** Now it is your turn. Use one of the trained model to make a prediction for the test set. When you do the prediction, it is important to use the same batch size as used during training, and to use the same name that was used for the model. What is the accuracy we get for the test set?

In [None]:
# Use this if your memory is not in memory:
# model_name = ... # fill in this line
# model_weight_file = model_name.replace(' ', '_') + '_weights'
# model_ae100_1LSTM.load_weights(model_weight_file)

<details>
  <summary>Click to see the solution</summary>
  
  ```python
  # solution
  model_name = 'model_1kW_ae100_1LSTM' # possible solution
  model_weight_file = model_name.replace(' ', '_') + '_weights'
  model_ae100_1LSTM.load_weights(model_weight_file)
  ```
</details>

In [None]:
... # fill in this line

<details>
  <summary>Click to see the solution</summary>
  
  ```python
results = model_ae100_1LSTM.evaluate(test_ds.batch(BATCH_SIZE), verbose=2)

for name, value in zip(model_ae100_1LSTM.metrics_names, results):
  print("%s: %.3f" % (name, value))
  ```
</details>