#License and Attribution

This notebook was developed by Emilio Serrano, Full Professor at the Department of Artificial Intelligence, Universidad PolitÃ©cnica de Madrid (UPM), for educational purposes in UPM courses. Personal website: https://emilioserrano.faculty.bio/

ðŸ“˜ License: Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA)

You are free to: (1) Share â€” copy and redistribute the material in any medium or format; (2) Adapt â€” remix, transform, and build upon the material.

Under the following terms: (1) Attribution â€” You must give appropriate credit, provide a link to the license, and indicate if changes were made; (2) NonCommercial â€” You may not use the material for commercial purposes; (3) ShareAlike â€” If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

ðŸ”— License details: https://creativecommons.org/licenses/by-nc-sa/4.0/

#Building a Text Classifier with LSTMs in Keras
In this notebook,  we build a text classification model using Recurrent Neural Networks (RNNs), specifically Bidirectional LSTMs, with the help of Keras â€” the high-level deep learning API of TensorFlow.

We will use a classic dataset (IMDB movie reviews) to train a model that can distinguish positive and negative sentiments in text.

This notebook also serves as an introduction to:

- Keras: an intuitive, modular, and easy-to-use deep learning API.
- KerasHub: hub of prebuilt, ready-to-use models and components that help you speed up development and experiment with state-of-the-art architectures with just a few lines of code.

Throughout this notebook, you will learn:

- How to preprocess text data for training.

- How to use an Embedding layer and stack LSTM layers.

- How to evaluate and interpret your model.

- How to extend your work using tools from KerasNLP and KerasHub.

#Keras

Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. It was developed with a focus on enabling fast experimentation.  

Check the [getting started](https://keras.io/getting_started/) and also the list of [examples](https://keras.io/examples/).

Examples of Keras for NLP include:

* Text classification from scratch
* Review Classification using Active Learning
* Text Classification using FNet
* Large-scale multi-label text classification
* Text classification with Transformer
* Text classification with Switch Transformer
* Text classification using Decision Forests and pretrained embeddings
* Using pre-trained word embeddings
* **Bidirectional LSTM on IMDB**

Let us see an example of how to train a 2-layer bidirectional LSTM on the IMDB movie review sentiment classification dataset (complete [here](https://keras.io/examples/nlp/bidirectional_lstm_imdb/)).







##GPUs?


Make sure you have your GPU available. While LSTMs process sequences step-by-step, many internal operations (like the gate computations) can be efficiently parallelized on a GPU.

In [None]:
import tensorflow as tf

# Check for GPU availability
gpus = tf.config.list_physical_devices('GPU')

if gpus:
    print(f"GPU detected: {gpus[0].name}")
else:
    print("No GPU detected.")
    print("If you are using Google Colab, please go to 'Runtime' > 'Change runtime type' and select 'GPU' as the hardware accelerator.")

##Setup and Imports

In [None]:
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense


## Build the model

Note that the model includes: an embedding layer, two bidirectional LSTM layers, and a final dense layer for the classification.


Note that the first LSTM returns the full sequence (all time steps) because the second LSTM needs a sequence input.

The second LSTM returns only the last output (default return_sequences=False), since after the last LSTM you usually want a single vector to feed the Dense layer for prediction.

Since this dense layer use the sigmoid function, it allows a binary classification (positive vs negative sentiment in this case).


In [None]:
# Parameters
max_features = 20000  # Use only the top 20,000 most frequent words
maxlen = 200          # Each review is truncated or padded to 200 words

# Build the model
model = Sequential([
    # Embedding layer converts word indices into dense vectors
    Embedding(input_dim=max_features, output_dim=128, input_length=maxlen),

    # First Bidirectional LSTM layer returns full sequences to feed into next LSTM
    Bidirectional(LSTM(64, return_sequences=True)),

    # Second Bidirectional LSTM layer summarizes the sequence into one vector
    Bidirectional(LSTM(64)),

    # Output layer for binary classification (positive or negative sentiment)
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Show model architecture
model.build(input_shape=(None, maxlen))  # None es el batch size
model.summary()


##Let us study the output shapes.

**Embedding Layer** (embedding) â†’ Output shape: (None, 200, 128):
This layer is responsible for converting each word in the input text into a dense vector representation of size 128. The input to the model is a sequence of word indices, each representing a word in the vocabulary. Because we set maxlen = 200, each input sequence contains exactly 200 words. Therefore, for each input example, the output of the embedding layer is a matrix of shape (200, 128), where 200 is the number of words and 128 is the dimensionality of each embedding. The first dimension, None, indicates that the batch size can vary during training or inference.

**First Bidirectional LSTM Layer** (bidirectional) â†’ Output shape: (None, 200, 128):
This layer consists of two LSTM networks running in opposite directionsâ€”one processes the input sequence from start to end (forward), and the other from end to start (backward). Each LSTM has 64 units, and the outputs from both directions are concatenated, resulting in 128 features per time step. Since return_sequences=True is set, the layer returns the full output sequence for each time step in the input. The shape (None, 200, 128) thus indicates that for each of the 200 time steps, the layer outputs a 128-dimensional vector, preserving the sequential structure of the input.

**Second Bidirectional LSTM Layer** (bidirectional_1) â†’ Output shape: (None, 128):
This layer is similar in structure to the previous bidirectional LSTM, again with 64 units in each direction. However, this time return_sequences=False (which is the default), so the layer does not return the full sequence of outputs. Instead, it returns only the final hidden state after processing the entire input sequence. The outputs from the forward and backward LSTMs are concatenated to form a single 128-dimensional vector. As a result, the shape (None, 128) means that each input sequence is now represented by a single fixed-size vector summarizing the information extracted by both LSTMs.

## Load the IMDB movie review sentiment data

Loads the [IMDB dataset](https://keras.io/api/datasets/imdb/).  

This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers).

The IMDb dataset reserves indices 0, 1, and 2 for special tokens (padding, start, unknown or oov). So actual words start from index 3 in the sequences.   

While LSTMs can conceptually process variable-length sequences, padding is a practical necessity to efficiently batch data during training and evaluation in frameworks like Keras. Without padding, training on batches would be very complicated and inefficient.


In [None]:
# Load the IMDb dataset from Keras datasets module.
# Only keep the top 'max_features' most frequent words in the dataset.
# This returns pre-tokenized sequences of word indices for training and validation sets.
(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=max_features)

# Print the number of training and validation sequences loaded
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")

# Pad (or truncate) each sequence to a fixed length 'maxlen'.
# Sequences longer than 'maxlen' will be truncated (cut off) at the beginning.
#     ...By default, pad_sequences in Keras truncates sequences at the beginning (truncating='pre')
# Sequences shorter than 'maxlen' will be padded with zeros at the start by default.
# This ensures all input sequences have the same length, which is required for batch processing in the model.
x_train = keras.utils.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.utils.pad_sequences(x_val, maxlen=maxlen)

print("Training sample: a vector of 200 dimensions, each one representing a word index (including padding with word indexed as 0):")
print(x_train[0])

## Train and evaluate the model

This code snippet demonstrates the process of compiling, training, and evaluating a neural network model for a binary classification task using Keras. We compile the model with the Adam optimizer and binary cross-entropy loss, train it on the training data while validating performance on a separate validation set, and finally evaluate the modelâ€™s accuracy and loss on the validation data.

Note that only 3 epochs are used (too few).



In [None]:
# Compile the model with Adam optimizer and binary crossentropy loss
# Metrics to track during training include accuracy
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

# Train the model on training data with a batch size of 32, for 3 epochs
# Validate on (x_val, y_val) at the end of each epoch
history = model.fit(x_train, y_train, batch_size=32, epochs=3, validation_data=(x_val, y_val))

# After training, evaluate the model performance on the validation set
loss, accuracy = model.evaluate(x_val, y_val)
print(f'Val loss: {loss:.4f}, Val accuracy: {accuracy:.4f}')


In some experiments Val accuracy: 0.8723, showing that  the model trained well, achieving good accuracy and low loss on both training and validation data. The validation accuracy close to training accuracy indicates reasonable generalization. More ephocs could be beneficial if overfitting does not occur.

Additionally, we can visualize the training and validation loss and accuracy across epochs to help monitor the modelâ€™s learning progress and detect potential overfitting or underfitting.



In [None]:
# ----------------------------------------------
# Code to plot training and validation loss/accuracy curves
import matplotlib.pyplot as plt

# Extract loss and accuracy history from the training process
train_loss = history.history['loss']
val_loss = history.history['val_loss']
train_acc = history.history.get('accuracy')
val_acc = history.history.get('val_accuracy')

# Plot Loss curves
plt.figure(figsize=(10, 5))
plt.plot(train_loss, label='Training Loss')
plt.plot(val_loss, label='Validation Loss')
plt.title('Training vs Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

# Plot Accuracy curves if available
if train_acc and val_acc:
    plt.figure(figsize=(10, 5))
    plt.plot(train_acc, label='Training Accuracy')
    plt.plot(val_acc, label='Validation Accuracy')
    plt.title('Training vs Validation Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.grid(True)
    plt.show()

## Using the model  

The published example does not show how to use the model!.

Let us try to get predction for the phrases:
- "I loved the movie, it was amazing!"
- "This movie was terrible"
- "This movie was not good at all."

Unlike the pipeline function in HuggingFace, Keras requires to preprocess samples and the network output.



In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.datasets import imdb

# Example raw reviews
reviews = ["I loved the movie, it was amazing!", "The movie was terrible", "This movie was not good at all"]

max_features = 20000
maxlen = 200

# We have not used a tokenizer but data already pre-tokenized as sequences of word indices.
# Therefore, text samples need to be translated into these indices.
# The IMDb dataset reserves indices 0, 1, and 2 for special tokens (padding, start, unknown).
# So actual words start from index 3 in the sequences.

# Load IMDb word index (word to integer mapping)
word_index = imdb.get_word_index()

def text_to_sequence(text):
    # Tokenize and convert text to integer sequence based on IMDb's word index
    tokens = text.lower().split()  # simple whitespace tokenizer, could be improved
    sequence = []
    for word in tokens:
        index = word_index.get(word, 2)  # 2 is usually the index for "unknown", returns 2 if word not found
        if index < max_features:
            sequence.append(index + 3)  # offset by 3 because of reserved indices in IMDb dataset
    return sequence


# Convert raw text reviews to padded sequences
sequences = [text_to_sequence(review) for review in reviews]
padded_sequences = pad_sequences(sequences, maxlen=maxlen)

# Now you can predict using your model
predictions = model.predict(padded_sequences)
print(predictions)



Let us study this output. Your model is trained to predict the sentiment of movie reviews as a binary classification task. The output is a 2D array where each inner array corresponds to the modelâ€™s prediction for one review. Each value is a probability between 0 and 1 representing the likelihood that the review is positive (usually 1 means positive, 0 means negative).

If the model prediction for  "I loved the movie, it was amazing!"  is bigger than 0,5; that is the probability that this review is positive, which makes sense since the review uses positive words like "loved" and "amazing".

We can postprocess these values further to obtain "Positive" and "Negative" labels in the output.

In [None]:

# Print each review with its predicted sentiment
for review, pred in zip(reviews, predictions):
    sentiment = "Positive" if pred[0] >= 0.5 else "Negative"
    print(f'Review: "{review}"\nPredicted sentiment: {sentiment} (probability of being positive: {pred[0]:.4f})\n')

How did it go for "This movie was not good at all"?

##Checking tokenization
It s a good idea to add a small helper function to verify that text-to-index and index-to-text mapping are consistent. This ensures the tokenizer mapping used in generation is correct and reversible.

Remember that in this example we have not used a tokenizer but data already pre-tokenized as sequences of word indices.

Deploying Deep Learning systems for NLP is largely about data preprocessing, rather than selecting one architecture or another.

In [None]:
# Reverse the IMDb word index to get index â†’ word mapping
index_word = {index + 3: word for word, index in imdb.get_word_index().items()}
index_word[0] = "<PAD>"
index_word[1] = "<START>"
index_word[2] = "<UNK>"

def check_imdb_sequence_conversion(text, word_index, max_features=20000):
    """
    Check how a raw review text is converted into a padded sequence,
    and how the sequence maps back to words.
    """
    print("Original text:", text)
    tokens = text.lower().split()

    # Convert to word indices
    sequence = []
    for word in tokens:
        index = word_index.get(word, 2)
        if index < max_features:
            sequence.append(index + 3)

    print("\nToken indices:", sequence)

    # Map back to words
    recovered_words = [index_word.get(i, "<UNK>") for i in sequence]
    print("Recovered words:", recovered_words)

    # Check mapping integrity
    print("\nMapping check:")
    for original, recovered in zip(tokens, recovered_words):
        match = "âœ“" if original == recovered else "âœ—"
        print(f"{original:15} -> {recovered:15} {match}")

check_imdb_sequence_conversion("This movie was not good at all", word_index)

# KerasHub  

**Update 2025**: KerasNLP has renamed to KerasHub!

[KerasHub](https://keras.io/keras_hub/getting_started/) is a natural language processing library that works natively with TensorFlow, JAX, or PyTorch.

KerasHub supports users through their entire development cycle. KerasHub **workflows** are built from modular components that have state-of-the-art preset weights and architectures when used **out-of-the-box** and are easily customizable when more control is needed.

Check the [getting started](https://keras.io/keras_hub/getting_started/). As HuggingFace, KerasHub contains end-to-end implementations of popular model architectures such as BERT and GPT2. Check [KerasHub models](hhttps://keras.io/keras_hub/presets/).

Let us see an example from the getting started (complete [here](https://colab.research.google.com/github/keras-team/keras-io/blob/master/guides/ipynb/keras_nlp/getting_started.ipynb#scrollTo=eLqs_qlzKPoP)).



##Installing and importing KerasHub

In [None]:
!pip install -q --upgrade keras-nlp
!pip install -q --upgrade keras  # Upgrade to Keras 3.

In [None]:
import os

os.environ["KERAS_BACKEND"] = "jax"  # or "tensorflow" or "torch"

import keras_nlp
import keras

# Use mixed precision to speed up all training in this guide.
keras.mixed_precision.set_global_policy("mixed_float16")

##Basic use of KerasHub workflow




In [None]:
classifier = keras_nlp.models.BertClassifier.from_preset("bert_tiny_en_uncased_sst2")
# Note: batched inputs expected so must wrap string in iterable
logits= classifier.predict(["I love modular workflows in keras-nlp!", "I hate modular workflows in keras-nlp!"])
print(logits)

##Logits vs probabilities

In previous example, Outputs are the logits per class (e.g., `[0, 0]` is 50% chance of positive). The output is [negative, positive] for binary classification.

Those are not probabilities but **logits**: the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a SoftMax layer.

**Keras (and HuggingFace) models outputs logits instead of probabilities by default** (because it allows for greater flexibility in model training).

Let us get probabilities form logits.

In [None]:
#logits can be converted into probabilities
from torch.nn import functional as F
import torch
# get probabilities using softmax from logit score and convert it to numpy array
probabilities_scores = F.softmax(torch.from_numpy(logits), dim = -1).numpy()
print(f"Negative, Positive probability: {probabilities_scores}")


# Conclusions and Next Steps

In this notebook, we explored how to build and train a Bidirectional LSTM model to classify movie reviews from the IMDB dataset. Bidirectional LSTMs are powerful because they capture context from both past and future words, improving understanding of sequence data such as text.


What do KerasHub "workflows" remind you of? As the HuggingFace transformer library and its pipeline function, workflows allows a basic use and more advance features such as fine-tuning at different levels and pretraining a LM as BERT.

New tools will always emerge to simplify standard machine learning and NLP workflows. However, the deeper your understanding of the underlying mechanisms, the better your ability to:

- Customize models effectively

- Troubleshoot issues

- Achieve improved performance â€” even when using black-box systems like ChatGPT

 Here are some ideas for how you could continue to explore and improve this project:

 *   You can try to optimize the previous model. There are automatic machine learning tools that make this task much easier, such as [KerasTuner](https://keras.io/keras_tuner/) or [AutoKeras](https://autokeras.com/)
*  You can try reproducing another [Keras example for NLP]( https://keras.io/examples/nlp/)
* You can try using KerasHub for fine-tuning a pretrained model [Getting Started with KerasHub](https://keras.io/keras_hub/getting_started/)
