#License and Attribution

This notebook was developed by Emilio Serrano, Full Professor at the Department of Artificial Intelligence, Universidad Polit√©cnica de Madrid (UPM), for educational purposes in UPM courses. Personal website: https://emilioserrano.faculty.bio/

üìò License: Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA)

You are free to: (1) Share ‚Äî copy and redistribute the material in any medium or format; (2) Adapt ‚Äî remix, transform, and build upon the material.

Under the following terms: (1) Attribution ‚Äî You must give appropriate credit, provide a link to the license, and indicate if changes were made; (2) NonCommercial ‚Äî You may not use the material for commercial purposes; (3) ShareAlike ‚Äî If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

üîó License details: https://creativecommons.org/licenses/by-nc-sa/4.0/

# Building a Language Model with LSTMs in Keras
In this notebook, we will learn how to build a simple language model for text generation using Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) units with TensorFlow and Keras. We'll be using the complete works of William Shakespeare as our dataset.

#GPUs?


Make sure you have your GPU available. While LSTMs process sequences step-by-step, many internal operations (like the gate computations) can be efficiently parallelized on a GPU.


In [21]:
%pip install tensorflow-macos
%pip install tensorflow-metal

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [22]:
import tensorflow as tf
devices = tf.config.list_physical_devices()
print("\nDevices: ", devices)

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  details = tf.config.experimental.get_device_details(gpus[0])
  print("GPU details: ", details)


Devices:  [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]


In [24]:
import tensorflow as tf
print("Devices:", tf.config.list_physical_devices())
print("MPS available:", tf.test.is_mps_available())


Devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]


AttributeError: module 'tensorflow._api.v2.test' has no attribute 'is_mps_available'

#Setup and Imports
First, let's import all the necessary libraries. We'll be using tensorflow for building our neural networks, numpy for numerical operations, and requests to download our datase

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense, SimpleRNN
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import requests
import os

# Download and Prepare the Dataset
We'll use the "Complete Works of Shakespeare" from Project Gutenberg. The following code will download the text file.



In [None]:
# Download the Shakespeare dataset
url = "https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt"
response = requests.get(url)
text = response.text

# Save to a local file (optional)
with open('shakespeare.txt', 'w') as f:
    f.write(text)

print("Dataset downloaded. Length of text: {} characters".format(len(text)))
print("\nFirst 250 characters of the dataset:")
print(text[:250])

# Preprocessing the Text
Our model can't understand raw text. We need to convert the text into a numerical format. This process involves a few key steps:

- Tokenization: We'll break the text down into individual words (tokens).

- Creating a Vocabulary: We'll build a vocabulary of all the unique words in our text.

-  Creating Sequences: We'll create input-output pairs. The model will learn to predict the next word given a sequence of preceding words.

In [None]:
# Tokenize the text
tokenizer = Tokenizer(oov_token="<OOV>") ##tensorflow.keras.preprocessing.text function, we indicate that every unknown word is mapped to a fixed index (Out Of Vocabulary, OOV)
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1

print(f"Total unique words: {total_words}")


#This code is used to generate input sequences for training a next-word prediction model. It works by:
# -Taking each line of the text,
# -Tokenizing it into integers,
# -Creating all possible n-gram sequences from that line,
# -Appending them to a list.
# This prepares the dataset so that, given a sequence of words, the model learns to predict the next one.

# Create input sequences for training a next-word prediction model
input_sequences = []

# Loop over each line in the text to get several training samples for each line of the text.
#This step is essential because instead of training the model on just full sentences,
#we train it on many partial sequences. This helps the model learn how to complete text step by step.
for line in text.split('\n'):
    # Convert the line of text into a sequence of integers (tokens)
    # Example: "I love AI" -> [12, 45, 78]
    token_list = tokenizer.texts_to_sequences([line])[0]

    # Create n-gram sequences from the token list
    # For the example above, it would produce:
    # [12, 45], [12, 45, 78]
    for i in range(1, len(token_list)):
        # Each sequence contains tokens from the start up to position i
        n_gram_sequence = token_list[:i+1]

        # Add this n-gram sequence to the list of input sequences
        input_sequences.append(n_gram_sequence)

# Pad sequences# IMPORTANT NOTE:
# While LSTM layers in theory can process sequences of varying lengths,
# in practice, Keras (and most deep learning frameworks) require all inputs
# in a batch to have the same shape for computational efficiency.
# Therefore, we must pad shorter sequences with zeros so they all match the same length.
# This is why we calculate the maximum sequence length and apply 'pre' padding (padding at the beginning).
# This way, the most recent tokens (more informative for prediction) are at the end of the sequence.
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# Create predictors and label

# Split the input sequences into input (xs) and target output (labels)
# xs: all tokens in the sequence except the last one (input for the model)
# labels: the last token in each sequence (this is what the model should learn to predict)
xs, labels = input_sequences[:, :-1], input_sequences[:, -1]

# Convert the labels to one-hot encoded vectors, since the output layer
# will predict a probability distribution over all possible words (vocabulary)
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)


# Display some training examples
print("\n--- Training Examples ---")
print("Example Input Sequence (numerical):", xs[5])
print("Corresponding Next Word (numerical):", labels[5])

# Let's see the word corresponding to the numerical label
print("Corresponding Next Word (text):", tokenizer.index_word[labels[5]])

# Building the Models
We'll build a few different models to see how their architectures affect their performance.



## Model 1: A Simple RNN
We'll start with a basic SimpleRNN layer. This will introduce the fundamental concepts.

- Embedding Layer: This layer learns a dense vector representation for each word in our vocabulary. This allows the model to capture semantic relationships between words.

- SimpleRNN Layer: This is our recurrent layer that processes the sequence of word embeddings.

- Dense Layer: This is the output layer that will predict the probability of each word in the vocabulary being the next word.

In [None]:
# Build a simple RNN-based language model using the Keras Sequential API

# We define a Sequential model where each layer feeds directly into the next
model_rnn = Sequential([

    # Embedding layer:
    # This layer turns word indices (integers) into dense vectors of fixed size.
    # - input_dim: total number of unique words in the vocabulary
    # - output_dim: size of each word vector (here, 100)
    # - input_length: length of input sequences (all padded to the same length)
    #
    # Output shape: (batch_size, sequence_length, embedding_dim)
    # For example: (None, 15, 100) if sequence length is 15
    Embedding(input_dim=total_words, output_dim=100, input_length=max_sequence_len - 1),

    # SimpleRNN layer:
    # A basic recurrent neural network layer that processes the sequence.
    # It outputs a fixed-size vector summarizing the input sequence.
    # Output shape: (batch_size, 150)
    SimpleRNN(150),

    # Dense output layer:
    # - total_words units (one per word in the vocabulary)
    # - softmax activation to produce a probability distribution over all words
    #
    # Output shape: (batch_size, total_words)
    Dense(total_words, activation='softmax')
])

# Compile the model:
# - loss: categorical crossentropy for multi-class classification
# - optimizer: Adam is a good default choice
# - metric: we track accuracy to monitor training performance
model_rnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Explicitly build the model with a known input shape
# This ensures that model.summary() can display full output shapes instead of '?'
# 'None' here refers to a variable batch size (we don't fix it)
model_rnn.build(input_shape=(None, max_sequence_len - 1))

# Show the model architecture and the shape of each layer‚Äôs output
model_rnn.summary()

### Understanding the Output Shape of each layer

###Tensors


A 1D tensor is a vector: a simple array of numbers, for example, [3, 5, 7]. It has a shape like (n,) where n is the number of elements. A 2D tensor is a matrix: an array of vectors (1D tensors). For example, a matrix with shape (m, n) can be seen as m rows, each a vector of length n. A 3D tensor is an array of matrices (2D tensors). It can be thought of as a stack of 2D tensors, with shape (p, m, n), where: p is the number of matrices (or ‚Äúslices‚Äù in the stack), m is the number of rows in each matrix, n is the number of columns in each matrix. In other words, A 3D tensor is like a nested array of arrays of arrays.




###Embedding layer




When you see an output shape like: *(batch_size, sequence_length, embedding_dim)* it describes a 3D tensor with three dimensions:

- batch_size: The number of examples processed simultaneously in one batch. This dimension is often None in Keras, meaning it can vary (e.g., 32, 64, or any number of examples).

- sequence_length: The length of each input sequence, i.e., how many tokens (words or subwords) each example contains. For example, if you feed sequences of 15 tokens, this dimension will be 15.

- embedding_dim: The size of the vector that represents each token after embedding. This is the ‚Äúdense vector‚Äù dimension you choose in the embedding layer (e.g., 100).

Imagine you have a batch of 2 sequences, each sequence has 3 tokens, and your embedding dimension is 4. Then your tensor looks like this:


```
[
  [  # Example 1 of batch
    [0.1, 0.2, 0.3, 0.4],  # Token 1 vector of sequence (embedding of 4 dimensions)
    [0.5, 0.6, 0.7, 0.8],  # Token 2 vector of sequence (embedding of 4 dimensions)
    [0.9, 1.0, 1.1, 1.2]   # Token 3 vector of sequence (embedding of 4 dimensions)
  ],
  [  # Example 2 of batch
    [0.11, 0.12, 0.13, 0.14],
    [0.15, 0.16, 0.17, 0.18],
    [0.19, 0.20, 0.21, 0.22]
  ]
]
```

This is an array of two examples, each containing a sequence of 3 tokens, where each token is represented by a 4-dimensional vector.

###SimpleRNN and Dense layers
When building a sequential model, you typically specify the input shape only in the first layer (like the Embedding layer). Keras will automatically infer the output shape of each layer and use that as the input shape for the next layer.

The SimpleRNN layer has a output shape of *(None, 150)* because it has 150 neurons (units).  It means the layer outputs a vector of length 150 for each example in the batch. This vector is a learned summary of the input sequence.

Finally, the Dense layer output shape *(None, 12633)* equals the size of your vocabulary (total_words). The Dense layer produces a vector of length 12,633 representing scores (logits) for each possible next word in the vocabulary.

## Model 2: A Single LSTM Layer
LSTMs are a type of RNN that are better at learning long-term dependencies in data. This is crucial for language, where the meaning of a word can depend on context from much earlier in a sentence or paragraph.

In [None]:
model_lstm = Sequential([
    Embedding(total_words, 100, input_length=max_sequence_len-1),
    LSTM(150),
    Dense(total_words, activation='softmax')
])

model_lstm.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model_lstm.build(input_shape=(None, max_sequence_len - 1))
model_lstm.summary()

## Model 3: A Multi-Layer LSTM

Stacking multiple LSTM layers can allow the model to learn more complex patterns in the data at different timescales.

Note that the first LSTM returns the full sequence (all time steps) because the second LSTM needs a sequence input.

The second LSTM returns only the last output (default return_sequences=False), since after the last LSTM you usually want a single vector to feed the Dense layer for prediction.

In [None]:
model_multi_lstm = Sequential([
    Embedding(total_words, 100, input_length=max_sequence_len-1),
    LSTM(150, return_sequences=True), # return_sequences=True is needed to stack LSTMs
    LSTM(100),
    Dense(total_words, activation='softmax')
])

model_multi_lstm.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model_multi_lstm.build(input_shape=(None, max_sequence_len - 1))
model_multi_lstm.summary()

# Training the Model
For this tutorial, we'll train our multi-lstm-layer LSTM model. Training deep learning models on large datasets can take a significant amount of time.

To keep this runnable in a reasonable time on Google Colab, we'll use a small subset of the data and train for only a few epochs. The results won't be perfect, but it will demonstrate the process.




In [None]:
from sklearn.model_selection import train_test_split

# For demonstration purposes, we'll use a smaller portion of the dataset.
# In real scenarios, you should use the full dataset (xs, ys) for better results.
xs_small = xs[:20000]
ys_small = ys[:20000]

print("Splitting data into training and validation sets...")

# Split the small subset into training and validation sets (80% train, 20% val)
x_train, x_val, y_train, y_val = train_test_split(xs_small, ys_small, test_size=0.2, random_state=42)

print("Training on a smaller subset of the data (with validation).")

# Train the model with validation data to monitor generalization
history_lstm = model_multi_lstm.fit(
    x_train,
    y_train,
    epochs=10,
    batch_size=64,
    validation_data=(x_val, y_val),
    verbose=1
)

# Evaluating the Model

As seen, when you train a model in Keras with validation_data=(x_val, y_val), the output per epoch looks like this:
```
250/250 ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ 4s 15ms/step - accuracy: 0.3734 - loss: 2.9888 - val_accuracy: 0.0747 - val_loss: 7.7836

```
This is just an output example, your results may be different.

Interpretation:
- The training accuracy is around 37% , which means the model is learning patterns from the training data.
- However, the validation accuracy is much lower (7.4%), and validation loss is much higher (7.78 vs 2.98).
-This suggests overfitting: the model performs better on data it has seen and much worse on new data.

This is common in small models or with limited data. You can improve generalization by:

- Training longer with early stopping

- Adding dropout or regularization

- Increasing the dataset size

- Using better architecture (like bidirectional LSTM or attention)


The next section of code is used to visualize how your model performed during training. It helps you diagnose problems like overfitting, underfitting, or whether your model is still improving.

It uses `history_lstm.history`, a dictionary that stores all the metrics recorded during training, epoch by epoch. history_lstm is an object (specifically, an instance of the History class) returned by the model.fit() function in Keras. And .history is an attribute of that object.

A decreasing curve of *loss* means the model is learning. If the training loss continues to decrease but validation loss increases, your model is overfitting. As with loss, if training *accuracy* increases while validation accuracy remains low or decreases, it's a sign of overfitting.





In [None]:
import matplotlib.pyplot as plt

# Evaluate the model on the validation set after training
val_loss, val_accuracy = model_multi_lstm.evaluate(x_val, y_val, verbose=0)
print(f"\nValidation Loss: {val_loss:.4f}, Validation Accuracy: {val_accuracy:.4f}")

# Plot the training and validation loss over epochs
plt.figure(figsize=(10, 5))
plt.plot(history_lstm.history['loss'], label='Training Loss')
plt.plot(history_lstm.history['val_loss'], label='Validation Loss')
plt.title('Training vs Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

# Optionally: plot accuracy if you're tracking it
if 'accuracy' in history_lstm.history:
    plt.figure(figsize=(10, 5))
    plt.plot(history_lstm.history['accuracy'], label='Training Accuracy')
    plt.plot(history_lstm.history['val_accuracy'], label='Validation Accuracy')
    plt.title('Training vs Validation Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.grid(True)
    plt.show()

# Generating Text


##Generate_text function

Now that our model is trained, let's use it to generate some text! We'll create a function that takes a starting text (a "seed") and generates a sequence of words.

We'll also introduce the concept of temperature. Temperature is a parameter that controls the randomness of the predictions.

- A low temperature will make the model more confident and deterministic, leading to less surprising but potentially more repetitive text.

- A high temperature will increase the randomness, leading to more creative and diverse, but also potentially more nonsensical, text.

In [None]:
def generate_text(seed_text, next_words, model, max_sequence_len, tokenizer, temperature=1.0):
    """
    Generates text from a given seed text using a trained model and temperature-controlled sampling.

    Arguments:
    - seed_text: initial string to begin text generation (can be empty "")
    - next_words: number of words to generate.
    - model: trained language model (e.g., LSTM).
    - max_sequence_len: maximum length used during training for padding.
    - tokenizer: tokenizer used to convert words to integers.
    - temperature: controls randomness. Lower = more predictable; Higher = more creative.

    Returns:
    - A string containing the seed and generated text.
    """
    generated_text = seed_text

    #loop that will be executed once for each word you want to generate
    for _ in range(next_words):
        # Convert current text (intially the seed) to a sequence of integers using the same tokenizer as in traning
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        # Pad the sequence to match the input length expected by the model (as in training)
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')

        # Predict the probability distribution for the next word
        # This returns a vector of probabilities for each word in the vocabulary
        # For example: [0.10, 0.05, 0.60, 0.15, 0.10] for a vocab of size 5
        predicted_probs = model.predict(token_list, verbose=0)[0]

        # Apply temperature to adjust the randomness of predictions.
        # Temperature affects how confident or diverse the next word selection is.
        # 1. Take the logarithm of the probabilities.
        #    Logarithms help spread out small probabilities and shrink large ones.
        #    We add a tiny value (1e-8) to avoid log(0), which is undefined.
        log_preds = np.log(predicted_probs + 1e-8)

        # 2. Divide by the temperature.
        #    - Low temperature (< 1.0): makes the distribution sharper (less random, more predictable)
        #    - High temperature (> 1.0): makes it softer (more random, more creative)
        # Example:
        # Original probs:       [0.10, 0.05, 0.60, 0.15, 0.10]
        # With temp = 0.5: sharpens ‚Üí more likely to pick 0.60
        # With temp = 1.5: flattens ‚Üí more chance for lower-probability words
        # With temp =1, orginal probability distribution
        adjusted_log_preds = log_preds / temperature

        # 3. Convert back from log scale using exponentials
        exp_preds = np.exp(adjusted_log_preds)

        # 4. Normalize the distribution to make it a valid probability vector again
        # This is like softmax ‚Äî ensures all values sum to 1
        predicted_probs = exp_preds / np.sum(exp_preds)

        # Now predicted_probs is ready for sampling the next word using np.random.choice
        # It has been adjusted by temperature to control creativity


        # Randomly choose the next word based on the probabilities
        predicted_id = np.random.choice(len(predicted_probs), p=predicted_probs)

        # Convert predicted token index back to word
        output_word = tokenizer.index_word.get(predicted_id, "")


        # Add the predicted word to the ongoing text
        seed_text += " " + output_word #for the next iteration of the loop
        generated_text += " " + output_word #for final ouput.

    return generated_text

##Generating Text Starting from an Empty Seed

In [None]:
print("\n--- Generating Text Starting from an Empty Seed ---")

# Start from an empty string
empty_seed = ""

# Generate text with temperature = 1.0 (balanced creativity)
generated = generate_text(seed_text=empty_seed,
                          next_words=50,
                          model=model_lstm,
                          max_sequence_len=max_sequence_len,
                          tokenizer=tokenizer,
                          temperature=1.0)

print("\nGenerated Text:\n", generated)

## Generating Text from Seed

We will use "to be or not to be" as seed.

Though none of these completions form perfect sentences, they show how adjusting temperature lets you control the trade-off between safe, predictable text and bold, novel expressions ‚Äî a key feature in creative AI text generation.




In [None]:
print("\n--- Generating Text from Seed ---")
seed = "to be or not to be"

print("\n... with temperature 0.2 (very conservative and predictable)")
print(generate_text(seed, 10, model_lstm, max_sequence_len, tokenizer, temperature=0.2))

print("\n... with temperature 1.0 (balanced creativity and coherence)")
print(generate_text(seed, 10, model_lstm, max_sequence_len, tokenizer, temperature=1.0))

print("\n... with temperature 1.5 (more creative, possibly less grammatical)")
print(generate_text(seed, 20, model_lstm, max_sequence_len, tokenizer, temperature=1.5))


##Checking tokenization

It s a good idea to add a small helper function to verify that text-to-index and index-to-text mapping are consistent. This ensures the tokenizer mapping used in generation is correct and reversible.

Deploying Deep Learning systems for NLP is largely about data preprocessing, rather than selecting one architecture or another.

In [None]:
def check_tokenizer_mapping(tokenizer, sample_text):
    """
    Check that texts_to_sequences and index_word mapping are consistent.

    Args:
        tokenizer: A fitted Keras Tokenizer.
        sample_text: A string whose tokens will be tested.

    Prints the original words, their integer indices, and the words recovered via index_word mapping.
    """
    # Tokenize the sample text into word indices
    tokens = tokenizer.texts_to_sequences([sample_text])[0]
    print("Original words:", sample_text.split())
    print("Token indices:    ", tokens)

    # Map indices back to words using index_word (UNKnown if not found)
    recovered = [tokenizer.index_word.get(i, "<UNK>") for i in tokens]
    print("Recovered words: ", recovered)

    # Compare
    print("\nMapping check:", ["OK" if orig.lower() == rec else "???"
                               for orig, rec in zip(sample_text.split(), recovered)])


# Example usage:
sample = "I love AI and machine learning, but Shakespeare did not write about them"
check_tokenizer_mapping(tokenizer, sample)

#Conclusion and Next Steps
Congratulations! You've successfully built and trained a language model to generate text in the style of Shakespeare.

Here are some ideas for how you could continue to explore and improve this project:

- Train on the full dataset: For better results, train the model on all of the data (xs and ys).

- Train for more epochs: The longer you train, the better the model will learn the patterns of the language.

- Experiment with model architecture: Try adding more LSTM layers, changing the number of units in each layer, or adding Dropout layers to prevent overfitting.

- Hyperparameter tuning: Experiment with the Embedding dimension, learning rate, and other hyperparameters.

- Character-level model: Instead of tokenizing by words, you could build a model that predicts the next character. This can be more flexible for handling unknown words.

- Use a different dataset: Try training a model on a different author's work, or a different type of text altogether!

This notebook provides a foundational understanding of how to build language models. The field of Natural Language Processing is vast and rapidly evolving, but these core concepts will serve you well as you continue to learn.