# Building Transformer Models for Chatbots

In this lesson we will take what we learned in the previous two weeks and apply them to building a simple chat bot.
Our own GPT!

## Pre-trained Embeddings

## Next-word prediction

In [1]:
import tensorflow as tf
import numpy as np
import re
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Layer, Embedding, Dense, LayerNormalization, Dropout, MultiHeadAttention

import pickle

First things first, we need some text to use as training data.
For this lessons, we'll use the complete works of Shakespeare.

In [2]:
# Load the text file
path_to_file = tf.keras.utils.get_file("shakespeare.txt", 
                                       "https://www.gutenberg.org/files/100/100-0.txt")

with open(path_to_file, "r", encoding="utf-8") as f:
    text = f.read()

# Preview the first few lines
print(text[:1000])

*** START OF THE PROJECT GUTENBERG EBOOK 100 ***
The Complete Works of William Shakespeare

by William Shakespeare




                    Contents

    THE SONNETS
    ALL’S WELL THAT ENDS WELL
    THE TRAGEDY OF ANTONY AND CLEOPATRA
    AS YOU LIKE IT
    THE COMEDY OF ERRORS
    THE TRAGEDY OF CORIOLANUS
    CYMBELINE
    THE TRAGEDY OF HAMLET, PRINCE OF DENMARK
    THE FIRST PART OF KING HENRY THE FOURTH
    THE SECOND PART OF KING HENRY THE FOURTH
    THE LIFE OF KING HENRY THE FIFTH
    THE FIRST PART OF HENRY THE SIXTH
    THE SECOND PART OF KING HENRY THE SIXTH
    THE THIRD PART OF KING HENRY THE SIXTH
    KING HENRY THE EIGHTH
    THE LIFE AND DEATH OF KING JOHN
    THE TRAGEDY OF JULIUS CAESAR
    THE TRAGEDY OF KING LEAR
    LOVE’S LABOUR’S LOST
    THE TRAGEDY OF MACBETH
    MEASURE FOR MEASURE
    THE MERCHANT OF VENICE
    THE MERRY WIVES OF WINDSOR
    A MIDSUMMER NIGHT’S DREAM
    MUCH ADO ABOUT NOTHING
    THE TRAGEDY OF OTHELLO, THE MOOR OF VENICE
    PERICLES, PRINC

We will now do some preprocessing to prepare our model for training.

### Step 1. Preprocess the Text

* Convert to lowercase (to reduce redundancy)
* Remove special characters, but keep punctuation.
* Tokenize into words (unlike last week where we used characters)

In [3]:
# Remove unnecessary characters, keeping punctuation and words
text = re.sub(r"[^a-zA-Z0-9.,;?!'\" \n]", "", text.lower())

# Split into sentences (optional for training efficiency)
sentences = text.split("\n")

# Remove empty lines
sentences = [s.strip() for s in sentences if len(s) > 0]

# Print sample
print(sentences[:10])

['start of the project gutenberg ebook 100', 'the complete works of william shakespeare', 'by william shakespeare', 'contents', 'the sonnets', 'alls well that ends well', 'the tragedy of antony and cleopatra', 'as you like it', 'the comedy of errors', 'the tragedy of coriolanus']


### Use Keras Tokenizer

Turn words into integer tokens.

In [4]:
# Initialize and fit tokenizer
tokenizer = Tokenizer(filters="")
tokenizer.fit_on_texts(sentences)

# Convert text to sequences
sequences = tokenizer.texts_to_sequences(sentences)

# Vocabulary size
vocab_size = len(tokenizer.word_index) + 1  # +1 for padding token

print(f"Vocabulary size: {vocab_size}")
print(f"Sample tokenized sentence: {sequences[0]}")

Vocabulary size: 56203
Sample tokenized sentence: [2910, 5, 1, 6634, 20332, 20333, 13473]


### Create input sequences and output tokens

We need to create the actual data the model will train on.
* Input is a sequence of words
* Output is the next word

**E.G**
> ```
> ["to", "be", "or", "not"] -> "to"
> ```

In [5]:
# Create input-output sequences
input_sequences = []
output_words = []

seq_length = 10  # Number of words per training sample

for seq in sequences:
    for i in range(1, len(seq)):
        context = seq[max(0, i - seq_length):i]  # Previous words as input
        target = seq[i]  # Next word as label
        input_sequences.append(context)
        output_words.append(target)

# Pad sequences to have a uniform length
# (Padding the front)
input_sequences = pad_sequences(input_sequences, maxlen=seq_length, padding="pre")

# Convert output to numpy array
output_words = np.array(output_words)

print(f"Sample input: {input_sequences[3]}")
print(f"Target word index: {output_words[3]}")

Sample input: [   0    0    0    0    0    0 2910    5    1 6634]
Target word index: 20332


### Step 5: Convert outputs to categorical labels

We need to also convert the target vector.

In [6]:
# Convert target words to one-hot encoding
output_words = np.array(output_words, dtype=np.int32)  # Ensure integer type
output_words = output_words.reshape(-1, 1)  # Reshape to match expected shape


print(f"Shape of input data: {input_sequences.shape}")
print(f"Shape of output data: {output_words.shape}")

Shape of input data: (809773, 10)
Shape of output data: (809773, 1)


### Step 6: Prepare data for training

Now that we have prepared our data for training, we can use tensorflow's datasets to prepare our data for efficient training.

In [7]:
# Prepare the dataset for batch processing

BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = tf.data.Dataset.from_tensor_slices((input_sequences, output_words))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)

## Building the Transformer Model

We now need to construct our MiniGPT model.
We will create two classes `TransformerDecodeBlock` and `MiniGPT`.

* `TransformerDecodeBlock` consists of a multi-head attention layer, as well as normalization layers and dense layers
* `MiniGPT` adds the positional encoding, combines several decoders in sequence, and generates the  model's final output.

Instead of sinusoidal encoding, our model initializes trainable positional embeddings as a tensor of zeros:

```python
    self.pos_embedding = tf.Variable(
        initial_value=tf.zeros(shape=(1, max_len, embed_dim)), trainable=True, name="pos_embedding"
    )
```

At first, this means every position is represented identically. However, during training, the model learns to adjust these embeddings to best encode positional information based on the dataset.
On models like GPT, this type of trainable positional encoding tends to perform better.

In [8]:
import tensorflow as tf
from tensorflow.keras.layers import Layer, Embedding, Dense, LayerNormalization, Dropout, MultiHeadAttention


# Define the transformer decoder block class

class TransformerDecoderBlock(Layer):
    def __init__(self, embed_dim: int, num_heads: int, ff_dim: int, dropout_rate: float = 0.1):
        super().__init__()
        self.attention = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.norm1 = LayerNormalization(epsilon=1e-6)
        self.norm2 = LayerNormalization(epsilon=1e-6)
        self.ffn = tf.keras.Sequential([
            Dense(ff_dim, activation="relu"),  # Feedforward layer
            Dense(embed_dim)  # Project back to embedding size
        ])
        self.dropout1 = Dropout(dropout_rate)
        self.dropout2 = Dropout(dropout_rate)

    def call(self, inputs, training: bool = False):
        seq_length = tf.shape(inputs)[1]  # Get sequence length dynamically
        batch_size = tf.shape(inputs)[0]  # Get batch size dynamically

        # Create a causal mask using TensorFlow operations
        mask = tf.linalg.band_part(tf.ones((seq_length, seq_length)), -1, 0)  # Lower triangular mask
        mask = tf.reshape(mask, (1, 1, seq_length, seq_length))  # Expand dims for broadcasting

        # Apply masked self-attention
        attention_output = self.attention(inputs, inputs, attention_mask=mask)
        attention_output = self.dropout1(attention_output, training=training)
        out1 = self.norm1(inputs + attention_output)  # Residual connection

        # Feedforward network
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.norm2(out1 + ffn_output)  # Another residual connection


# Define the MiniGPT class


class MiniGPT(tf.keras.Model):
    def __init__(self, vocab_size: int, embed_dim: int = 128, num_heads: int = 4, ff_dim: int = 512, num_layers: int = 3, max_len: int = 10):
        super().__init__()
        self.embedding = Embedding(vocab_size, embed_dim)
        
        # Use tf.Variable for positional embeddings instead of add_weight
        self.pos_embedding = tf.Variable(
            initial_value=tf.zeros(shape=(1, max_len, embed_dim)), trainable=True, name="pos_embedding"
        )

        self.decoder_blocks = [TransformerDecoderBlock(embed_dim, num_heads, ff_dim) for _ in range(num_layers)]
        self.final_layer = Dense(vocab_size)  # Output layer for token predictions

    def call(self, inputs, training: bool = False):
        x = self.embedding(inputs) + self.pos_embedding[:, :tf.shape(inputs)[1], :]
        for block in self.decoder_blocks:
            x = block(x, training=training)
        return self.final_layer(x[:, -1, :])  # Predict only the last token



# Instantiate the model w/ default values
embed_dim = 128
num_heads = 3
ff_dim = 512
num_layers = 4
max_len = seq_length  # 10

gpt_model = MiniGPT(vocab_size, embed_dim, num_heads, ff_dim, num_layers, max_len)

### Compile the model

In [9]:
learning_rate = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.001, decay_steps=10000, decay_rate=0.9
)

gpt_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

## Training

We're ready to train the model!

In [10]:
%%time

# This should take ~15 min on a powerful system.

EPOCHS = 10

gpt_model.fit(dataset, epochs=EPOCHS, batch_size=512)

Epoch 1/10


I0000 00:00:1739909192.746647    6826 service.cc:146] XLA service 0x7ff228037060 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1739909192.746697    6826 service.cc:154]   StreamExecutor device (0): Quadro RTX 5000, Compute Capability 7.5
I0000 00:00:1739909192.746703    6826 service.cc:154]   StreamExecutor device (1): Quadro RTX 5000, Compute Capability 7.5


[1m   21/12652[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m1:46[0m 8ms/step - accuracy: 0.0126 - loss: 10.6521      

I0000 00:00:1739909201.934160    6826 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m12652/12652[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m115s[0m 8ms/step - accuracy: 0.0314 - loss: 7.7398
Epoch 2/10
[1m12652/12652[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m99s[0m 8ms/step - accuracy: 0.0323 - loss: 7.3998
Epoch 3/10
[1m12652/12652[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m99s[0m 8ms/step - accuracy: 0.0323 - loss: 7.3600
Epoch 4/10
[1m12652/12652[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m99s[0m 8ms/step - accuracy: 0.0324 - loss: 7.3481
Epoch 5/10
[1m12652/12652[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m99s[0m 8ms/step - accuracy: 0.0325 - loss: 7.3442
Epoch 6/10
[1m12652/12652[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m99s[0m 8ms/step - accuracy: 0.0325 - loss: 7.3371
Epoch 7/10
[1m12652/12652[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m99s[0m 8ms/step - accuracy: 0.0327 - loss: 7.3358
Epoch 8/10
[1m12652/12652[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m99s[0m 8ms/step - accuracy: 0.0325 - loss: 7.3277
Epoch 9/10

<keras.src.callbacks.history.History at 0x7ff3dc9f3690>

In [11]:
# Save the model and tokenizer for future use

gpt_model.save("shakespeare_gpt.keras")

with open("tokenizer.pkl", "wb") as f:
    pickle.dump(tokenizer, f)

In the future, we can use this code to load our saved model.


```python

# Load the trained model
loaded_gpt_model = tf.keras.models.load_model("shakespeare_gpt.keras")

# Load tokenizer
with open("tokenizer.pkl", "rb") as f:
    loaded_tokenizer = pickle.load(f)
```

# Chatbot

We now have a model which can take some text and predict the next word.
Let's use that to create our chat bot.

Our model will take in some seed text input by the user.
It will then predict the next word, appending that to the seed text.
This process repeats until the model is finished predicting text.

We will try two methods to ge the next word:


* **Greedy:** Select the model's most cofident prediction.
* **Top-$k$ with temperature**: Look at the top predictions and select from them with weighted probability.

In [12]:
def generate_text_greedy(model, tokenizer, seed_text, max_length=20):
    """
    Generate text using greedy decoding.
    
    :param model: Trained MiniGPT model
    :param tokenizer: Keras tokenizer used for training
    :param seed_text: Initial text prompt
    :param max_length: Maximum words to generate
    :return: Generated text
    """
    sequence = tokenizer.texts_to_sequences([seed_text])[0]  # Convert seed to tokens

    for _ in range(max_length):
        padded_sequence = tf.keras.preprocessing.sequence.pad_sequences([sequence], maxlen=10, padding="pre")

        # Get model prediction (logits over vocabulary)
        predictions = model.predict(padded_sequence, verbose=0)
        next_word_id = np.argmax(predictions)  # Greedy: pick most probable word

        if next_word_id == 0:  # Stop if unknown token predicted
            break

        sequence.append(next_word_id)  # Add predicted word to sequence

    return tokenizer.sequences_to_texts([sequence])[0]  # Convert tokens back to text


prompt = "To be or not to"
print(generate_text_greedy(gpt_model, tokenizer, prompt))

to be or not to the the the the the the the the the the the the the the the the the the the the


### Temperature Scaling

When the model predicts the next word, it assigns a probability to each word in the vocabulary. The temperature scales these probabilities before choosing the final word.

Mathematically, temperature modifies the probability distribution as follows:

$$P(w_i) = \frac{\exp\left(\frac{\log P(w_i)}{T}\right)}{\sum_{j} \exp\left(\frac{\log P(w_j)}{T}\right)}$$

where:
- $( P(w_i) )$ is the probability of word $( w_i )$
- $( T )$ is the temperature parameter.
- $( \log P(w_i) )$ represents the original logits output by the model.
- The denominator ensures the probabilities sum to 1.
- A higher T makes the probabilities more uniform (increasing randomness).
- A lower T makes the highest probability words dominate (more deterministic).

By modifying $k$ and the temperature parameter, we can create a model which is more or less random in its responses.


In [13]:
def generate_text_top_k(model, tokenizer, seed_text, max_length=20, k=5, temperature=1.0):
    """
    Generate text using top-k sampling.

    :param model: Trained MiniGPT model
    :param tokenizer: Keras tokenizer used for training
    :param seed_text: Initial text prompt
    :param max_length: Maximum words to generate
    :param k: Number of top-k words to consider
    :param temperature: Controls randomness (higher = more random)
    :return: Generated text
    """
    sequence = tokenizer.texts_to_sequences([seed_text])[0]

    for _ in range(max_length):
        padded_sequence = tf.keras.preprocessing.sequence.pad_sequences([sequence], maxlen=10, padding="pre")

        # Get model prediction
        predictions = model.predict(padded_sequence, verbose=0)
        predictions = predictions.flatten()  # Convert shape (vocab_size,) for processing

        # Apply temperature scaling
        predictions = np.exp(predictions / temperature)
        predictions = predictions / np.sum(predictions)  # Normalize to probability distribution

        # Get top-k predictions
        top_k_indices = np.argsort(predictions)[-k:]  # Get k highest probability words
        top_k_probs = predictions[top_k_indices]

        # Sample next word from top-k probabilities
        next_word_id = np.random.choice(top_k_indices, p=top_k_probs / np.sum(top_k_probs))

        if next_word_id == 0:
            break  # Stop if unknown token predicted

        sequence.append(next_word_id)

    return tokenizer.sequences_to_texts([sequence])[0]

prompt = "To be or not to"
print(generate_text_top_k(gpt_model, tokenizer, prompt, k=5, temperature=0.7))

to be or not to in the to the the of and and to of the the and to in of to and the and


In [15]:
def chatbot():
    print("Shakespeare GPT Chatbot - Type 'exit' to quit.")
    
    while True:
        user_input = input("\nYou: ")
        if user_input.lower() == "exit":
            print("Goodbye!")
            break

        response = generate_text_top_k(gpt_model, tokenizer, user_input, k=5, temperature=2.8)
        print(f"Bot: {response}")


# Start Chatbot
chatbot()

Shakespeare GPT Chatbot - Type 'exit' to quit.



You:  Hello


Bot: in and and and of of the in of the in the in of and and to and the of



You:  exit


Goodbye!


## Closing Comments

This model is pretty cool, but suffers from some obvious issues.

* The model is very repititive
* The model does not use punctuation
* The output is not very natural sounding

There are some advanced techniques we could implement to address them.
* We could introduce a penalty to the output to repeated words (as was done in GPT 2).
* We could manually split out input sentences where punctuation naturally occurs
* We could try a bigger model, and train for longer
* We could include more training text
* We could increase the training context window (we only used 10)

# Homework

1. Download the novels below:

* https://www.gutenberg.org/cache/epub/345/pg345.txt - [**Dracula**]
* https://www.gutenberg.org/cache/epub/84/pg84.txt - [**Frankenstein**]
* https://www.gutenberg.org/cache/epub/514/pg514.txt - [**Little Women**]
* https://www.gutenberg.org/cache/epub/42671/pg42671.txt - [**Pride and Prejudice**]
* https://www.gutenberg.org/cache/epub/64317/pg64317.txt - [**The Great Gatsby**]
* https://www.gutenberg.org/cache/epub/2701/pg2701.txt - [**Moby Dick**]

2. Perform text preprocessing on these texts.

3. Build a tokenizer using the cleaned text.

4. Train a GPT model using the cleaned text.

5. Build your own chat bot using the trained GPT model.