## Learning Objectives

At the end of the experiment, you will be able to:

* understand the big picture of transformers
* explore masking of transformers
* implement transformer decoder and understand its architecture
* apply learning on a machine translation problem

### The Big Picture of Transformer

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST%205%20Big%20Picture.png" width=700px/>
</center>


Transformer architecture follows an encoder-decoder structure:

- the ***encoder***, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations;
- the ***decoder***, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step to generate an output sequence.

The Transformer decoder generates sequences autoregressively by attending to previously generated positions using masked self-attention, attending to the encoder's output using encoder-decoder attention, applying feed-forward networks, and utilizing positional encodings. This architecture allows the decoder to produce coherent and contextually accurate sequences in various natural language processing tasks

### Setup Steps:

In [None]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython
ipython.magic("sx unzip -q Demo_spa-eng.zip")



### Importing required packages

In [None]:
# Importing the NumPy library, which provides support for mathematical operations on large arrays and matrices.
import numpy as np

# Importing the re module for working with regular expressions, useful for pattern matching in strings.
import re

# Importing the random module, which provides functionalities for generating random numbers and choices.
import random

# Importing the string module to access string constants (e.g., ascii_letters, digits) and utilities for string manipulation.
import string

# Importing TensorFlow, a popular library for machine learning and deep learning.
import tensorflow as tf

# Importing Keras, the high-level neural network API included in TensorFlow, to build and train models.
from tensorflow import keras

# Importing specific modules from Keras for building deep learning models:
# - `layers` provides various building blocks like Dense, Conv2D, LSTM, etc., for constructing neural network layers.
from tensorflow.keras import layers

# **Part A** : Building Encoder Transformer

The concepts for Transformer encoder have been discussed in Assignment 4 and the same steps are implemented here for creating a decoder network.

### Define TransformerEncoder class to be used in model building

In [None]:
# Defining a custom Transformer Encoder layer by subclassing the `layers.Layer` class in Keras.
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        # Initialize the parent Layer class with any additional keyword arguments.
        super().__init__(**kwargs)

        # Store the embedding dimension, which defines the size of input embeddings (e.g., 4 in a dummy example).
        self.embed_dim = embed_dim

        # Define the size of the dense (fully connected) layer in the feedforward network within the encoder.
        self.dense_dim = dense_dim

        # Define the number of attention heads in the Multi-Head Attention mechanism.
        self.num_heads = num_heads

        # Create a Multi-Head Attention layer for self-attention, with specified number of heads and embedding dimension.
        self.attention = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)

        # Build a feedforward neural network (FFN) with two dense layers.
        # The first layer uses ReLU activation, and the second projects back to the embedding dimension.
        self.dense_proj = keras.Sequential([
            layers.Dense(dense_dim, activation="relu"),  # Expands to `dense_dim`.
            layers.Dense(embed_dim)                      # Projects back to `embed_dim` to match input shape.
        ])

        # Layer normalization for stabilizing training and improving convergence.
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    # Define the forward pass logic for the Transformer Encoder layer.
    def call(self, inputs, mask=None):
        # Apply masking if a mask is provided, adding a new axis to the mask tensor.
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
            print(f"**test: mask in not None. mask = {mask}")

        # Perform self-attention with inputs as query, key, and value.
        # This makes it a "self-attention" mechanism as all arguments come from the same source (inputs).
        attention_output = self.attention(
            query=inputs,             # The query tensor.
            value=inputs,             # The value tensor.
            key=inputs,               # The key tensor.
            attention_mask=mask       # Optional attention mask.
        )

        # Apply residual connection and normalization after the attention step.
        proj_input = self.layernorm_1(inputs + attention_output)

        # Pass the normalized result through the feedforward network.
        proj_output = self.dense_proj(proj_input)

        # Apply another residual connection and normalization after the feedforward network.
        return self.layernorm_2(proj_input + proj_output)

    # Define a method to return the configuration of the layer (for serialization purposes).
    def get_config(self):
        # Retrieve the configuration from the parent class and update it with custom attributes.
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,  # Embedding dimension.
            "num_heads": self.num_heads,  # Number of attention heads.
            "dense_dim": self.dense_dim,  # Size of the dense layer in the FFN.
        })
        return config

### Positional Embedding





*   Learn position- embedding vectors the same way we learn to embed word indices.
*   Proceed to **add** our position embeddings to the corresponding word embeddings, to obtain a position-aware word embedding.
*   This technique is called “positional embedding.”






<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST6%20Positional%20Embedding.png" width=700px/>


<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST6%20Encoder%20Embedding.png" width=650px/>
</center>

![]()

**Q:** In the picture above:


*   What is the embedding dimension for both the layers? - 3
*   How many rows would the token embedding layer have?  - 20000 (vocab size)
*   How many rows would the postional embedding layer have? - 600 (seq length)
*   Where do we get the indices in token embedding layer? - from TextVectorization
*   Where do we get the indices in token embedding layer? - We explicitly define a range



### Define PositionalEmbedding class to be used in model building

In [None]:
# Using positional encoding to re-inject order information

class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        # input_dim = (token) vocabulary size, output_dim = embedding size
        super().__init__(**kwargs)

        # Embedding layer for token embeddings:
        # Converts tokens into dense vector representations of size `output_dim`.
        self.token_embeddings = layers.Embedding(input_dim=input_dim, output_dim=output_dim)
        # Q: what is input_dim and output_dim? A: vocab size, embedding dim

        # Embedding layer for positional embeddings:
        # Assigns a unique embedding to each position in the sequence (0 to sequence_length-1).
        self.position_embeddings = layers.Embedding(input_dim=sequence_length, output_dim=output_dim)
        # Q: Why input_dim = seq_length?  A: there are seq_len; no. of possible positions
        # Q: What is the vocab for this Embedding layer? A: seq_length

        # Store the sequence length, input dimension, and output dimension.
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):  # Inputs will be a batch of sequences (batch_size, seq_len)
        # Extract the sequence length dynamically from the input tensor.
        length = tf.shape(inputs)[-1]  # `length` will just be sequence length.

        # Generate position indices (0 to sequence_length-1) for the input sequence.
        positions = tf.range(start=0, limit=length, delta=1)  # Indices for input to positional embedding.

        # Convert token IDs in `inputs` to dense embeddings using `token_embeddings`.
        embedded_tokens = tf.reshape(self.token_embeddings(inputs), (-1, length, self.output_dim))

        # Convert position indices to dense embeddings using `position_embeddings`.
        embedded_positions = tf.reshape(self.position_embeddings(positions), (-1, length, self.output_dim))

        # Add token embeddings and positional embeddings element-wise.
        return layers.Add()([embedded_tokens, embedded_positions])  # ADD the embeddings.

    def compute_mask(self, inputs, mask=None):  # Makes this layer a mask-generating layer.
        if mask is None:
            return None  # If no mask is provided, return None.
        # Generate a boolean mask where tokens equal to 0 (padding tokens) are marked as False.
        return tf.math.not_equal(inputs, 0)  # Mask will get propagated to the next layer.

    # When using custom layers, this enables the layer to be reinstantiated from its config dict,
    # which is useful during model saving and loading.
    def get_config(self):
        # Get the configuration of the parent class and update it with custom attributes.
        config = super(PositionalEmbedding, self).get_config()
        config.update({
            "output_dim": self.output_dim,        # Embedding dimension size.
            "sequence_length": self.sequence_length,  # Maximum sequence length.
            "input_dim": self.input_dim,         # Vocabulary size.
        })
        return config


In [None]:
a = tf.constant([1, 0, 2, 0, 3])  # Define a TensorFlow constant tensor with the specified values.
print(a)  # Print the tensor `a`. Output will display the tensor's values.

print(tf.math.not_equal(a, 0))
# Use TensorFlow's `tf.math.not_equal` function to create a boolean tensor.
# This tensor will have `True` where the elements of `a` are not equal to `0`, and `False` where they are `0`.

### TransformerEncoder model definition with Positional Embedding

In [None]:
# Combining the Transformer encoder with positional embedding
# The values below are for the classification problem. We will change them for the translation example.

# Define key parameters for the model.
vocab_size = 15000           # The size of the vocabulary (number of unique tokens).
sequence_length = 20         # The maximum length of input sequences.
embed_dim = 256              # The embedding size (dimensionality of token embeddings).
num_heads = 2                # Number of attention heads in the Transformer encoder.
dense_dim = 32               # Number of neurons in the feedforward network of the encoder.

# Define the model input.
inputs = keras.Input(shape=(None,), dtype="int64")
# Input is expected to be integer-encoded (e.g., token indices from a TextVectorization layer).

# Add a PositionalEmbedding layer to combine token and positional embeddings.
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)

# Add a TransformerEncoder layer for self-attention and feedforward transformations.
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)

# Apply global max pooling to reduce the sequence dimension, retaining only the most significant features.
x = layers.GlobalMaxPooling1D()(x)

# Add a dropout layer to prevent overfitting during training.
x = layers.Dropout(0.5)(x)

# Add a dense output layer with a sigmoid activation function for binary classification.
outputs = layers.Dense(1, activation="sigmoid")(x)

# Create the Keras model by specifying the input and output layers.
model = keras.Model(inputs, outputs)

# Compile the model with the following configurations:
# - Optimizer: "rmsprop" (RMSProp optimization algorithm).
# - Loss: "binary_crossentropy" (loss function for binary classification tasks).
# - Metrics: Track "accuracy" during training and evaluation.
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

# Print a summary of the model, including the number of trainable parameters and layer details.
model.summary()

# Compute and print the number of trainable weights for embeddings.
print(f"Token embedding weights: {256*15000}")  # Number of weights in the token embedding layer.
print(f"Position embedding weights: {256*20}")  # Number of weights in the positional embedding layer.
print(f"Total no. of weights: {256*15000 + 256*20}")  # Sum of token and positional embedding weights.

# **Part B** : Building Decoder Transformer

## Encoder - Decoder Overview

Encoder - Encodes the input as some representation

Decoder - Uses the encoded representation (and targets) to decode these representation as per the target.

Transformer Architecture:

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST6%20Transformer%20Network.png" width=350px/>
</center>

During training,
* An encoder model turns the source sequence into an intermediate representation.
* **A decoder is trained to predict the next token i** in the target sequence by looking at both
    - previous tokens (0 to i - 1) and
    - the encoded source sequence
    

During inference, we don’t have access to the target sequence—we’re trying to predict it from scratch. We’ll have to generate it one token at a time:
1. We obtain the encoded source sequence from the encoder.
2. The decoder starts by looking at the encoded source sequence as well as an initial “seed” token (such as the string "[start]"), and uses them to predict the
first real token in the sequence.
3. The predicted sequence so far is fed back into the decoder, which generates the next token, and so on, until it generates a stop token (such as the string "[end]").

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST6%20Transformer%20gif.gif" width=750px/>
</center>


### Masking





Masking is needed to prevent the attention mechanism of a transformer from “cheating” in the decoder when training (on a translating task for instance). This kind of “ cheating-proof masking” is not present in the encoder side.

Consider the sequence: “I love it”, then the expected prediction for the token at position one (“I”) is the token at the next position (“love”). Similarly the expected prediction for the tokens “I love” is “it”.

We do not want the attention mechanism to share any information regarding the token at the next positions, when giving a prediction using all the previous tokens.

To ensure that this is done, we mask future positions (setting them to -inf) before the softmax step in the self-attention calculation.

### Padding mask

Padding is a special form of masking where the masked steps are at the start or the end of a sequence. Padding comes from the need to encode sequence data into contiguous batches: in order to make all sequences in a batch fit a given standard length, it is necessary to pad or truncate some sequences.

* The Embedding layer is capable of generating a “mask” that corresponds to its input data.

* By default, this option isn’t active—you can turn it on by passing mask_zero=True to your Embedding layer.

* You can retrieve the mask with the compute_mask() method:

### An example to understand Padding Masking

In [None]:
# Padding mask
# Define an Embedding layer with:
# - `input_dim=10`: The vocabulary size (number of unique tokens).
# - `output_dim=256`: The size of the embedding vectors.
# - `mask_zero=True`: Enables masking for padding tokens (value 0).

embedding_layer_ = layers.Embedding(input_dim=10, output_dim=256, mask_zero=True)

# Define some input sequences with padding tokens (value 0 at the end).
some_input = [
  [4, 3, 2, 1, 0, 0, 0],  # Sequence with padding (0) at the end.
  [5, 4, 3, 2, 1, 0, 0],  # Sequence with padding (0) at the end.
  [2, 1, 0, 0, 0, 0, 0]   # Sequence with more padding (0) at the end.
]

# Compute the mask for the input sequences using the embedding layer.
d_mask = embedding_layer_.compute_mask(some_input)

# Print the mask. It is a boolean tensor where `True` represents valid tokens,
# and `False` represents padding tokens (value 0).
print(d_mask)

# Convert the boolean mask to integers for clearer representation:
# `True` becomes 1 and `False` becomes 0.
print(tf.cast(d_mask, dtype="int32"))

### Causal Padding




*   The TransformerDecoder is order-agnostic: it looks at the entire target sequence at once.
*   If it were allowed to use its entire input, it would simply learn to copy input step N+1 to location N in the output.
*  Solution: mask the upper half of the pairwise attention matrix to prevent the model from paying any attention to information from the future
*  We'll see this in the method get_causal_attention_mask(self, inputs) inside the decoder class


  

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST5%20Self%20Attention%20Scores.png" width=600px/>
</center>


<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST5%20Multihead%20Attention.png" width=600px/>
</center>


In [None]:
# Assume sequence length is 5
j = normal_range = tf.range(5)
# `tf.range(5)` generates a 1D tensor with values [0, 1, 2, 3, 4].
# `normal_range` and `j` are just aliases for the same tensor.

i = with_new_axis = tf.range(5)[:, tf.newaxis]
# `tf.range(5)` generates [0, 1, 2, 3, 4].
# `[:, tf.newaxis]` adds a new dimension to make the tensor 2D:
# The resulting shape is (5, 1), producing a column vector:
# [[0],
#  [1],
#  [2],
#  [3],
#  [4]]

In [None]:
print(normal_range)
print(with_new_axis)

In [None]:
# `j` is a 1D tensor: [0, 1, 2, 3, 4]
# `i` is a 2D tensor: [[0], [1], [2], [3], [4]]
# Broadcasting will align the dimensions of `j` and `i` for comparison.
# The comparison `i >= j` is performed element-wise.

d_mask = tf.cast(i >= j, dtype="int32")
# `i >= j` produces a boolean tensor, where:
# - `True` indicates that the element in `i` is greater than or equal to the corresponding element in `j`.
# - `False` indicates otherwise.
# `tf.cast(..., dtype="int32")` converts the boolean tensor into an integer tensor:
# - `True` becomes `1`.
# - `False` becomes `0`.

# Print the resulting mask.
print(d_mask)

In [None]:
# Reshape the tensor `d_mask` to have a shape of (1, 5, 5).
d_mask = tf.reshape(d_mask, (1, 5, 5))

# Print the reshaped tensor.
print(d_mask)

In [None]:
# Define tile multiplier for tiling

# Define the batch size value (a scalar integer)
batch_size = 2

# Expand the dimension of batch_size to create a tensor with shape (1,)
# and value [batch_size]. This makes batch_size a 1D tensor for compatibility in operations like concat.
mult = tf.concat(
    [tf.expand_dims(batch_size, -1),  # Expands batch_size to shape (1,)
     tf.constant([1, 1], dtype=tf.int32)],  # Defines a tensor with values [1, 1] and shape (2,)
    axis=0)  # Concatenates both tensors along the 0th axis, resulting in a tensor of shape (3,)

# Print the result of expanding batch_size to shape (1,)
print(tf.expand_dims(batch_size, -1))  # Output: tf.Tensor([2], shape=(1,), dtype=int32)

# Print the tensor created with values [1, 1] and shape (2,)
print(tf.constant([1, 1], dtype=tf.int32))  # Output: tf.Tensor([1 1], shape=(2,), dtype=int32)

# Print the concatenated result of expanding batch_size and the constant tensor [1, 1]
print(mult)  # Output: tf.Tensor([2 1 1], shape=(3,), dtype=int32)

In [None]:
# Tile the mask to replicate across batchsize

# Use the previously created `d_mask` and replicate it across the batch size.
# `mult` is a tensor that indicates how many times to replicate the `d_mask` tensor across each axis.
causal_mask_ = tf.tile(d_mask, mult)

# Print the tiled mask result
print(causal_mask_)

In the example above:

sequence length = 5

batch size = 2

To know more about masking, refer [here](https://www.tensorflow.org/guide/keras/masking_and_padding).

### Transformer Decoder

In [None]:
class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        # Define the layers. Let's point them out in the diagram
        super().__init__(**kwargs)
        self.embed_dim = embed_dim  # Dimension of embedding (e.g., 256)
        self.dense_dim = dense_dim  # Number of neurons in dense layer (e.g., 32)
        self.num_heads = num_heads  # Number of heads for MultiHead Attention layer

        # Now we have 2 MultiHead Attention layers - one for self attention and one for generalized attention
        self.attention_1 = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)  # Self-attention
        self.attention_2 = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)  # Cross-attention
        self.dense_proj = keras.Sequential([layers.Dense(dense_dim, activation="relu"),  # Fully connected layers
                                            layers.Dense(embed_dim),]
                                           )
        # Layer normalization for stabilizing training and improving performance
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()

        self.supports_masking = True  # Ensures that the layer will propagate its input mask to its outputs

    def get_config(self):
        # Provides a configuration dictionary for the custom layer, for model serialization
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

    def get_causal_attention_mask(self, inputs):
        # Generates a causal attention mask to prevent the decoder from attending to future tokens
        input_shape = tf.shape(inputs)  # Get the shape of the input tensor
        batch_size, sequence_length = input_shape[0], input_shape[1]  # Extract batch size and sequence length

        # Generate a causal mask by comparing indices i and j (where i >= j means attention is allowed)
        i = tf.range(sequence_length)[:, tf.newaxis]  # Create a column vector of sequence indices
        j = tf.range(sequence_length)  # Create a row vector of sequence indices
        mask = tf.cast(i >= j, dtype="int32")  # True (1) for valid positions, False (0) for invalid positions
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))  # Reshape to (1, seq_len, seq_len)

        # Concatenate batch size to create a multiplier for tiling
        mult = tf.concat([tf.expand_dims(batch_size, -1),  # Expand batch size to match shape
                          tf.constant([1, 1], dtype=tf.int32)],  # Keep the other dimensions unchanged
                         axis=0
                         )

        # Tile the mask according to the batch size
        return tf.tile(mask, mult)

    def call(self, inputs, encoder_outputs, mask=None):
        # `inputs`: decoder input sequence
        # `encoder_outputs`: output of the encoder (key-value pairs for cross-attention)
        # `mask`: optional mask for padding (e.g., for handling padded tokens)

        # Generate a causal mask to prevent attending to future tokens
        causal_mask = self.get_causal_attention_mask(inputs)

        # Padding mask: if provided, it prevents attending to padding tokens in the encoder output
        padding_mask = None
        if mask is not None:
            padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")  # Expand mask for attention compatibility
            padding_mask = tf.minimum(padding_mask, causal_mask)  # Union of padding and causal mask (0s prevent attention)

        # First attention layer (self-attention)
        attention_output_1 = self.attention_1(query=inputs,  # Query: decoder inputs
                                              value=inputs,  # Value: decoder inputs
                                              key=inputs,    # Key: decoder inputs
                                              attention_mask=causal_mask  # Causal mask prevents future token attention
                                              )

        # Apply layer normalization after adding the attention output to the input (residual connection)
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)

        # Second attention layer (cross-attention) using encoder outputs as key and value
        attention_output_2 = self.attention_2(query=attention_output_1,  # Query: output from first attention layer
                                              value=encoder_outputs,  # Value: encoder outputs (key-value pairs)
                                              key=encoder_outputs,    # Key: encoder outputs
                                              attention_mask=padding_mask,  # Padding mask if provided
                                              )

        # Apply layer normalization after adding the attention output to the input (residual connection)
        attention_output_2 = self.layernorm_2(attention_output_1 + attention_output_2)

        # Apply a dense projection after the second attention layer
        proj_output = self.dense_proj(attention_output_2)

        # Apply final layer normalization with residual connection
        return self.layernorm_3(attention_output_2 + proj_output)

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST6%20Transformer%20Network.png" width=350px/>
</center>

In [None]:
# English to Spanish translation using a Transformer model

# Define model parameters
vocab_size = 15000  # Vocabulary size for both English and Spanish
sequence_length = 20  # Maximum sequence length (length of the input and output sentences)
embed_dim = 256  # Dimension of token embeddings
dense_dim = 2048  # Number of neurons in the dense layers after the attention layers
num_heads = 8  # Number of heads in the multi-head attention mechanism

# Define encoder input for English sentences
encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="english")  # Input sequence of English words
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)  # Apply positional embedding
# Q: First arg acts like a 'vocabulary' for pos embedding layer. A: The first argument corresponds to sequence length, which determines how many unique positions need to be embedded.
encoder_outputs = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)  # Pass through the Transformer encoder
# Q: What are these arguments? A: embedding dimension, number of neurons in the dense layer, number of heads in the multi-head attention layer.

# Define decoder input for Spanish sentences
decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="spanish")  # Input sequence of Spanish words
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)  # Apply positional embedding
x = TransformerDecoder(embed_dim, dense_dim, num_heads)(x, encoder_outputs)  # Pass through the Transformer decoder, taking encoder output as context
# Q: What are the call arguments in the picture? A: The decoder receives the embedded inputs and the encoder's output to perform cross-attention.

x = layers.Dropout(0.5)(x)  # Apply dropout to prevent overfitting
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)  # Output layer with softmax activation for predicting the next word in Spanish

# Create the complete Transformer model with encoder and decoder inputs and decoder outputs
transformer = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)  # Note that there are two input layers: one for the encoder and one for the decoder

# Display model summary
transformer.summary()

# A Machine Translation Example

English to Spanish translation

### Preparing the data

In [None]:
# Rows of the dataset
!tail spa-eng/spa.txt

In [None]:
# Pre-processing: Separating input and output sequences
text_file = "spa-eng/spa.txt"  # Define the file path of the dataset

with open(text_file) as f:
    lines = f.read().split("\n")[:-1]  # Open the file and read the contents, splitting by newlines and removing the last empty line

text_pairs = []  # Initialize an empty list to store English-Spanish pairs

for line in lines:  # Iterate through each line in the dataset
    english, spanish = line.split("\t")  # Split each line into English and Spanish sentences based on the tab character
    spanish = "[start] " + spanish + " [end]"  # Add start and end tokens to the Spanish sentence for the translation model
    text_pairs.append((english, spanish))  # Append the pair (English, Spanish) to the list

print(random.choice(text_pairs))  # Print a random English-Spanish pair from the list to check the format
print(f"no. of pairs: {len(text_pairs)}")  # Print the total number of sentence pairs processed

In [None]:
# Splitting data

random.shuffle(text_pairs)  # Shuffle the list of sentence pairs to ensure randomness before splitting

# Calculate the number of validation samples (15% of total data)
num_val_samples = int(0.15 * len(text_pairs))

# Calculate the number of training samples (remaining after allocating for validation and test sets)
num_train_samples = len(text_pairs) - 2 * num_val_samples

# Split the data into training, validation, and test sets
train_pairs = text_pairs[:num_train_samples]  # The first `num_train_samples` pairs for training
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]  # The next `num_val_samples` pairs for validation
test_pairs = text_pairs[num_train_samples + num_val_samples:]  # The remaining pairs for testing

In [None]:
print(string.punctuation)

In [None]:
# Vectorizing the English and Spanish text pairs

# Define which characters to strip out for Spanish data- [, ], ¿
strip_chars = string.punctuation + "¿"  # Combine standard punctuation and the Spanish specific character (¿) to strip
strip_chars = strip_chars.replace("[", "")  # Remove the "[" character from strip_chars
strip_chars = strip_chars.replace("]", "")  # Remove the "]" character from strip_chars

# Custom standardization function for Spanish
def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)  # Convert input string to lowercase
    return tf.strings.regex_replace(  # Replace elements of input matching regex pattern with an empty string
        lowercase, f"[{re.escape(strip_chars)}]", "")  # Remove characters defined in `strip_chars`

vocab_size = 15000  # Define the vocabulary size
sequence_length = 20  # Set the sequence length for padding/truncation

# Create a TextVectorization layer for the source (English) text
source_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,  # Maximum number of tokens in the vocabulary
    output_mode="int",  # Convert text into integer sequences
    output_sequence_length=sequence_length,  # Pad/truncate sequences to a fixed length
)

# Create a TextVectorization layer for the target (Spanish) text with custom standardization
target_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,  # Maximum number of tokens in the vocabulary
    output_mode="int",  # Convert text into integer sequences
    output_sequence_length=sequence_length + 1,  # Add 1 for the [end] token at the end of each sequence
    standardize=custom_standardization,  # Apply the custom standardization function
)

# Prepare the training data: separate English and Spanish text pairs
train_english_texts = [pair[0] for pair in train_pairs]  # Extract English sentences
train_spanish_texts = [pair[1] for pair in train_pairs]  # Extract Spanish sentences

# Adapt the vectorization layers to the training data (build the vocabulary)
source_vectorization.adapt(train_english_texts)
target_vectorization.adapt(train_spanish_texts)

In [None]:
seq = tf.range(10)  # Create a tensor with values from 0 to 9 (10 elements)
dec_in = seq[:-1]  # Slice the sequence to get all elements except the last one (input for decoder)
dec_out = seq[1:]  # Slice the sequence to get all elements except the first one (output for decoder)

# Print the original sequence, the decoder input, and the decoder output
print(f"original seq:  {seq}")  # Prints the original sequence: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
print(f"dec_in:   {dec_in}")    # Prints the decoder input sequence: [0, 1, 2, 3, 4, 5, 6, 7, 8]
print(f"dec_out:  {dec_out}")   # Prints the decoder output sequence: [1, 2, 3, 4, 5, 6, 7, 8, 9]

In [None]:
# Preparing datasets for the translation task

batch_size = 64  # Set the batch size for training and validation datasets

# IMPORTANT- returns nested tuple- ( (eng_encod_input, spa_ decod_input), spa_decod_output)
def format_dataset(eng, spa):
    # Q: What are eng and spa pre and post re-assignment? A: raw text (eng, spa) and indices after vectorization.
    eng = source_vectorization(eng)  # Convert English sentences into integer sequences using source vectorizer
    spa = target_vectorization(spa)  # Convert Spanish sentences into integer sequences using target vectorizer
    return ({
        "english": eng,            # Encoder input: English sequence
        "spanish": spa[:, :-1],    # Decoder input: Spanish sequence without the last token (used for prediction)
    }, spa[:, 1:])                 # Decoder output: Spanish sequence without the first token (target for prediction)


def make_dataset(pairs):
    # Unzip the pairs (english, spanish) into separate lists
    eng_texts, spa_texts = zip(*pairs)
    eng_texts = list(eng_texts)  # Convert English texts into a list
    spa_texts = list(spa_texts)  # Convert Spanish texts into a list
    # Create a tf.data.Dataset object from the text pairs
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(batch_size)  # Batch the dataset to the specified batch size
    dataset = dataset.map(format_dataset, num_parallel_calls=4)  # Apply the formatting function in parallel
    return dataset.shuffle(2048).prefetch(16).cache()  # Shuffle data, prefetch for performance, and cache in memory


train_ds = make_dataset(train_pairs)  # Create the training dataset
val_ds = make_dataset(val_pairs)     # Create the validation dataset

In [None]:
# Iterate over the dataset `train_ds` and take one batch
for inputs, targets in train_ds.take(1):
    # Print the shape of the 'english' input sequence in the batch (encoder input)
    print(f"inputs['english'].shape: {inputs['english'].shape}")

    # Print the shape of the 'spanish' input sequence in the batch (decoder input)
    print(f"inputs['spanish'].shape: {inputs['spanish'].shape}")

    # Print the shape of the 'targets' (decoder output) in the batch
    print(f"targets.shape: {targets.shape}")

#### Train and evaluate the model *(Switch to GPU runtime if needed)*

In [None]:
# Compiling the Transformer model with the following configurations:
transformer.compile(optimizer="rmsprop",                     # Use RMSprop optimizer for training
                    loss="sparse_categorical_crossentropy",  # Use sparse categorical crossentropy loss function for multi-class classification
                    metrics=["accuracy"]                   # Track accuracy as a metric during training and evaluation
                    )

In [None]:
# Training the Transformer model with the provided datasets
transformer.fit(train_ds,              # Train the model using the training dataset 'train_ds'
                validation_data=val_ds, # Validate the model during training using the validation dataset 'val_ds'
                epochs=20)              # Set the number of epochs to 20, indicating how many times the entire dataset will be passed through the model

### Save model

In [None]:
# Save the trained Transformer model to a file for later use
transformer.save("trained-transformer-model.keras")  # Save the entire model (architecture, weights, optimizer, etc.) to a file named "trained-transformer-model.keras"

### Inference

In [None]:
# Inference

# Retrieve the vocabulary for the Spanish language from the target vectorizer
spa_vocab = target_vectorization.get_vocabulary()

# Create a dictionary that maps token indices to words in the Spanish vocabulary
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))

# Define a maximum length for the decoded Spanish sentence
max_decoded_sentence_length = 20

# Function to decode a sequence (translate English to Spanish)
def decode_sequence(input_sentence):
    # Tokenize the input sentence using the English vectorizer
    tokenized_input_sentence = source_vectorization([input_sentence])

    # Initialize the decoded sentence with the start token
    decoded_sentence = "[start]"

    # Loop to generate each word in the translated sentence, up to the max length
    for i in range(max_decoded_sentence_length):
        # Tokenize the partial decoded sentence (without the last token) using the Spanish vectorizer
        tokenized_target_sentence = target_vectorization([decoded_sentence])[:, :-1]

        # Get the model's predictions for the next token in the sequence
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])

        # Select the token with the highest probability from the prediction
        sampled_token_index = np.argmax(predictions[0, i, :])

        # Look up the word corresponding to the sampled token index
        sampled_token = spa_index_lookup[sampled_token_index]

        # Add the word to the decoded sentence
        decoded_sentence += " " + sampled_token

        # If the end token is predicted, stop decoding
        if sampled_token == "[end]":
            break

    return decoded_sentence

# Sample a few test sentences and generate translations
test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(4):
    input_sentence = random.choice(test_eng_texts)  # Select a random English sentence
    print("-"*50)
    print(f"Input (eng):  {input_sentence}")  # Print the original English sentence
    print(f"Output(spa): {decode_sequence(input_sentence)}")  # Print the translated Spanish sentence

Note that both the TransformerEncoder and the TransformerDecoder are shape-invariant, so you could be stacking many of them to create a more powerful encoder or decoder.

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST%206%20last%20image.png" width=600px/>
</center>


### Technical Ungraded Questions:

1. Connection between encoder outputs and decoder inputs when there are multiple stacks of them?

    **Answer:** The output from the last encoder block acts as input to all decoder blocks.

\\

2. During training, are the decoder inputs obtained from decoder predictions or are they obtained directly from the target data?

    **Answer:** During training, the decoder input is obtained directly from the target data. The only differnce between the decoder input and decoder target is an offset of 1 index. For example, consider a hindi to english translation problem with a an english sample "[start] I like to learn [end]".  The input to the decoder for this sample will be sample[:-1], i.e. "[start] I like to learn" and the target will be sample[1:], i.e. "I like to learn [end]". The prediction during training will be a probabilitly distribution over the vocabulary for each element in the sequence. So if the sequence length is 8 and the vocabulary size is 100, then the output shape of the prediction for the given sample will be (6,100). The actual predicted sequence can be computed by taking the argmax, i.e. the token with the maximum probability, for each token in the sequence. An exemplary prediction based on our example can be "I love to study". The loss will be computed based on the sum of cross-entropy losses for each token. Here 'like'/'love' and 'learn'/'study' will contribute to the loss.

  (Notes:
  1. The sample will actually have integer data. Here its written text for the sake of clarity
  2. The above explanation is for 1 sample. If the batch size of 64, i.e. 64 samples in a mini-batch , then the decoder output shape is (64,6,100). In general, it is (batch_size, seq_length, vocab_size).


Other important points:
- An advatage of Transformers is that they allow parallizable computations. Note that the computation of given token does not depend on the computations of the previous token, and can be done in parallel during training.

- Note what kind of data structure the the function "format_dataset(eng, spa)" returns. It is a nested tuple- ( (eng_encod_input, spa_decod_input), spa_decod_output), where '(eng_encod_input, spa_decod_input)' form the input of the Transformer Model and 'spa_decod_output' is the target output of the Transformer Model.