 # Transformers Part 2: Language Models and Multimodality

 ## Table of Contents



 - [3. Transformer Language Models](#3-transformer-language-models)

   - [3.1 Decoder Transformers (e.g., GPT)](#31-decoder-transformers-eg-gpt)

     - [Exercise 1: Causal Masking](#exercise-1-causal-masking)

     - [Solution 1](#solution-1)

   - [3.2 Sampling Strategies](#32-sampling-strategies)

     - [Exercise 2: Sampling Methods](#exercise-2-sampling-methods)

     - [Solution 2](#solution-2)

   - [3.3 Encoder Transformers (e.g., BERT)](#33-encoder-transformers-eg-bert)

     - [Exercise 3: Masked Language Modeling](#exercise-3-masked-language-modeling)

     - [Solution 3](#solution-3)

   - [3.4 Sequence-to-Sequence Transformers](#34-sequence-to-sequence-transformers)

   - [3.5 Large Language Models (LLMs)](#35-large-language-models-llms)

 - [4. Multimodal Transformers](#4-multimodal-transformers)

   - [4.1 Vision Transformers (ViT)](#41-vision-transformers-vit)

     - [Exercise 4: ViT Patch Calculation](#exercise-4-vit-patch-calculation)

     - [Solution 4](#solution-4)

   - [4.2 Generative Image Transformers](#42-generative-image-transformers)

   - [4.3 Audio Data](#43-audio-data)

   - [4.4 Text-to-Speech](#44-text-to-speech)

   - [4.5 Vision and Language Transformers](#45-vision-and-language-transformers)

 - [Reference](#reference)

 ## 3. Transformer Language Models



 The flexible transformer architecture can be adapted for many different language tasks, which can be grouped into three main categories.



 1.  **Decoder-Only Models:** These are generative models that take a sequence and produce a new sequence, often by predicting one token at a time. A great example is text generation, like in GPT.

 2.  **Encoder-Only Models:** These models are designed to understand a sequence and output a fixed-size representation or a label for each token. They are excellent for tasks like text classification (e.g., sentiment analysis) or named entity recognition. BERT is a famous example.

 3.  **Encoder-Decoder Models:** These combine both architectures for sequence-to-sequence tasks, where an input sequence is transformed into a different output sequence, like translating from one language to another. This was the architecture of the original Transformer paper.



 We'll explore each of these classes.

 ### <a id="31-decoder-transformers-eg-gpt"></a>3.1 Decoder Transformers (e.g., GPT)



 Decoder-only models are **generative** and **autoregressive**, meaning they generate output one step at a time, with each step depending on the previous ones.



 * **Goal:** To model the probability of the next token in a sequence, given all the preceding tokens: $$p(x_n | x_1, ..., x_{n-1})$$

 * **Architecture:** It's essentially a stack of standard transformer layers. The input is a sequence of tokens, and the final output is passed through a linear layer followed by a softmax function to get a probability distribution over the entire vocabulary for the *next* token.
<br><br>

<center><img src="image/Figure_15.png" width="650px"/><br>Figure 15: Architecture of a GPT-style decoder transformer. Note the "masked" transformer layers.</center>

<br>

 * **Training and Masking:** Decoders are trained using self-supervised learning on large text corpora. To process an entire sequence at once efficiently, a technique called **masked attention** (or causal attention) is used. This mask prevents any token from "attending to" (seeing) subsequent tokens in the sequence. This is crucial because, at generation time, the model won't know the future tokens, so it must be trained the same way.



<center><img src="image/Figure_16.png" width="450px"/><br>Figure 16: An illustration of a causal attention mask. In predicting "across", the model can only attend to "(start)", "I", and "swam".</center>



 ---

 #### PyTorch Code: Causal Attention Mask



 Here's how you can create a causal (look-ahead) mask in PyTorch, which is essential for decoder models.

In [None]:
import torch

# Let's assume a sequence length of 5
seq_len = 5

# Create a square matrix of ones
mask = torch.ones(seq_len, seq_len)

# `torch.triu` (upper triangle) sets the upper triangular part of a matrix to 1
# and the rest to 0. `diagonal=1` means we don't mask the diagonal itself,
# so a token can attend to itself but not to future tokens.
causal_mask = torch.triu(mask, diagonal=1).bool()

print("Causal (Look-Ahead) Mask:")
print(causal_mask)
print("\nIn PyTorch's MultiheadAttention, a value of `True` indicates a position that will be masked (ignored).")
print("So, this mask prevents attention to future positions (upper triangle).")

# How it's used in nn.MultiheadAttention:
# attn_output, _ = multihead_attn(query, key, value, attn_mask=causal_mask)


 #### <a id="exercise-1-causal-masking"></a>Exercise 1: Causal Masking



 **Question:** Why is a causal (look-ahead) mask used in decoder-only transformers like GPT, but not typically in encoder-only transformers like BERT?

 #### <a id="solution-1"></a>Solution 1

`Response:`

 ---

 ### <a id="32-sampling-strategies"></a>3.2 Sampling Strategies



 Once a decoder model produces a probability distribution over the vocabulary for the next token, we need a strategy to select one.



 * **Greedy Search:** Always pick the token with the highest probability. This is fast and deterministic but often leads to repetitive and boring text. It finds the most likely *next word*, not necessarily the most likely *sequence*.

 * **Beam Search:** Keeps track of the `B` (the "beam width") most probable sequences at each step, expanding all of them and then pruning the list back down to the top `B`. This explores the search space better than greedy search but is more computationally expensive and can still produce text that feels unnatural compared to human writing.

<br>
    <center>
        <img src="image/Figure_17.jpg" width="550px"/>
        <p>Figure 17: Human-written text often has lower token probabilities (is more surprising) than text generated by beam search.</p>
    </center>

<br>


 * **Stochastic Sampling:** To increase diversity and creativity, we can sample from the probability distribution instead of always picking the best.

     * **Temperature Sampling:** A parameter `T` is used to rescale the logits before the softmax function. `T > 1` makes the distribution flatter (more random), while `T < 1` makes it peakier (closer to greedy). For `T=0`, it becomes greedy search.

     * **Top-K Sampling:** Only consider the top `K` most probable tokens and redistribute the probability among them, then sample from this smaller set. This avoids picking highly improbable, nonsensical words from the long tail of the distribution.

     * **Nucleus (Top-p) Sampling:** A more dynamic approach than Top-K. Instead of a fixed `K`, it samples from the smallest possible set of tokens whose cumulative probability is greater than or equal to a threshold `p`.



 ---

 #### PyTorch Code: Temperature Sampling



 Here is a simple demonstration of how temperature affects the softmax output.

In [None]:
import torch
import torch.nn.functional as F

# Example logits from a model for a vocabulary of 5 words
logits = torch.tensor([2.0, 1.5, 0.5, -1.0, -2.0])

# Temperature = 1.0 (Standard Softmax)
temp_1 = 1.0
probs_t1 = F.softmax(logits / temp_1, dim=-1)

# Temperature = 0.5 (Less random, more confident)
temp_0_5 = 0.5
probs_t0_5 = F.softmax(logits / temp_0_5, dim=-1)

# Temperature = 2.0 (More random, less confident)
temp_2 = 2.0
probs_t2 = F.softmax(logits / temp_2, dim=-1)

print("Original Logits:       ", logits.numpy())
print("-" * 50)
print(f"Probabilities (T={temp_1}):    ", probs_t1.numpy().round(4))
print(f"Probabilities (T={temp_0_5}):  ", probs_t0_5.numpy().round(4), "<- Peakier")
print(f"Probabilities (T={temp_2}):    ", probs_t2.numpy().round(4), "<- Flatter")


 #### <a id="exercise-2-sampling-methods"></a>Exercise 2: Sampling Methods



 **Question:** You are building a chatbot for creative writing. Would you prefer Greedy Search, Beam Search, or a stochastic method like Nucleus Sampling? Why?

 #### <a id="solution-2"></a>Solution 2

`Response:`

 ---

 ### <a id="33-encoder-transformers-eg-bert"></a>3.3 Encoder Transformers (e.g., BERT)



 Encoder-only models, like BERT (Bidirectional Encoder Representations from Transformers), are designed to generate rich, contextualized representations of input text.



 * **Goal:** To pre-train a deep bidirectional model on a vast amount of unlabeled text, which can then be quickly **fine-tuned** for various downstream tasks (like classification or question answering) with smaller, task-specific labeled datasets.

 * **Architecture:** A stack of standard transformer encoder layers. A special `[CLS]` (or `(class)`) token is often prepended to the input sequence, and its final hidden state is used as the aggregate representation for classification tasks.

<br>
<div align="center">
    <img src="image/Figure_18.png" width="650px"/>
    <br>
    <em>Figure 18: Architecture of an encoder transformer model. Note the absence of look-ahead masking.</em>
</div>
<br>

 * **Training: Masked Language Modeling (MLM):** Unlike GPT's autoregressive objective, BERT is trained using a **masked language model** objective. During pre-training, about 15% of the tokens in the input sequence are randomly replaced with a special `[MASK]` token. The model's job is to predict the original identity of these masked tokens. Because the model isn't constrained by a causal mask, it can use the entire sequence (both left and right context) to make its prediction, making its representations deeply **bidirectional**.



 ---

 #### Conceptual Code: Masked Language Modeling



 This code conceptualizes how you would prepare data for BERT's Masked Language Modeling (MLM) pre-training task.

In [None]:
import random

# Example sentence tokenized into words
sentence = ["The", "cat", "sat", "on", "the", "mat", "."]

# In a real tokenizer, these would be sub-word IDs. We'll use words for clarity.
vocab = {"[PAD]": 0, "[MASK]": 1, "The": 2, "cat": 3, "sat": 4, "on": 5, "the": 6, "mat": 7, ".": 8}
# Invert vocab for lookup
inv_vocab = {v: k for k, v in vocab.items()}

input_ids = [vocab[word] for word in sentence]
labels = list(input_ids) # The original IDs are the labels

mask_prob = 0.15
for i in range(len(input_ids)):
    # Decide whether to mask this token
    if random.random() < mask_prob:
        # Per BERT paper: 80% of the time, replace with [MASK]
        # 10% of the time, replace with a random word
        # 10% of the time, keep the original word
        rand_choice = random.random()
        if rand_choice < 0.8:
            input_ids[i] = vocab["[MASK]"]
        elif rand_choice < 0.9:
            # Replace with a random token (excluding special tokens)
            random_token_id = random.choice(list(range(2, len(vocab))))
            input_ids[i] = random_token_id
        # else: keep the original token
        
    else:
        # If we don't mask this token, we don't need to predict it.
        # Set its label to -100 (a standard value ignored by PyTorch's CrossEntropyLoss)
        labels[i] = -100

print("Original Sentence: ", " ".join(sentence))
print("-" * 50)
print("Original IDs:   ", [vocab[word] for word in sentence])
print("Masked Input IDs: ", input_ids)
print("Input Tokens:     ", " ".join([inv_vocab[id] for id in input_ids]))
print("Labels for Loss:  ", labels, " (negative values are ignored)")

# The model would take `input_ids` and try to predict the non-negative `labels`.


 #### <a id="exercise-3-masked-language-modeling"></a>Exercise 3: Masked Language Modeling



 **Question:** In the example sentence "The cat sat on the mat", if we mask the word "sat", how does BERT's training objective allow it to use the word "on" to help predict "sat"? Why can't a standard left-to-right autoregressive model (like GPT) do this?

 #### <a id="solution-3"></a>Solution 3

`Response:`


 ---

 ### <a id="34-sequence-to-sequence-transformers"></a>3.4 Sequence-to-Sequence Transformers



 For tasks that map an input sequence to a different output sequence (e.g., machine translation), we use the full encoder-decoder architecture.



 * **How it works:**

     1.  The **Encoder** reads the entire source sequence (e.g., an English sentence) and creates a rich, contextualized representation of it, let's call it `Z`.

     2.  The **Decoder** then generates the target sequence (e.g., a Dutch sentence) token by token, in an autoregressive manner.

 * **The Link: Cross-Attention:** The key is how the decoder uses the information from the encoder. In addition to its standard (masked) self-attention layer, the decoder has a second attention layer called **cross-attention**.

     * In this layer, the **Queries (Q)** come from the decoder's own sequence (the tokens generated so far).

     * However, the **Keys (K) and Values (V)** come from the encoder's output representation `Z`.

     * This allows each token being generated by the decoder to "look back" and attend to all parts of the original input sequence to gather the most relevant information for its prediction.


<br>
<div align="center">
    <img src="image/Figure_19.png" width="200px"/>
    <p><em>Figure 19: One layer of a decoder with cross-attention. Q comes from the decoder's self-attention output; K and V come from the encoder's output Z.</em></p>
</div>
<br>


<br>
<div align="center">
    <img src="image/Figure_20.png" width="650px"/>
    <p><em>Figure 20: The full encoder-decoder architecture for sequence-to-sequence tasks.</em></p>
</div>
<br>



 ---

 ### <a id="35-large-language-models-llms"></a>3.5 Large Language Models (LLMs)



 Recent years have seen the rise of massive transformer-based models, known as LLMs, with parameter counts reaching into the trillions.



 * **Scaling Laws:** A key driver is the "scaling hypothesis" which suggests that performance dramatically improves simply by increasing model size, dataset size, and computation, often outpacing gains from architectural changes.

 * **Training Paradigm:** LLMs are first **pre-trained** in a self-supervised manner (e.g., autoregressively or with MLM) on enormous, unlabeled text corpora. This pre-trained model, often called a **foundation model**, can then be adapted to many specific tasks.

 * **Fine-Tuning:**

     * **Standard Fine-Tuning:** The pre-trained model is further trained on a smaller, labeled dataset specific to a downstream task.

     * **Parameter-Efficient Fine-Tuning (PEFT):** Methods like **LoRA (Low-Rank Adaptation)** have become popular. LoRA freezes the original massive LLM weights and injects small, trainable "adapter" matrices into the layers. Since only these small matrices are trained, fine-tuning becomes vastly more memory and computationally efficient. The trained adapter matrices can then be merged back into the original weights for zero inference overhead.



<br>
<div align="center">
    <img src="image/Figure_21.png" width="550px"/>
    <p><em>Figure 21: Schematic of LoRA. The original weight matrix W_0 is frozen. A new, low-rank update is learned via matrices A and B and added to the output.</em></p>
</div>
<br>



 * **Prompting and In-Context Learning:** For very large models, explicit fine-tuning is often no longer necessary. They can perform new tasks simply by being prompted with instructions in natural language. Providing a few examples of the task within the prompt is called **few-shot learning** and can dramatically improve performance without any weight updates.



 ---

 ## 4. Multimodal Transformers



 The general-purpose nature of the transformer architecture has allowed it to become state-of-the-art in domains far beyond text, including vision and audio. The core idea is always the same: if you can convert your data into a sequence of **tokens**, you can likely apply a transformer.



 ---

 ### <a id="41-vision-transformers-vit"></a>4.1 Vision Transformers (ViT)



 The Vision Transformer (ViT) applies a standard transformer encoder architecture to image classification tasks.



 * **Tokenization:** An image isn't an obvious sequence. The ViT approach is to split the image into a grid of non-overlapping patches (e.g., 16x16 pixels). Each patch is then flattened into a long vector, which becomes one **token** in the sequence. Using patches is critical because the quadratic complexity of self-attention makes processing every single pixel as a token computationally infeasible for standard-sized images.

 * **Architecture:**

     1.  The sequence of patch tokens is fed into a linear projection layer.

     2.  **Positional embeddings** are added to the patch tokens to retain spatial information. Unlike in NLP, these are often learned parameters because image patch grids are usually a fixed size.

     3.  A special `[class]` token is prepended to the sequence.

     4.  This whole sequence is processed by a standard Transformer Encoder.

     5.  The final hidden state corresponding to the `[class]` token is used as the image representation for a classification head.


<br>
<div align="center">
    <img src="image/Figure_22.png" width="600px"/>
    <br>
    <em>Figure 22: The Vision Transformer (ViT) architecture. An image is split into patches, which are treated as tokens.</em>
</div>
<br>


 * **Inductive Bias:** Compared to Convolutional Neural Networks (CNNs), which have strong built-in inductive biases for vision (like translation equivariance and locality), ViTs have very few. This means they typically require much larger datasets to learn the basic properties of images from scratch, but with sufficient data, they often achieve higher performance.



 ---

 #### PyTorch Code: Image Patching



 This shows how to implement the first step of a ViT: converting an image into a sequence of flattened patches.

In [None]:
import torch
import torch.nn as nn

class PatchEmbedding(nn.Module):
    """
    Converts an image into a sequence of flattened patch embeddings.
    """
    def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
        super().__init__()
        self.img_size = img_size
        self.patch_size = patch_size
        self.num_patches = (img_size // patch_size) ** 2
        
        # A convolution layer is a very efficient way to do this.
        # A kernel size and stride equal to the patch size will create
        # non-overlapping patches and project them to the embed_dim.
        self.projection = nn.Conv2d(
            in_channels, 
            embed_dim, 
            kernel_size=patch_size, 
            stride=patch_size
        )

    def forward(self, x):
        # x: (batch_size, in_channels, height, width) -> e.g., (B, 3, 224, 224)
        x = self.projection(x)  # (B, embed_dim, num_patches_h, num_patches_w) -> (B, 768, 14, 14)
        
        # Flatten the spatial dimensions into a single sequence dimension
        # (B, embed_dim, num_patches) -> (B, 768, 196)
        x = x.flatten(2)
        
        # Transpose to get the desired shape for the transformer encoder
        # (B, num_patches, embed_dim) -> (B, 196, 768)
        x = x.transpose(1, 2)
        
        return x

# Example Usage
img_size = 224
patch_size = 16
embed_dim = 768
batch_size = 4

# Dummy batch of images
dummy_images = torch.randn(batch_size, 3, img_size, img_size)

# Create patch embedding layer
patch_embed_layer = PatchEmbedding(img_size, patch_size, 3, embed_dim)
patch_embeddings = patch_embed_layer(dummy_images)

print(f"Number of patches: {patch_embed_layer.num_patches}")
print("Shape of input images: ", dummy_images.shape)
print("Shape of output patch embeddings: ", patch_embeddings.shape)


 #### <a id="exercise-4-vit-patch-calculation"></a>Exercise 4: ViT Patch Calculation



 **Question:** For an input image of size 256x256 pixels and a patch size of 32x32 pixels, how many patch tokens would be fed into the ViT encoder (excluding the special `[class]` token)?

 #### <a id="solution-4"></a>Solution 4

`Response:`

 ---

 ### <a id="42-generative-image-transformers"></a>4.2 Generative Image Transformers



 Transformers can also be used to *generate* images autoregressively, much like they generate text.



 * **Image Ordering:** Since images have no natural one-dimensional order, we must impose one. A common choice is a **raster scan** (left-to-right, top-to-bottom), which turns the grid of pixels or patches into a sequence.


<br>
<div align="center">
    <img src="image/Figure_23.png" width="300px"/>
    <br>
    <em>Figure 23: A raster scan defines a linear ordering of pixels.</em>
</div>
<br>

<br>
<div align="center">
    <img src="image/Figure_24.png" width="700px"/>
    <br>
    <em>Figure 24: Generating an image autoregressively, pixel by pixel, following a raster scan order.</em>
</div>
<br>



 * **Discrete Tokens and Vector Quantization (VQ):** Directly predicting continuous RGB values can lead to blurry, averaged results. A better approach is to work with a discrete vocabulary of "visual words". This is achieved using **Vector Quantization**. An encoder (often a CNN) maps image patches to the closest vector in a learned "codebook". The transformer is then trained to predict the *index* of the next codebook vector in the sequence. A decoder then converts the generated sequence of indices back into an image.



 ---

 ### <a id="43-audio-data"></a>4.3 Audio Data



 Transformers have also replaced older models like CNNs for audio tasks.



 * **Representation:** Raw audio waveforms are typically converted into a **mel spectrogram**. This is a 2D representation where one axis is time and the other is frequency (on the perceptually-motivated mel scale).


<br>
<div align="center">
    <img src="image/Figure_25.png" width="450px"/>
    <br>
    <em>Figure 25: A mel spectrogram of a humpback whale song.</em>
</div>
<br>


 * **Tokenization and Modeling:** The spectrogram is treated like an image. It can be split into patches, which are then tokenized and fed into a standard transformer encoder for tasks like audio classification (e.g., identifying sounds like "car" or "laughter").



 ---

 ### <a id="44-text-to-speech"></a>4.4 Text-to-Speech



 Transformers can generate highly realistic speech from text by framing it as a conditional language modeling task.



 * **Approach (e.g., VALL-E):**

     1.  **Acoustic Tokenization:** Just like with images, speech is tokenized. An audio codec model based on vector quantization learns a codebook of discrete acoustic tokens from a huge amount of speech data.

     2.  **Conditional Generation:** A decoder transformer is trained to predict the next acoustic token. It is conditioned on two things:

         * **Text Tokens:** A prompt of the text to be spoken.

         * **Acoustic Prompt Tokens:** A short (e.g., 3-second) sample of a speaker's voice, converted into its acoustic tokens. This tells the model *whose* voice to use.

     3.  **Synthesis:** The generated sequence of acoustic tokens is then passed through a decoder to synthesize the final audio waveform. This allows for zero-shot text-to-speech synthesis in a new speaker's voice from just a brief sample.



<br>
<div align="center">
    <img src="image/Figure_26.png" width="600px"/>
    <br>
    <em>Figure 26: High-level architecture of VALL-E. It uses text and acoustic prompts to generate output speech tokens.</em>
</div>
<br>


 ---

 ### <a id="45-vision-and-language-transformers"></a>4.5 Vision and Language Transformers



 The ultimate goal is to have a single model that can seamlessly process and generate interleaved data from multiple modalities, like text and images.



 * **Text-to-Image Generation:** This can be treated as a sequence-to-sequence task, where the input is a sequence of text tokens and the output is a sequence of discrete visual tokens (from a VQ codebook). A full encoder-decoder transformer is well-suited for this.

 * **Unified Multimodal Models:** The latest models (e.g., CM3Leon) take this a step further. They use a single, unified vocabulary that contains both text tokens and visual tokens. A single, large autoregressive transformer is then trained on vast web-scale datasets of interleaved text and images (like HTML documents). By learning from this mixed data, the model becomes extremely versatile and can perform a wide range of tasks based on the structure of the prompt, including text-to-image, image-to-text (captioning), visual question answering, and image editing, all within a single framework.



<br>
<div align="center">
    <img src="image/Figure_27.png" width="700px"/>
    <br>
    <em>Figure 27: Examples of a single multimodal model (CM3Leon) performing a variety of different vision-and-language tasks.</em>
</div>
<br>


 ---

 ---

 ### <a id="reference"></a>Reference



 Bishop, C. M. (2024). *Deep Learning: Foundations and Concepts*. Springer. (Chapter 12: Transformers).