(Hyper_parameters)=
# Chapter 19 -- Hyper-Parameters

## Introduction

The Vision Transformer (ViT) is a breakthrough architecture that applies the principles of the Transformer model, originally designed for natural language processing (NLP), to the domain of computer vision. Introduced by Dosovitskiy et al. in their 2020 paper titled "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," the ViT has revolutionized the way deep learning models handle image data, offering a powerful alternative to traditional convolutional neural networks (CNNs).

Applying Transformers on images was always going to be a challenge for the following reasons,

Unlike words/sentences/paragraphs, images contain much much more information in them basically in form of pixels.

It would be very hard, even with current hardware to attend to every other pixel in the image.

Instead, a popular alternative was to use localized attention.

In fact CNNs do something very similar through convolutions and the receptive field essentially grows bigger as we go deeper into the model's layers, but Tranformers were always going to be computationally more expensive than CNNs because of the' nature of Transformers.

# Vision Transformer (ViT): A Comprehensive Review

## Introduction

The Vision Transformer (ViT) is a groundbreaking model that applies the Transformer architecture, initially developed for natural language processing (NLP), to image-based tasks. ViT was introduced by Dosovitskiy et al. in their 2020 paper, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." The model has shown that Transformers can outperform traditional convolutional neural networks (CNNs) in various computer vision tasks, particularly when trained on large datasets.

## Key Concepts and Theoretical Foundations

### 1. Image as a Sequence of Patches

Unlike CNNs, which process images as grids of pixels with localized filters, the Vision Transformer treats an image as a sequence of patches. The idea is to break down the image into smaller, fixed-size patches, which can be processed similarly to words in NLP tasks.

Given an input image \( \mathbf{x} \) of size \( H \times W \times C \) (height, width, and number of channels), the image is divided into patches of size \( P \times P \). The total number of patches \( N \) is given by:

$$
N = \frac{H \times W}{P^2}
$$

Each patch \( \mathbf{x}_p \) is then flattened into a vector of size \( P^2 \times C \). This vector is linearly projected into a higher-dimensional space \( \mathbb{R}^D \) using a learnable matrix \( \mathbf{E} \):

$$
\mathbf{z}_p = \mathbf{E} \cdot \text{flatten}(\mathbf{x}_p)
$$

where \( \mathbf{z}_p \in \mathbb{R}^D \) is the embedded patch.

### 2. Positional Embedding

Transformers are inherently permutation-invariant, meaning they do not have a built-in mechanism to capture the order of input tokens. To retain the spatial structure of the image, a learnable positional embedding \( \mathbf{E}_{pos} \) is added to each patch embedding:

$$
\mathbf{z}_p' = \mathbf{z}_p + \mathbf{E}_{pos}
$$

This positional embedding provides the model with information about the relative positions of patches within the image, enabling it to maintain spatial relationships.

### 3. Transformer Encoder

The sequence of embedded patches, now augmented with positional information, is fed into a Transformer encoder. The encoder is composed of multiple layers, each containing a multi-head self-attention mechanism and a feed-forward network (FFN).

#### Multi-Head Self-Attention

The self-attention mechanism allows the model to compute the relationships between different patches in the sequence, capturing both local and global dependencies. For a given input sequence of patches \( \mathbf{Z} = [\mathbf{z}_1', \mathbf{z}_2', \dots, \mathbf{z}_N'] \), the attention mechanism computes the output as follows:

$$
\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{D_k}}\right) \mathbf{V}
$$

Where:
- \( \mathbf{Q} = \mathbf{Z} \mathbf{W}_Q \), \( \mathbf{K} = \mathbf{Z} \mathbf{W}_K \), and \( \mathbf{V} = \mathbf{Z} \mathbf{W}_V \) are the query, key, and value matrices, respectively.
- \( \mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V \) are learnable weight matrices.
- \( D_k \) is the dimensionality of the key/query vectors.

In the multi-head attention mechanism, this process is repeated \( h \) times with different sets of projection matrices, and the results are concatenated and linearly transformed:

$$
\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) \mathbf{W}_O
$$

Where ( \mathbf{W}_O \) is another learnable weight matrix.

#### Position-Wise Feed-Forward Network

After the self-attention mechanism, the output is passed through a position-wise feed-forward network (FFN), which is applied independently to each position in the sequence:

$$
\text{FFN}(\mathbf{x}) = \text{ReLU}(\mathbf{x} \mathbf{W}_1 + \mathbf{b}_1) \mathbf{W}_2 + \mathbf{b}_2
$$

Where \( \mathbf{W}_1 \), \( \mathbf{W}_2 \), \( \mathbf{b}_1 \), and \( \mathbf{b}_2 \) are learnable parameters.

### 4. Classification Token (CLS Token)

To generate a global representation of the image, ViT introduces a special classification token (`[CLS]`) at the beginning of the sequence of patch embeddings. This token is treated like any other patch and is processed through the Transformer encoder. After passing through the encoder, the output corresponding to the `[CLS]` token is used as the global representation of the image:

$$
\mathbf{y} = \text{softmax}(\mathbf{W}_{\text{cls}} \mathbf{z}_{\text{CLS}})
$$

Where \( \mathbf{W}_{\text{cls}} \) is a learnable matrix, and \( \mathbf{z}_{\text{CLS}} \) is the output corresponding to the `[CLS]` token.



In ViT, an image is treated as a sequence of patches, similar to how words are treated in NLP tasks. The image is divided into fixed-size patches (e.g., 16x16 or 32x32 pixels), and each patch is flattened into a vector. This approach simplifies the training process, as smaller patches reduce the complexity of the model.


In the Vision Transformer, images are treated as sequences rather than grids of pixels. The image is divided into fixed-size patches (e.g., 16x16 or 32x32 pixels), and each patch is considered analogous to a "word" in NLP. This division simplifies the training process, as smaller patches reduce the model's complexity and improve its ability to learn. The phrase "An Image is Worth 16x16 Words" captures this idea, highlighting that each image can be broken down into a sequence of patches for further processing.

## Implementation of ViT

In [None]:
import torch
import torch.nn as nn
from einops import rearrange

from self_attention_cv import TransformerEncoder


class ViT(nn.Module):
    def __init__(self, *,
                 img_dim,
                 in_channels=3,
                 patch_dim=16,
                 num_classes=10,
                 dim=512,
                 blocks=6,
                 heads=4,
                 dim_linear_block=1024,
                 dim_head=None,
                 dropout=0, transformer=None, classification=True):
        """
        Args:
            img_dim: the spatial image size
            in_channels: number of img channels
            patch_dim: desired patch dim
            num_classes: classification task classes
            dim: the linear layer's dim to project the patches for MHSA
            blocks: number of transformer blocks
            heads: number of heads
            dim_linear_block: inner dim of the transformer linear block
            dim_head: dim head in case you want to define it. defaults to dim/heads
            dropout: for pos emb and transformer
            transformer: in case you want to provide another transformer implementation
            classification: creates an extra CLS token
        """
        super().__init__()
        assert img_dim % patch_dim == 0, f'patch size {patch_dim} not divisible'
        self.p = patch_dim
        self.classification = classification
        tokens = (img_dim // patch_dim) ** 2
        self.token_dim = in_channels * (patch_dim ** 2)
        self.dim = dim
        self.dim_head = (int(dim / heads)) if dim_head is None else dim_head
        self.project_patches = nn.Linear(self.token_dim, dim)

        self.emb_dropout = nn.Dropout(dropout)
        if self.classification:
            self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
            self.pos_emb1D = nn.Parameter(torch.randn(tokens + 1, dim))
            self.mlp_head = nn.Linear(dim, num_classes)
        else:
            self.pos_emb1D = nn.Parameter(torch.randn(tokens, dim))

        if transformer is None:
            self.transformer = TransformerEncoder(dim, blocks=blocks, heads=heads,
                                                  dim_head=self.dim_head,
                                                  dim_linear_block=dim_linear_block,
                                                  dropout=dropout)
        else:
            self.transformer = transformer

    def expand_cls_to_batch(self, batch):
        """
        Args:
            batch: batch size
        Returns: cls token expanded to the batch size
        """
        return self.cls_token.expand([batch, -1, -1])

    def forward(self, img, mask=None):
        batch_size = img.shape[0]
        img_patches = rearrange(
            img, 'b c (patch_x x) (patch_y y) -> b (x y) (patch_x patch_y c)',
                                patch_x=self.p, patch_y=self.p)
        # project patches with linear layer + add pos emb
        img_patches = self.project_patches(img_patches)

        if self.classification:
            img_patches = torch.cat(
                (self.expand_cls_to_batch(batch_size), img_patches), dim=1)

        patch_embeddings = self.emb_dropout(img_patches + self.pos_emb1D)

        # feed patch_embeddings and output of transformer. shape: [batch, tokens, dim]
        y = self.transformer(patch_embeddings, mask)

        if self.classification:
            # we index only the cls token for classification. nlp tricks :P
            return self.mlp_head(y[:, 0, :])
        else:
            return y