<a href="https://colab.research.google.com/github/yaelbab66/Deep/blob/main/Copy_of_Assignment02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2
#  Transformers for Vision

**Student 1 ID and name,**

**Student 2 ID and name**

### Objective


In this assignment, the goal is to implement a Transformer architecture for image classification OxfordIIITPet dataset.

### Instructions



The assignment includes two parts, the first is filling in the missing code in the provided code cells. The second part deals with testing and comparing different implementations.

### Submission Guidelines:
*   Assignments are done in pairs, include both ids in the filename when submitting (e.g. *HW02_123456789_123456789.ipynb*).
*   Submit a Jupyter notebook containing your code modifications, comments, and analysis.
*   Include visualizations, graphs, or plots to support your analysis where needed.
*   Provide a conclusion summarizing your findings, challenges faced, and potential future improvements.


### Important Notes:

*  Ensure clarity in code comments and explanations for better understanding.
*  Experiment, analyze, and document your observations throughout the assignment.
*  Feel free to train on Colab GPU (see example in practice 4 notebook).
*  If answering open ended questions in Markdown is difficult, you can attatch a doc/pdf file to your submittion which holds any/all explanations. Just make sure it is aligned with the code somehow.
*  Feel free to seek clarification on any aspect of the assignment via forum or email.

## Transformers for Image Classification

Transformers have been originally proposed to process sets since it is a permutation-equivariant architecture, i.e., producing the same output permuted if the input is permuted. To apply Transformers to sequences, we have simply added a positional encoding to the input feature vectors, and the model learned by itself what to do with it. So, why not do the same thing on images? This is exactly what [Alexey Dosovitskiy et al.](https://openreview.net/pdf?id=YicbFdNTTy) proposed in their paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. Specifically, the Vision Transformer is a model for image classification that views images as sequences of smaller patches. As a preprocessing step, we split an image of, for example 48x48 pixels into 9 16x16 patches. Each of those patches is considered to be a “word”/“token” and projected to a feature space. With adding positional encodings and a token for classification on top, we can apply a Transformer as usual to this sequence and start training it for our task. A nice GIF visualization of the architecture is shown below.

<center width="100%"><img src="https://github.com/lucidrains/vit-pytorch/blob/main/images/vit.gif?raw=true" width="800px"></center>

### Imports & Device setup

In [None]:
# arrange any/all imports here

#for plotting
import matplotlib.pyplot as plt

## PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import torch.optim as optim
from torch.utils.data import DataLoader, random_split

## Torchvision
import torchvision
from torchvision.datasets import OxfordIIITPet
from torchvision import transforms

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
print("Device:", device)

DATASET_PATH = "data"

## Section 1: The Data



In this section we will load, explore and prepare the dataset for training.

You are given a dataset of colored images, **in different sizes**, representing cats and dogs, which are classified into 37 different breeds (of both cats and dogs). The dataset is called *OxfordIIITPet* you can read more about it [here](https://www.robots.ox.ac.uk/~vgg/data/pets/). The dataset will be downloaded to local enviroment using the `tourchvision` library and split into the train-eval-test sets. Your tasks in this section are:

> 1. Describe how you would preprocess the data for a vision transformer explain your choice of transofrmation. Implement the preprocessing of your choice in the `train_transform` and `test_transform` code section below. Are they identical? why or why not.

> 2. Create a data loader for the trian, eval, and test sets. Explain your choice of `batch_size`.

In [None]:
# Define transformations for the images
# First we need to resize all images
train_transform = transforms.Compose([
    transforms.Resize((224, 224)),  # Resize to ViT input size
    transforms.RandomHorizontalFlip(p=0.5),  # Randomly flip images for augmentation
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),  # Random color jitter
    transforms.ToTensor(),  # Convert to tensor
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalize
])

test_transform = transforms.Compose([
                                      #TODO
                                      ])

In [None]:
# Loading the train-eval dataset
dataset = OxfordIIITPet(root=DATASET_PATH, split="trainval", transform=train_transform, target_types="category", download=True)
# Load the test set
test_set = OxfordIIITPet(root=DATASET_PATH, split="test", transform=test_transform, target_types="category", download=True)

# Split into training and validation randomly
train_size = int(0.9 * len(dataset))  # 90% for training
val_size = len(dataset) - train_size  # 10% for validation
train_set, val_set = random_split(dataset, [train_size, val_size], generator=torch.Generator().manual_seed(42))

# Update validation set with the test transform
val_set.dataset.transform = test_transform

In [None]:
BATCH_SIZE = #TODO

# Data loaders
train_loader =
val_loader =
test_loader =

print(f"Number of training samples: {len(train_set)}")
print(f"Number of validation samples: {len(val_set)}")
print(f"Number of test samples: {len(test_set)}")

Print 4 images to see what they include

In [None]:
# Visualize some examples
NUM_IMAGES = 4
example_images = torch.stack([val_set[idx][0] for idx in range(NUM_IMAGES)], dim=0)
img_grid = torchvision.utils.make_grid(example_images, nrow=4, normalize=True, pad_value=0.9)
img_grid = img_grid.permute(1, 2, 0)

plt.figure(figsize=(8,8))
plt.title("Image examples of the OxfordIIITPet dataset")
plt.imshow(img_grid)
plt.axis('off')
plt.show()
plt.close()

## Section 2: Prepare Patches



Vision Transformers (ViTs) begin by splitting input images into smaller patches. This approach enables ViTs to process images as sequences of fixed-size patches rather than whole images. Just as words become tokens in natural language processing (NLP), each image patch becomes a token for processing.

The image patching process involves two steps:

- **Image Partitioning**: Divide the image into equal, non-overlapping patches (for example 8×8 pixels).
- **Flattening Patches**: Convert each patch into a 1D vector to create individual tokens.

The code sections below demonstrate how to create these patches from an input image and patch size. An image of size $N\times N$ is split into $(N/M)^2$ patches of size $M\times M$. These patches serve as the input "words" to the Transformer.

Review the code carefully and experiment with the example sections below to understand how the patches are created. Then answer the following question:

> What patch size would you choose for your pipeline? Your answer should consider the chosen image size. How do you think patch size (smaller or larger) affects the pipeline's performance?

In [None]:
def img_to_patch(x, patch_size, flatten_channels=True):
  """
  Inputs:
    x - torch.Tensor representing the image of shape [B, C, H, W]
    patch_size - Number of pixels per dimension of the patches (integer)
    flatten_channels - If True, the patches will be returned in a flattened format
                        as a feature vector instead of a image grid.
  """
  B, C, H, W = x.shape
  x = x.reshape(B, C, H//patch_size, patch_size, W//patch_size, patch_size)
  x = x.permute(0, 2, 4, 1, 3, 5) # [B, H', W', C, p_H, p_W]
  x = x.flatten(1,2)              # [B, H'*W', C, p_H, p_W]
  if flatten_channels:
    x = x.flatten(2,4)          # [B, H'*W', C*p_H*p_W]
  return x

In [None]:
#@title Visualizing Patches of example_images
#@markdown Change the patch size to view the resulting image patches.
#@markdown Make sure the image can be fully split into the desired size of patches.

# Define image dimensions and patch size
image_dim = example_images.shape[2]
patch_size = 4  #@param {type: "number"}

# Convert example_images to patches
img_patches = img_to_patch(example_images, patch_size=patch_size, flatten_channels=False)

# Calculate the number of patches per row dynamically
num_patches_per_row = image_dim // patch_size

# Adjust the visualization dynamically
fig, ax = plt.subplots(img_patches.shape[0], 1, figsize=(14, 3 * img_patches.shape[0]))
fig.suptitle("Images as input sequences of patches", fontsize=16)
for i in range(img_patches.shape[0]):
    # img_patches[i] has shape [H'*W', C, patch_size, patch_size]
    img_grid = torchvision.utils.make_grid(
        img_patches[i].reshape(-1, *img_patches.shape[2:]),  # Reshape to [num_patches, C, patch_size, patch_size]
        nrow=num_patches_per_row,  # Calculate patches per row dynamically
        normalize=True,
        pad_value=0.9
    )
    img_grid = img_grid.permute(1, 2, 0)  # Convert to HWC format for visualization
    ax[i].imshow(img_grid)
    ax[i].axis("off")
plt.show()
plt.close()


In [None]:
#@title Visualizing Patches of example_images in a sequence
#@markdown Change the patch size to view the resulting image patches.
#@markdown Make sure the image can be fully split into the desired size of patches.

# Define image dimensions and patch size
image_dim = example_images.shape[2]
patch_size = 4  #@param {type: "number"}

# Convert images to patches with the specified patch size
img_patches = img_to_patch(example_images, patch_size=patch_size, flatten_channels=False)

# Visualize the patches in a single row
fig, ax = plt.subplots(example_images.shape[0], 1, figsize=(14, 3 * example_images.shape[0]))

for i in range(example_images.shape[0]):
    # Calculate the number of patches in the row for the current image
    num_patches = img_patches[i].shape[0]

    # Row visualization (all patches in a single row)
    img_row = torchvision.utils.make_grid(
        img_patches[i].reshape(-1, *img_patches.shape[2:]),  # Reshape to [num_patches, C, patch_size, patch_size]
        nrow=num_patches,  # Show all patches in a single row
        normalize=True,
        pad_value=0.9
    )

    # Convert to HWC format for visualization
    img_row = img_row.permute(1, 2, 0)

    # Plot the image row
    ax[i].imshow(img_row)
    ax[i].axis("off")
    ax[i].set_title(f"Image {i + 1}: Patches in Row", fontsize=12)

# Adjust layout and show the plot
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
plt.close()


## Section 3: Vision Transformer


Now we can start building the Transformer model. The model consists of (in this order):

* A **linear projection** layer

  that maps the input patches to a feature vector of larger size. It is implemented by a simple linear layer that takes each $ M\times M $ patch independently as input.

* A **classification token**

  that is added to the input sequence. The CLS (classification) token is a special learnable embedding added at the start of the sequence of patches.
  It acts as a summary representation for the entire image. After processing by the Transformer layers, the CLS token is expected to contain the most relevant information for the classification task.

* Learnable **positional encodings**

  that are added to the tokens before being processed by the Transformer. Those are needed to learn position-dependent information, and convert the set to a sequence. Since we usually work with a fixed resolution, we can learn the positional encodings instead of having the pattern of sine and cosine functions.

* A **Transformer Encoder/Block**

  discussed in detain in the next subsection. The block is repeated multiple times.

* An **MLP head**

  that takes the output feature vector of the CLS token, and maps it to a classification prediction. This is usually implemented by a small feed-forward network or even a single linear layer.


The figure below contains all the explained pieces.

<center width="100%"><img src="https://d2l.ai/_images/vit.svg" width="650px"></center>

### 3.1. The Attention Block

The attention block includes the following components :

1. Layer Normalization: There are two layer normalizations
    - The first one normalizes the input before the attention block.
    - The second one normalizes the input before the feed-forward network.

2. Multi-Head Attention: There is one multihead attention for self-attention, with the following parameters:
    - `embed_dim`: Dimensionality of input vectors.
    - `num_heads`: Number of attention heads.
    - Apply dropout as part of the attention mechanism.

3. Feed-Forward Network (FFN):Design a feed-forward network that includes (in this order):
    - First linear layer projects from `embed_dim` to `hidden_dim`.
    - GELU Activation function.
    - Dropout.
    - Second linear layer projects from `hidden_dim` back to `embed_dim`.
    - Another dropout.

4. Residual Connections: The model has two connections
    - Combine the input with the output of the multihead attention.
    - Combine the result of the previous combination with the output of the FFN.

The inputs to the implemented model are:
- `embed_dim`: Dimensionality of input and attention feature vectors.
- `hidden_dim`: Dimensionality of the hidden layer in the feed-forward network (usually 2-4x larger than `embed_dim`).
- `num_heads`: Number of heads for the Multi-Head Attention block.
- `dropout`: Amount of dropout to apply in the feed-forward network. (both in dropout layers and attention)

The Forward Pass:
1. Normalize the input and pass it through the attention mechanism.
2. Add a residual connection from the input to the attention output.
3. Normalize the attention output and pass it through the feed-forward network.
4. Add a residual connection from the attention output to the feed-forward output.

Check out the ilustration below to further enhance your understanding of the attention block. Your task in this subsection is

> Fill the `__init__` and `forward` methods in the `AttentionBlock` class.



<center width="100%"><img src="https://discuss.d2l.ai/uploads/default/optimized/2X/e/e635a8fb7898d1c260a5a0d5e1fde010801d6ee8_2_690x418.png" width="450px"></center>

In [None]:
class AttentionBlock(nn.Module):

  def __init__(self, embed_dim, hidden_dim, num_heads, dropout=0.0):
    """
    Inputs:
      embed_dim - Dimensionality of input and attention feature vectors
      hidden_dim - Dimensionality of hidden layer in feed-forward network
                    (usually 2-4x larger than embed_dim)
      num_heads - Number of heads to use in the Multi-Head Attention block
      dropout - Amount of dropout to apply in the feed-forward network
    """
    super().__init__()

  def forward(self, x):

    return

### 3.2. Vision Transformer

Now we will implement the full Vision Transformer, using the AttentionBlock created above and adding the other pieces: A linear projection layer, a classification token, positional encodings and a MLP head.

The class below is partially implemented to include all the pieces.

Implement the VisionTransformer class step-by-step by following these guidelines:

1. **Initialization (`__init__`)**:
  - **Patch Embedding**: A linear layer is used to map the flattened patches to the embedding dim.
  - **Transformer Layers**: A stack of `AttentionBlock` layers, the number of layers is defined by `num_layers` parameter.
  - **Classification Head**: A feed-forward network using `nn.LayerNorm` and `nn.Linear`. which maps the final `CLS` token to the output logits.
  - **Positional Embedding**: Define a `nn.Parameter` for positional encoding.
  - **CLS Token**: Define a `nn.Parameter` for the `CLS` token.
  - **Dropout**: for regularization with the given `dropout` rate.

2. **Forward Pass (`forward`)**:
    - **Step 1**: Convert the input images into patches using the provided `img_to_patch` function.
        - Ensure the resulting tensor is of shape `[B, T, embed_dim]` (batch, patches, embedding).
    - **Step 2**: Apply the patch embedding layer.
    - **Step 3**: Add the `CLS` token to the beginning of the sequence for each image in the batch.
    - **Step 4**: Add positional encodings to the sequence.
    - **Step 5**: Apply a dropout layer (missing in the viz above)
    - **Step 6**: Pass the input through the Transformer layers (`AttentionBlock` stack).
        - Note: Ensure the sequence matches the order the attention block needs.
    - **Step 7**: Use the `CLS` token output to compute class logits via the classification head.

#### Inputs:
- `x`: Input tensor of shape `[B, C, H, W]` where:
    - `B`: Batch size
    - `C`: Number of input channels (e.g., 3 for RGB).
    - `H` and `W`: Height and width of the images.

#### Outputs:
- `out`: Tensor of shape `[B, num_classes]`, representing the class logits for each image in the batch.

Your task in this section is:
> Fill the missing layer and parameter sizes in the `__init__`.

In [None]:
class VisionTransformer(nn.Module):

  def __init__(self, embed_dim, hidden_dim, num_channels, num_heads, num_layers, num_classes, patch_size, num_patches, dropout=0.0):
    """
    Inputs:
      embed_dim - Dimensionality of the input feature vectors to the Transformer
      hidden_dim - Dimensionality of the hidden layer in the feed-forward networks
                    within the Transformer
      num_channels - Number of channels of the input
      num_heads - Number of heads to use in the Multi-Head Attention block
      num_layers - Number of layers to use in the AttentionBlock
      num_classes - Number of classes to predict
      patch_size - Number of pixels that the patches have per dimension
      num_patches - Maximum number of patches an image can have
      dropout - Amount of dropout to apply in the feed-forward network and
                on the input encoding
    """
    super().__init__()

    self.patch_size = patch_size

    self.input_layer = nn.Linear( , )

    self.cls_token = nn.Parameter(torch.randn(, , ))
    self.pos_embedding = nn.Parameter(torch.randn(, , ))

    self.transformer = nn.Sequential(*[AttentionBlock(, , , ) for _ in range(num_layers)])

    self.mlp_head = nn.Sequential(
      nn.LayerNorm(),
      nn.Linear(, )
    )
    self.dropout = nn.Dropout()

  def forward(self, x):
    # Preprocess input
    x = img_to_patch(x, self.patch_size)
    B, T, _ = x.shape
    x = self.input_layer(x)

    # Add CLS token and positional encoding
    cls_token = self.cls_token.repeat(B, 1, 1)
    x = torch.cat([cls_token, x], dim=1)
    x = x + self.pos_embedding[:,:T+1]

    # Apply Transformer
    x = self.dropout(x)
    x = x.transpose(0, 1)
    x = self.transformer(x)

    # Perform classification prediction
    cls = x[0]
    out = self.mlp_head(cls)
    return out

## Section 4: Train and Evaluate the ViT



This section includes the model training, and testing. A `train_model` function is partially implemented to train, evaluate, save best model and then test it on the test set. Familiaraize yourself with the method, your tasks for this section are:

> 1. Fill the missing code lines in the train_model function (in the model training part).
>
> 2. Fill the missing model arguments, loss function, optimizer and number of epochs. Then train the model.
>    
>    2.1. Train the model with at least 3 different patch sizes.
>
> 3. Compare the results of the training with the different patch sizes, reflect on the chnages in preformance.

*Note* try to set parameters that allow training quickly, a min or two per epoch. Consider testing a set of parameters on a tiny training session (5-7 epochs) before running the final training.

In [None]:
# Seed everything for reproducibility
def seed_everything(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

# Training and evaluation loop
def train_model(model, train_loader, val_loader, test_loader, num_epochs=180):
    seed_everything(42)

    # Check device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

    # Train the model from scratch
    best_val_acc = 0.0

    for epoch in range(num_epochs):
        # Training phase
        model.train()
        train_loss, train_acc = 0.0, 0.0
        for imgs, labels in train_loader:
            imgs, labels = imgs.to(device), labels.to(device)

            #TODO add missing training steps here.

            train_loss += loss.item()
            train_acc += (preds.argmax(dim=-1) == labels).float().mean().item()

        train_loss /= len(train_loader)
        train_acc /= len(train_loader)

        # Validation phase
        model.eval()
        val_loss, val_acc = 0.0, 0.0
        with torch.no_grad():
            for imgs, labels in val_loader:
                imgs, labels = imgs.to(device), labels.to(device)

                preds = model(imgs)
                loss = criterion(preds, labels)

                val_loss += loss.item()
                val_acc += (preds.argmax(dim=-1) == labels).float().mean().item()

        val_loss /= len(val_loader)
        val_acc /= len(val_loader)

        # Save the best model
        best_model_path = "best_model.pth"
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), best_model_path)

        print(f"Epoch {epoch + 1}/{num_epochs}")
        print(f"  Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}")
        print(f"  Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")
        print(f"  Current LR: {current_lr:.6f}")

    # Test the model
    print("Testing the best model...")
    model.load_state_dict(torch.load(best_model_path))
    model.eval()
    test_acc = 0.0
    with torch.no_grad():
        for imgs, labels in test_loader:
            imgs, labels = imgs.to(device), labels.to(device)
            preds = model(imgs)
            test_acc += (preds.argmax(dim=-1) == labels).float().mean().item()
    test_acc /= len(test_loader)

    print(f"Test Accuracy: {test_acc:.4f}")
    return model, {"test": test_acc, "val": best_val_acc}

In [None]:
#TODO fill the model arguments, and setup the optimizer and loss function.

# Define your model parameters
model_kwargs = {
    'embed_dim': ,
    'hidden_dim': ,
    'num_heads': ,
    'num_layers': ,
    'patch_size': ,
    'num_channels': ,
    'num_patches': ,
    'num_classes': ,
    'dropout':
}

# Instantiate the model
model = VisionTransformer(**model_kwargs).to(device)

# Optimizer
optimizer =

# Loss function
criterion =

In [None]:
# TODO: set number of epochs and train the model.

num_epochs =

# Train the model
model, results = train_model(model, train_loader, val_loader, test_loader, num_epochs=num_epochs)
print(f"Validation Accuracy: {results['val']:.4f}")
print(f"Test Accuracy: {results['test']:.4f}")

## Section 5: CNN


In this section you will suggest a CNN architecture to solve the same problem above, classify the cats and dogs breed in the given images. Detailed tasks:


1.   Suggest a CNN model to solve the same problem the transformer above was trained on. Explain and implement your choice.
2.   Do you think a different data preperation is needed for the different architecture? (the transformations does in Section 1). Explain you choice.
3.   Train and evaluate the model - you can reuse the `train_model` method defined above. Explain your choice of loss function and optimizer.

In [None]:
# TODO: CNN

## Section 6: Compare & Discuss


1. Compare the two architectures based on the following aspects:

  *  Performance: Which model performs better on the test set? Are there significant differences in accuracy or other metrics?
  *  Training Dynamics: Compare the training and validation curves. Which model converged faster? Did either model overfit?
  *  Computational Efficiency: Compare the training and inference time of the models. Which model was more computationally demanding?

2. Discussion

  Based on your observations, discuss the trade-offs between Vision Transformers and CNNs:

  *  Suitability for vision tasks: Which architecture seems better suited for the dataset and why?
  *  Scalability: How might these results change with a larger dataset or higher-resolution images?