[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1FoOW9LVAyi8GCvXZ3QytHt6Jk6ssjRdA)

Author: 
- **Safouane El Ghazouali**, 
- Ph.D. in AI, 
- Senior data scientist and researcher at TOELT LLC,
- Lecturer at HSLU

# -----  -----  -----  -----  -----  -----  -----  -----

# Introduction to Transformers

Welcome to this introductory notebook on Transformers! Transformers are a powerful architecture in machine learning, revolutionizing fields like natural language processing (NLP) and computer vision.

In this notebook, we'll cover the basics of what transformers are, why they matter, and a high-level overview of their components.

![Vision-Transformer](https://github.com/safouaneelg/HSLU-Transformers-in-Vision/blob/main/vit-course.png?raw=true)

### Why Transformers?
- **Efficiency in Handling Sequences**: Unlike RNNs, transformers process sequences in parallel.
- **Attention Mechanism**: Allows models to focus on relevant parts of input.
- **Scalability**: Powers large models like GPT and BERT.

### What You'll Learn
- The history and motivation behind transformers.
- Key components: Attention, Positional Encoding, Encoder-Decoder.
- Applications in NLP and beyond.

# 🧰 Environment Setup

We'll use PyTorch for this course. Install it if needed.

In [None]:
!pip install -q torch

### Import Libraries

Here we import PyTorch, which will be used throughout the notebook.

In [None]:
import torch
print("PyTorch version:", torch.__version__)

# 📜 History of Transformers

Transformers were introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017. They addressed limitations of RNNs and LSTMs in handling long sequences.

### Key Innovation: Self-Attention

Self-attention allows the model to weigh different parts of the input sequence dynamically.

# 🏗️ Transformer Architecture Overview

A transformer consists of:
- Encoder stack
- Decoder stack
- Multi-head attention layers
- Feed-forward networks
- Positional encodings

### Positional Encoding Explanation

Since transformers don't process sequences recurrently, we add positional encodings to give order information.

In [None]:
# Simple positional encoding example
def positional_encoding(position, d_model):
    return torch.tensor([position / (10000 ** (2 * i / d_model)) for i in range(d_model)])

# Explanation of the variable
# positional_encoding: A function to compute sine/cosine-based positional encodings.
# position: The position in the sequence.
# d_model: The dimension of the model embeddings.

# The Attention Mechanism in Transformers

This section dives deeper into the core of transformers: the attention mechanism. We'll explain self-attention and implement it in PyTorch.

Author: Safouane El Ghazouali, Ph.D. in AI, Senior data scientist and researcher at TOELT LLC, Lecturer at HSLU

### What You'll Learn
- How attention works.
- Scaled dot-product attention.
- Multi-head attention.

# 🧰 Environment Setup

Reuse PyTorch from previous section.

In [None]:
import torch
import torch.nn.functional as F

# 🔍 Understanding Attention

Attention computes a weighted sum of values based on similarity between queries and keys.

### Queries, Keys, Values

- Queries: What we're looking for.
- Keys: What we compare against.
- Values: What we retrieve.

In [None]:
# Example tensors
queries = torch.randn(2, 3)  # Batch size 2, embedding dim 3
keys = torch.randn(2, 3)
values = torch.randn(2, 3)

print("queries", queries)
print("keys", keys)
print("values", values)

# Explanation of variables
# queries: Represents the query vectors in attention.
# keys: Key vectors used to compute attention scores.
# values: Value vectors that are weighted and summed.

# 📐 Scaled Dot-Product Attention

The formula: $Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$

In [None]:
def scaled_dot_product_attention(q, k, v):
    d_k = q.size(-1)
    scores = torch.matmul(q, k.T) / torch.sqrt(torch.tensor(d_k).float())
    attn = F.softmax(scores, dim=-1)
    return torch.matmul(attn, v)

# Compute attention
output = scaled_dot_product_attention(queries, keys, values)
print(output)

# Explanation
# scores: Dot product of queries and keys, scaled.
# attn: Softmax probabilities.
# output: Weighted sum of values.

# Hands-on: Pre-trained Transformer Models for Image Classification

In this section, we'll use PyTorch and the timm library to load pre-trained Vision Transformer (ViT) models and perform inference on images.

Author: Safouane El Ghazouali, Ph.D. in AI, Senior data scientist and researcher at TOELT LLC, Lecturer at HSLU

### What You'll Learn
- Installing and using timm.
- Loading pre-trained ViT models.
- Running inference on sample images.

# 🧰 Environment Setup

Install timm and other dependencies.

In [None]:
!pip install -q timm torch torchvision

### Import Libraries

Import timm for models, torchvision for image handling.

In [None]:
import timm
import torch
from torchvision import transforms
from PIL import Image
import requests

# 📦 Loading a Pre-trained Model

We'll load a Vision Transformer model pre-trained on ImageNet.

In [None]:
model = timm.create_model('vit_base_patch16_224', pretrained=True)
model.eval()

# Explanation
# model: A pre-trained ViT model.
# eval(): Sets the model to evaluation mode.

# 🖼️ Image Preprocessing

Prepare an image for input to the model.

In [None]:
# Download a sample image
url = 'http://farm8.staticflickr.com/7012/6597749473_03b2f736ac_z.jpg'
img = Image.open(requests.get(url, stream=True).raw)

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = transform(img).unsqueeze(0)
input_tensor.shape
# Explanation
# transform: Preprocessing pipeline to resize, normalize the image.
# input_tensor: The image as a tensor ready for model input.

In [None]:
img

# 🚀 Running Inference

Pass the image through the model.

In [None]:
with torch.no_grad():
    output = model(input_tensor)
probabilities = torch.nn.functional.softmax(output[0], dim=0)
print(probabilities)

# Explanation
# output: Logits from the model.
# probabilities: Softmax probabilities for each class.

In [None]:
url = "https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt"
imagenet_classes = requests.get(url).text.strip().split('\n')

In [None]:
top10_prob, top10_catid = torch.topk(probabilities, 10)

print("\n--- Top 10 Predictions ---")
for i in range(top10_prob.size(0)):
    print(f"{i+1}. Class: {imagenet_classes[top10_catid[i]]}, Probability: {top10_prob[i].item():.4f}")


# # # # # # # # # # # # # # # # # # # # # # # #
# 💡 Student Task

In this task you can use image `URLs` from Microsoft coco data explorer

Task:

1. Download/look at a different image, preprocess it, and run inference.  

2. What class does the model predict?

3. Load other models from the `timm` library  such as ResNet, VGG. 

4. run the inference on the image and print probabilities

5. print the top 10 classes

6. Compare the performances

Tips: use `timm.list_models(filter="*{MODEL_NAME}*")` to print all available models in the timm hub

In [None]:
model2 = timm.create_model('resnet50', pretrained=True)
model2.eval()

with torch.no_grad():
    output = model2(input_tensor)
    # The output is a tensor of shape [1, 1000]. We apply softmax to get probabilities.
probabilities = torch.nn.functional.softmax(output, dim=-1)[0] # [0] to get rid of the batch dimension
print(probabilities.shape)
top5_prob, top5_catid = torch.topk(probabilities, 5)

print("\n--- Top 5 Predictions ---")
for i in range(top5_prob.size(0)):
    class_name = imagenet_classes[top5_catid[i]]
    probability = top5_prob[i].item()
    print(f"{i+1}. Class: {class_name}, Probability: {probability:.4f}")
