## ðŸ“š Assignment: Transfer Learning & The Power of Initialization
## Building Intuition for MAML

**Learning Objectives:**
- Understand why initialization matters for few-shot learning
- Experience the difference between various pre-training strategies
- Develop intuition for what MAML tries to optimize

**Advice on using LLM's**

---


Avoid it , but unfortunately we cannot stop you from using it , dont ask it everything more you think on your own the better , but whenever you take in a code from it , understand how that part fits in the current code , is there some optimization it did on its own, node it down or comment it in the code.

In [None]:
!pip install -q torch torchvision matplotlib numpy

#Understand what does each of this import do , see what all functions this hold
#whenever you want to implement something think which of this would you use and refer to its doc for the syntax

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader, Subset
import matplotlib.pyplot as plt
import numpy as np
from collections import defaultdict
import random

print("âœ… Setup complete!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

âœ… Setup complete!
PyTorch version: 2.9.0+cpu
CUDA available: False
Using device: cpu


## ðŸ“Š Part A: Dataset Preparation

We'll use **MNIST** for simplicity (or you can use Omniglot if you prefer).

**Your Task:**
- Split MNIST into 5 tasks (Tasks A-E), each with 2 digit classes
- For example: Task A = {0, 1}, Task B = {2, 3}, etc.

In [None]:
import torchvision
import torchvision.transforms as transforms

# Download MNIST
transform = transforms.Compose([
    # see what different tranformation you can do , one is converting the image into tensor
    transforms.ToTensor(),
])

train_dataset = torchvision.datasets.MNIST(
    root="./data",
    train=True,
    transform=transform,
    download=True
)

test_dataset = torchvision.datasets.MNIST(
    root="./data",
    train=False,
    transform=transform,
    download=True
)

# we get a special parameter while loading which is 'background'
# refer to document for what it means and how to use it
# NOTE: 'background' is used in Omniglot, not MNIST, so we do not use it here

print(f"âœ… MNIST loaded: {len(train_dataset)} train, {len(test_dataset)} test images")

# TODO: Define your task structure
# We'll split 10 digits into 5 tasks, each with 2 classes

task_definitions = {
    'A': [0, 1],
    'B': [2, 3],
    'C': [4, 5],
    'D': [6, 7],
    'E': [8, 9],
}


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 9.91M/9.91M [00:00<00:00, 137MB/s]
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 28.9k/28.9k [00:00<00:00, 37.8MB/s]
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1.65M/1.65M [00:00<00:00, 29.7MB/s]
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 4.54k/4.54k [00:00<00:00, 6.93MB/s]


âœ… MNIST loaded: 60000 train, 10000 test images


In [None]:
# TODO: Define your task structure
# We'll split 10 digits into 5 tasks, each with 2 classes

task_definitions = {
    'A': [0, 1],
    'B': [2, 3],
    'C': [4, 5],
    'D': [6, 7],
    'E': [8, 9],
}

# Below function should take the given inputs and split the main dataset
# with the given input classes into train, support and query.
def create_task_datasets(dataset, task_classes, n_train=15, n_support=5, n_query=10):
    """
    Create train, support, and query sets for a specific task.

    Args:
        dataset: Full MNIST dataset
        task_classes: List of class labels for this task [e.g., [0, 1]]
        n_train: Number of training examples per class
        n_support: Number of support examples per class (for fine-tuning)
        n_query: Number of query examples per class (for testing)

    Returns:
        train_data, support_data, query_data
        (each is list of (image, label) tuples)
    """

    # TODO: Implement this function
    # HINT: Filter dataset to only include examples from task_classes
    # HINT: Split into train/support/query sets

    import random
    from collections import defaultdict

    # Collect samples for each class
    class_to_samples = defaultdict(list)
    for img, label in dataset:
        if label in task_classes:
            class_to_samples[label].append((img, label))

    train_data = []
    support_data = []
    query_data = []

    # Split samples per class
    for cls in task_classes:
        samples = class_to_samples[cls]
        random.shuffle(samples)

        train_data.extend(samples[:n_train])
        support_data.extend(samples[n_train:n_train + n_support])
        query_data.extend(samples[n_train + n_support:
                                  n_train + n_support + n_query])

    return train_data, support_data, query_data


In [None]:
# Test the function

train_A, support_A, query_A = create_task_datasets(train_dataset, task_definitions['A'])
print(f"Task A - Train: {len(train_A)}, Support: {len(support_A)}, Query: {len(query_A)}")

Task A - Train: 30, Support: 10, Query: 20


Part A (continued): **Build Your Model**

**TODO:** Design a simple CNN for digit classification

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# think on the architecture of the model as discussed in class
# general flow -> convolution -> relu -> maxpooling -> ...
# in the end some fully connected layers then final classification
# Refer to the 60 minute pytorch implementation section of 'neural networks'

# Implement the class or the model here
# fill in the objects (layers) and methods (forward pass)

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=2):
        super(SimpleCNN, self).__init__()

        # -------- Convolutional layers --------
        self.conv1 = nn.Conv2d(
            in_channels=1,      # MNIST is grayscale
            out_channels=32,
            kernel_size=3,
            padding=1
        )

        self.conv2 = nn.Conv2d(
            in_channels=32,
            out_channels=64,
            kernel_size=3,
            padding=1
        )

        # -------- Pooling layer --------
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

        # -------- Fully connected layers --------
        # After two poolings: 28x28 -> 14x14 -> 7x7
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, num_classes)

    def forward(self, x):
        # Conv block 1
        x = self.conv1(x)
        x = F.relu(x)
        x = self.pool(x)

        # Conv block 2
        x = self.conv2(x)
        x = F.relu(x)
        x = self.pool(x)

        # Flatten
        x = x.view(x.size(0), -1)

        # Fully connected layers
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)

        return x


Now since the model is ready we decide how do we want to train it :

First Do normal classification on large dataset of Task A - 0 & 1.

The we will do fine tuning

1.   Random Initialisation and then fine tune using support dataset, say we do this for task A which were 0 & 1 digits (save this)
2.   Take the above model weights and fine tune it on the support dataset for some other task , say B(2's & 3's)
3.   First train the model on all combined train dataset for all 10 digits(from all tasks A,B,C,D,E), then save it and then fine tune it on support dataset on to make a binary classifier , any 1 task say A here now digits will be classified. 0 class->0 digit , 1->1.

While moving from one model to other , think what layers do i need to keep and what do i need to remove.



In [None]:
# ===============================
# Imports
# ===============================
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset

# ===============================
# Model Definition
# ===============================
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=2):
        super(SimpleCNN, self).__init__()

        # Convolutional layers
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)

        self.pool = nn.MaxPool2d(2, 2)

        # Fully connected layers
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))

        x = x.view(x.size(0), -1)  # flatten
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# ===============================
# Prepare Task A Data
# ===============================
train_A, support_A, query_A = create_task_datasets(
    train_dataset,
    task_definitions['A']
)

def to_tensor_dataset(data):
    images = torch.stack([x[0] for x in data])
    labels = torch.tensor([x[1] for x in data])
    return TensorDataset(images, labels)

train_A_dataset = to_tensor_dataset(train_A)
query_A_dataset = to_tensor_dataset(query_A)

train_loader_A = DataLoader(train_A_dataset, batch_size=32, shuffle=True)
query_loader_A = DataLoader(query_A_dataset, batch_size=32, shuffle=False)

# ===============================
# Training Setup
# ===============================
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_A = SimpleCNN(num_classes=2).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_A.parameters(), lr=0.001)

# ===============================
# Training Loop
# ===============================
num_epochs = 10
train_losses = []

for epoch in range(num_epochs):
    model_A.train()
    running_loss = 0.0

    for images, labels in train_loader_A:
        images = images.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()
        outputs = model_A(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    avg_loss = running_loss / len(train_loader_A)
    train_losses.append(avg_loss)

    print(f"Epoch [{epoch+1}/{num_epochs}] - Loss: {avg_loss:.4f}")

# ===============================
# Evaluation on Query Set
# ===============================
model_A.eval()
correct = 0
total = 0

with torch.no_grad():
    for images, labels in query_loader_A:
        images = images.to(device)
        labels = labels.to(device)

        outputs = model_A(images)
        _, predicted = torch.max(outputs, 1)

        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f"\nMethod 1 - Task A Query Accuracy: {accuracy:.2f}%")


Epoch [1/10] - Loss: 0.7042
Epoch [2/10] - Loss: 0.6383
Epoch [3/10] - Loss: 0.5546
Epoch [4/10] - Loss: 0.4613
Epoch [5/10] - Loss: 0.3569
Epoch [6/10] - Loss: 0.2540
Epoch [7/10] - Loss: 0.1660
Epoch [8/10] - Loss: 0.1004
Epoch [9/10] - Loss: 0.0564
Epoch [10/10] - Loss: 0.0299

Method 1 - Task A Query Accuracy: 100.00%


In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import copy

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

# -------------------------------------------------
# Helper functions
# -------------------------------------------------
def to_tensor_dataset(data, task_classes):
    """
    Remap labels:
    task_classes[0] -> 0
    task_classes[1] -> 1
    """
    images = torch.stack([x[0] for x in data])
    labels = torch.tensor([task_classes.index(x[1]) for x in data])
    return TensorDataset(images, labels)

def evaluate(model, dataloader):
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for images, labels in dataloader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, preds = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (preds == labels).sum().item()
    return 100 * correct / total

# -------------------------------------------------
# Prepare Task A and Task B data
# -------------------------------------------------
train_A, support_A, query_A = create_task_datasets(
    train_dataset, task_definitions['A']
)
train_B, support_B, query_B = create_task_datasets(
    train_dataset, task_definitions['B']
)

support_A_loader = DataLoader(
    to_tensor_dataset(support_A, task_definitions['A']),
    batch_size=16, shuffle=True
)
query_A_loader = DataLoader(
    to_tensor_dataset(query_A, task_definitions['A']),
    batch_size=32, shuffle=False
)

support_B_loader = DataLoader(
    to_tensor_dataset(support_B, task_definitions['B']),
    batch_size=16, shuffle=True
)
query_B_loader = DataLoader(
    to_tensor_dataset(query_B, task_definitions['B']),
    batch_size=32, shuffle=False
)

# =========================================================
# Method 2A: Random Initialization â†’ Fine-tune on Support A
# =========================================================
model_2A = SimpleCNN(num_classes=2).to(device)
optimizer = torch.optim.Adam(model_2A.parameters(), lr=0.001)

for epoch in range(5):
    model_2A.train()
    for images, labels in support_A_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        loss = criterion(model_2A(images), labels)
        loss.backward()
        optimizer.step()

acc_2A = evaluate(model_2A, query_A_loader)
print(f"Method 2A (Random Init â†’ Support A) Accuracy: {acc_2A:.2f}%")

# =========================================================
# Method 2B: Task A â†’ Fine-tune on Support B
# =========================================================
model_2B = copy.deepcopy(model_A)

# Replace classifier head (binary)
model_2B.fc2 = nn.Linear(128, 2).to(device)

optimizer = torch.optim.Adam(model_2B.parameters(), lr=0.001)

for epoch in range(5):
    model_2B.train()
    for images, labels in support_B_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        loss = criterion(model_2B(images), labels)
        loss.backward()
        optimizer.step()

acc_2B = evaluate(model_2B, query_B_loader)
print(f"Method 2B (Task A â†’ Support B) Accuracy: {acc_2B:.2f}%")

# =========================================================
# Method 2C: Train on ALL 10 digits â†’ Fine-tune on Support A
# =========================================================

# -------- Step 1: Train 10-class model --------
full_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

model_10 = SimpleCNN(num_classes=10).to(device)
optimizer = torch.optim.Adam(model_10.parameters(), lr=0.001)

for epoch in range(5):
    model_10.train()
    for images, labels in full_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        loss = criterion(model_10(images), labels)
        loss.backward()
        optimizer.step()

# -------- Step 2: Replace head â†’ Fine-tune on Task A --------
model_2C = copy.deepcopy(model_10)
model_2C.fc2 = nn.Linear(128, 2).to(device)

optimizer = torch.optim.Adam(model_2C.parameters(), lr=0.001)

for epoch in range(5):
    model_2C.train()
    for images, labels in support_A_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        loss = criterion(model_2C(images), labels)
        loss.backward()
        optimizer.step()

acc_2C = evaluate(model_2C, query_A_loader)
print(f"Method 2C (10-digit Pretrain â†’ Support A) Accuracy: {acc_2C:.2f}%")


Method 2A (Random Init â†’ Support A) Accuracy: 100.00%
Method 2B (Task A â†’ Support B) Accuracy: 50.00%
Method 2C (10-digit Pretrain â†’ Support A) Accuracy: 100.00%


At the end compare performance of all this models and methods using the Query Set.

Also plot the learning curve vs epoch for all the methods

Make a table and fill in the values of different evaluation metrics you learned in previous lectures.

In [None]:
import torch
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# -------------------------------------
# Helper: get predictions and labels
# -------------------------------------
def get_predictions(model, dataloader):
    model.eval()
    y_true, y_pred = [], []

    with torch.no_grad():
        for images, labels in dataloader:
            images = images.to(device)
            outputs = model(images)
            _, preds = torch.max(outputs, 1)

            y_true.extend(labels.cpu().numpy())
            y_pred.extend(preds.cpu().numpy())

    return y_true, y_pred

# -------------------------------------
# Collect predictions for all methods
# -------------------------------------

# Method 1 (Task A)
y_true_m1, y_pred_m1 = get_predictions(model_A, query_A_loader)

# Method 2A (Random Init â†’ Support A)
y_true_2A, y_pred_2A = get_predictions(model_2A, query_A_loader)

# Method 2B (Task A â†’ Support B)
y_true_2B, y_pred_2B = get_predictions(model_2B, query_B_loader)

# Method 2C (10-digit Pretrain â†’ Support A)
y_true_2C, y_pred_2C = get_predictions(model_2C, query_A_loader)

# -------------------------------------
# Build evaluation metrics table
# -------------------------------------
results = {
    "Method": [
        "Method 1: Train from Scratch (A)",
        "Method 2A: Random Init â†’ Support A",
        "Method 2B: Task A â†’ Support B",
        "Method 2C: 10-digit Pretrain â†’ Support A",
    ],
    "Accuracy": [
        accuracy_score(y_true_m1, y_pred_m1),
        accuracy_score(y_true_2A, y_pred_2A),
        accuracy_score(y_true_2B, y_pred_2B),
        accuracy_score(y_true_2C, y_pred_2C),
    ],
    "Precision": [
        precision_score(y_true_m1, y_pred_m1, average="macro"),
        precision_score(y_true_2A, y_pred_2A, average="macro"),
        precision_score(y_true_2B, y_pred_2B, average="macro"),
        precision_score(y_true_2C, y_pred_2C, average="macro"),
    ],
    "Recall": [
        recall_score(y_true_m1, y_pred_m1, average="macro"),
        recall_score(y_true_2A, y_pred_2A, average="macro"),
        recall_score(y_true_2B, y_pred_2B, average="macro"),
        recall_score(y_true_2C, y_pred_2C, average="macro"),
    ],
    "F1-score": [
        f1_score(y_true_m1, y_pred_m1, average="macro"),
        f1_score(y_true_2A, y_pred_2A, average="macro"),
        f1_score(y_true_2B, y_pred_2B, average="macro"),
        f1_score(y_true_2C, y_pred_2C, average="macro"),
    ],
}

df_results = pd.DataFrame(results)
print("\nEvaluation Metrics on Query Sets:\n")
print(df_results)



Evaluation Metrics on Query Sets:

                                     Method  Accuracy  Precision  Recall  \
0          Method 1: Train from Scratch (A)       1.0       1.00     1.0   
1        Method 2A: Random Init â†’ Support A       1.0       1.00     1.0   
2             Method 2B: Task A â†’ Support B       0.5       0.25     0.5   
3  Method 2C: 10-digit Pretrain â†’ Support A       1.0       1.00     1.0   

   F1-score  
0  1.000000  
1  1.000000  
2  0.333333  
3  1.000000  


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Some Theoritical Questions :

1.   Which strategy in Method 2 works best and why do you feel so ?
2.   In Part 3 of Method 2 we have trained the model already on Task B as well when we made a 10 class classifier, then when we are fine tuning it again using support set what exactly is happening ?
3.   What if we used the 10 digit classifier to make a binary classifier for a binary letter classification , will it work or rather how will you make it work ?
4.   Where exactly have we used Meta Learning, in which approach? Have we even used it ?

---


Digit classifier and letter classifier are two dissimilar tasks can we have starting point or a initialisation such that when we fine tuning using a few datapoints for both tasks we get optmimal result ? This is what we will try to do in MAML ?


---


Think on them sincerely , would love to read your answers!

---
1. Among the strategies explored in Method 2, the approach that works best is training a model on all ten digits first and then fine-tuning it using the support set for a specific task (Method 2C). This strategy performs well because pretraining on all digits allows the model to learn general and reusable features such as edges, curves, and stroke patterns that are common across digits. When the model is later fine-tuned on a small support set, only the task-specific decision boundaries need to be adjusted, which is much easier than learning features from scratch. In contrast, random initialization relies heavily on overfitting to a small dataset, and task-to-task transfer can fail if the tasks are dissimilar, leading to poor performance.

2. During the 10-class training phase, the model learns to extract general digit-level features without associating them with a specific binary task. Task B contributes to this learning by helping the model understand digit structures, but it does not define the final classification objective. When fine-tuning is performed using the support set, the original classification head is replaced with a binary classifier, and the model is optimized to separate only the two relevant classes. At this stage, the learned features are reused, but the decision boundary is reshaped to match the new binary task. Fine-tuning therefore adapts the modelâ€™s interpretation of features rather than relearning them from scratch.

3. A model trained only on digits is unlikely to perform well when directly applied to letter classification because digits and letters have different visual structures and stroke patterns. However, this approach can be made to work if the pretraining task is expanded to include a broader and more diverse set of visual concepts. For example, training on both digits and letters, or on datasets like Omniglot that contain multiple alphabets, would allow the model to learn more universal visual features. Another effective approach would be self-supervised or contrastive pretraining, which encourages the model to learn general representations independent of specific labels. Once such general features are learned, fine-tuning on a small letter dataset can produce good performance.

4. Yes, it is possible to learn an initialization that works well for both digit and letter classification, and this is precisely the motivation behind Model-Agnostic Meta-Learning (MAML). MAML aims to learn a set of initial parameters that can be quickly adapted to a wide range of tasks using only a few training examples. Instead of optimizing the model for a single task, MAML optimizes the model so that a small number of gradient updates leads to good performance on any task drawn from a task distribution. In this sense, MAML focuses on learning how to learn, making it especially suitable for scenarios where tasks are diverse and data is limited.

# ALL THE BEST !