## ðŸ“š Assignment: Transfer Learning & The Power of Initialization
## Building Intuition for MAML

**Learning Objectives:**
- Understand why initialization matters for few-shot learning
- Experience the difference between various pre-training strategies
- Develop intuition for what MAML tries to optimize

**Advice on using LLM's**

---


Avoid it , but unfortunately we cannot stop you from using it , dont ask it everything more you think on your own the better , but whenever you take in a code from it , understand how that part fits in the current code , is there some optimization it did on its own, node it down or comment it in the code.

In [1]:
!pip install -q torch torchvision matplotlib numpy

#Understand what does each of this import do , see what all functions this hold
#whenever you want to implement something think which of this would you use and refer to its doc for the syntax

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader, Subset
import matplotlib.pyplot as plt
import numpy as np
from collections import defaultdict
import random

print("âœ… Setup complete!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

âœ… Setup complete!
PyTorch version: 2.9.0+cpu
CUDA available: False
Using device: cpu


## ðŸ“Š Part A: Dataset Preparation

We'll use **MNIST** for simplicity (or you can use Omniglot if you prefer).

**Your Task:**
- Split MNIST into 5 tasks (Tasks A-E), each with 2 digit classes
- For example: Task A = {0, 1}, Task B = {2, 3}, etc.

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import Subset
import numpy as np

# Step 1: Define Transformations
transform = transforms.Compose([
    transforms.ToTensor(), # Converts PIL Image to (C x H x W) tensor in range [0, 1]
    transforms.Normalize((0.1307,), (0.3081,)) # MNIST mean and std deviation
])

# Step 2: Load MNIST
# 'download=True' fetches the data, 'train=True/False' selects the split
train_dataset = torchvision.datasets.MNIST(
    root='./data', train=True, download=True, transform=transform
)
test_dataset = torchvision.datasets.MNIST(
    root='./data', train=False, download=True, transform=transform
)

print(f"âœ… MNIST loaded: {len(train_dataset)} train, {len(test_dataset)} test images")

In [7]:
# TODO: Define your task structure
# We'll split 10 digits into 5 tasks, each with 2 classes

# Split 10 digits into 5 tasks (A-E)
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import Subset
import numpy as np

# Step 1: Define Transformations
transform = transforms.Compose([
    transforms.ToTensor(), # Converts PIL Image to (C x H x W) tensor in range [0, 1]
    transforms.Normalize((0.1307,), (0.3081,)) # MNIST mean and std deviation
])

# Step 2: Load MNIST
# 'download=True' fetches the data, 'train=True/False' selects the split
train_dataset = torchvision.datasets.MNIST(
    root='./data', train=True, download=True, transform=transform
)
test_dataset = torchvision.datasets.MNIST(
    root='./data', train=False, download=True, transform=transform
)

print(f"âœ… MNIST loaded: {len(train_dataset)} train, {len(test_dataset)} test images")
task_definitions = {
    'A': [0, 1],
    'B': [2, 3],
    'C': [4, 5],
    'D': [6, 7],
    'E': [8, 9]
}

#Below function should take the given inputs and split the main dataset with the given input classes into train,support and query.
def create_task_datasets(dataset, task_classes, n_train=15, n_support=5, n_query=10):
    """
    Filters the dataset for task_classes and splits them into Train, Support, and Query sets.
    """
    train_data, support_data, query_data = [], [], []

    # Get all labels from the dataset
    targets = np.array(dataset.targets)

    for cls in task_classes:
        # Find indices where the label matches the current class in the task
        indices = np.where(targets == cls)[0]

        # Shuffle indices to ensure randomness
        np.random.shuffle(indices)

        # Select required number of samples
        # Total needed = n_train + n_support + n_query
        selected_idx = indices[:n_train + n_support + n_query]

        # Split selected indices into sub-groups
        cls_train_idx = selected_idx[:n_train]
        cls_support_idx = selected_idx[n_train : n_train + n_support]
        cls_query_idx = selected_idx[n_train + n_support :]

        # Extract the (image, label) tuples for each group
        # Note: We keep the original labels, but in some tasks you might remap them to [0, 1]
        train_data.extend([dataset[i] for i in cls_train_idx])
        support_data.extend([dataset[i] for i in cls_support_idx])
        query_data.extend([dataset[i] for i in cls_query_idx])

    return train_data, support_data, query_data
train_A, support_A, query_A = create_task_datasets(train_dataset, task_definitions['A'])

# Test the function

# TODO: Implement this function
# HINT: Filter dataset to only include examples from task_classes
# HINT: Split into train/support/query sets

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 9.91M/9.91M [00:00<00:00, 18.1MB/s]
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 28.9k/28.9k [00:00<00:00, 505kB/s]
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1.65M/1.65M [00:00<00:00, 4.59MB/s]
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 4.54k/4.54k [00:00<00:00, 10.4MB/s]


âœ… MNIST loaded: 60000 train, 10000 test images


In [3]:
# Test the function

train_A, support_A, query_A = create_task_datasets(train_dataset, task_definitions['A'])
print(f"Task A - Train: {len(train_A)}, Support: {len(support_A)}, Query: {len(query_A)}")

NameError: name 'train_dataset' is not defined

Part A (continued): **Build Your Model**

**TODO:** Design a simple CNN for digit classification

In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()
        # Architecture: Convolution -> ReLU -> MaxPool
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

        # Fully Connected layers
        # After two 2x2 pooling layers, 28x28 image becomes 7x7
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, num_classes) # Final classification layer

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 64 * 7 * 7) # Flatten
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Helper function to train
def train_model(model, train_loader, epochs=5, lr=0.001):
    model.train()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    history = []

    for epoch in range(epochs):
        total_loss = 0
        for images, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        history.append(total_loss / len(train_loader))
        print(f"Epoch {epoch+1}, Loss: {history[-1]:.4f}")
    return history

Now since the model is ready we decide how do we want to train it :

First Do normal classification on large dataset of Task A - 0 & 1.

The we will do fine tuning

1.   Random Initialisation and then fine tune using support dataset, say we do this for task A which were 0 & 1 digits (save this)
2.   Take the above model weights and fine tune it on the support dataset for some other task , say B(2's & 3's)
3.   First train the model on all combined train dataset for all 10 digits(from all tasks A,B,C,D,E), then save it and then fine tune it on support dataset on to make a binary classifier , any 1 task say A here now digits will be classified. 0 class->0 digit , 1->1.

While moving from one model to other , think what layers do i need to keep and what do i need to remove.



In [10]:
from torch.utils.data import DataLoader

# 1. Prepare Large Task A dataset (all 0s and 1s from train_dataset)
# Filtering full dataset for digits 0 and 1
indices_A = [i for i, target in enumerate(train_dataset.targets) if target in [0, 1]]
task_A_full_loader = DataLoader(torch.utils.data.Subset(train_dataset, indices_A), batch_size=32, shuffle=True)

# 2. Initialize Model for 2 classes
model_method1 = SimpleCNN(num_classes=2)

# 3. Train from scratch
print("Training Method 1: Full Task A Training...")
history_m1 = train_model(model_method1, task_A_full_loader, epochs=5)

# Save weights for Method 2 comparison
torch.save(model_method1.state_dict(), 'model_method1.pth')

Training Method 1: Full Task A Training...


NameError: name 'train_model' is not defined

In [12]:
import matplotlib.pyplot as plt

from torch.utils.data import DataLoader
import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()
        # Architecture: Convolution -> ReLU -> MaxPool
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

        # Fully Connected layers
        # After two 2x2 pooling layers, 28x28 image becomes 7x7
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, num_classes) # Final classification layer

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 64 * 7 * 7) # Flatten
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Helper function to train
def train_model(model, train_loader, epochs=5, lr=0.001):
    model.train()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    history = []

    for epoch in range(epochs):
        total_loss = 0
        for images, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        history.append(total_loss / len(train_loader))
        print(f"Epoch {epoch+1}, Loss: {history[-1]:.4f}")
    return history

# 1. Prepare Large Task A dataset (all 0s and 1s from train_dataset)
# Filtering full dataset for digits 0 and 1
indices_A = [i for i, target in enumerate(train_dataset.targets) if target in [0, 1]]
task_A_full_loader = DataLoader(torch.utils.data.Subset(train_dataset, indices_A), batch_size=32, shuffle=True)

# 2. Initialize Model for 2 classes
model_method1 = SimpleCNN(num_classes=2)

# 3. Train from scratch
print("Training Method 1: Full Task A Training...")
history_m1 = train_model(model_method1, task_A_full_loader, epochs=5)

# Save weights for Method 2 comparison
torch.save(model_method1.state_dict(), 'model_method1.pth')
# Convert support lists to DataLoaders
support_A_loader = DataLoader(support_A, batch_size=4, shuffle=True)
support_B_loader = DataLoader(support_B, batch_size=4, shuffle=True)

# --- Sub-Method A: Random Init + Fine-tune on Task A Support ---
model_ft_random = SimpleCNN(num_classes=2)
print("\nFine-tuning Method 2a: Random Init on Support A")
hist_2a = train_model(model_ft_random, support_A_loader, epochs=10)

# --- Sub-Method B: Task A Weights + Fine-tune on Task B Support ---
model_ft_transfer = SimpleCNN(num_classes=2)
model_ft_transfer.load_state_dict(torch.load('model_method1.pth'))
# Since Task B is also 2 classes (2,3), we don't need to replace fc2, just re-train
print("\nFine-tuning Method 2b: Transfer Task A -> Task B Support")
hist_2b = train_model(model_ft_transfer, support_B_loader, epochs=10)

# --- Sub-Method C: All-Digit Pre-train + Fine-tune on Task A ---
# 1. Pre-train on all 10 digits (Simulated here)
model_full = SimpleCNN(num_classes=10)
full_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
train_model(model_full, full_loader, epochs=2) # Pre-training phase

# 2. Modify for Binary Task A
model_full.fc2 = nn.Linear(128, 2) # Replacing the head
print("\nFine-tuning Method 2c: Pre-trained -> Task A Support")
hist_2c = train_model(model_full, support_A_loader, epochs=10)

# --- Plotting Learning Curves ---
plt.plot(hist_2a, label='Random Init (Task A)')
plt.plot(hist_2b, label='Transfer (A -> B)')
plt.plot(hist_2c, label='Pre-trained (All -> A)')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.title('Learning Curves Comparison')
plt.show()

Training Method 1: Full Task A Training...
Epoch 1, Loss: 0.0123
Epoch 2, Loss: 0.0024
Epoch 3, Loss: 0.0010
Epoch 4, Loss: 0.0014
Epoch 5, Loss: 0.0016


NameError: name 'support_B' is not defined

At the end compare performance of all this models and methods using the Query Set.

Also plot the learning curve vs epoch for all the methods

Make a table and fill in the values of different evaluation metrics you learned in previous lectures.

Some Theoritical Questions :

1.   Which strategy in Method 2 works best and why do you feel so ?
2.   In Part 3 of Method 2 we have trained the model already on Task B as well when we made a 10 class classifier, then when we are fine tuning it again using support set what exactly is happening ?
3.   What if we used the 10 digit classifier to make a binary classifier for a binary letter classification , will it work or rather how will you make it work ?
4.   Where exactly have we used Meta Learning, in which approach? Have we even used it ?

---


Digit classifier and letter classifier are two dissimilar tasks can we have starting point or a initialisation such that when we fine tuning using a few datapoints for both tasks we get optmimal result ? This is what we will try to do in MAML ?


---


Think on them sincerely , would love to read your answers!


1)The Pre-trained (All $\rightarrow$ A) strategy (Sub-method C) typically works best.Breadth of Features: When the model is trained on all 10 digits, it learns a very diverse set of visual features (loops in '8', straight lines in '1', curves in '6').Feature Reuse: When you fine-tune this model for a binary task (like 0 vs 1), the "backbone" (convolutional layers) already knows how to detect the necessary shapes. It only needs to learn how to map those shapes to two specific categories.Data Efficiency: It requires much less data to reach high accuracy compared to the Random Initialization approach, which has to learn what an "edge" or "curve" is from scratch using only a handful of images.

2)Even though the 10-class classifier already "knows" what a '2' and a '3' (Task B) look like, the fine-tuning process is doing two specific things:

Head Replacement: We removed the original 10-way output layer and replaced it with a 2-way layer. This new layer starts with random weights. Fine-tuning "teaches" this new layer how to interpret the features from the backbone.

Domain Adaptation: The "support set" provides a specific context. Fine-tuning adjusts the weights of the model (often with a smaller learning rate) to minimize error specifically for the binary distinction, effectively "narrowing the focus" of the model from the general digit space to a specific pair.

3)Yes, it will work, but with some caveats. This is a classic example of Cross-Domain Transfer Learning.

Will it work? Not perfectly out of the box. A model trained on digits knows "roundness" (from '0'), which helps with the letter 'O', but it hasn't seen the complex intersections of a letter like 'K' or 'W'.

How to make it work: * Freeze the Backbone: Keep the convolutional layers from the digit model (they already detect basic edges).

New Head: Replace the final layer with a new one for your letters.

Fine-tune: Train on the letter dataset. Because digits and letters share "low-level features" (lines and curves), the model will learn much faster than a model starting from zero.

4)Technically, we have not implemented a true Meta-Learning algorithm (like MAML or Prototypical Networks) in this specific code. Instead, we have used Transfer Learning.

The Difference: * Transfer Learning (What we did): We took knowledge from Task A and applied it to Task B.

Meta-Learning: This would involve training the model specifically to learn how to learn. In a Meta-Learning approach, the model would be optimized across thousands of different tasks so that when it sees a new task (like Task E), it can adapt in just 1 or 2 gradient steps.

The Connection: Our setup of "Support" and "Query" sets is the structure used in Meta-Learning. By organizing data this way, we have prepared the environment for Meta-Learning, even though our training method (Standard SGD/Adam) was traditional fine-tuning.


Final Answer

While standard Transfer Learning (what we did in the previous blocks) assumes that a model trained on one task (digits) will provide a good starting point for a similar task, it often struggles when tasks are dissimilar (like digits vs. letters). MAML changes the goal: instead of finding a starting point that is good for one task, it finds a starting point that is easy to change for any task.1. The Search for the "Optimal Initialisation"In your example of digits vs. letters, a standard pre-trained model might be "too specialized" in digits. MAML seeks a parameter configuration $\theta$ that is sensitive to the gradients of both tasks.The Logic: You want to find a set of weights where a small "nudge" (gradient update) in the direction of letters makes it a great letter classifier, and a small nudge toward digits makes it a great digit classifier.The Analogy: Imagine standing on a hill between two valleys (Digits and Letters). Transfer Learning puts you at the bottom of the Digit valley; to get to Letters, you have to climb all the way out. MAML puts you on the peak of the hill, so you can run down into either valley with very little effort.2. Is this what we do in MAML?Exactly. MAML is designed to optimize for fast adaptation. Here is the formal breakdown of how it achieves that:The Two-Step Optimization (Inner and Outer Loops)MAML uses a "nested" logic that mimics your experiment:The Inner Loop (Task-Specific): For a specific task (e.g., Letter classification), the model takes a few steps of gradient descent using the Support Set.The Outer Loop (Meta-Optimization): The model looks at how well it performed on the Query Set after those steps. It then updates the initial weights ($\theta$) to ensure that next time, the inner loop update is even more effective.The Meta-Learning FormulaThe goal is to minimize the loss across a variety of tasks $T_i$:$$\min_{\theta} \sum_{T_i \sim P(T)} \mathcal{L}_{T_i}(f_{\theta_i'})$$Where $\theta_i'$ are the weights after adapting to task $i$:$$\theta_i' = \theta - \alpha \nabla_{\theta} \mathcal{L}_{T_i}(f_{\theta})$$Application: In your exam, if asked what $\theta$ represents in MAML, it is the Meta-Initialisation.3. Digit vs. Letter: Why MAML works better hereIf you use the 10-digit classifier as an initialization for letters, you might encounter Negative Transferâ€”where the specific features of digits (like the hole in an '8') actually confuse the model when it tries to learn an 'A'.MAML avoids this because:It doesn't learn "This is a 0"; it learns "These are the types of features (lines/curves) that are useful for distinguishing shapes."It preserves High-Level Plasticity, meaning the model remains flexible enough to be "molded" into a letter classifier using just 5 or 10 examples (Few-shot learning).



# ALL THE BEST !