<table style="background-color:#FFFFFF">   
  <tr>     
  <td><img src="https://upload.wikimedia.org/wikipedia/commons/9/95/Logo_EPFL_2019.svg" width="150x"/>
  </td>     
  <td>
  <h1> <b>CS-461: Foundation Models and Generative AI</b> </h1>
  Prof. Charlotte Bunne  
  </td>   
  </tr>
</table>

# 📚 Graded Assignment 1  
### CS-461: Foundation Models and Generative AI - Fall 2025  - Due: October 8, 23:59 CET

Welcome to the first graded assignment!
In this assignment, you will **implement and explore self-supervised learning** on a downsampled subset of the [ImageNet-1k dataset](https://www.image-net.org/), and evaluate how well your model generalizes **both in-distribution and out-of-distribution (OOD)**.  

---

## 🎯 Learning Objectives
By completing this assignment, you will learn to:
- Implement a custom **encoder** and **projection head** for images  
- Experiment with **data augmentations** for self-supervised learning  
- Train a model using a **self-supervised loss**  
- Evaluate learned representations with **k-NN** and **linear probes**  
- Assess **out-of-distribution (OOD) generalization** to unseen classes  
- Save, visualize, and submit results in a reproducible way  

---

## ⚡ Practical Notes
- **Dataset:**  
  - Training: 200 ImageNet classes, 500 images each (100k total)  
  - Validation: 200 ImageNet classes, 50 images each (10k total)  
  - **OOD dataset:** 200 unseen classes, 50 images each (10k total)  
- Use OOD only for **evaluation**, never for training.  
- Checkpoints and evaluation intervals are already set up — your main tasks are to fill in missing functions and customize the model.  
- Some helper utilities (e.g., dataset loaders, probes) are provided in `utils.py`.  

---

👉 **Deliverables:** You will submit:
- Your modified **`models.py`**  
- Trained weights in **`final_model.safetensors`**  
- A short **report.md** (max 500 words) — including **discussion of OOD results**  
- This completed notebook **CS461_Assignment1.ipynb**  

---

⚠️ **Important:** Don’t forget to fill in your **SCIPER number** and **full name** in Section 0, otherwise you will receive **0 points**.  

First, we import packages and set up the device. \
Feel free to add any additional packages you may need.

In [1]:
# Automatically reloads modules when you make changes (useful during development)
%load_ext autoreload
%autoreload 2

In [2]:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True,max_split_size_mb:64"

from pathlib import Path
import shutil
import copy, math, json, csv

import numpy as np

import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid
from tqdm import tqdm

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as T
from torch.utils.data import DataLoader
from safetensors.torch import save_model

from torch.amp import autocast, GradScaler
from torchvision.utils import make_grid, save_image
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from torch.optim import AdamW
from torch.optim.lr_scheduler import SequentialLR, LinearLR, CosineAnnealingLR
import gc
gc.collect()
torch.cuda.empty_cache()


device = 'cuda' if torch.cuda.is_available() else 'cpu'

# 🆔 0. SCIPER Number and Name  

⚠️ **IMPORTANT!** ⚠️  
You **must** fill in your **SCIPER number** and **full name** below.  

This is **required for automatic grading**.  
If you do **not** provide this information, you will receive **0️⃣ (zero)** for this assignment. 

In [3]:
SCIPER = "395715"
LAST_NAME = "Kroknes-Gomez"
FIRST_NAME = "Yasmine"

## 1. Datasets & Utilities

- In the following, we will work with a subset of the ImageNet-1k dataset: color images downsampled to 64×64, covering 200 classes.
- The training set contains 500 images per class (100,000 images in total), and the validation set contains 50 images per class (10,000 images in total).
- The Out-Of-Distribution (OOD) datasets contain images from classes not present in the training set. It contains 50 images from 200 different classes (1,000 images in total).
- The purpose of these OOD datasets is to evaluate the generalization capabilities of the learned representations. You should not use it for training.
- During evalution, we will measure your model's performance on another OOD dataset (different from the one provided here), so make sure to not overfit on the provided OOD dataset.

<!-- Let's download/load it and define a default transformation turning a PIL Image into a `torch.tensor` -->
Make sure that you have access to the `/shared/CS461/cs461_assignment1_data/` folder. The folder structure should look like this:
```
cs461_assignment1_data/
└── train.npz
└── val.npz
└── ood.npz
```


Import dataset class and other utilities you developed in previous homeworks:

In [4]:
from utils import ImageDatasetNPZ, default_transform, seed_all
from utils import run_knn_probe, run_linear_probe, extract_features_and_labels

hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
Extension for Scikit-learn* enabled (https://github.com/uxlfoundation/scikit-learn-intelex)


For reproducibility, you can use the provided `seed_all` function to set the random seed for all relevant libraries (Python, NumPy, PyTorch).

In [5]:
seed_all(42)  # For reproducibility, you can use any integer here

You probably want to implement custom data augmentations for the self-supervised learning method you choose. \
Feel free to swap the `default_transform` defined below and create multiple instances of datasets with different transforms.

In [6]:
data_dir = Path('/shared/CS461/cs461_assignment1_data/')

In [7]:
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD  = [0.229, 0.224, 0.225]

class BYOLTransform:

    def __init__(self, size=64, s=0.5, blur_p=0.5):
        color_jitter = T.ColorJitter(0.8*s, 0.8*s, 0.8*s, 0.2*s)
        k = 9
        base = [
            T.ToPILImage(), 
            T.RandomResizedCrop(size=size, scale=(0.3, 1.0)),
            T.RandomHorizontalFlip(),
            T.RandomApply([color_jitter], p=0.8),
            T.RandomGrayscale(p=0.2),
            T.RandomApply([T.GaussianBlur(kernel_size=k, sigma=(0.1, 2.0))], p=blur_p),
            T.ToTensor(),
            T.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD)
        ]
        self.train_transform = T.Compose(base)

    def __call__(self, x):
        return self.train_transform(x), self.train_transform(x)

byol_transform = BYOLTransform()

train_dataset = ImageDatasetNPZ(data_dir / 'train.npz', transform=byol_transform)
val_dataset = ImageDatasetNPZ(data_dir / 'val.npz', transform=byol_transform)

# train "eval" dataset that is single view to build the kNN/linear feature bank
# train_eval_dataset = ImageDatasetNPZ(data_dir/'train.npz', transform=simclr_eval_transform)

You can split the provided OOD dataset into a training and validation set using the code below. \
You should not use the training split for actually training your models, but only for evaluation (e.g. kNN or linear probing).

In [8]:
rng = np.random.RandomState(42)
ds_ood = ImageDatasetNPZ(data_dir / 'ood.npz', transform=default_transform)
ood_val_ratio = 0.2
train_mask = rng.permutation(len(ds_ood)) >= int(len(ds_ood) * ood_val_ratio)
ds_oods_train = torch.utils.data.Subset(ds_ood, np.where(train_mask)[0])
ds_oods_val = torch.utils.data.Subset(ds_ood, np.where(~train_mask)[0])

In [9]:
batch_size = 128
num_workers = 4
pin_memory = True

def byol_collate_fn(batch):
    xs1, xs2, ys = [], [], []
    for (x1, x2), y in batch:
        xs1.append(x1)
        xs2.append(x2)
        ys.append(y)
    return torch.stack(xs1), torch.stack(xs2), torch.tensor(ys)
    
collate_fn = None

In [10]:
train_loader = DataLoader(train_dataset, batch_size=batch_size, num_workers=num_workers, pin_memory=pin_memory, shuffle=True, collate_fn=byol_collate_fn)
val_loader  = DataLoader(val_dataset,  batch_size=batch_size, num_workers=num_workers, pin_memory=pin_memory, shuffle=False, collate_fn=byol_collate_fn)

# train_eval_loader  = DataLoader(train_eval_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=pin_memory)

# 2. Load Your Model

- Load your model from `models.py`.
- You will need to modify the `encoder` and `projection` modules, as the provided template implementation is only a placeholder.
- You SHOULD NOT change the `input_dim`, `input_channels`, and `feature_dim` parameters of the `ImageEncoder` class.
- You can use an existing architecture (e.g., ResNet, ViT) but you SHOULD NOT use any pre-trained weights.

In [11]:
from models import ImageEncoder

model = ImageEncoder().to(device)
model

ImageEncoder(
  (encoder): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): Identity()
    (4): Sequential(
      (0): Bottleneck(
        (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (downsample): Sequential(
          (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
       

## 3. Helpers for Training & Evaluation

We suggest you to implement the following helper functions to keep your training and evaluation loops clean and organized. 
- `training_step`: Performs a single training step (forward pass, loss computation, backward pass, optimizer step) and returns the loss value.
- `evaluation_step`: Evaluates the model on the validation dataset and returns the accuracy.

Depending on your specific requirements, you may also want to implement additional utility functions for tasks such as data loading, metric computation, and logging.

As you have seen from previous assignments, loss functions for self-supervised learning objectives can be quite complex. \
Feel free to implement any helper functions you may need to compute the loss.


In [12]:
def training_step(online_network, target_network, batch, optimizer, ema, device, scaler=None):
    # TODO: Implement the training step
    x1, x2, _ = batch
    x1 = x1.to(device, non_blocking=True)
    x2 = x2.to(device, non_blocking=True)

    online_network.train()
    target_network.eval()  # EMA target is inference-only

    # AMP context (optional but convenient)
    autocast_ctx = torch.amp.autocast('cuda') if scaler is not None else nullcontext()
    with autocast_ctx:
        # Online predictions on each view
        _, _, p1 = online_network(x1)
        _, _, p2 = online_network(x2)

        # Target projections (stop-grad)
        with torch.no_grad():
            _, z1_t, _ = target_network(x1)  # predictor unused
            _, z2_t, _ = target_network(x2)

        # Symmetric BYOL loss
        loss = byol_loss(p1, z2_t) + byol_loss(p2, z1_t)

    optimizer.zero_grad(set_to_none=True)
    if scaler is not None:
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
    else:
        loss.backward()
        optimizer.step()

    # EMA update of the target network (in-place)
    with torch.no_grad():
        ema.update(online_network, target_network)

    return float(loss.item())


def compute_byol_loss(online_network, target_network, batch, device, scaler=None):
    x1, x2, _ = batch
    x1 = x1.to(device, non_blocking=True)
    x2 = x2.to(device, non_blocking=True)

    online_network.train()
    target_network.eval()

    autocast_ctx = torch.amp.autocast('cuda') if (scaler is not None) else nullcontext()
    with autocast_ctx:
        # online predictions
        _, _, p1 = online_network(x1)
        _, _, p2 = online_network(x2)
        # target projections (stop-grad)
        with torch.no_grad():
            _, z1_t, _ = target_network(x1)
            _, z2_t, _ = target_network(x2)
        # symmetric BYOL loss
        loss = byol_loss(p1, z2_t) + byol_loss(p2, z1_t)
    return loss

In [13]:
def evaluation_step(online_network, 
                    target_network, 
                    train_loader_noaugment, 
                    test_loader_noaugment,
                    device,
                    use_target: bool = True,     
                    normalize: bool = True,      
                    do_linear: bool = True 
    ):
    # TODO: Implement the evaluation step
    eval_model = target_network if (use_target and target_network is not None) else online_network
    eval_model.eval()

    # 2) extract features (pre-projector; your get_features handles this)
    train_feats_t, train_labels_t = extract_features_and_labels(eval_model, train_loader_noaugment, normalize=normalize)
    test_feats_t,  test_labels_t  = extract_features_and_labels(eval_model, test_loader_noaugment,  normalize=normalize)

    # sklearn expects numpy arrays
    train_feats = train_feats_t.cpu().numpy()
    test_feats  = test_feats_t.cpu().numpy()
    train_labels = train_labels_t.cpu().numpy()
    test_labels  = test_labels_t.cpu().numpy()

    # 3) kNN probe (fast sanity metric)
    knn_acc = run_knn_probe(train_feats, train_labels, test_feats, test_labels)

    metrics = {"knn_acc": float(knn_acc)}

    # 4) linear probe (stronger metric; optional to keep eval fast)
    if do_linear:
        linear_acc = run_linear_probe(train_feats, train_labels, test_feats, test_labels)
        metrics["linear_acc"] = float(linear_acc)

    return metrics

In [14]:
def byol_loss(p, z, eps: float = 1e-8):
    """2 - 2 * cosine similarity (batch mean)."""
    p = F.normalize(p, dim=-1, eps=eps)
    z = F.normalize(z, dim=-1, eps=eps)
    return 2 - 2 * (p * z).sum(dim=-1).mean()

class EMA:
    def __init__(self, beta=0.99): self.beta = beta
    @torch.no_grad()
    def update(self, online, target):
        for p_o, p_t in zip(online.parameters(), target.parameters()):
            p_t.data.mul_(self.beta).add_(p_o.data, alpha=(1.0 - self.beta))

In [15]:
class TrainLogger:
    def __init__(self):
        self.history = []

    def log(self, **kwargs):
        self.history.append(dict(**kwargs))

    def to_jsonl(self, path: Path):
        path.parent.mkdir(parents=True, exist_ok=True)
        with open(path, "w") as f:
            for row in self.history:
                f.write(json.dumps(row) + "\n")

    def to_csv(self, path: Path):
        path.parent.mkdir(parents=True, exist_ok=True)
        if not self.history:
            return
        keys = sorted(self.history[0].keys())
        with open(path, "w", newline="") as f:
            w = csv.DictWriter(f, fieldnames=keys)
            w.writeheader()
            for row in self.history:
                w.writerow(row)

# for visualisations
def _denorm(x, mean=IMAGENET_MEAN, std=IMAGENET_STD):
    m = torch.tensor(mean, device=x.device)[None, :, None, None]
    s = torch.tensor(std,  device=x.device)[None, :, None, None]
    return x * s + m

@torch.no_grad()
def sample_val_predictions(
    eval_model,
    train_loader_noaugment,
    val_loader_noaugment,
    out_path: Path,
    n_samples: int = 16,
    use_linear: bool = True
):
    eval_model.eval()

    # 1) features
    train_feats_t, train_labels_t = extract_features_and_labels(eval_model, train_loader_noaugment, normalize=True)
    val_feats_t,   val_labels_t   = extract_features_and_labels(eval_model, val_loader_noaugment,   normalize=True)

    train_feats = train_feats_t.numpy()
    val_feats   = val_feats_t.numpy()
    train_lbls  = train_labels_t.numpy()
    val_lbls    = val_labels_t.numpy()

    # 2) probe fit + accuracy
    if use_linear:
        clf = LogisticRegression(max_iter=1000, n_jobs=-1).fit(train_feats, train_lbls)
    else:
        clf = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(train_feats, train_lbls)

    val_preds = clf.predict(val_feats)
    acc = (val_preds == val_lbls).mean().item() if hasattr(acc, "item") else float((val_preds == val_lbls).mean())

    # 3) gather n_samples images from val loader
    imgs, labels = [], []
    for x, y in val_loader_noaugment:
        imgs.append(x)
        labels.append(y)
        if sum(b.size(0) for b in imgs) >= n_samples:
            break
    imgs = torch.cat(imgs, 0)[:n_samples]
    labels = torch.cat(labels, 0)[:n_samples]

    # 4) predict on sampled images
    dev = next(eval_model.parameters()).device
    feats_small = eval_model.get_features(imgs.to(dev))
    feats_small = F.normalize(feats_small, dim=1).cpu().numpy()
    preds_small = clf.predict(feats_small)

    # optional probabilities if available
    probs_small = None
    if hasattr(clf, "predict_proba"):
        probs_small = clf.predict_proba(feats_small).max(axis=1).tolist()

    # 5) save grid
    grid = make_grid(_denorm(imgs.to(dev)).cpu(), nrow=int(n_samples**0.5), padding=2)
    out_path.parent.mkdir(parents=True, exist_ok=True)
    save_image(grid, out_path)

    # 6) tabular sample info
    samples = []
    for i in range(n_samples):
        row = {"index": int(i), "true": int(labels[i].item()), "pred": int(preds_small[i])}
        if probs_small is not None:
            row["conf"] = float(probs_small[i])
        samples.append(row)

    return {
        "probe_acc": float(acc),
        "samples": samples,
        "grid_path": str(out_path)
    }

# 4. Optimizer Configuration

In [16]:
# Feel free to adapt and add more arguments
def param_groups_no_wd(model, weight_decay=1e-6):
    decay, no_decay = [], []
    for p in model.parameters():
        if not p.requires_grad:
            continue
        if p.ndim >= 2:
            decay.append(p)
        else:
            no_decay.append(p)
    return [
        {"params": decay,    "weight_decay": weight_decay},
        {"params": no_decay, "weight_decay": 0.0},
    ]

def make_optimizer(model, global_batch):
    lr = 1e-3 * (global_batch / 256.0)
    return AdamW(param_groups_no_wd(model, weight_decay=1e-6), lr=lr, betas=(0.9, 0.999))

def make_scheduler(optimizer, total_epochs, steps_per_epoch, warmup_pct=0.1):
    total_steps = total_epochs * steps_per_epoch
    warmup_steps = max(1, int(warmup_pct * total_steps))
    cosine_steps = max(1, total_steps - warmup_steps)

    warmup = LinearLR(optimizer, start_factor=1e-3, end_factor=1.0, total_iters=warmup_steps)
    cosine = CosineAnnealingLR(optimizer, T_max=cosine_steps, eta_min=0.0)
    return SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[warmup_steps])

def ema_momentum(epoch, total_epochs, m_start=0.99, m_end=0.996):
    if total_epochs <= 1:
        return m_end
    t = epoch / (total_epochs - 1)
    return m_end - (m_end - m_start) * 0.5 * (1.0 + math.cos(math.pi * t))


# lr = 1e-3
# weight_decay = 5e-2
# lr_step_size = 10
# lr_gamma = 0.1

In [17]:
# optimizer = torch.optim.Adam(model.parameters(), lr=lr)
# lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=lr_step_size, gamma=lr_gamma)


# 5. Training Loop

Adapt your training configuration and implement the training loop. \
You probably want to save model checkpoints and evaluate the model on the validation set at regular intervals.

In [18]:
n_epochs = 50  # Adjust the number of epochs as needed
eval_interval = 5  # Evaluate the model every 'eval_interval' epochs
save_interval = 10  # Save the model every 'save_interval' epochs

checkpoints_dir = Path('checkpoints')
if not checkpoints_dir.exists():
    checkpoints_dir.mkdir(parents=True, exist_ok=False)

In [19]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
scaler = GradScaler(enabled=(device.type == "cuda"))

online_network = model
target_network = copy.deepcopy(online_network).to(device)
for p in target_network.parameters(): 
    p.requires_grad = False

global_batch = train_loader.batch_size  # adjust if using DDP: per_gpu * world_size
base_lr = 1e-3 * (global_batch / 256.0)
optimizer = AdamW(param_groups_no_wd(online_network, weight_decay=1e-6),
                  lr=base_lr, betas=(0.9, 0.999))

steps_per_epoch = len(train_loader)
total_steps = max(1, n_epochs * steps_per_epoch)
warmup_steps = max(1, int(0.1 * total_steps))
cosine_steps = max(1, total_steps - warmup_steps)

warmup = LinearLR(optimizer, start_factor=1e-3, end_factor=1.0, total_iters=warmup_steps)
cosine = CosineAnnealingLR(optimizer, T_max=cosine_steps, eta_min=0.0)
lr_scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[warmup_steps])


history = []
def log_row(**kwargs): 
    history.append(dict(**kwargs))

accum_steps = 4  # effective_batch = per_iter_batch * accum_steps

for epoch in tqdm(range(n_epochs)):
    m = ema_momentum(epoch, n_epochs)
    avg_loss = 0.0
    online_network.train()
    target_network.eval()

    optimizer.zero_grad(set_to_none=True)

    for step, batch in enumerate(train_loader, start=1):
        # forward-only (no step)
        loss = compute_byol_loss(online_network, target_network, batch, device, scaler)
        avg_loss += float(loss.item())

        # scale for accumulation
        loss = loss / accum_steps

        if scaler is not None:
            scaler.scale(loss).backward()
        else:
            loss.backward()

        do_step = (step % accum_steps == 0) or (step == len(train_loader))
        if do_step:
            if scaler is not None:
                scaler.step(optimizer)
                scaler.update()
            else:
                optimizer.step()
            optimizer.zero_grad(set_to_none=True)

            # EMA + LR scheduler once per optimizer step
            with torch.no_grad():
                EMA(beta=m).update(online_network, target_network)
            lr_scheduler.step()


    avg_loss /= max(1, len(train_loader))

    row = {
        "epoch": epoch + 1,
        "train_loss": float(avg_loss),
        "lr": float(optimizer.param_groups[0]["lr"]),
        "ema_m": float(m),
    }

    if (epoch + 1) % eval_interval == 0:
        metrics = evaluation_step(
            online_network, target_network,
            train_loader_noaugment, val_loader_noaugment,
            device, use_target=True, normalize=True, do_linear=True
        )
        row["knn_acc"] = float(metrics["knn_acc"])
        row["linear_acc"] = float(metrics.get("linear_acc", float("nan")))

        grid_path = checkpoints_dir / f"val_samples_epoch_{epoch+1:03d}.png"
        sample_info = sample_val_predictions(
            target_network, train_loader_noaugment, val_loader_noaugment,
            grid_path, n_samples=16, use_linear=True
        )
        row["sample_probe_acc"] = float(sample_info["probe_acc"])
        row["sample_grid_path"] = sample_info["grid_path"]

        print(f"Epoch {epoch+1:03d} | loss {row['train_loss']:.4f} | "
              f"kNN {row['knn_acc']:.2%} | linear {row['linear_acc']:.2%}")
        torch.cuda.empty_cache()

    log_row(**row)

    if (epoch + 1) % save_interval == 0:
        torch.save(online_network.state_dict(), checkpoints_dir / f"online_epoch_{epoch+1:03d}.pth")
        torch.save(target_network.state_dict(), checkpoints_dir / f"target_epoch_{epoch+1:03d}.pth")

# ---- final save + logs ----
torch.save(online_network.state_dict(), checkpoints_dir / "online_final.pth")
torch.save(target_network.state_dict(), checkpoints_dir / "target_final.pth")

with open(checkpoints_dir / "training_log.jsonl", "w") as f:
    for r in history:
        f.write(json.dumps(r) + "\n")

with open(checkpoints_dir / "training_log.csv", "w", newline="") as f:
    if history:
        keys = sorted(history[0].keys())
        w = csv.DictWriter(f, fieldnames=keys)
        w.writeheader()
        for r in history:
            w.writerow(r)

  0%|          | 0/50 [00:02<?, ?it/s]


OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU 0 has a total capacity of 9.75 GiB of which 168.81 MiB is free. Process 1544998 has 2.15 GiB memory in use. Process 1845532 has 2.92 GiB memory in use. Process 2709594 has 7.44 GiB memory in use. Process 2747369 has 6.01 GiB memory in use. Process 2639663 has 1.13 GiB memory in use. Process 2787762 has 7.41 GiB memory in use. Of the allocated memory 7.29 GiB is allocated by PyTorch, and 16.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

# 6. Visualize Results

To better understand the performance of your trained model, visualize some results. \
You can visualize:
- Sample images from the validation set along with their predicted labels.
- Training and validation loss curves over epochs.

In [None]:
# TODO: Visualize some results from your trained model.

# 7. Submission Instructions

You must submit the following files:
- `models.py`: Contains the implementation of your model architecture.
- `final_model.safetensors`: The trained model weights saved in the safetensors format.
- `report.md`: A brief report summarizing your approach, design choices, and results.
- `CS461_Assignment1.ipynb`: The Jupyter notebook containing your code and explanations. Make sure to save your progress before running the cell below.

You will submit your assignment under a single folder named `/home/cs461_assignment1_submission` containing the above files. \
Make sure to replace `<SCIPER>`, `<LAST_NAME>`, and `<FIRST_NAME>` with your actual SCIPER number, last name, and first name respectively. \
The following cell will help you move the files into the submission folder.

In [None]:
work_dir = Path('.')
output_dir = Path.home() / 'cs461_assignment1_submission'

if not output_dir.exists():
    output_dir.mkdir(parents=True, exist_ok=False)
    
shutil.copy(final_model_path, output_dir / 'final_model.safetensors')
shutil.copy(work_dir / 'models.py', output_dir / 'models.py')
shutil.copy(work_dir / 'CS461_Assignment1.ipynb', output_dir / 'CS461_Assignment1.ipynb')
shutil.copy(work_dir / 'report.md', output_dir / 'report.md')

Check that all required files are present in the submission folder before running the cell below.

In [None]:
assert SCIPER is not None and LAST_NAME is not None and FIRST_NAME is not None, "Please set your SCIPER, LAST_NAME, and FIRST_NAME variables."

list_of_files = ['final_model.safetensors', 'models.py', 'CS461_Assignment1.ipynb', 'report.md']
files_found = all((output_dir / f).exists() for f in list_of_files)
assert files_found, f"One or more required files are missing in the submission folder: {list_of_files}"


You can test whether your submission folder is appropriately structured by using the `eval.py`:
```bash
python eval.py
```

In [2]:
### Uncomment the line below to run the evaluation script and check your model's performance

# !python eval.py

---
🎉 **Congratulations!**  
You’ve completed Assignment 1. Good luck, and don’t forget to double-check your submission!