# Experiment Tracking with Weights & Biases (W&B)
*Created by **Laia Albors**, **Jorge Pueyo** and **Àlex Solé** (2025)*

This lab introduces you to [**Weights & Biases (W&B)**](https://docs.wandb.ai/), a powerful tool for tracking machine learning experiments. You will learn how to set up projects, log data such as metrics, images, and configurations, and explore results through the W&B interface. By the end, you will be able to organize experiments effectively and gain valuable insights into your models.



## PART 1: Getting Started

### 0) Setup
First we need to install W&B:


In [None]:
!pip install -q wandb

### 1) Create an account and log in

Once installed, we need an account and an API key:

1. Go to [**wandb.ai**](https://app.wandb.ai/login?signup=true) and create an account (you can sign in with GitHub/Google).
2. Now, you can log in from Python using `wandb.login()`.
3. Use the interactive prompt or environment variables to set your API key.


In [None]:
import os

import wandb

# The call below will open a prompt in the notebook/terminal to paste your API key (once).
# Alternatively, you can set the env var WANDB_API_KEY before running this cell.
wandb.login()  # <- follow the instructions in the output


### 2) Configure a project & initialize a run

In W&B, **each experiment is a _run_**, and runs are **grouped into _projects_**. In this first part, we’ll just create a small example project (`aa2-wandb-lab`) and a few demo runs to understand the workflow. Later, you’ll reuse the same project while training a real network.

We’ll also attach some **config** (hyperparameters/metadata), **tags** (to filter/search), and **notes** (short description).

In [None]:
PROJECT = "aa2-wandb-lab"            # keep it short & consistent across the course
RUN_NAME = "intro-logging"           # optional but useful
TAGS = ["intro", "logging", "part1"] # you can add/remove as you like
NOTES = "First steps with W&B: login, init, log scalars/images/matrices."

# Anything you place under config will be versioned & shown in the UI.
config = dict(
    seed=42,
    batch_size=32,
    lr=1e-3,
    comment="config is just metadata for this run"
)

# Initialize a run. If you're inside a Jupyter notebook,
# W&B will automatically attach the notebook as an artifact (unless disabled).
init_kwargs = dict(
    project=PROJECT,
    name=RUN_NAME,
    tags=TAGS,
    notes=NOTES,
    config=config
)

run = wandb.init(**init_kwargs)

If you click the link shown after **“View project at”**, you’ll be taken to the W&B project page (`aa2-wandb-lab`).  At this point, you should see only a single run listed there, named **`intro-logging`**.

From the project page, you can either click on this run or use the link printed after **“View run at”**.  This will open the detailed dashboard for that specific run, where all metrics and artifacts are tracked.  Right now, you’ll only see **system-level metrics** (CPU/GPU usage, memory, etc.), since we haven’t logged any custom values yet.

If you return to the project page showing all runs, notice that you are in the **"Workspace"** view by default.  If you switch to the **"Runs"** tab on the left, W&B will display a table of all runs in the project.  Here you can easily inspect the **init arguments** we defined earlier—such as tags, notes, and configuration values (`seed`, `batch_size`, `lr`, etc.)—for each run. This view is especially useful when you start comparing multiple experiments side by side.


### 3) Log numbers (scalars)

Now that our run is initialized, let’s start recording some metrics into it.  
The main way to do this is with **`wandb.log({...})`**, which sends a dictionary of key–value pairs to W&B.  

In practice, these calls usually happen inside your training and validation loops (e.g., logging loss and accuracy after each step or epoch).  
Here we’ll simulate this by logging a few dummy values (`train/loss`, `val/loss`, and `val/accuracy`) across 10 steps, so you can immediately see them appear in the run dashboard you opened earlier.

In [None]:
import math
import random
import time

# Simulate logging of a few scalar metrics across steps
for step in range(1, 11):
    train_loss = math.exp(-step/5.0) + random.random() * 0.02
    val_loss = train_loss + 0.05 + random.random() * 0.02
    acc = 1.0 - val_loss

    wandb.log({
        "step": step,
        "train/loss": train_loss,
        "val/loss": val_loss,
        "val/accuracy": acc
    })
    time.sleep(0.1)  # just to make the timeline nicer

print("Logged scalar metrics for 10 steps.")

In the W&B run page, you should now see three new sections:

- **Charts**, which contains the default `steps` plot.  
- **train**, with a plot showing the `train/loss` values we logged for each step.  
- **val**, with two plots: one for `val/loss` and another for `val/accuracy`.  

Each point in these plots comes from a call to `wandb.log`, and the x-axis corresponds to the `step` we logged.

### 4) Log images

So far we’ve only logged numeric values, but W&B can also handle richer data types.  
A common use case during training is to **visualize model predictions as images** (e.g., inputs, reconstructions, or segmentations).  

To log an image, simply wrap it with `wandb.Image`. This works with arrays (NumPy/PyTorch) or with `PIL.Image` objects.  
You can also add a **caption** to give context to what’s being shown.  

Below we create a small synthetic image (a color gradient with a white rectangle) just to see how images are logged.  
In the W&B run page, you’ll find the result in a new **demo** section, under the key `demo/image`.

In [None]:
import numpy as np
from PIL import Image, ImageDraw

# Create a simple synthetic image (RGB gradient with a rectangle) just for demo
H, W = 128, 128
gradient = np.zeros((H, W, 3), dtype=np.uint8)
for y in range(H):
    for x in range(W):
        gradient[y, x, :] = [x * 255 // (W-1), y * 255 // (H-1), 128]

img = Image.fromarray(gradient)
draw = ImageDraw.Draw(img)
draw.rectangle([32, 32, 96, 96], outline=(255, 255, 255), width=3)

wandb.log({
    "step": 11,
    "demo/image": wandb.Image(img, caption="Synthetic demo image")
})

print("Logged one demo image.")

### 5) Log matrices

Besides numbers and images, W&B can also log **matrices**, which are very common in ML workflows.  
There are several ways to represent them, depending on what you want to highlight:

- **As a heatmap image**: render with matplotlib and wrap in `wandb.Image`.  
- **As a histogram**: summarize the distribution of values with `wandb.Histogram(matrix)`.  
- **As a table**: store the raw values in a `wandb.Table`, useful for small matrices you want to inspect cell by cell.  
- **As a confusion matrix**: use `wandb.plot.confusion_matrix` when working on classification tasks to compare predictions against ground truth.

Below we show all of these options:

1. A random 32×32 matrix logged as a heatmap, a histogram, and a small table.  
2. A **synthetic confusion matrix example**, where we pretend we have ground-truth labels and model predictions for a 2-class problem. This is exactly the kind of plot you’ll use later with your CNN.


In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Create a random matrix
M = np.random.randn(32, 32)

# 1) Heatmap as image
plt.figure()
plt.imshow(M)
plt.title("Random matrix heatmap")
plt.colorbar()
plt.tight_layout()
wandb.log({
    "step": 12,
    "demo/matrix_heatmap": wandb.Image(plt)
})
plt.close()

# 2) Histogram of values
wandb.log({
    "step": 13,
    "demo/matrix_hist": wandb.Histogram(M)
})

# 3) Small table (be careful with large arrays!)
small = M[:5, :5]
table = wandb.Table(data=small.tolist(), columns=[f"c{i}" for i in range(small.shape[1])])
wandb.log({
    "step": 14,
    "demo/matrix_table": table
})

# 4) Example confusion matrix (binary case)
true_labels = [0, 1, 0, 1, 0, 1, 0, 0, 1, 1]
pred_labels = [0, 1, 0, 0, 0, 1, 1, 0, 1, 1]

wandb.log({
    "step": 15,
    "demo/conf_matrix": wandb.plot.confusion_matrix(
        probs=None,
        y_true=true_labels,
        preds=pred_labels,
        class_names=["negative", "positive"]
    )
})

print("Logged matrix as heatmap, histogram, table, and confusion matrix.")

If you refresh the W&B run page, you should now see several new panels in the **demo** section:

- **Confusion Matrix Curve** → visualization of predictions vs. ground truth.  
  - Rows = actual labels, columns = predicted labels.  
  - Each cell shows how many examples fell into that category (e.g. true positives, false negatives).

- **demo/matrix_heatmap** → a heatmap of the random 32×32 matrix we logged with matplotlib.  
  - Bright vs. dark areas correspond to higher vs. lower values.  
  - This is similar to how you might visualize filters or attention maps.

- **demo/matrix_hist** → a histogram of the matrix values.  
  - Y-axis = bins of values (from negative to positive).  
  - Color intensity = how many values fell into each bin.  
  - This is useful to check the distribution of weights, activations, or gradients during training.

- **demo/matrix_table** → a table view of a 5×5 slice of the matrix with the raw numbers.  
  - Helpful when you want to inspect specific values rather than just the visualization.

- **demo/conf_matrix_table** → the backing data for the confusion matrix plot.  
  - For each (Actual, Predicted) pair, it shows how many samples were counted.

Together, these examples show the different ways W&B can log and visualize matrices: as plots, distributions, tables of numbers, or task-specific charts like confusion matrices.

### 6) Attach files and artifacts

So far, we’ve logged **metrics** (scalars), **images**, and **matrices** directly into W&B.  
Another powerful feature is the ability to attach **files** to a run, so they are versioned and stored alongside your metrics.  

This is especially useful when you want to keep track of external resources such as:  
- Training logs or notes saved to a `.txt` file.  
- Example predictions saved to disk.  
- **Model checkpoints** (in Part 3 of this notebook, we’ll see how to upload and version them properly as *artifacts*).  

In this simple example, we’ll create a tiny text file and upload it with `wandb.save`.  
Once logged, the file will appear in the run page under the **Files** tab, so you can always recover it later.  


In [None]:
# Write a small text file and save it as an "artifact" via log
with open("hello_wandb.txt", "w") as f:
    f.write("Hello W&B from AA2!\n")

wandb.save("hello_wandb.txt")  # uploads & versions this file for the run

print("Attached a small text file to the run.")

### 7) Wrapping up Part 1

In this first part, you’ve learned how to set up W&B, create a project and runs, and log different types of data: **scalars, images, matrices, and files**.  This already gives you a complete toolkit to track experiments, but so far we’ve only worked with toy examples.  

In **Parts 2** and **3**, you’ll put this into practice by training a real **CNN** and using W&B to log training and validation metrics, visualize predictions, build confusion matrices, and even save checkpoints as artifacts.

In [None]:
# Always finish your run to ensure everything is uploaded
wandb.finish()

## PART 2: Logging your Experiment



### 8) Initialize a new experiment
We will start by creating a new run and defining the hyperparameters that will be used for this experiment.



In [None]:
PROJECT = "aa2-wandb-lab"
RUN_NAME = "cnn_mnist"
hparams = {
    'kernel_size': 5,
    'num_inp_channels': 1,
    'num_out_fmaps_1': 6,
    "num_out_fmaps_2": 16,
    'num_classes':10,

    'batch_size':64,
    'num_epochs':5,
    'test_batch_size':64,
    'learning_rate': 1e-3,
    'log_interval':100,
}

init_kwargs = dict(
    project=PROJECT,
    name=RUN_NAME,
    config=hparams
)
run = wandb.init(**init_kwargs)

### 9) Defining our Network
In this exercise we will define the same `PseudoLeNet` network used in the previous lab. However, this time we will define it so it accepts a `dict` containing the value of the parameters of the network, so we can define the model in a dynamic way.


In [None]:
import torch
import torch.nn as nn
from typing import Tuple, Dict, Any, List

class ConvBlock(nn.Module):

    def __init__(
            self,
            num_inp_channels: int,
            num_out_fmaps: int,
            kernel_size: int,
            pool_size: int=2) -> None:

        super().__init__()

        self.conv = nn.Conv2d(
            in_channels=num_inp_channels,
            out_channels=num_out_fmaps,
            kernel_size=(kernel_size, kernel_size))
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=(pool_size, pool_size))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.maxpool(self.relu(self.conv(x)))


class PseudoLeNet(nn.Module):

  def __init__(self, hparams: dict) -> None:
      super().__init__()
      # TODO: Define the zero-padding
      self.pad = ...

      # TODO: Define the convolutional layers according to the hyperparameters
      self.conv1 = ...
      self.conv2 = ...

      self.mlp = nn.Sequential(
          nn.Linear(in_features=hparams["num_out_fmaps_2"] * hparams["kernel_size"] * hparams["kernel_size"], out_features=120),
          nn.ReLU(inplace=True),
          nn.Linear(in_features=120, out_features=84),
          nn.ReLU(inplace=True),
          nn.Linear(in_features=84, out_features=hparams["num_classes"]),
          nn.LogSoftmax(dim=-1)
      )

  def forward(self, x: torch.Tensor) -> torch.Tensor:
      x = self.pad(x)
      x = self.conv1(x)
      x = self.conv2(x)

      bsz, nch, height, width = x.shape

      # TODO: Flatten the feature map with the reshape() operator
      # within each batch sample
      x = ...

      y = self.mlp(x)
      return y

In [None]:
x = torch.randn(1, 1, 28, 28)
network = PseudoLeNet(hparams)
y = network(x)
print(f"Output shape: {y.shape}")

### 10) Logging Training and Testing
In this part, we will train our network and learn how to log the training and testing loops to W&B for more convenient monitoring.



Start by creating the MNIST train/test dataloaders.

In [None]:
from torchvision import datasets, transforms

transforms = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])


# Dataset initializations

mnist_trainset = datasets.MNIST(
  root='data',
  train=True,
  download=True,
  transform=transforms
)

# deterministically sample 1k train examples for demo purposes
seed = int(hparams.get("seed", 42))
g = torch.Generator()
g.manual_seed(seed)
num_samples = 1000
total = len(mnist_trainset)
selected_indices = torch.randperm(total, generator=g)[:num_samples].tolist()
subset_train = torch.utils.data.Subset(mnist_trainset, selected_indices)

mnist_testset = datasets.MNIST(
  root='data',
  train=False,
  download=True,
  transform=transforms
)

# Dataloders initialization

train_loader = torch.utils.data.DataLoader(
  dataset=subset_train,
  batch_size=hparams['batch_size'],
  shuffle=True,
  drop_last=True,
)

test_loader = torch.utils.data.DataLoader(
  dataset=mnist_testset,
  batch_size=hparams['test_batch_size'],
  shuffle=False,
  drop_last=False, # Changed from True to False
)

Create the loss function to evaluate the performance of your model.

In [None]:
def compute_accuracy(predicted_batch: torch.Tensor, label_batch: torch.Tensor) -> int:
    """
    Define the Accuracy metric in the function below by:
      (1) obtain the maximum for each predicted element in the batch to get the
        class (it is the maximum index of the num_classes array per batch sample)
        (look at torch.argmax in the PyTorch documentation)
      (2) compare the predicted class index with the index in its corresponding
        neighbor within label_batch
      (3) sum up the number of affirmative comparisons and return the summation

    Parameters:
    -----------
    predicted_batch: torch.Tensor shape: [BATCH_SIZE, N_CLASSES]
        Batch of predictions
    label_batch: torch.Tensor shape: [BATCH_SIZE, 1]
        Batch of labels / ground truths.
    """
    pred = predicted_batch.argmax(dim=1, keepdim=True) # get the index of the max log-probability
    acum = pred.eq(label_batch.view_as(pred)).sum().item()
    return acum

Complete the function to train the network for one epoch. Remember the five key steps:


1.   Set all parameters' gradients to **zero**.
2.   Perform the **forward** pass.
3.   Compute the **loss** function.
4.   Perform the **backward** pass.
4.   **Update** the network's parameters.


In [None]:
import numpy as np

def train_epoch(
    train_loader: torch.utils.data.DataLoader,
    network: torch.nn.Module,
    optimizer: torch.optim,
    criterion: torch.nn.functional,
    log_interval: int,
  ) -> Tuple[float, float]:

  # Activate the train=True flag inside the model
  network.train()

  train_loss = []
  acc = 0.
  avg_weight = 0.1
  for batch_idx, (data, target) in enumerate(train_loader):

      #TODO: Move input data and labels to the device
      data, target = ...

      #TODO: Set network gradients to 0.
      optimizer. ...

      #TODO: Forward batch of images through the network
      output = ...

      #TODO: Compute loss
      loss = ...

      #TODO: Compute backpropagation
      loss. ...

      #TODO: Update parameters of the network
      optimizer. ...

      #TODO:  Compute metrics
      acc += ...

      train_loss.append(loss.item())

      if batch_idx % log_interval == 0:
          print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
              epoch, batch_idx * len(data), len(train_loader.dataset),
              100. * batch_idx / len(train_loader), loss.item()))

  avg_acc = 100. * acc / len(train_loader.dataset)

  return np.mean(train_loss), avg_acc

Complete the function to test the network on a specify epoch.

In [None]:
@torch.no_grad() # decorator: avoid computing gradients
def test_epoch(
    test_loader: torch.utils.data.DataLoader,
    network: torch.nn.Module,
  ) -> Tuple[float, float]:

  # Dectivate the train=True flag inside the model
  network.eval()

  test_loss = []
  acc = 0
  for data, target in test_loader:

      data, target = data.to(device), target.to(device)
      output = network(data)

      #TODO: Apply the loss criterion and accumulate the loss
      loss = ...
      test_loss. ...

      #TODO: Compute number of correct predictions in the batch
      acc += ...

  # TODO: Average accuracy across all correct predictions batches now
  test_acc = ...
  test_loss = ...

  print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
      test_loss, acc, len(test_loader.dataset), test_acc,
      ))

  return test_loss, test_acc

Finally, run the train/test loops and complete the code so it logs their respective metrics to W&B each epoch.

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# TODO: Create the network
network = ...

optimizer = torch.optim.RMSprop(network.parameters(), lr=hparams['learning_rate'])
criterion = nn.NLLLoss(reduction='mean')

for epoch in range(hparams['num_epochs']):

    # TODO: Compute & log the average training loss and accuracy for the current epoch
    train_loss, train_acc = ...

    # TODO: Compute & log the average training loss and accuracy for the current epoch
    test_loss, test_acc = ...

    wandb.log({
        "epoch": epoch, ## "epoch": ...,
        "train/loss": train_loss, ## "train/loss": ...,
        "train/accuracy": train_acc, ## "train/accuracy": ...,
        "test/loss": test_loss, ## "test/los": ...,
        "test/accuracy": test_acc, ## "test/accuracy": ...,
    })

### 11) Logging Predictions

Now that we have trained our network, let's visualize its performance by logging a confusion matrix to W&B.

In [None]:
@torch.no_grad()
def get_all_predictions(
    test_loader: torch.utils.data.DataLoader,
    network: torch.nn.Module,
    device: torch.device,
) -> Tuple[List[int], List[int]]:
    """
    Collect all predictions and ground truth labels from the test set.

    Returns:
    --------
    all_preds: List[int]
        List of predicted class indices
    all_labels: List[int]
        List of ground truth class indices
    """
    network.eval()

    all_preds = []
    all_labels = []

    for data, label in test_loader:
        data = data.to(device)
        output = network(data)

        # TODO: Get predicted class indices (hint: use argmax)
        preds = ...

        all_preds.append(preds)
        all_labels.append(label)

    all_preds = torch.cat(all_preds)
    all_labels = torch.cat(all_labels)

    return all_preds.numpy(), all_labels.numpy()

# Collect predictions and labels
all_preds, all_labels = get_all_predictions(test_loader, network, device)

# Define class names for MNIST (digits 0-9)
class_names = [str(i) for i in range(10)]

# TODO: Log the confusion matrix to W&B as 'test/conf_matrix'
# Hint: use wandb.plot.confusion_matrix with the collected predictions and labels
run.log({ ... })

print("Confusion matrix logged to W&B!")

## PART 3: Sweeps Hyperparameter Tuning

A Sweep in Weights & Biases is an automated hyperparameter search system that runs many training trials with different hyperparameter choices, collects their metrics, and helps you find the best configuration.

Check the Sweeps documentation for further details:

https://docs.wandb.ai/guides/sweeps

First we need to define the sweeps configuration. This will alow us to define which metric we want to check, which parameters we want to search, and which searching method to use.

Try:

-   random method

-   we want to maximize the test/acc

-   learning rate 1e-4, 1e-3, 1e-2

-   batch size 32,64,128

-   num_out_fmaps 6,8,16

-   num_out_fmaps_2 16,32

-   optimizers adam, sgd

In [None]:
# TODO: Define the sweep configuration dictionary below.
sweep_config = { ... }


With that we can now define our Sweep training based on our configuration file

In [None]:
def sweep_train():
    # Each sweep run gets its own wandb.init context
    with wandb.init(project=PROJECT, config=sweep_config["parameters"]) as run_local:
        config = wandb.config

        # Build hyperparams dict for the model/training (keep other defaults from hparams)
        local_hparams = dict(
            kernel_size=hparams.get("kernel_size",5),
            num_inp_channels=hparams.get("num_inp_channels", 1),
            num_out_fmaps_1=int(config.num_out_fmaps_1),
            num_out_fmaps_2=int(config.num_out_fmaps_2),
            num_classes=hparams.get("num_classes", 10),
            batch_size=int(config.batch_size),
            num_epochs=hparams.get("num_epochs",5),
            test_batch_size=hparams.get("test_batch_size", config.batch_size),
            learning_rate=float(config.learning_rate),
            log_interval=hparams.get("log_interval", 100),
        )


        # deterministically sample 10k train examples
        seed = int(hparams.get("seed", 42))
        g = torch.Generator()
        g.manual_seed(seed)
        num_samples = 10000
        total = len(mnist_trainset)
        selected_indices = torch.randperm(total, generator=g)[:num_samples].tolist()
        subset_train = torch.utils.data.Subset(mnist_trainset, selected_indices)

        train_loader_local = torch.utils.data.DataLoader(
            subset_train,
            batch_size=local_hparams["batch_size"],
            shuffle=False,
            drop_last=True,
        )
        test_loader_local = torch.utils.data.DataLoader(
            mnist_testset,
            batch_size=local_hparams["test_batch_size"],
            shuffle=False,
            drop_last=False,
        )

        # Create model and move to device
        network_local = PseudoLeNet(local_hparams)
        network_local.to(device)

        # Optimizer selection
        if config.optimizer == "adam":
            optimizer_local = torch.optim.Adam(network_local.parameters(), lr=local_hparams["learning_rate"])
        else:
            optimizer_local = torch.optim.SGD(network_local.parameters(), lr=local_hparams["learning_rate"])

        criterion_local = nn.NLLLoss(reduction="mean")

        best_test_acc = 0.0
        # train / eval loop (train_epoch/test_epoch use the global name `epoch` when printing;
        # ensure it exists in globals so the existing function works as-is)
        for ep in range(local_hparams["num_epochs"]):
            # expose epoch as a global so train_epoch's print works
            globals()["epoch"] = ep

            train_loss, train_acc = train_epoch(
                train_loader_local,
                network_local,
                optimizer_local,
                criterion_local,
                local_hparams["log_interval"],
            )

            test_loss, test_acc = test_epoch(test_loader_local, network_local)

            all_preds, all_labels = get_all_predictions(test_loader_local, network_local, device)

            # Define class names for MNIST (digits 0-9)
            class_names = [str(i) for i in range(10)]

            # TODO: Log the confusion matrix to W&B as 'test/conf_matrix'
            # Hint: use wandb.plot.confusion_matrix with the collected predictions and labels

            # Log metrics to W&B
            wandb.log(
                {
                    "epoch": ep,
                    "train/loss": float(train_loss),
                    "train/acc": float(train_acc),
                    "test/loss": float(test_loss),
                    "test/acc": float(test_acc),
                }
            )

            # Optionally save best model artifact
            if test_acc > best_test_acc:
                best_test_acc = test_acc
                # Save checkpoint locally and upload as artifact
                ckpt_path = f"best_model_ep{ep:02d}_acc{test_acc:.2f}.pt"
                torch.save(network_local.state_dict(), ckpt_path)
                wandb.save(ckpt_path)

        # context manager will finish the run

Now we have to initialize our Sweep. This will return a sweep id that will be used to create the agents responsibles for the trainings.

In [None]:
sweep_id = wandb.sweep(sweep_config, project=PROJECT)
sweep_id

You can now enter in the provided link to check the Sweeps status.

Finally, we need to initalize the agent responsible to launch the trains. Then you can go back to the Sweeps URL and see the magic!

In [None]:

#TODO: Start the agent to run the sweep. Select only 10 runs to limit the time.
wandb.agent( ..., function=..., count=10)

The good thing of using Sweeps is that they allow to create agents in multiple machines at the same time. You only need the Sweeps ID and call the agent in a different machine and the agents will handle everything.

If you go the generated Sweeps webpage you will see something like this:

![W&B Sweeps – exemple](https://raw.githubusercontent.com/telecombcn-dl/labs-all/main/labs/wandb/img/wandb_sweeps.png)






## Main Panels in the *W&B Sweep Workspace*

## Sidebar (Runs)

* List of runs with colors; toggle visibility to focus comparisons.

### 1) **test/acc vs. created** (scatter)

* X: creation time, Y: test accuracy. Reveals time trends and outliers.

### 2) **Parameter importance (w.r.t. test/acc)**

* Bars show influence of each hyperparameter; correlation indicates direction (positive/negative effect on accuracy).

### 3) **Parallel coordinates**

* Each line is a run; axes are hyperparameters/metrics; color encodes performance. Highlights combinations associated with better results.

### 4) **Confusion Matrix Curve**

* Compact per-class confusion across runs. Rows = actual, columns = predicted. Identifies classes with frequent misclassification.

### 5) **test/loss** (curve)

* Loss over steps/epochs per run. Lower and stable indicates better convergence.

### 6) **test/acc** (curve)

* Accuracy over steps/epochs per run. Shows learning progress and final performance ceiling.


## Conclusions

- Summary of W&B usage
    - Demonstrated W&B core workflow: install, login, init runs, and finish runs.
    - Showed how to log diverse data types: scalars, images, matrices (heatmaps/histograms/tables), confusion matrices, and files/artifacts.
    - Illustrated project/run organization and how metadata (config, tags, notes) is captured for reproducibility and searchability.

- W&B capabilities highlighted
    - Real‑time charts and dashboards (Charts, Panels) for interactive inspection and comparison of metrics across runs.
    - Artifacts and files for versioned storage of checkpoints, datasets, and auxiliary outputs.
    - Confusion matrix and specialized plots (wandb.plot) to inspect model predictions and per‑class behavior.
    - Sweeps + agents to automate hyperparameter search at scale (random/Bayes/grid strategies, distributed agents).

- Reproducibility & collaboration
    - Configs, tags, and notes recorded with runs provide a canonical experiment record.
    - Artifacts enable exact restoration of models and datasets and preserve lineage between versions.
    - Project and Runs views plus reports make it easy to share results and collaborate across teams.

- Recommended W&B‑centric next steps
    - Convert best checkpoints into W&B artifacts and attach descriptive metadata (hyperparams, dataset hash, training script commit).
    - Build reusable dashboards/reports that surface key comparisons (per‑class metrics, training curves, top artifacts).
    - Expand Sweep strategy (e.g., Bayesian sweeps), and use agent orchestration across machines for parallel search.
    - Enable alerts, monitor system metrics, and integrate W&B with CI/GitHub to link experiments to code commits and PRs.
