# Hyperparameter Tuning
*(Note: This notebook runs significantly faster if you have access to a GPU. Use either the GPUHub, Google Colab, or your own GPU.)*

In this project, you will optimize the hyperparameters of a model in 3 stages.

## Paraphrase Detection
We finetune [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on [MRPC](https://huggingface.co/datasets/glue/viewer/mrpc/train), a paraphrase detection dataset. This notebook is adapted from a [PyTorch Lightning example](https://lightning.ai/docs/pytorch/1.9.5/notebooks/lightning_examples/text-transformers.html).

In [None]:
%pip install -q torch transformers lightning datasets wandb evaluate ipywidgets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/828.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m828.5/828.5 kB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m983.2/983.2 kB[0m [31m51.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m832.4/832.4 kB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m56.6 MB/s[0m eta [36m0:00:00[0m
[?25h

The next 4 cells are:
* Imports
* The `GLUEDataModule` loads the task's dataset and creates dataloaders for the train and valid sets.
* The `GLUETransformer` implements the model forward pass and the training/validation steps. You can check here what is logged with the `self.log` calls.
* The last cell runs training with the given parameters.

In [None]:
from datetime import datetime
from typing import Optional
from lightning.pytorch.loggers import WandbLogger

import wandb
import datasets
import evaluate
import lightning as L
import torch
from torch.utils.data import DataLoader
from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
)

In [None]:
wandb.login(key="3b8fb613ce4af5ffb82486f87379678bd7550244", relogin=True)

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mbenjamin-amhof[0m ([33mbenjamin-amhof-hochschule-luzern[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [None]:
class GLUEDataModule(L.LightningDataModule):
    task_text_field_map = {
        "cola": ["sentence"],
        "sst2": ["sentence"],
        "mrpc": ["sentence1", "sentence2"],
        "qqp": ["question1", "question2"],
        "stsb": ["sentence1", "sentence2"],
        "mnli": ["premise", "hypothesis"],
        "qnli": ["question", "sentence"],
        "rte": ["sentence1", "sentence2"],
        "wnli": ["sentence1", "sentence2"],
        "ax": ["premise", "hypothesis"],
    }

    glue_task_num_labels = {
        "cola": 2,
        "sst2": 2,
        "mrpc": 2,
        "qqp": 2,
        "stsb": 1,
        "mnli": 3,
        "qnli": 2,
        "rte": 2,
        "wnli": 2,
        "ax": 3,
    }

    loader_columns = [
        "datasets_idx",
        "input_ids",
        "token_type_ids",
        "attention_mask",
        "start_positions",
        "end_positions",
        "labels",
    ]

    def __init__(
        self,
        model_name_or_path: str,
        task_name: str = "mrpc",
        max_seq_length: int = 128, # Hyperparameter 5
        train_batch_size: int = 32,
        eval_batch_size: int = 32,
        **kwargs,
    ):
        super().__init__()
        self.model_name_or_path = model_name_or_path
        self.task_name = task_name
        self.max_seq_length = max_seq_length
        self.train_batch_size = train_batch_size
        self.eval_batch_size = eval_batch_size

        self.text_fields = self.task_text_field_map[task_name]
        self.num_labels = self.glue_task_num_labels[task_name]
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name_or_path, use_fast=True)

    def setup(self, stage: str):
        self.dataset = datasets.load_dataset("glue", self.task_name)

        for split in self.dataset.keys():
            self.dataset[split] = self.dataset[split].map(
                self.convert_to_features,
                batched=True,
                remove_columns=["label"],
            )
            self.columns = [c for c in self.dataset[split].column_names if c in self.loader_columns]
            self.dataset[split].set_format(type="torch", columns=self.columns)

        self.eval_splits = [x for x in self.dataset.keys() if "validation" in x]

    def prepare_data(self):
        datasets.load_dataset("glue", self.task_name)
        AutoTokenizer.from_pretrained(self.model_name_or_path, use_fast=True)

    def train_dataloader(self):
        return DataLoader(self.dataset["train"], batch_size=self.train_batch_size, shuffle=True)

    def val_dataloader(self):
        if len(self.eval_splits) == 1:
            return DataLoader(self.dataset["validation"], batch_size=self.eval_batch_size)
        elif len(self.eval_splits) > 1:
            return [DataLoader(self.dataset[x], batch_size=self.eval_batch_size) for x in self.eval_splits]

    def test_dataloader(self):
        if len(self.eval_splits) == 1:
            return DataLoader(self.dataset["test"], batch_size=self.eval_batch_size)
        elif len(self.eval_splits) > 1:
            return [DataLoader(self.dataset[x], batch_size=self.eval_batch_size) for x in self.eval_splits]

    def convert_to_features(self, example_batch, indices=None):
        # Either encode single sentence or sentence pairs
        if len(self.text_fields) > 1:
            texts_or_text_pairs = list(zip(example_batch[self.text_fields[0]], example_batch[self.text_fields[1]]))
        else:
            texts_or_text_pairs = example_batch[self.text_fields[0]]

        # Tokenize the text/text pairs
        features = self.tokenizer.batch_encode_plus(
            texts_or_text_pairs, max_length=self.max_seq_length, padding="max_length", truncation=True
        )

        # Rename label to labels to make it easier to pass to model forward
        features["labels"] = example_batch["label"]

        return features

In [None]:
class GLUETransformer(L.LightningModule):
    def __init__(
        self,
        model_name_or_path: str,
        num_labels: int,
        task_name: str,
        learning_rate: float = 2e-5, # Hyperparameter 1
        warmup_steps: int = 0, # Hyperparameter 2
        weight_decay: float = 0.0, # Hyperparameter 3
        train_batch_size: int = 32, # Hyperparameter 4
        eval_batch_size: int = 32,
        lr_scheduler_type: str = "linear",  # Hyperparameter 5
        optimizer_type: str = "adamw",       # Hyperparameter 7 adamw, adam, sgd
        adam_beta1: float = 0.9,             # Hyperparameter 8
        adam_beta2: float = 0.999,           # Hyperparameter 9
        adam_epsilon: float = 1e-8,          # Hyperparameter 10
        eval_splits: Optional[list] = None,
        **kwargs,
    ):
        super().__init__()

        self.save_hyperparameters()

        self.config = AutoConfig.from_pretrained(model_name_or_path, num_labels=num_labels)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, config=self.config)
        self.metric = evaluate.load(
            "glue", self.hparams.task_name, experiment_id=datetime.now().strftime("%d-%m-%Y_%H-%M-%S")
        )

        self.validation_step_outputs = []

    def forward(self, **inputs):
        return self.model(**inputs)

    def training_step(self, batch, batch_idx):
        outputs = self(**batch)
        loss = outputs[0]
        return loss

    def validation_step(self, batch, batch_idx, dataloader_idx=0):
        outputs = self(**batch)
        val_loss, logits = outputs[:2]

        if self.hparams.num_labels > 1:
            preds = torch.argmax(logits, axis=1)
        elif self.hparams.num_labels == 1:
            preds = logits.squeeze()

        labels = batch["labels"]
        self.validation_step_outputs.append({"loss": val_loss, "preds": preds, "labels": labels})
        return val_loss

    def on_validation_epoch_end(self):
        if self.hparams.task_name == "mnli":
            for i, output in enumerate(self.validation_step_outputs):
                # matched or mismatched
                split = self.hparams.eval_splits[i].split("_")[-1]
                preds = torch.cat([x["preds"] for x in output]).detach().cpu().numpy()
                labels = torch.cat([x["labels"] for x in output]).detach().cpu().numpy()
                loss = torch.stack([x["loss"] for x in output]).mean()
                self.log(f"val_loss_{split}", loss, prog_bar=True)
                split_metrics = {
                    f"{k}_{split}": v for k, v in self.metric.compute(predictions=preds, references=labels).items()
                }
                self.log_dict(split_metrics, prog_bar=True)
            self.validation_step_outputs.clear()
            return loss

        preds = torch.cat([x["preds"] for x in self.validation_step_outputs]).detach().cpu().numpy()
        labels = torch.cat([x["labels"] for x in self.validation_step_outputs]).detach().cpu().numpy()
        loss = torch.stack([x["loss"] for x in self.validation_step_outputs]).mean()
        self.log("val_loss", loss, prog_bar=True)
        self.log_dict(self.metric.compute(predictions=preds, references=labels), prog_bar=True)
        self.validation_step_outputs.clear()

    def configure_optimizers(self):
        """Prepare optimizer and schedule (linear warmup and decay)"""
        model = self.model
        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": self.hparams.weight_decay,
            },
            {
                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
            },
        ]
                # Optimizer auswählen
        if self.hparams.optimizer_type == "adamw":
            optimizer = torch.optim.AdamW(
                optimizer_grouped_parameters,
                lr=self.hparams.learning_rate,
                betas=(self.hparams.adam_beta1, self.hparams.adam_beta2),
                eps=self.hparams.adam_epsilon,
            )
        elif self.hparams.optimizer_type == "adam":
            optimizer = torch.optim.Adam(
                optimizer_grouped_parameters,
                lr=self.hparams.learning_rate,
                betas=(self.hparams.adam_beta1, self.hparams.adam_beta2),
                eps=self.hparams.adam_epsilon,
            )
        elif self.hparams.optimizer_type == "sgd":
            optimizer = torch.optim.SGD(
                optimizer_grouped_parameters,
                lr=self.hparams.learning_rate,
                momentum=0.9,
                nesterov=True,
            )
        else:
            raise ValueError(f"Unknown optimizer type: {self.hparams.optimizer_type}")

        # Learning Rate Scheduler auswählen
        if self.hparams.lr_scheduler_type == "linear":
            from transformers import get_linear_schedule_with_warmup
            scheduler = get_linear_schedule_with_warmup(
                optimizer,
                num_warmup_steps=self.hparams.warmup_steps,
                num_training_steps=self.trainer.estimated_stepping_batches,
            )
        elif self.hparams.lr_scheduler_type == "cosine":
            from transformers import get_cosine_schedule_with_warmup
            scheduler = get_cosine_schedule_with_warmup(
                optimizer,
                num_warmup_steps=self.hparams.warmup_steps,
                num_training_steps=self.trainer.estimated_stepping_batches,
            )
        elif self.hparams.lr_scheduler_type == "constant":
            from transformers import get_constant_schedule_with_warmup
            scheduler = get_constant_schedule_with_warmup(
                optimizer,
                num_warmup_steps=self.hparams.warmup_steps,
            )
        else:
            raise ValueError(f"Unknown scheduler type: {self.hparams.lr_scheduler_type}")
        scheduler = {"scheduler": scheduler, "interval": "step", "frequency": 1}
        return [optimizer], [scheduler]


In [None]:
DATASET = "mrpc"
# PROJECT = "week1_exploration"
PROJECT = "week2_finetuning"
experiments = [
    # # ============================================
    # # BASELINE
    # # ============================================
    # {
    #     "name": f"{DATASET}___{PROJECT}___baseline",
    #     "tags": ["week1", "baseline"],
    #     "description": "Original notebook settings - our reference point",

    #     "learning_rate": 2e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.0,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # ============================================
    # # EXPERIMENTS - Just trying things!
    # # ============================================

    # # Exp 1: Let's go fast - high LR!
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp01_fast_learning",
    #     "tags": ["week1", "high_lr"],
    #     "description": "Maybe we can learn faster with higher LR",

    #     "learning_rate": 5e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.0,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 2: Super aggressive - all high!
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp02_super_aggressive",
    #     "tags": ["week1", "aggressive"],
    #     "description": "What if we just crank everything up?",

    #     "learning_rate": 1e-4,
    #     "warmup_steps": 200,
    #     "weight_decay": 0.01,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 3: Tiny steps - be careful
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp03_tiny_steps",
    #     "tags": ["week1", "conservative"],
    #     "description": "Maybe we need to be more careful",

    #     "learning_rate": 1e-5,
    #     "warmup_steps": 100,
    #     "weight_decay": 0.0,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 4: Small batches - maybe more updates help?
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp04_small_batches",
    #     "tags": ["week1", "batch_size"],
    #     "description": "Smaller batches = more gradient updates",

    #     "learning_rate": 3e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.0,
    #     "train_batch_size": 16,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 5: Regularize heavily - prevent overfitting
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp05_heavy_regularization",
    #     "tags": ["week1", "regularization"],
    #     "description": "Strong weight decay to fight overfitting",

    #     "learning_rate": 2e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.1,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 6: Warmup party - long warmup
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp06_long_warmup",
    #     "tags": ["week1", "warmup"],
    #     "description": "Give the model time to warm up properly",

    #     "learning_rate": 2e-5,
    #     "warmup_steps": 200,
    #     "weight_decay": 0.0,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 7: Cosine vibes - smooth schedule
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp07_cosine_smooth",
    #     "tags": ["week1", "scheduler"],
    #     "description": "Cosine schedule for smoother learning",

    #     "learning_rate": 3e-5,
    #     "warmup_steps": 100,
    #     "weight_decay": 0.01,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "cosine",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 8: Old school SGD - classic approach
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp08_old_school_sgd",
    #     "tags": ["week1", "optimizer"],
    #     "description": "Maybe SGD is better than fancy Adam",

    #     "learning_rate": 1e-4,  # SGD likes higher LR
    #     "warmup_steps": 50,
    #     "weight_decay": 0.01,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "sgd",
    # },

    # # Exp 9: Short sequences - faster training
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp09_short_sequences",
    #     "tags": ["week1", "sequence_length"],
    #     "description": "Shorter sequences might be enough",

    #     "learning_rate": 2e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.0,
    #     "train_batch_size": 32,
    #     "max_seq_length": 96,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 10: Long sequences - capture more context
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp10_long_sequences",
    #     "tags": ["week1", "sequence_length"],
    #     "description": "Maybe we need longer context",

    #     "learning_rate": 2e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.0,
    #     "train_batch_size": 32,
    #     "max_seq_length": 192,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 11: Balanced combo - middle ground
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp11_balanced_middle",
    #     "tags": ["week1", "balanced"],
    #     "description": "Not too aggressive, not too conservative",

    #     "learning_rate": 3e-5,
    #     "warmup_steps": 100,
    #     "weight_decay": 0.01,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 12: High LR + Warmup combo
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp12_high_lr_safe_warmup",
    #     "tags": ["week1", "combo"],
    #     "description": "High LR but with warmup safety net",

    #     "learning_rate": 5e-5,
    #     "warmup_steps": 150,
    #     "weight_decay": 0.0,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 13: Small batch + regularization
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp13_small_batch_regularized",
    #     "tags": ["week1", "combo"],
    #     "description": "Small batches with regularization",

    #     "learning_rate": 2e-5,
    #     "warmup_steps": 50,
    #     "weight_decay": 0.05,
    #     "train_batch_size": 16,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 14: Constant LR - no decay
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp14_constant_lr",
    #     "tags": ["week1", "scheduler"],
    #     "description": "What if we don't decay the LR at all?",

    #     "learning_rate": 2e-5,
    #     "warmup_steps": 100,
    #     "weight_decay": 0.0,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "constant",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 15: Kitchen sink - everything together
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp15_kitchen_sink",
    #     "tags": ["week1", "combo"],
    #     "description": "Throw everything at it and see what sticks",

    #     "learning_rate": 4e-5,
    #     "warmup_steps": 150,
    #     "weight_decay": 0.02,
    #     "train_batch_size": 16,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "cosine",
    #     "optimizer_type": "adamw",
    # },
    # ============================================
    # ADDITIONAL EXPERIMENTS (16-20)
    # ============================================

    # Exp 16: Medium-high LR with cosine - best of both worlds?
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp16_mediumhigh_lr_cosine",
    #     "tags": ["week1", "combo", "promising"],
    #     "description": "Medium-high LR with smooth cosine decay",

    #     "learning_rate": 4e-5,
    #     "warmup_steps": 100,
    #     "weight_decay": 0.0,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "cosine",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 17: Moderate everything - safe optimization
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp17_moderate_safe",
    #     "tags": ["week1", "balanced"],
    #     "description": "Moderate settings across the board",

    #     "learning_rate": 2.5e-5,
    #     "warmup_steps": 50,
    #     "weight_decay": 0.01,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 18: Large batch - fewer but bigger updates
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp18_large_batch",
    #     "tags": ["week1", "batch_size"],
    #     "description": "Larger batches for more stable gradients",

    #     "learning_rate": 3e-5,  # Scale LR with batch size
    #     "warmup_steps": 100,
    #     "weight_decay": 0.01,
    #     "train_batch_size": 64,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 19: Light regularization + warmup
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp19_light_reg_warmup",
    #     "tags": ["week1", "regularization", "combo"],
    #     "description": "Light weight decay with gentle warmup",

    #     "learning_rate": 3e-5,
    #     "warmup_steps": 100,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 20: Cosine no warmup - direct smooth decay
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp20_cosine_no_warmup",
    #     "tags": ["week1", "scheduler"],
    #     "description": "Cosine schedule starting immediately",

    #     "learning_rate": 3e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.01,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "cosine",
    #     "optimizer_type": "adamw",
    # },
    # ============================================
    # NEW EXPERIMENTS (21-33) - More Diversity!
    # ============================================

    # COSINE SCHEDULER EXPERIMENTS (3 new)
    # ============================================

    # Exp 21: Cosine with SGD - smooth decay meets classic optimizer
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp21_cosine_sgd",
    #     "tags": ["week1", "scheduler", "optimizer", "combo"],
    #     "description": "Cosine schedule with SGD optimizer",

    #     "learning_rate": 8e-5,  # SGD needs higher LR
    #     "warmup_steps": 100,
    #     "weight_decay": 0.01,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "cosine",
    #     "optimizer_type": "sgd",
    # },

    # # Exp 22: Cosine with large batch - stable smooth training
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp22_cosine_large_batch",
    #     "tags": ["week1", "scheduler", "batch_size"],
    #     "description": "Cosine decay with large batches",

    #     "learning_rate": 4e-5,
    #     "warmup_steps": 150,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 64,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "cosine",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 23: Cosine with long sequences - context + smooth decay
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp23_cosine_long_seq",
    #     "tags": ["week1", "scheduler", "sequence_length"],
    #     "description": "Cosine schedule with longer context",

    #     "learning_rate": 3e-5,
    #     "warmup_steps": 100,
    #     "weight_decay": 0.01,
    #     "train_batch_size": 32,
    #     "max_seq_length": 192,
    #     "lr_scheduler_type": "cosine",
    #     "optimizer_type": "adamw",
    # },

    # # CONSTANT SCHEDULER EXPERIMENTS (3 new)
    # # ============================================

    # # Exp 24: Constant with SGD - no decay classic
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp24_constant_sgd",
    #     "tags": ["week1", "scheduler", "optimizer"],
    #     "description": "Constant LR with SGD - old school style",

    #     "learning_rate": 5e-5,
    #     "warmup_steps": 50,
    #     "weight_decay": 0.01,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "constant",
    #     "optimizer_type": "sgd",
    # },

    # # Exp 25: Constant with small batches - steady noisy updates
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp25_constant_small_batch",
    #     "tags": ["week1", "scheduler", "batch_size"],
    #     "description": "Constant LR with small batches",

    #     "learning_rate": 3e-5,
    #     "warmup_steps": 100,
    #     "weight_decay": 0.02,
    #     "train_batch_size": 16,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "constant",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 26: Constant with short sequences - fast and steady
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp26_constant_short_seq",
    #     "tags": ["week1", "scheduler", "sequence_length"],
    #     "description": "Constant LR with shorter sequences",

    #     "learning_rate": 2.5e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.0,
    #     "train_batch_size": 32,
    #     "max_seq_length": 96,
    #     "lr_scheduler_type": "constant",
    #     "optimizer_type": "adamw",
    # },

    # # SGD OPTIMIZER EXPERIMENTS (7 new)
    # # ============================================

    # # Exp 27: SGD with moderate LR - balanced classic
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp27_sgd_moderate",
    #     "tags": ["week1", "optimizer"],
    #     "description": "SGD with moderate learning rate",

    #     "learning_rate": 5e-5,
    #     "warmup_steps": 100,
    #     "weight_decay": 0.01,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "sgd",
    # },

    # # Exp 28: SGD with small batches - noisy gradients
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp28_sgd_small_batch",
    #     "tags": ["week1", "optimizer", "batch_size"],
    #     "description": "SGD with small batches for more updates",

    #     "learning_rate": 8e-5,
    #     "warmup_steps": 50,
    #     "weight_decay": 0.02,
    #     "train_batch_size": 16,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "sgd",
    # },

    # # Exp 29: SGD with large batches - stable gradients
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp29_sgd_large_batch",
    #     "tags": ["week1", "optimizer", "batch_size"],
    #     "description": "SGD with large batches for stability",

    #     "learning_rate": 1e-4,
    #     "warmup_steps": 150,
    #     "weight_decay": 0.01,
    #     "train_batch_size": 64,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "sgd",
    # },

    # # Exp 30: SGD with heavy regularization - controlled momentum
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp30_sgd_heavy_reg",
    #     "tags": ["week1", "optimizer", "regularization"],
    #     "description": "SGD with strong weight decay",

    #     "learning_rate": 8e-5,
    #     "warmup_steps": 100,
    #     "weight_decay": 0.05,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "sgd",
    # },

    # # Exp 31: SGD with long sequences - capture context
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp31_sgd_long_seq",
    #     "tags": ["week1", "optimizer", "sequence_length"],
    #     "description": "SGD with longer sequences",

    #     "learning_rate": 6e-5,
    #     "warmup_steps": 100,
    #     "weight_decay": 0.01,
    #     "train_batch_size": 32,
    #     "max_seq_length": 192,
    #     "lr_scheduler_type": "constant",
    #     "optimizer_type": "sgd",
    # },

    # # Exp 32: SGD with short sequences - fast training
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp32_sgd_short_seq",
    #     "tags": ["week1", "optimizer", "sequence_length"],
    #     "description": "SGD with shorter sequences for speed",

    #     "learning_rate": 7e-5,
    #     "warmup_steps": 50,
    #     "weight_decay": 0.01,
    #     "train_batch_size": 32,
    #     "max_seq_length": 96,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "sgd",
    # },

    # # Exp 33: SGD aggressive - high everything
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp33_sgd_aggressive",
    #     "tags": ["week1", "optimizer", "aggressive"],
    #     "description": "Aggressive SGD configuration",

    #     "learning_rate": 1.2e-4,
    #     "warmup_steps": 200,
    #     "weight_decay": 0.02,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "constant",
    #     "optimizer_type": "sgd",
    # },

        # Exp 01-05: learning_rate = 2e-5
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp01_lr2e5_linear_128",
    #     "tags": ["week2", "lr_2e-5", "linear", "seq_128"],
    #     "description": "Conservative LR, linear scheduler, standard length",

    #     "learning_rate": 2e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp02_lr2e5_cosine_160",
    #     "tags": ["week2", "lr_2e-5", "cosine", "seq_160"],
    #     "description": "Conservative LR, cosine scheduler, medium-long sequences",

    #     "learning_rate": 2e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 160,
    #     "lr_scheduler_type": "cosine",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp03_lr2e5_linear_192",
    #     "tags": ["week2", "lr_2e-5", "linear", "seq_192"],
    #     "description": "Conservative LR, linear scheduler, long sequences",

    #     "learning_rate": 2e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 192,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp04_lr2e5_cosine_224",
    #     "tags": ["week2", "lr_2e-5", "cosine", "seq_224"],
    #     "description": "Conservative LR, cosine scheduler, very long sequences",

    #     "learning_rate": 2e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 224,
    #     "lr_scheduler_type": "cosine",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp05_lr2e5_linear_128_rep",
    #     "tags": ["week2", "lr_2e-5", "linear", "seq_128", "replicate"],
    #     "description": "Replicate of exp01 for statistical confidence",

    #     "learning_rate": 2e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 06-10: learning_rate = 3e-5
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp06_lr3e5_cosine_128",
    #     "tags": ["week2", "lr_3e-5", "cosine", "seq_128"],
    #     "description": "Medium LR, cosine scheduler, standard length",

    #     "learning_rate": 3e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "cosine",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp07_lr3e5_linear_160",
    #     "tags": ["week2", "lr_3e-5", "linear", "seq_160"],
    #     "description": "Medium LR, linear scheduler, medium-long sequences",

    #     "learning_rate": 3e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 160,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp08_lr3e5_cosine_192",
    #     "tags": ["week2", "lr_3e-5", "cosine", "seq_192"],
    #     "description": "Medium LR, cosine scheduler, long sequences",

    #     "learning_rate": 3e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 192,
    #     "lr_scheduler_type": "cosine",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp09_lr3e5_linear_224",
    #     "tags": ["week2", "lr_3e-5", "linear", "seq_224"],
    #     "description": "Medium LR, linear scheduler, very long sequences",

    #     "learning_rate": 3e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 224,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp10_lr3e5_linear_128_rep",
    #     "tags": ["week2", "lr_3e-5", "linear", "seq_128", "replicate"],
    #     "description": "Replicate baseline for statistical confidence",

    #     "learning_rate": 3e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 11-15: learning_rate = 4e-5
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp11_lr4e5_linear_128",
    #     "tags": ["week2", "lr_4e-5", "linear", "seq_128"],
    #     "description": "Medium-high LR, linear scheduler, standard length",

    #     "learning_rate": 4e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp12_lr4e5_cosine_160",
    #     "tags": ["week2", "lr_4e-5", "cosine", "seq_160"],
    #     "description": "Medium-high LR, cosine scheduler, medium-long sequences",

    #     "learning_rate": 4e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 160,
    #     "lr_scheduler_type": "cosine",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp13_lr4e5_linear_192",
    #     "tags": ["week2", "lr_4e-5", "linear", "seq_192"],
    #     "description": "Medium-high LR, linear scheduler, long sequences",

    #     "learning_rate": 4e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 192,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp14_lr4e5_cosine_224",
    #     "tags": ["week2", "lr_4e-5", "cosine", "seq_224"],
    #     "description": "Medium-high LR, cosine scheduler, very long sequences",

    #     "learning_rate": 4e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 224,
    #     "lr_scheduler_type": "cosine",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp15_lr4e5_linear_128_rep",
    #     "tags": ["week2", "lr_4e-5", "linear", "seq_128", "replicate"],
    #     "description": "Replicate baseline for statistical confidence",

    #     "learning_rate": 4e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },

    # # Exp 16-20: learning_rate = 5e-5
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp16_lr5e5_cosine_128",
    #     "tags": ["week2", "lr_5e-5", "cosine", "seq_128"],
    #     "description": "High LR, cosine scheduler, standard length",

    #     "learning_rate": 5e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "cosine",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp17_lr5e5_linear_160",
    #     "tags": ["week2", "lr_5e-5", "linear", "seq_160"],
    #     "description": "High LR, linear scheduler, medium-long sequences",

    #     "learning_rate": 5e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 160,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp18_lr5e5_cosine_192",
    #     "tags": ["week2", "lr_5e-5", "cosine", "seq_192"],
    #     "description": "High LR, cosine scheduler, long sequences",

    #     "learning_rate": 5e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 192,
    #     "lr_scheduler_type": "cosine",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp19_lr5e5_linear_224",
    #     "tags": ["week2", "lr_5e-5", "linear", "seq_224"],
    #     "description": "High LR, linear scheduler, very long sequences",

    #     "learning_rate": 5e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 224,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },
    # {
    #     "name": f"{DATASET}___{PROJECT}___exp20_lr5e5_linear_128_rep",
    #     "tags": ["week2", "lr_5e-5", "linear", "seq_128", "replicate"],
    #     "description": "Replicate Week1 best config (with weight_decay) for comparison",

    #     "learning_rate": 5e-5,
    #     "warmup_steps": 0,
    #     "weight_decay": 0.005,
    #     "train_batch_size": 32,
    #     "max_seq_length": 128,
    #     "lr_scheduler_type": "linear",
    #     "optimizer_type": "adamw",
    # },
]

In [None]:
# import wandb

# epochs = 3  # Fest vorgegeben

# for i, config in enumerate(experiments):
#     print(f"\n{'='*60}")
#     print(f"Experiment {i+1}/{len(experiments)}: {config['name']}")
#     print(f"{'='*60}\n")

#     # Config für diesen Run
#     config["epochs"] = epochs
#     config["model"] = "distilbert-base-uncased"
#     config["task"] = "mrpc"

#     # WandB Logger
#     logger = WandbLogger(
#         project="MLOPS___Sem_5___Project_01",
#         name=config["name"],
#         config=config,
#         tags=["week1", f"lr_{config['learning_rate']}", f"opt_{config['optimizer_type']}"]
#     )

#     # Seed für Reproduzierbarkeit
#     L.seed_everything(42)

#     # DataModule mit konfigurierbaren Parametern
#     dm = GLUEDataModule(
#         model_name_or_path="distilbert-base-uncased",
#         task_name="mrpc",
#         max_seq_length=config["max_seq_length"],
#         train_batch_size=config["train_batch_size"],
#         eval_batch_size=32,
#     )
#     dm.setup("fit")

#     # Model mit allen Hyperparametern
#     model = GLUETransformer(
#         model_name_or_path="distilbert-base-uncased",
#         num_labels=dm.num_labels,
#         eval_splits=dm.eval_splits,
#         task_name=dm.task_name,
#         learning_rate=config["learning_rate"],
#         warmup_steps=config["warmup_steps"],
#         weight_decay=config["weight_decay"],
#         train_batch_size=config["train_batch_size"],
#         lr_scheduler_type=config["lr_scheduler_type"],
#         optimizer_type=config["optimizer_type"],
#     )

#     # Accumulate grad batches für größere effektive Batch Size
#     accumulate_grad_batches = 2 if config["name"] == "large_batch" else 1

#     # Trainer
#     trainer = L.Trainer(
#         max_epochs=epochs,
#         accelerator="auto",
#         devices=1,
#         logger=logger,
#         benchmark=True,
#         accumulate_grad_batches=accumulate_grad_batches,
#     )

#     # Training
#     try:
#         trainer.fit(model, datamodule=dm)
#         print(f"✅ Experiment {config['name']} completed successfully!")
#     except Exception as e:
#         print(f"❌ Experiment {config['name']} failed: {str(e)}")

#     # Run beenden
#     wandb.finish()

#     # Speicher freigeben
#     del model, dm, trainer
#     torch.cuda.empty_cache()

# print("\n" + "="*60)
# print("All experiments completed!")
# print("="*60)

## Week 01

In [None]:
# ============================================================================
# WEEK 3: AUTOMATED HYPERPARAMETER OPTIMIZATION WITH W&B SWEEPS
# ============================================================================

import wandb

# ----------------------------------------------------------------------------
# SWEEP CONFIGURATION
# ----------------------------------------------------------------------------
  sweep_config = {
      'method': 'bayes',  # Bayesian optimization
      'metric': {
          'name': 'accuracy',
          'goal': 'maximize'
      },
      'parameters': {
          # === TUNE THESE 3 HYPERPARAMETERS (same as Week 2) ===
          'learning_rate': {
              'distribution': 'log_uniform_values',
              'min': 2e-5,
              'max': 5e-5
          },
          'lr_scheduler_type': {
              'values': ['linear', 'cosine']
          },
          'max_seq_length': {
              'values': [128, 160, 192, 224]
          },

          # === FIXED HYPERPARAMETERS (exactly like Week 2) ===
          'warmup_steps': {'value': 0},
          'weight_decay': {'value': 0.005},
          'train_batch_size': {'value': 32},
          'optimizer_type': {'value': 'adamw'},
      }
  }

  # ----------------------------------------------------------------------------
  # TRAINING FUNCTION - EVERYTHING DEFINED INSIDE
  # ----------------------------------------------------------------------------
  sweep_run_counter = {'count': 0}

  def train_sweep():
      """Self-contained training function for W&B sweep"""

      # ALL IMPORTS INSIDE THE FUNCTION
      import torch
      import pytorch_lightning as pl
      from transformers import AutoTokenizer, AutoModelForSequenceClassification
      from datasets import load_dataset
      from torch.utils.data import DataLoader
      from pytorch_lightning.loggers import WandbLogger
      from pytorch_lightning.callbacks import ModelCheckpoint

      # Increment counter
      sweep_run_counter['count'] += 1
      run_number = sweep_run_counter['count']

      # Initialize wandb
      run = wandb.init()
      config = wandb.config

      # Create run name
      run_name = (f"{DATASET}___week3_sweep___"
                  f"run{run_number:02d}___"
                  f"lr{config.learning_rate:.0e}_"
                  f"{config.lr_scheduler_type}_"
                  f"seq{config.max_seq_length}")

      wandb.run.name = run_name
      wandb.run.tags = ["week3", "sweep", "automated"]

      print(f"\n{'='*60}")
      print(f"Sweep Run {run_number}/20: {run_name}")
      print(f"{'='*60}")

      pl.seed_everything(42)

      # ========================================================================
      # DEFINE PYTORCH LIGHTNING MODULE INSIDE
      # ========================================================================
      class DistilBERTClassifier(pl.LightningModule):
          def __init__(self, config):
              super().__init__()
              self.save_hyperparameters(config)
              self.model = AutoModelForSequenceClassification.from_pretrained(
                  "distilbert-base-uncased",
                  num_labels=2
              )

          def forward(self, input_ids, attention_mask):
              return self.model(input_ids=input_ids, attention_mask=attention_mask)

          def training_step(self, batch, batch_idx):
              outputs = self(batch['input_ids'], batch['attention_mask'])
              loss = torch.nn.functional.cross_entropy(outputs.logits, batch['labels'])
              self.log('train_loss', loss, prog_bar=True)
              return loss

          def validation_step(self, batch, batch_idx):
              outputs = self(batch['input_ids'], batch['attention_mask'])
              loss = torch.nn.functional.cross_entropy(outputs.logits, batch['labels'])
              preds = torch.argmax(outputs.logits, dim=1)
              acc = (preds == batch['labels']).float().mean()

              # Calculate F1
              tp = ((preds == 1) & (batch['labels'] == 1)).float().sum()
              fp = ((preds == 1) & (batch['labels'] == 0)).float().sum()
              fn = ((preds == 0) & (batch['labels'] == 1)).float().sum()

              precision = tp / (tp + fp + 1e-8)
              recall = tp / (tp + fn + 1e-8)
              f1 = 2 * (precision * recall) / (precision + recall + 1e-8)

              self.log('val_loss', loss, prog_bar=True)
              self.log('accuracy', acc, prog_bar=True)
              self.log('f1', f1, prog_bar=True)
              return loss

          def configure_optimizers(self):
              if self.hparams.optimizer_type == 'adamw':
                  optimizer = torch.optim.AdamW(
                      self.parameters(),
                      lr=self.hparams.learning_rate,
                      weight_decay=self.hparams.weight_decay
                  )
              else:
                  optimizer = torch.optim.SGD(
                      self.parameters(),
                      lr=self.hparams.learning_rate,
                      weight_decay=self.hparams.weight_decay,
                      momentum=0.9
                  )

              num_training_steps = self.trainer.estimated_stepping_batches
              num_warmup_steps = self.hparams.warmup_steps

              if self.hparams.lr_scheduler_type == 'linear':
                  from transformers import get_linear_schedule_with_warmup
                  scheduler = get_linear_schedule_with_warmup(
                      optimizer,
                      num_warmup_steps=num_warmup_steps,
                      num_training_steps=num_training_steps
                  )
              elif self.hparams.lr_scheduler_type == 'cosine':
                  from transformers import get_cosine_schedule_with_warmup
                  scheduler = get_cosine_schedule_with_warmup(
                      optimizer,
                      num_warmup_steps=num_warmup_steps,
                      num_training_steps=num_training_steps
                  )
              else:  # constant
                  from transformers import get_constant_schedule_with_warmup
                  scheduler = get_constant_schedule_with_warmup(
                      optimizer,
                      num_warmup_steps=num_warmup_steps
                  )

              return {
                  'optimizer': optimizer,
                  'lr_scheduler': {
                      'scheduler': scheduler,
                      'interval': 'step',
                      'frequency': 1
                  }
              }

      # ========================================================================
      # PREPARE DATA INSIDE
      # ========================================================================
      tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
      dataset = load_dataset("glue", "mrpc")

      def tokenize_function(examples):
          return tokenizer(
              examples['sentence1'],
              examples['sentence2'],
              padding='max_length',
              truncation=True,
              max_length=config.max_seq_length
          )

      tokenized_train = dataset['train'].map(
          tokenize_function,
          batched=True,
          remove_columns=['sentence1', 'sentence2', 'idx']
      )
      tokenized_val = dataset['validation'].map(
          tokenize_function,
          batched=True,
          remove_columns=['sentence1', 'sentence2', 'idx']
      )

      tokenized_train.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
      tokenized_val.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
      tokenized_train = tokenized_train.rename_column('label', 'labels')
      tokenized_val = tokenized_val.rename_column('label', 'labels')

      train_loader = DataLoader(tokenized_train, batch_size=config.train_batch_size, shuffle=True)
      val_loader = DataLoader(tokenized_val, batch_size=config.train_batch_size)

      # ========================================================================
      # TRAIN MODEL
      # ========================================================================
      model = DistilBERTClassifier(dict(config))

      wandb_logger = WandbLogger(
          project="MLOPS___Sem_5___Project_01",
          name=run_name,
          tags=["week3", "sweep", "automated"]
      )
      checkpoint_callback = ModelCheckpoint(
          monitor='accuracy',
          mode='max',
          save_top_k=1
      )

      trainer = pl.Trainer(
          max_epochs=3,
          logger=wandb_logger,
          callbacks=[checkpoint_callback],
          accelerator='auto',
          devices=1,
          enable_progress_bar=True
      )

      trainer.fit(model, train_loader, val_loader)

      print(f"✅ Completed Run {run_number}/20: {run_name}")
      wandb.finish()

  # ----------------------------------------------------------------------------
  # RUN THE SWEEP
  # ----------------------------------------------------------------------------
  print("\n" + "="*70)
  print("WEEK 3: STARTING AUTOMATED HYPERPARAMETER SWEEP")
  print("="*70)
  print(f"Method: {sweep_config['method']}")
  print(f"Metric: {sweep_config['metric']['name']} (maximize)")
  print(f"Number of runs: 20 (same as Week 2)")
  print("="*70 + "\n")

  sweep_id = wandb.sweep(sweep=sweep_config, project="MLOPS___Sem_5___Project_01")

  print(f"✅ Sweep initialized!")
  print(f"📊 Sweep ID: {sweep_id}")
  print(f"🔗 View live at: https://wandb.ai/benjamin-amhof-hochschule-luzern/MLOPS___Sem_5___Project_01/sweeps/{sweep_id}")
  print("\n⏳ Starting sweep agent (this will run 20 trials)...\n")

  wandb.agent(sweep_id, function=train_sweep, count=20)

  print("\n" + "="*70)
  print("🎉 WEEK 3 SWEEP COMPLETED!")
  print("="*70)
  print(f"Results: https://wandb.ai/benjamin-amhof-hochschule-luzern/MLOPS___Sem_5___Project_01/sweeps/{sweep_id}")
  print("="*70)


WEEK 3: STARTING AUTOMATED HYPERPARAMETER SWEEP
Method: bayes
Metric: accuracy (maximize)
Number of runs: 20 (same as Week 2)

Create sweep with ID: b0dfmbpr
Sweep URL: https://wandb.ai/benjamin-amhof-hochschule-luzern/MLOPS___Sem_5___Project_01/sweeps/b0dfmbpr
✅ Sweep initialized!
📊 Sweep ID: b0dfmbpr
🔗 View live at: https://wandb.ai/benjamin-amhof-hochschule-luzern/MLOPS___Sem_5___Project_01/sweeps/b0dfmbpr

⏳ Starting sweep agent (this will run 20 trials)...



[34m[1mwandb[0m: Agent Starting Run: fq1ekwis with config:
[34m[1mwandb[0m: 	learning_rate: 3.158930197470382e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_seq_length: 160
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 1/20: mrpc___week3_sweep___run01___lr3e-05_cosine_seq160


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

mrpc/train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

mrpc/validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

mrpc/test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loggers/wandb.py:397: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading

Sanity Checking: |          | 0/? [00:00<?, ?it/s]



Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 1/20: mrpc___week3_sweep___run01___lr3e-05_cosine_seq160


0,1
accuracy,▅▁█
epoch,▁▁▁▅▅▅███
f1,▁▂█
train_loss,█▄▁▁▂▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,▆▁█

0,1
accuracy,0.85294
epoch,2.0
f1,0.89735
train_loss,0.14851
trainer/global_step,344.0
val_loss,0.37091


[34m[1mwandb[0m: Agent Starting Run: q8pd39w6 with config:
[34m[1mwandb[0m: 	learning_rate: 3.493463338439319e-05
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_seq_length: 160
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 2/20: mrpc___week3_sweep___run02___lr3e-05_linear_seq160


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 2/20: mrpc___week3_sweep___run02___lr3e-05_linear_seq160


0,1
accuracy,█▁▃
epoch,▁▁▁▅▅▅███
f1,█▁▇
train_loss,█▄▁▁▃▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,▃▁█

0,1
accuracy,0.85049
epoch,2.0
f1,0.89577
train_loss,0.15444
trainer/global_step,344.0
val_loss,0.41532


[34m[1mwandb[0m: Agent Starting Run: kly08k9k with config:
[34m[1mwandb[0m: 	learning_rate: 2.226511115964475e-05
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_seq_length: 224
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 3/20: mrpc___week3_sweep___run03___lr2e-05_linear_seq224


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 3/20: mrpc___week3_sweep___run03___lr2e-05_linear_seq224


0,1
accuracy,▁▆█
epoch,▁▁▁▅▅▅███
f1,▁▆█
train_loss,█▄▁▁▃▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,█▁▆

0,1
accuracy,0.84804
epoch,2.0
f1,0.89204
train_loss,0.20666
trainer/global_step,344.0
val_loss,0.38553


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: ozfzuyy1 with config:
[34m[1mwandb[0m: 	learning_rate: 3.175318303116283e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_seq_length: 160
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 4/20: mrpc___week3_sweep___run04___lr3e-05_cosine_seq160


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 4/20: mrpc___week3_sweep___run04___lr3e-05_cosine_seq160


0,1
accuracy,█▁▃
epoch,▁▁▁▅▅▅███
f1,▁▄█
train_loss,█▄▁▁▂▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,▇▁█

0,1
accuracy,0.85049
epoch,2.0
f1,0.89561
train_loss,0.14898
trainer/global_step,344.0
val_loss,0.36921


[34m[1mwandb[0m: Agent Starting Run: 84c0dj24 with config:
[34m[1mwandb[0m: 	learning_rate: 2.62011583808396e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_seq_length: 192
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 5/20: mrpc___week3_sweep___run05___lr3e-05_cosine_seq192


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 5/20: mrpc___week3_sweep___run05___lr3e-05_cosine_seq192


0,1
accuracy,▁█▇
epoch,▁▁▁▅▅▅███
f1,▁█▇
train_loss,█▅▁▂▂▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,█▁▇

0,1
accuracy,0.85049
epoch,2.0
f1,0.89506
train_loss,0.16542
trainer/global_step,344.0
val_loss,0.36993


[34m[1mwandb[0m: Agent Starting Run: zqtddndb with config:
[34m[1mwandb[0m: 	learning_rate: 2.0293987465839464e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_seq_length: 128
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 6/20: mrpc___week3_sweep___run06___lr2e-05_cosine_seq128


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 6/20: mrpc___week3_sweep___run06___lr2e-05_cosine_seq128


0,1
accuracy,▁▇█
epoch,▁▁▁▅▅▅███
f1,▁██
train_loss,█▄▁▁▂▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,█▁▄

0,1
accuracy,0.84314
epoch,2.0
f1,0.88897
train_loss,0.20893
trainer/global_step,344.0
val_loss,0.37295


[34m[1mwandb[0m: Agent Starting Run: jm5el87j with config:
[34m[1mwandb[0m: 	learning_rate: 3.0973179978198824e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_seq_length: 192
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 7/20: mrpc___week3_sweep___run07___lr3e-05_cosine_seq192


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 7/20: mrpc___week3_sweep___run07___lr3e-05_cosine_seq192


0,1
accuracy,█▁█
epoch,▁▁▁▅▅▅███
f1,▁▅█
train_loss,█▄▁▁▂▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,▇▁█

0,1
accuracy,0.85049
epoch,2.0
f1,0.89574
train_loss,0.15002
trainer/global_step,344.0
val_loss,0.3668


[34m[1mwandb[0m: Agent Starting Run: fkqg5760 with config:
[34m[1mwandb[0m: 	learning_rate: 3.3864450261330505e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_seq_length: 224
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 8/20: mrpc___week3_sweep___run08___lr3e-05_cosine_seq224


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 8/20: mrpc___week3_sweep___run08___lr3e-05_cosine_seq224


0,1
accuracy,█▁▄
epoch,▁▁▁▅▅▅███
f1,█▁▇
train_loss,█▄▁▁▃▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,▄▁█

0,1
accuracy,0.85049
epoch,2.0
f1,0.89558
train_loss,0.14848
trainer/global_step,344.0
val_loss,0.38433


[34m[1mwandb[0m: Agent Starting Run: qpmors94 with config:
[34m[1mwandb[0m: 	learning_rate: 2.1941346188014093e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_seq_length: 160
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 9/20: mrpc___week3_sweep___run09___lr2e-05_cosine_seq160


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 9/20: mrpc___week3_sweep___run09___lr2e-05_cosine_seq160


0,1
accuracy,▁██
epoch,▁▁▁▅▅▅███
f1,▁██
train_loss,█▄▁▁▂▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,█▁▄

0,1
accuracy,0.84559
epoch,2.0
f1,0.89074
train_loss,0.19168
trainer/global_step,344.0
val_loss,0.36954


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 3irjwg4c with config:
[34m[1mwandb[0m: 	learning_rate: 2.4359742814664232e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_seq_length: 192
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 10/20: mrpc___week3_sweep___run10___lr2e-05_cosine_seq192


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 10/20: mrpc___week3_sweep___run10___lr2e-05_cosine_seq192


0,1
accuracy,▁█▆
epoch,▁▁▁▅▅▅███
f1,▁█▆
train_loss,█▅▁▂▂▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,█▁▆

0,1
accuracy,0.84559
epoch,2.0
f1,0.89067
train_loss,0.17227
trainer/global_step,344.0
val_loss,0.36733


[34m[1mwandb[0m: Agent Starting Run: hqtzrqza with config:
[34m[1mwandb[0m: 	learning_rate: 4.1102410178162005e-05
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_seq_length: 192
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 11/20: mrpc___week3_sweep___run11___lr4e-05_linear_seq192


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 11/20: mrpc___week3_sweep___run11___lr4e-05_linear_seq192


0,1
accuracy,▁█▆
epoch,▁▁▁▅▅▅███
f1,▁█▇
train_loss,█▅▁▂▂▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,▅▁█

0,1
accuracy,0.84559
epoch,2.0
f1,0.89111
train_loss,0.15949
trainer/global_step,344.0
val_loss,0.40801


[34m[1mwandb[0m: Agent Starting Run: ftypdvxc with config:
[34m[1mwandb[0m: 	learning_rate: 2.5672597962455305e-05
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_seq_length: 192
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 12/20: mrpc___week3_sweep___run12___lr3e-05_linear_seq192


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 12/20: mrpc___week3_sweep___run12___lr3e-05_linear_seq192


0,1
accuracy,▁██
epoch,▁▁▁▅▅▅███
f1,▁▇█
train_loss,█▄▁▁▃▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,▇▁█

0,1
accuracy,0.84314
epoch,2.0
f1,0.88986
train_loss,0.17521
trainer/global_step,344.0
val_loss,0.39244


[34m[1mwandb[0m: Agent Starting Run: d0essiks with config:
[34m[1mwandb[0m: 	learning_rate: 4.4585668053547554e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_seq_length: 224
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 13/20: mrpc___week3_sweep___run13___lr4e-05_cosine_seq224


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 13/20: mrpc___week3_sweep___run13___lr4e-05_cosine_seq224


0,1
accuracy,▁██
epoch,▁▁▁▅▅▅███
f1,▁▇█
train_loss,█▆▁▂▂▂
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,▃▁█

0,1
accuracy,0.84559
epoch,2.0
f1,0.89246
train_loss,0.14327
trainer/global_step,344.0
val_loss,0.41509


[34m[1mwandb[0m: Agent Starting Run: 5q2scyf6 with config:
[34m[1mwandb[0m: 	learning_rate: 2.7676727391567817e-05
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_seq_length: 160
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 14/20: mrpc___week3_sweep___run14___lr3e-05_linear_seq160


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 14/20: mrpc___week3_sweep___run14___lr3e-05_linear_seq160


0,1
accuracy,▁▅█
epoch,▁▁▁▅▅▅███
f1,▁▅█
train_loss,█▄▁▁▃▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,▆▁█

0,1
accuracy,0.85049
epoch,2.0
f1,0.89551
train_loss,0.16718
trainer/global_step,344.0
val_loss,0.39131


[34m[1mwandb[0m: Agent Starting Run: 29yzk8ad with config:
[34m[1mwandb[0m: 	learning_rate: 3.894229615301737e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_seq_length: 192
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 15/20: mrpc___week3_sweep___run15___lr4e-05_cosine_seq192


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 15/20: mrpc___week3_sweep___run15___lr4e-05_cosine_seq192


0,1
accuracy,▁▁█
epoch,▁▁▁▅▅▅███
f1,▁▇█
train_loss,█▅▁▁▃▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,▇▁█

0,1
accuracy,0.84069
epoch,2.0
f1,0.88876
train_loss,0.14953
trainer/global_step,344.0
val_loss,0.38275


[34m[1mwandb[0m: Agent Starting Run: mpe7ebzz with config:
[34m[1mwandb[0m: 	learning_rate: 4.0772155289130054e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_seq_length: 192
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 16/20: mrpc___week3_sweep___run16___lr4e-05_cosine_seq192


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 16/20: mrpc___week3_sweep___run16___lr4e-05_cosine_seq192


0,1
accuracy,█▃▁
epoch,▁▁▁▅▅▅███
f1,▁█▄
train_loss,█▅▁▂▃▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,▃▁█

0,1
accuracy,0.84069
epoch,2.0
f1,0.88882
train_loss,0.14201
trainer/global_step,344.0
val_loss,0.40944


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: bzelinnb with config:
[34m[1mwandb[0m: 	learning_rate: 3.808925684914765e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_seq_length: 128
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 17/20: mrpc___week3_sweep___run17___lr4e-05_cosine_seq128


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 17/20: mrpc___week3_sweep___run17___lr4e-05_cosine_seq128


0,1
accuracy,█▁▁
epoch,▁▁▁▅▅▅███
f1,▁█▆
train_loss,█▅▁▂▃▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,▇▁█

0,1
accuracy,0.84314
epoch,2.0
f1,0.88972
train_loss,0.13642
trainer/global_step,344.0
val_loss,0.37689


[34m[1mwandb[0m: Agent Starting Run: f086e1s0 with config:
[34m[1mwandb[0m: 	learning_rate: 4.2629227622584424e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_seq_length: 160
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 18/20: mrpc___week3_sweep___run18___lr4e-05_cosine_seq160


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 18/20: mrpc___week3_sweep___run18___lr4e-05_cosine_seq160


0,1
accuracy,█▄▁
epoch,▁▁▁▅▅▅███
f1,▁█▆
train_loss,█▆▁▂▂▂
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,▄▁█

0,1
accuracy,0.84314
epoch,2.0
f1,0.8904
train_loss,0.14548
trainer/global_step,344.0
val_loss,0.40348


[34m[1mwandb[0m: Agent Starting Run: 9qsvfyz7 with config:
[34m[1mwandb[0m: 	learning_rate: 2.1425203499242853e-05
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_seq_length: 192
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 19/20: mrpc___week3_sweep___run19___lr2e-05_linear_seq192


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 19/20: mrpc___week3_sweep___run19___lr2e-05_linear_seq192


0,1
accuracy,▁██
epoch,▁▁▁▅▅▅███
f1,▁██
train_loss,█▅▁▁▃▁
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,█▁▆

0,1
accuracy,0.84559
epoch,2.0
f1,0.89058
train_loss,0.21026
trainer/global_step,344.0
val_loss,0.38602


[34m[1mwandb[0m: Agent Starting Run: 8zjftgd5 with config:
[34m[1mwandb[0m: 	learning_rate: 4.4333307761161706e-05
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	max_seq_length: 160
[34m[1mwandb[0m: 	optimizer_type: adamw
[34m[1mwandb[0m: 	train_batch_size: 32
[34m[1mwandb[0m: 	warmup_steps: 0
[34m[1mwandb[0m: 	weight_decay: 0.005


INFO:lightning_fabric.utilities.seed:Seed set to 42



Sweep Run 20/20: mrpc___week3_sweep___run20___lr4e-05_cosine_seq160


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSe

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


✅ Completed Run 20/20: mrpc___week3_sweep___run20___lr4e-05_cosine_seq160


0,1
accuracy,▁▅█
epoch,▁▁▁▅▅▅███
f1,▁▆█
train_loss,█▆▁▂▂▂
trainer/global_step,▁▂▃▃▅▅▆▇█
val_loss,▄▁█

0,1
accuracy,0.84559
epoch,2.0
f1,0.89211
train_loss,0.14113
trainer/global_step,344.0
val_loss,0.41003



🎉 WEEK 3 SWEEP COMPLETED!
Results: https://wandb.ai/benjamin-amhof-hochschule-luzern/MLOPS___Sem_5___Project_01/sweeps/b0dfmbpr
