# Hyperparameter Tuning
*(Note: This notebook runs significantly faster if you have access to a GPU. Use either the GPUHub, Google Colab, or your own GPU.)*

In this project, you will optimize the hyperparameters of a model in 3 stages.

## Paraphrase Detection
We finetune [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on [MRPC](https://huggingface.co/datasets/glue/viewer/mrpc/train), a paraphrase detection dataset. This notebook is adapted from a [PyTorch Lightning example](https://lightning.ai/docs/pytorch/1.9.5/notebooks/lightning_examples/text-transformers.html).

In [1]:
%pip install -q torch transformers lightning datasets wandb evaluate ipywidgets wandb optuna

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install transformers[torch]

Note: you may need to restart the kernel to use updated packages.


The next 4 cells are:
* Imports
* The `GLUEDataModule` loads the task's dataset and creates dataloaders for the train and valid sets.
* The `GLUETransformer` implements the model forward pass and the training/validation steps. You can check here what is logged with the `self.log` calls.
* The last cell runs training with the given parameters.

In [3]:
from datetime import datetime
from typing import Optional

import datasets
import evaluate
import lightning as L
import torch
from torch.utils.data import DataLoader
from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
)

In [4]:
class GLUEDataModule(L.LightningDataModule):
    task_text_field_map = {
        "cola": ["sentence"],
        "sst2": ["sentence"],
        "mrpc": ["sentence1", "sentence2"],
        "qqp": ["question1", "question2"],
        "stsb": ["sentence1", "sentence2"],
        "mnli": ["premise", "hypothesis"],
        "qnli": ["question", "sentence"],
        "rte": ["sentence1", "sentence2"],
        "wnli": ["sentence1", "sentence2"],
        "ax": ["premise", "hypothesis"],
    }

    glue_task_num_labels = {
        "cola": 2,
        "sst2": 2,
        "mrpc": 2,
        "qqp": 2,
        "stsb": 1,
        "mnli": 3,
        "qnli": 2,
        "rte": 2,
        "wnli": 2,
        "ax": 3,
    }

    loader_columns = [
        "datasets_idx",
        "input_ids",
        "token_type_ids",
        "attention_mask",
        "start_positions",
        "end_positions",
        "labels",
    ]

    def __init__(
        self,
        model_name_or_path: str,
        task_name: str = "mrpc",
        max_seq_length: int = 128,
        train_batch_size: int = 32,
        eval_batch_size: int = 32,
        **kwargs,
    ):
        super().__init__()
        self.model_name_or_path = model_name_or_path
        self.task_name = task_name
        self.max_seq_length = max_seq_length
        self.train_batch_size = train_batch_size
        self.eval_batch_size = eval_batch_size

        self.text_fields = self.task_text_field_map[task_name]
        self.num_labels = self.glue_task_num_labels[task_name]
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name_or_path, use_fast=True)

    def setup(self, stage: str):
        self.dataset = datasets.load_dataset("glue", self.task_name)

        for split in self.dataset.keys():
            self.dataset[split] = self.dataset[split].map(
                self.convert_to_features,
                batched=True,
                remove_columns=["label"],
            )
            self.columns = [c for c in self.dataset[split].column_names if c in self.loader_columns]
            self.dataset[split].set_format(type="torch", columns=self.columns)

        self.eval_splits = [x for x in self.dataset.keys() if "validation" in x]

    def prepare_data(self):
        datasets.load_dataset("glue", self.task_name)
        AutoTokenizer.from_pretrained(self.model_name_or_path, use_fast=True)

    def train_dataloader(self):
        return DataLoader(self.dataset["train"], batch_size=self.train_batch_size, shuffle=True)

    def val_dataloader(self):
        if len(self.eval_splits) == 1:
            return DataLoader(self.dataset["validation"], batch_size=self.eval_batch_size)
        elif len(self.eval_splits) > 1:
            return [DataLoader(self.dataset[x], batch_size=self.eval_batch_size) for x in self.eval_splits]

    def test_dataloader(self):
        if len(self.eval_splits) == 1:
            return DataLoader(self.dataset["test"], batch_size=self.eval_batch_size)
        elif len(self.eval_splits) > 1:
            return [DataLoader(self.dataset[x], batch_size=self.eval_batch_size) for x in self.eval_splits]

    def convert_to_features(self, example_batch, indices=None):
        # Either encode single sentence or sentence pairs
        if len(self.text_fields) > 1:
            texts_or_text_pairs = list(zip(example_batch[self.text_fields[0]], example_batch[self.text_fields[1]]))
        else:
            texts_or_text_pairs = example_batch[self.text_fields[0]]

        # Tokenize the text/text pairs
        features = self.tokenizer.batch_encode_plus(
            texts_or_text_pairs, max_length=self.max_seq_length, pad_to_max_length=True, truncation=True
        )

        # Rename label to labels to make it easier to pass to model forward
        features["labels"] = example_batch["label"]

        return features

In [5]:
from torch.optim import AdamW, Adam, SGD
from transformers import get_cosine_schedule_with_warmup
class GLUETransformer(L.LightningModule):
    def __init__(
        self,
        model_name_or_path: str,
        num_labels: int,
        task_name: str,
        learning_rate: float = 2e-5,
        warmup_steps: int = 0,
        weight_decay: float = 0.0,
        train_batch_size: int = 32,
        eval_batch_size: int = 32,
        eval_splits: Optional[list] = None,
        scheduler: str = 'linear',
        **kwargs,
    ):
        super().__init__()

        self.save_hyperparameters()

        self.config = AutoConfig.from_pretrained(model_name_or_path, num_labels=num_labels)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, config=self.config)
        self.metric = evaluate.load(
            "glue", self.hparams.task_name, experiment_id=datetime.now().strftime("%d-%m-%Y_%H-%M-%S")
        )

        self.validation_step_outputs = []

    def forward(self, **inputs):
        return self.model(**inputs)

    def training_step(self, batch, batch_idx):
        outputs = self(**batch)
        loss = outputs[0]
        self.log("loss", loss, on_step=False, on_epoch=True)
        return loss

    def validation_step(self, batch, batch_idx, dataloader_idx=0):
        outputs = self(**batch)
        val_loss, logits = outputs[:2]

        if self.hparams.num_labels > 1:
            preds = torch.argmax(logits, axis=1)
        elif self.hparams.num_labels == 1:
            preds = logits.squeeze()

        labels = batch["labels"]
        self.validation_step_outputs.append({"loss": val_loss, "preds": preds, "labels": labels})
        return val_loss

    def on_validation_epoch_end(self):
        if self.hparams.task_name == "mnli":
            for i, output in enumerate(self.validation_step_outputs):
                # matched or mismatched
                split = self.hparams.eval_splits[i].split("_")[-1]
                preds = torch.cat([x["preds"] for x in output]).detach().cpu().numpy()
                labels = torch.cat([x["labels"] for x in output]).detach().cpu().numpy()
                loss = torch.stack([x["loss"] for x in output]).mean()
                self.log(f"val_loss_{split}", loss, prog_bar=True)
                split_metrics = {
                    f"{k}_{split}": v for k, v in self.metric.compute(predictions=preds, references=labels).items()
                }
                self.log_dict(split_metrics, prog_bar=True)
            self.validation_step_outputs.clear()
            return loss

        preds = torch.cat([x["preds"] for x in self.validation_step_outputs]).detach().cpu().numpy()
        labels = torch.cat([x["labels"] for x in self.validation_step_outputs]).detach().cpu().numpy()
        loss = torch.stack([x["loss"] for x in self.validation_step_outputs]).mean()
        self.log("val_loss", loss, prog_bar=True)
        self.log_dict(self.metric.compute(predictions=preds, references=labels), prog_bar=True)
        self.validation_step_outputs.clear()

    def configure_optimizers(self):
        """Prepare optimizer and schedule (linear warmup and decay)"""
        model = self.model
        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": self.hparams.weight_decay,
            },
            {
                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
            },
        ]
    
        # Initialize optimizer based on the provided hyperparameter
        if self.hparams.optimizer == "adamw":
            optimizer = AdamW(optimizer_grouped_parameters, lr=self.hparams.learning_rate)
        elif self.hparams.optimizer == "adam":
            optimizer = Adam(optimizer_grouped_parameters, lr=self.hparams.learning_rate)
        elif self.hparams.optimizer == "sgd":
            optimizer = SGD(optimizer_grouped_parameters, lr=self.hparams.learning_rate)
        else:
            raise ValueError(f"Unknown optimizer type: {self.hparams.optimizer}")
    
        # Initialize learning rate scheduler
        scheduler = None  # Initialize scheduler to None
    
        # Check for valid scheduler types
        if self.hparams.scheduler == 'linear':
            scheduler = get_linear_schedule_with_warmup(
                optimizer,
                num_warmup_steps=self.hparams.warmup_steps,
                num_training_steps=self.trainer.estimated_stepping_batches,
            )
        elif self.hparams.scheduler == 'cosine':
            scheduler = get_cosine_schedule_with_warmup(
                optimizer,
                num_warmup_steps=self.hparams.warmup_steps,
                num_training_steps=self.trainer.estimated_stepping_batches,
            )
        elif self.hparams.scheduler == 'linear_warmup':
            # Add handling for 'linear_warmup' if that's your intention
            scheduler = get_linear_schedule_with_warmup(
                optimizer,
                num_warmup_steps=self.hparams.warmup_steps,
                num_training_steps=self.trainer.estimated_stepping_batches,
            )
        else:
            raise ValueError(f"Unknown scheduler type: {self.hparams.scheduler}")
    
        # Create the scheduler dictionary
        scheduler_dict = {"scheduler": scheduler, "interval": "step", "frequency": 1}
        return [optimizer], [scheduler_dict]


In [6]:
import os
import torch
import wandb
import lightning as L
from lightning.pytorch.loggers import WandbLogger
from lightning.pytorch.callbacks import ModelCheckpoint

def run_experiment(optimizer_name, warmup_steps, projectname, batch_size, weight_decay=0.0, scheduler='linear_warmup', learning_rate=2e-5):
    wandb_logger = WandbLogger(
        project=projectname,
        log_model=True
    )

    run_name = f"lr={learning_rate}_opt={optimizer_name}_wd={weight_decay}_bs={batch_size}_sched={scheduler}_warmup={warmup_steps}"
    wandb_logger.experiment.name = run_name

    wandb_logger.log_hyperparams({
        'learning_rate': learning_rate,
        'optimizer': optimizer_name,
        'weight_decay': weight_decay,
        'batch_size': batch_size,
        'scheduler': scheduler,
        'warmup_steps': warmup_steps
    })

    dm = GLUEDataModule(
        model_name_or_path="distilbert-base-uncased",
        task_name="mrpc",
    )
    dm.setup("fit")

    model = GLUETransformer(
        model_name_or_path="distilbert-base-uncased",
        num_labels=dm.num_labels,
        task_name=dm.task_name,
        learning_rate=learning_rate,
        warmup_steps=warmup_steps,
        weight_decay=weight_decay,
        train_batch_size=batch_size,
        eval_batch_size=batch_size,
        optimizer=optimizer_name, 
        scheduler=scheduler
    )

    trainer = L.Trainer(
        max_epochs=3,
        accelerator="auto",
        devices=1,
        logger=wandb_logger
    )

    trainer.fit(model, datamodule=dm)

    val_loss = trainer.callback_metrics["val_loss"].item()
    print(f"Validation Loss: {val_loss}")
    
    wandb.log({"val_loss": val_loss})
    wandb.finish()
    return val_loss 

In [None]:
run_experiment(
    learning_rate=2e-5,
    optimizer_name="adamw",
    weight_decay=0.0,
    batch_size=32,
    scheduler="linear_warmup",
    warmup_steps=0,
    projectname="GLUE"
)

run_experiment(
    learning_rate=2e-5,
    optimizer_name="adam",
    weight_decay=0.0,
    batch_size=32,
    scheduler="linear_warmup",
    warmup_steps=0,
    projectname="GLUE"
)

run_experiment(
    learning_rate=2e-5,
    optimizer_name="sgd",
    weight_decay=0.0,
    batch_size=32,
    scheduler="linear_warmup",
    warmup_steps=0,
    projectname="GLUE"
)

run_experiment(
    learning_rate=1e-5,
    optimizer_name="adamw",
    weight_decay=0.0,
    batch_size=32,
    scheduler="linear_warmup",
    warmup_steps=0,
    projectname="GLUE"
)

run_experiment(
    learning_rate=1e-6,
    optimizer_name="adamw",
    weight_decay=0.0,
    batch_size=32,
    scheduler="linear_warmup",
    warmup_steps=0,
    projectname="GLUE"
)

run_experiment(
    learning_rate=2e-5,
    optimizer_name="adamw",
    weight_decay=0.01,
    batch_size=32,
    scheduler="linear_warmup",
    warmup_steps=0,
    projectname="GLUE"
)

run_experiment(
    learning_rate=1e-7,
    optimizer_name="adamw",
    weight_decay=0.01,
    batch_size=32,
    scheduler="linear_warmup",
    warmup_steps=0,
    projectname="GLUE"
)

run_experiment(
    learning_rate=2e-5,
    optimizer_name="adamw",
    weight_decay=0.0,
    batch_size=16,
    scheduler="linear_warmup",
    warmup_steps=0,
    projectname="GLUE"
)
run_experiment(
    learning_rate=2e-5,
    optimizer_name="adamw",
    weight_decay=0.0,
    batch_size=16,
    scheduler="cosine",
    warmup_steps=0,
    projectname="GLUE"
)
run_experiment(
    learning_rate=2e-5,
    optimizer_name="adamw",
    weight_decay=0.0,
    batch_size=16,
    scheduler="linear_warmup",
    warmup_steps=100,
    projectname="GLUE"
)
run_experiment(
    learning_rate=2e-5,
    optimizer_name="adamw",
    weight_decay=0.0,
    batch_size=16,
    scheduler="linear_warmup",
    warmup_steps=200,
    projectname="GLUE"
)
run_experiment(
    learning_rate=1e-4,
    optimizer_name="adamw",
    weight_decay=0.0,
    batch_size=32,
    scheduler="linear_warmup",
    warmup_steps=0,
    projectname="GLUE"
)

# Week 2

In [None]:
import os
import torch
import wandb
import lightning as L
from lightning.pytorch.loggers import WandbLogger
from lightning.pytorch.callbacks import ModelCheckpoint

#wandb.init(project="hypertuning", name="hyperparam_search", mode="disabled")

PATH = 'content/trainings'
os.makedirs(PATH, exist_ok=True)

#logger = WandbLogger(project="hypertuning", log_model=True)

# hyperparameter search space (grid)
hyperparam_space = {
    "optimizer": ['adam', 'adamw'],
    "warmup_steps": [0, 100, 200, 300],
    "batch_size": [16, 32],
}

# Set random seed for reproducibility
L.seed_everything(42)

epochs = 3  # do not change this
best_val_loss = 0
best_hparams = None

for op in hyperparam_space["optimizer"]:
    for ws in hyperparam_space["warmup_steps"]:
        for batch_sz in hyperparam_space["batch_size"]:
            run_name = f"op-{op}_ws-{ws}_batch-{batch_sz}"
            print(f"Training with {run_name}")

            val_loss = run_experiment(
                optimizer_name=op,
                batch_size=batch_sz,
                warmup_steps=ws,
                projectname="hypertuning"
            )

            if val_loss < best_val_loss:
                best_val_loss = val_loss
                best_hparams = {
                    "optimizer": op,
                    "warmup_steps": ws,
                    "batch_size": batch_sz
                }
                
                model_path = os.path.join(PATH, f"{run_name}_best_model.pth")
                torch.save(model.state_dict(), model_path)

print(f"Best Hyperparameters: {best_hparams} with Validation Loss: {best_val_loss}")

# Week 3

# try sweeps

In [7]:
import os
import torch
import wandb
import lightning as L
from lightning.pytorch.loggers import WandbLogger
from lightning.pytorch.callbacks import ModelCheckpoint
sweep_config = {
    'method': 'random', 
    'metric': {
        'name': 'val_loss', 
        'goal': 'minimize'  
    },
    'parameters': {
        'optimizer': {
            'values': ['adam', 'adamw'] 
        },
        'warmup_steps': {
            'min': 0, 'max': 500  
        },
        'batch_size': {
            'values': [16, 32]
        }
    }
}

# Initialize the sweep
sweep_id = wandb.sweep(sweep_config, project="Sweeps")

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


Create sweep with ID: 9518et7t
Sweep URL: https://wandb.ai/sofia-horlacher-hochschule-luzern/Sweeps/sweeps/9518et7t


In [8]:
def tune_model(
    learning_rate: float=2e-5,
    warmup_steps: int=0,
    weight_decay: float=0,
    batch_size: int=32,
    optimizer: str="AdamW", 
    scheduler: str="linear_warmup",
    momentum: float=0.0,
    epsilon: float=1e-8,
    max_epochs = 3
):
    L.seed_everything(42)

    dm = GLUEDataModule(
        model_name_or_path="distilbert-base-uncased",
        task_name="mrpc",
    )
    dm.setup("fit")

    model = GLUETransformer(
        model_name_or_path=dm.model_name_or_path,
        num_labels=dm.num_labels,
        eval_splits=dm.eval_splits,
        task_name=dm.task_name,
        learning_rate=learning_rate,
        warmup_steps=warmup_steps,
        weight_decay=weight_decay,
        optimizer=optimizer, 
        scheduler=scheduler,
        momentum=momentum,
        epsilon=epsilon,
        train_batch_size=batch_size,
        eval_batch_size=batch_size,
    )


    run_name = f"lr={learning_rate}_opt={optimizer}_wd={weight_decay}_bs={batch_size}_sched={scheduler}_warmup={warmup_steps}"
    
    wandb_logger = WandbLogger(project="Sweeps", log_model=True)
    wandb_logger.experiment.name = run_name
    wandb_logger.watch(model)

    trainer = L.Trainer(
        max_epochs=max_epochs,
        accelerator="auto",
        devices=1,
        logger=wandb_logger,
        accumulate_grad_batches=batch_size / 32
    )
    trainer.fit(model, datamodule=dm)    

In [9]:
def train(config=None):
    with wandb.init(config):
        config = wandb.config
        tune_model(
            optimizer=config.optimizer,
            warmup_steps=config.warmup_steps,
            batch_size=config.batch_size,
        )

In [10]:
sweep_id = wandb.sweep(sweep_config, project="Sweeps")
wandb.agent(sweep_id, function=train, count=16)

Create sweep with ID: w5rh1fpe
Sweep URL: https://wandb.ai/sofia-horlacher-hochschule-luzern/Sweeps/sweeps/w5rh1fpe


[34m[1mwandb[0m: Agent Starting Run: gzq0h2ip with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	optimizer: adam
[34m[1mwandb[0m: 	warmup_steps: 313
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33msofia-horlacher[0m ([33msofia-horlacher-hochschule-luzern[0m). Use [1m`wandb login --relogin`[0m to force relogin


Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A16') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` whi

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.217 MB of 766.552 MB uploaded\r'), FloatProgress(value=0.0015877563149299361, ma…

0,1
accuracy,▁▆█
epoch,▁▁▅▅██
f1,▁▅█
loss,█▅▁
trainer/global_step,▁▁▅▅██
val_loss,█▃▁

0,1
accuracy,0.85294
epoch,2.0
f1,0.89726
loss,0.32643
trainer/global_step,344.0
val_loss,0.3683


[34m[1mwandb[0m: Agent Starting Run: 7gbdi2jf with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	optimizer: adam
[34m[1mwandb[0m: 	warmup_steps: 288
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.294 MB of 766.551 MB uploaded\r'), FloatProgress(value=0.0016877659682282215, ma…

0,1
accuracy,▁▆█
epoch,▁▁▅▅██
f1,▁▄█
loss,█▅▁
trainer/global_step,▁▁▅▅██
val_loss,█▃▁

0,1
accuracy,0.83578
epoch,2.0
f1,0.88468
loss,0.31561
trainer/global_step,344.0
val_loss,0.37997


[34m[1mwandb[0m: Agent Starting Run: a1konmff with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	optimizer: adamw
[34m[1mwandb[0m: 	warmup_steps: 235
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.264 MB of 766.552 MB uploaded\r'), FloatProgress(value=0.0016483912882706347, ma…

0,1
accuracy,▁▆█
epoch,▁▁▅▅██
f1,▁▃█
loss,█▅▁
trainer/global_step,▁▁▅▅██
val_loss,█▁▂

0,1
accuracy,0.84069
epoch,2.0
f1,0.89221
loss,0.28516
trainer/global_step,344.0
val_loss,0.43819


[34m[1mwandb[0m: Agent Starting Run: 74lai30n with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	optimizer: adamw
[34m[1mwandb[0m: 	warmup_steps: 11
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.218 MB of 766.553 MB uploaded\r'), FloatProgress(value=0.0015888245484146793, ma…

0,1
accuracy,▁▆█
epoch,▁▁▅▅██
f1,▁▆█
loss,█▄▁
trainer/global_step,▁▁▅▅██
val_loss,█▁▄

0,1
accuracy,0.85539
epoch,2.0
f1,0.89949
loss,0.20597
trainer/global_step,344.0
val_loss,0.37896


[34m[1mwandb[0m: Agent Starting Run: tumz97mk with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	optimizer: adam
[34m[1mwandb[0m: 	warmup_steps: 461
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.279 MB of 766.551 MB uploaded\r'), FloatProgress(value=0.0016679152869094182, ma…

0,1
accuracy,▁▅█
epoch,▁▁▅▅██
f1,▁▄█
loss,█▅▁
trainer/global_step,▁▁▅▅██
val_loss,█▅▁

0,1
accuracy,0.85294
epoch,2.0
f1,0.89761
loss,0.39357
trainer/global_step,344.0
val_loss,0.35836


[34m[1mwandb[0m: Agent Starting Run: 9qbp289o with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	optimizer: adam
[34m[1mwandb[0m: 	warmup_steps: 258
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.202 MB of 766.553 MB uploaded\r'), FloatProgress(value=0.001568202600433696, max…

0,1
accuracy,▁▆█
epoch,▁▁▅▅██
f1,▁▄█
loss,█▅▁
trainer/global_step,▁▁▅▅██
val_loss,█▃▁

0,1
accuracy,0.85049
epoch,2.0
f1,0.89573
loss,0.29255
trainer/global_step,344.0
val_loss,0.36341


[34m[1mwandb[0m: Agent Starting Run: pynbu6yv with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	optimizer: adam
[34m[1mwandb[0m: 	warmup_steps: 493
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.218 MB of 766.553 MB uploaded\r'), FloatProgress(value=0.001588526436974466, max…

0,1
accuracy,▁▄█
epoch,▁▁▅▅██
f1,▁▃█
loss,█▅▁
trainer/global_step,▁▁▅▅██
val_loss,█▅▁

0,1
accuracy,0.84314
epoch,2.0
f1,0.89333
loss,0.40552
trainer/global_step,344.0
val_loss,0.37654


[34m[1mwandb[0m: Agent Starting Run: x6q1ox85 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	optimizer: adamw
[34m[1mwandb[0m: 	warmup_steps: 361
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.278 MB of 766.551 MB uploaded\r'), FloatProgress(value=0.0016677226386159099, ma…

0,1
accuracy,▁▆█
epoch,▁▁▅▅██
f1,▁▄█
loss,█▅▁
trainer/global_step,▁▁▅▅██
val_loss,█▃▁

0,1
accuracy,0.85294
epoch,2.0
f1,0.89286
loss,0.34678
trainer/global_step,344.0
val_loss,0.35726


[34m[1mwandb[0m: Agent Starting Run: 9y4xsyyd with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	optimizer: adam
[34m[1mwandb[0m: 	warmup_steps: 135
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.297 MB of 766.554 MB uploaded\r'), FloatProgress(value=0.001691721758348217, max…

0,1
accuracy,▁██
epoch,▁▁▅▅██
f1,▁▇█
loss,█▅▁
trainer/global_step,▁▁▅▅██
val_loss,█▁▂

0,1
accuracy,0.85294
epoch,2.0
f1,0.89761
loss,0.21239
trainer/global_step,344.0
val_loss,0.39292


[34m[1mwandb[0m: Agent Starting Run: jwhdag29 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	optimizer: adamw
[34m[1mwandb[0m: 	warmup_steps: 2
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.293 MB of 766.550 MB uploaded\r'), FloatProgress(value=0.0016870754095019093, ma…

0,1
accuracy,▁▄█
epoch,▁▁▅▅██
f1,▁▃█
loss,█▄▁
trainer/global_step,▁▁▅▅██
val_loss,▅▁█

0,1
accuracy,0.84559
epoch,2.0
f1,0.89376
loss,0.16606
trainer/global_step,344.0
val_loss,0.46202


[34m[1mwandb[0m: Agent Starting Run: rttjmit5 with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	optimizer: adamw
[34m[1mwandb[0m: 	warmup_steps: 51
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.292 MB of 766.549 MB uploaded\r'), FloatProgress(value=0.0016859054319254294, ma…

0,1
accuracy,▁▃█
epoch,▁▁▅▅██
f1,▁▃█
loss,█▄▁
trainer/global_step,▁▁▅▅██
val_loss,▆▁█

0,1
accuracy,0.83578
epoch,2.0
f1,0.88739
loss,0.16003
trainer/global_step,344.0
val_loss,0.46457


[34m[1mwandb[0m: Agent Starting Run: yij4ap8o with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	optimizer: adam
[34m[1mwandb[0m: 	warmup_steps: 188
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112527454841054, max=1.0…

Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.201 MB of 766.551 MB uploaded\r'), FloatProgress(value=0.0015662797383405782, ma…

0,1
accuracy,▁██
epoch,▁▁▅▅██
f1,▁▇█
loss,█▅▁
trainer/global_step,▁▁▅▅██
val_loss,█▁▃

0,1
accuracy,0.83824
epoch,2.0
f1,0.88851
loss,0.24579
trainer/global_step,344.0
val_loss,0.44072


[34m[1mwandb[0m: Agent Starting Run: 02sa0zkr with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	optimizer: adam
[34m[1mwandb[0m: 	warmup_steps: 393
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.265 MB of 766.553 MB uploaded\r'), FloatProgress(value=0.0016501661867013599, ma…

0,1
accuracy,▁▆█
epoch,▁▁▅▅██
f1,▁▅█
loss,█▅▁
trainer/global_step,▁▁▅▅██
val_loss,█▄▁

0,1
accuracy,0.84559
epoch,2.0
f1,0.88649
loss,0.36261
trainer/global_step,344.0
val_loss,0.35337


[34m[1mwandb[0m: Agent Starting Run: ra46udou with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	optimizer: adamw
[34m[1mwandb[0m: 	warmup_steps: 286
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.312 MB of 766.553 MB uploaded\r'), FloatProgress(value=0.0017109698886933569, ma…

0,1
accuracy,▁▆█
epoch,▁▁▅▅██
f1,▁▄█
loss,█▅▁
trainer/global_step,▁▁▅▅██
val_loss,█▄▁

0,1
accuracy,0.84069
epoch,2.0
f1,0.88576
loss,0.3128
trainer/global_step,344.0
val_loss,0.34995


[34m[1mwandb[0m: Agent Starting Run: krabrz8i with config:
[34m[1mwandb[0m: 	batch_size: 16
[34m[1mwandb[0m: 	optimizer: adamw
[34m[1mwandb[0m: 	warmup_steps: 472
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.294 MB of 766.551 MB uploaded\r'), FloatProgress(value=0.001688402985151149, max…

0,1
accuracy,▁▄█
epoch,▁▁▅▅██
f1,▁▃█
loss,█▅▁
trainer/global_step,▁▁▅▅██
val_loss,█▅▁

0,1
accuracy,0.85049
epoch,2.0
f1,0.89679
loss,0.39782
trainer/global_step,344.0
val_loss,0.36284


[34m[1mwandb[0m: Agent Starting Run: 58jvyyvs with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	optimizer: adamw
[34m[1mwandb[0m: 	warmup_steps: 80
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


Seed set to 42
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]



Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.

  | Name  | Type                                | Params | Mode
---------------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M | eval
---------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
0         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


VBox(children=(Label(value='1.282 MB of 766.554 MB uploaded\r'), FloatProgress(value=0.0016720586993805217, ma…

0,1
accuracy,▁██
epoch,▁▁▅▅██
f1,▁▇█
loss,█▄▁
trainer/global_step,▁▁▅▅██
val_loss,█▁▄

0,1
accuracy,0.84559
epoch,2.0
f1,0.89231
loss,0.19649
trainer/global_step,344.0
val_loss,0.3981


In [2]:
import nbformat

# Load the notebook
notebook_path = "week1.ipynb"
with open(notebook_path, "r") as file:
    notebook = nbformat.read(file, as_version=4)

# Display the structure of the notebook for assessment
notebook.cells[:5] 

[{'cell_type': 'markdown',
  'metadata': {'id': 'TRH0teWHl4Uy'},
  'source': '# Hyperparameter Tuning\n*(Note: This notebook runs significantly faster if you have access to a GPU. Use either the GPUHub, Google Colab, or your own GPU.)*\n\nIn this project, you will optimize the hyperparameters of a model in 3 stages.'},
 {'cell_type': 'markdown',
  'metadata': {'id': '6BWNKzfGl4U2'},
  'source': '## Paraphrase Detection\nWe finetune [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on [MRPC](https://huggingface.co/datasets/glue/viewer/mrpc/train), a paraphrase detection dataset. This notebook is adapted from a [PyTorch Lightning example](https://lightning.ai/docs/pytorch/1.9.5/notebooks/lightning_examples/text-transformers.html).'},
 {'cell_type': 'code',
  'execution_count': 1,
  'metadata': {'colab': {'base_uri': 'https://localhost:8080/'},
   'id': 'B3PUFAq9l4U2',
   'outputId': 'f13897e1-5b24-470f-dae3-10a916367455'},
  'outputs': [{'name': 'stdout',
    'out