<a href="https://colab.research.google.com/github/dgromann/cl_intro_ws2024/blob/main/exercises/HomeExercise3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exericse 3: Hyperparameters and Evaluation
In this third home exercise, you will use the knowledge from Tutorial 4 to experiment with hyperparameters, create a test set, and evaluate your final model on the created test set.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

## **Distilbert: Hyperparameters and Evaluation**

Use the code of Tutorial 4 to load and fine-tune the `distilbert-base-cased`model on the small subset of the `imdb`Movie Review Dataset. For convenience, the code of Tutorial 4 required for this exercise is already provided in the code cells below.

👋 ⚒ When creating the dataset splits in the code cell below, additionally create a test set to be used after thet training. Make sure that your test set does not contain any of the sentences contained in the training or validation set and is approximately of the same size as the validation set.

In [None]:
!pip install transformers
!pip install datasets
!pip install evaluate
!pip install accelerate --upgrade
!pip install pandas

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [None]:
from datasets import load_dataset, DatasetDict
from transformers import DataCollatorWithPadding
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer
import numpy as np
import evaluate
import torch
import pandas as pd

imdb_dataset = load_dataset("imdb")
# we had loaded the imdb dataset already above - if not, outcomment this line
# Make sure you have the right tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased")


# Just take the first 50 tokens for speed on CPU
def truncate(example):
    return {
        'text': " ".join(example['text'].split()[:50]),
        'label': example['label']
    }

# Take 128 random examples for train and 32 validation
small_imdb_dataset = DatasetDict(
    train=imdb_dataset['train'].shuffle(seed=24).select(range(128)).map(truncate),
    val=imdb_dataset['train'].shuffle(seed=24).select(range(128, 160)).map(truncate),
    test=imdb_dataset['train'].shuffle(seed=24).select(range(160, 200)).map(truncate)
)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True)

small_tokenized_dataset = small_imdb_dataset.map(tokenize_function, batched=True, batch_size=16)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/128 [00:00<?, ? examples/s]

Map:   0%|          | 0/32 [00:00<?, ? examples/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

Map:   0%|          | 0/128 [00:00<?, ? examples/s]

Map:   0%|          | 0/32 [00:00<?, ? examples/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

👋 ⚒ For this exercise, we will use the Hugging Face Trainer class to play with hyperparamters. Try to find a set of hyperparameter settings that achieves the highest possilbe accuracy on the **validation set** with the small dataset and model in this setup.

**Optional:** If you want to follow a more systematic route, feel free to use available frameworks for hyperparameter optimization, such as [Optuna](https://optuna.org/).

In [None]:
model = AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-cased', num_labels=2)
accuracy = evaluate.load("accuracy")

arguments = TrainingArguments(
    output_dir="sample_cl_trainer",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_steps=8,
    num_train_epochs=5,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    report_to='none',
    seed=224
)

def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    #unpack predictions
    logits, labels = eval_pred
    #convert logits to a predicted class indices; take index of max logit for each one
    predictions = np.argmax(logits, axis=-1)
    #calculates the accuracy
    return accuracy.compute(predictions=predictions, references=labels)


trainer = Trainer(
    model=model, #model to be trained
    args=arguments,#training arguments
    train_dataset=small_tokenized_dataset['train'],#training dataset
    eval_dataset=small_tokenized_dataset['val'], #validation dataset
    processing_class=tokenizer,#preprocess datset with tokenization
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.7012,0.698196,0.46875
2,0.6808,0.693926,0.46875
3,0.666,0.686764,0.46875
4,0.6592,0.678429,0.65625
5,0.6458,0.677457,0.65625


TrainOutput(global_step=40, training_loss=0.6705888628959655, metrics={'train_runtime': 498.2497, 'train_samples_per_second': 1.284, 'train_steps_per_second': 0.08, 'total_flos': 18636507148416.0, 'train_loss': 0.6705888628959655, 'epoch': 5.0})

In [None]:
trainer.evaluate()

{'eval_loss': 0.6774572134017944,
 'eval_accuracy': 0.65625,
 'eval_runtime': 3.7579,
 'eval_samples_per_second': 8.515,
 'eval_steps_per_second': 0.532,
 'epoch': 5.0}

\### **hyperparameter optimization with OPTUNA**

In [None]:
!pip install optuna

Collecting optuna
  Downloading optuna-4.1.0-py3-none-any.whl.metadata (16 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.14.0-py3-none-any.whl.metadata (7.4 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.6-py3-none-any.whl.metadata (2.9 kB)
Downloading optuna-4.1.0-py3-none-any.whl (364 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m364.4/364.4 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.14.0-py3-none-any.whl (233 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.5/233.5 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Downloading Mako-1.3.6-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: M

In [None]:
import optuna
def objective(trial):
  learning_rate = trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True)
  weight_decay = trial.suggest_float("weight_decay", 1e-6, 1e-4, log=True)
  num_train_epochs = trial.suggest_categorical(1, 5)
  batch_size = trial.suggest_int("batch_size", [8, 16, 32])

  trial_training_args = TrainingArguments(
    output_dir="sample_cl_trainer",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_steps=8,
    num_train_epochs=num_train_epochs,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=learning_rate,
    load_best_model_at_end=True,
    report_to='none',
    seed=224
  )

  trial_trainer = Trainer(
      model=AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-cased', num_labels=2),
      args=trial_training_args,
      train_dataset=small_tokenized_dataset['train'],
      eval_dataset=small_tokenized_dataset['val'],
      processing_class=tokenizer,
      compute_metrics=compute_metrics
  )

  trial_trainer.train()
  eval_results = trial_trainer.evaluate()

  return eval_results["eval_accuracy"]

In [None]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
print(study.best_params)

[I 2024-12-03 15:35:43,630] A new study created in memory with name: no-name-402bca6b-9580-4563-8a1c-2cacf504990f
[W 2024-12-03 15:35:43,634] Trial 0 failed with parameters: {'learning_rate': 4.5169344068097554e-06, 'weight_decay': 9.791995166729206e-05} because of the following error: TypeError("object of type 'int' has no len()").
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 197, in _run_trial
    value_or_values = func(trial)
  File "<ipython-input-17-e02f0de47183>", line 5, in objective
    num_train_epochs = trial.suggest_categorical(1, 5)
  File "/usr/local/lib/python3.10/dist-packages/optuna/trial/_trial.py", line 402, in suggest_categorical
    return self._suggest(name, CategoricalDistribution(choices=choices))
  File "/usr/local/lib/python3.10/dist-packages/optuna/distributions.py", line 515, in __init__
    if len(choices) == 0:
TypeError: object of type 'int' has no len()
[W 2024-12-03 15:35:43,637] Tria

TypeError: object of type 'int' has no len()

👋 ⚒ Change the following code cell in a way that not only a single sentence is evaluated on your trained model (!make sure to use the correct checkpoint!) but the evaluation is performaned on the entire newly created test set.

This might also be a good occassion to get familiar with the [Hugging Face documentation and tutorials](https://huggingface.co/docs/transformers/index).

In [None]:
fine_tuned_model = AutoModelForSequenceClassification.from_pretrained("sample_cl_trainer/checkpoint-40")

def evaluation_on_model(eval_dataset):
  fine_tuned_model.eval() #set model to evaluation mode
  all_predictions = []
  all_labels = []

  #loop through each item and tokenize the text
  for item in eval_dataset:
    model_inputs = tokenizer(item['text'], padding=True, truncation=True, return_tensors="pt")
    labels = item['label']

    with torch.no_grad():
      logits = fine_tuned_model(**model_inputs).logits
      predictions = torch.argmax(logits)
      all_predictions.append(predictions)
      all_labels.append(labels)

  #convert predictions and labels to tensors
  all_predictions = torch.stack(all_predictions).cpu()
  all_labels = torch.tensor(all_labels).cpu()

  result = accuracy.compute(predictions=all_predictions, references=all_labels)
  return result

def prediction(dataset, model, tokenizer):
  results = []

  for item in dataset:
    text = item['text']
    model_inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")

    with torch.no_grad():
      logits = model(**model_inputs).logits
      prediction = torch.argmax(logits)
      results.append({text, ["NEGATIVE", "POSITIVE"][prediction]})

  return pd.DataFrame(results)

fine_tuned_model.eval()
predicted_results = prediction(small_tokenized_dataset['val'], fine_tuned_model, tokenizer)
print(predicted_results)

                                                    0  \
0                                            NEGATIVE   
1   I have been a fan of Pushing Daisies since the...   
2   ... but the trouble of this production is that...   
3   This film moved me beyond comprehension, it is...   
4   Good action show, but nothing new. This one to...   
5                                            NEGATIVE   
6   The Revolt of the Zombies is not the worst mov...   
7   I have walked out of very few movies before th...   
8                                            POSITIVE   
9                                            NEGATIVE   
10                                           POSITIVE   
11                                           NEGATIVE   
12  I've no idea what dimwit from San Francisco ca...   
13  I love this movie. My only disappointment was ...   
14                                           POSITIVE   
15  I've read countless of posts about this game b...   
16                             

Run on Test set

In [None]:
test_results = trainer.evaluate(eval_dataset=small_tokenized_dataset['test'])
print("Test set evaluation results:", test_results)

predictions = trainer.predict(test_dataset=small_tokenized_dataset['test'])

predicted_labels = np.argmax(predictions.predictions, axis=1)
test_texts = small_tokenized_dataset['test']['text']

label_map = {0: "NEGATIVE", 1: "POSITIVE"}
results_df = pd.DataFrame({
    "Text": test_texts,
    "Predicted Sentiment": [label_map[label] for label in predicted_labels]
})

print(results_df)

Test set evaluation results: {'eval_loss': 0.6866800785064697, 'eval_accuracy': 0.525, 'eval_runtime': 10.0157, 'eval_samples_per_second': 3.994, 'eval_steps_per_second': 0.3, 'epoch': 5.0}
                                                 Text Predicted Sentiment
0   If you made a genre flick in the late 80s, you...            POSITIVE
1   I really enjoyed this movie as a young kid. At...            NEGATIVE
2   As a rule, a Full Moon production logo is a wa...            POSITIVE
3   Terry Gilliam's fantastic, twisted story of a ...            POSITIVE
4   This is one of my all time favorites.<br /><br...            POSITIVE
5   Billed as a kind of sequel to The Full Monty, ...            POSITIVE
6   This was playing at our theater in Amsterdam a...            POSITIVE
7   The story and the show were good, but it was r...            NEGATIVE
8   Very interesting and moving documentary about ...            POSITIVE
9   Cuban Blood is one of those sleeper films that...            POSIT