# My Summary
This homework was personal for me, as my patience and stubbornness were tested.
If you're interested and have time, you can read about my journey here:

1. In our group chat, a few people complained about Optuna, saying it took too long to work with the data and even reduced their accuracy. I hadn’t personally used or tested Optuna for my homework, so...
2. I decided to examine all possible combinations of hyperparameters on my own. By the third combination, I was exhausted, which led me to my third idea:
3. Why not use an automated grid search? I wrote some code to automate the process (thinking it would be faster). The result? I spent almost 3–4 hours waiting for my code to run. I achieved the highest accuracy—0.781250 on the training data—out of 10 iterations (I think) only during the first run. I ran and tweaked my combinations many, many times.
4. At this point, I thought, "Wow, this model really wants to fight me." So, I increased the training, validation, and test dataset sizes. I had a fixed goal in my head: I want to achieve 80%> accuracy.
5. And here I am! You can see all the results below (even my automated grid search that I ran many, many times).


This was a fun experience for me because when I delve into something like this (maniacally), I grasp the concepts very well—very well

# Home Exericse 3: Hyperparameters and Evaluation
In this third home exercise, you will use the knowledge from Tutorial 4 to experiment with hyperparameters, create a test set, and evaluate your final model on the created test set.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

## **Distilbert: Hyperparameters and Evaluation**

Use the code of Tutorial 4 to load and fine-tune the `distilbert-base-cased`model on the small subset of the `imdb`Movie Review Dataset. For convenience, the code of Tutorial 4 required for this exercise is already provided in the code cells below.

👋 ⚒ When creating the dataset splits in the code cell below, additionally create a test set to be used after thet training. Make sure that your test set does not contain any of the sentences contained in the training or validation set and is approximately of the same size as the validation set.

In [1]:
!pip install transformers
!pip install datasets
!pip install evaluate
!pip install accelerate --upgrade

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [65]:
from datasets import load_dataset, DatasetDict
from transformers import DataCollatorWithPadding
from transformers import AutoTokenizer
imdb_dataset = load_dataset("imdb")


tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased")


def truncate(example):
    return {
        'text': " ".join(example['text'].split()[:200]),
        'label': example['label']
    }

small_imdb_dataset = DatasetDict(
    train=imdb_dataset['train'].shuffle(seed=24).select(range(256)).map(truncate),
    val=imdb_dataset['train'].shuffle(seed=24).select(range(256, 320)).map(truncate),
    test_data = imdb_dataset['train'].shuffle(seed=24).select(range(320, 384)).map(truncate),

)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True)


small_tokenized_dataset = small_imdb_dataset.map(
    tokenize_function,
    batched=True,
    batch_size=16
).with_format("torch")

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/64 [00:00<?, ? examples/s]

Map:   0%|          | 0/64 [00:00<?, ? examples/s]

Map:   0%|          | 0/64 [00:00<?, ? examples/s]

Map:   0%|          | 0/64 [00:00<?, ? examples/s]

👋 ⚒ For this exercise, we will use the Hugging Face Trainer class to play with hyperparamters. Try to find a set of hyperparameter settings that achieves the highest possilbe accuracy on the **validation set** with the small dataset and model in this setup.

**Optional:** If you want to follow a more systematic route, feel free to use available frameworks for hyperparameter optimization, such as [Optuna](https://optuna.org/).

In [77]:
import numpy as np
import evaluate
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification
from transformers import set_seed

set_seed(24)

model = AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-cased', num_labels=2)
compute_accuracy = evaluate.load("accuracy")

arguments = TrainingArguments(
    output_dir="sample_cl_trainer",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_steps=8,
    num_train_epochs=7,
    eval_strategy="epoch", # run validation at the end of each epoch
    save_strategy="epoch",
    learning_rate=1e-5,
    weight_decay=0.00,
    load_best_model_at_end=True,
    report_to='none',
    seed=24
)

def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    # calculates the accuracy
    return compute_accuracy.compute(predictions=predictions, references=labels)


trainer = Trainer(
    model=model,
    args=arguments,
    train_dataset=small_tokenized_dataset['train'],
    eval_dataset=small_tokenized_dataset['val'], # change to test when you do your final evaluation!
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)



Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [78]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.6764,0.705665,0.40625
2,0.6757,0.702631,0.40625
3,0.6293,0.664286,0.609375
4,0.4795,0.507213,0.78125


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6764,0.705665,0.40625
2,0.6757,0.702631,0.40625
3,0.6293,0.664286,0.609375
4,0.4795,0.507213,0.78125
5,0.2501,0.475514,0.8125
6,0.205,0.442381,0.796875
7,0.2084,0.485982,0.78125


TrainOutput(global_step=224, training_loss=0.4684728485132967, metrics={'train_runtime': 3544.9379, 'train_samples_per_second': 0.506, 'train_steps_per_second': 0.063, 'total_flos': 146825622788448.0, 'train_loss': 0.4684728485132967, 'epoch': 7.0})

In [80]:
test_set = small_tokenized_dataset['test_data']
fine_tuned_model = AutoModelForSequenceClassification.from_pretrained("sample_cl_trainer/checkpoint-160") #5 epoch gave me the highest accuracy

correct_predictions = 0
total_predictions = 0

for example in test_set:
    model_inputs = tokenizer(example['text'], return_tensors="pt")
    prediction = torch.argmax(fine_tuned_model(**model_inputs).logits)
    if prediction == example['label']:
        correct_predictions += 1
    total_predictions += 1

accuracy = correct_predictions / total_predictions
print(f"Accuracy: {accuracy}")

Accuracy: 0.84375


In [69]:
results = trainer.predict(small_tokenized_dataset['val'])
print(results)

PredictionOutput(predictions=array([[ 0.15956673, -0.17060235],
       [-0.51694745,  0.318438  ],
       [ 0.23124236, -0.20232707],
       [-0.46319348,  0.24490991],
       [-0.031952  ,  0.02173012],
       [-0.57307476,  0.37920517],
       [-0.47913852,  0.39930236],
       [-0.4349009 ,  0.32846504],
       [ 0.21087773, -0.08863482],
       [-0.61962074,  0.40776348],
       [-0.19877662,  0.085187  ],
       [-0.5099886 ,  0.29655027],
       [-0.03091278,  0.07970127],
       [ 0.33791673, -0.21151479],
       [ 0.458614  , -0.1919109 ],
       [-0.20526166,  0.05679601],
       [-0.12594014,  0.07895851],
       [-0.4226833 ,  0.29728526],
       [-0.4639297 ,  0.34127992],
       [-0.3713472 ,  0.17571756],
       [ 0.29998946, -0.2178136 ],
       [ 0.15657294, -0.09589853],
       [ 0.32110962, -0.24459232],
       [-0.23833604,  0.26911283],
       [-0.16589464,  0.11386663],
       [ 0.03663639, -0.02480044],
       [-0.27984196,  0.1262895 ],
       [-0.14622149,  0.10

# My automate grid search

In [73]:
#if not successful(80>) run one more time , starting from 2e-05, and so on
#weight_decays 0.1 was not successful for me that s why in the end i left it as 0.0
#epoch 5 I tested, and wanted to test with 7 as my back up
from itertools import product
from transformers import Trainer
import numpy as np
import evaluate
from transformers import set_seed

set_seed(24)


model = AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-cased', num_labels=2)
compute_accuracy = evaluate.load("accuracy")
# defining hyperparameter grid
learning_rates = [2e-05,]
batch_sizes = [8, 16]
weight_decays = [0.0]
epochs = [7]

# creating all combinations of hyperparameters
grid = list(product(learning_rates, batch_sizes, weight_decays, epochs))

best_accuracy = 0
best_params = None
compute_accuracy = evaluate.load("accuracy")

for lr, batch_size, wd, epoch in grid:
  print(f"Testing configuration: lr={lr}, batch_size={batch_size}, weight_decay={wd}, epochs={epoch}")


  arguments = TrainingArguments(
      output_dir="sample_cl_trainer",
      per_device_train_batch_size=batch_size,
      per_device_eval_batch_size=batch_size,
      logging_steps=8,
      num_train_epochs=epoch,
      eval_strategy="epoch",
      save_strategy="epoch",
      learning_rate=lr,
      weight_decay=wd,
      load_best_model_at_end=True,
      report_to='none',
      seed= 24
  )

  trainer = Trainer(
      model=model,
      args=arguments,
      train_dataset=small_tokenized_dataset['train'],
      eval_dataset=small_tokenized_dataset['val'],
      tokenizer=tokenizer,
      data_collator=data_collator,
      compute_metrics=compute_metrics
  )

  trainer.train()
  eval_result = trainer.evaluate()

  accuracy = eval_result['eval_accuracy']
  if accuracy > best_accuracy:
    best_accuracy = accuracy
    best_params = (lr, batch_size, wd, epoch)

print(f"Best configuration: lr={best_params[0]}, batch_size={best_params[1]}, weight_decay={best_params[2]}, epochs={best_params[3]}")
print(f"Best accuracy: {best_accuracy}")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Testing configuration: lr=1e-06, batch_size=8, weight_decay=0.0, epochs=5


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.676,0.707539,0.40625


KeyboardInterrupt: 