# Assignment: Fine-Tuning Theory and Practice

### Part 1: Theory of Fine-Tuning

**Application Task**:

1. Transfer learning is like learning to play a new musical instrument after mastering one. Suppose you are an expert pianist and want to learn the violin. Though the instruments are different, your knowledge of reading sheet music, rhythm, and hand coordination gives you a head start. You don‚Äôt have to start from scratch; instead, you build on your existing musical skills and adapt them to the violin. Similarly, in machine learning, a model trained on a large dataset for one task (like recognizing objects in photos) can be fine-tuned for a new but related task (like identifying medical images) with much less data. For example, a model trained on millions of general images (e.g., cats, cars, trees) can be adapted to detect tumors in X-rays by retraining only a small portion of it. This saves time, improves accuracy, and reduces the need for massive datasets in specialized applications.

2. Example fine-tuning dataset: https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset

This is an example of a preprocessed and specialized dataset that is useful for fine-tuning LLMs for a specific task or tone. In this case, Human-Like-DPO contains both natural/human-like and stiff/generic responses that can be used to fine-tune an LLM to sound more natural and conversational. This dataset is already cleaned and labeled as provided.

### Part 2: Practical Fine-Tuning Session

**Hands-On Coding Task**

In [1]:
!pip install torch tensorflow transformers datasets


Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting 

In [2]:
import torch
torch.cuda.is_available()

True

In [3]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [4]:
from datasets import load_dataset
dataset = load_dataset("imdb")

def preprocess_function(examples):
  return tokenizer(examples['text'], truncation=True, padding=True)

tokenized_dataset = dataset.map(preprocess_function, batched=True)

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [5]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
  output_dir="./results", evaluation_strategy="epoch",
  learning_rate=2e-5, per_device_train_batch_size=16,
  num_train_epochs=3, weight_decay=0.01,
)

trainer = Trainer(
  model=model, args=training_args,
  train_dataset=tokenized_dataset["train"],
  eval_dataset=tokenized_dataset["test"],
)



In [6]:
# Disabling WANDB to use colab CUDA instead

import os, wandb
wandb.init(mode="disabled")
os.environ["WANDB_DISABLED"] = "true"

In [7]:
trainer.train()




Epoch,Training Loss,Validation Loss
1,0.2278,0.219885
2,0.1485,0.239593
3,0.0885,0.284228


TrainOutput(global_step=4689, training_loss=0.16726805462616545, metrics={'train_runtime': 4668.5298, 'train_samples_per_second': 16.065, 'train_steps_per_second': 1.004, 'total_flos': 9935054899200000.0, 'train_loss': 0.16726805462616545, 'epoch': 3.0})

In [8]:
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.txt',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

In [9]:
results = trainer.evaluate()
print(results)

{'eval_loss': 0.2842276692390442, 'eval_runtime': 353.9991, 'eval_samples_per_second': 70.622, 'eval_steps_per_second': 8.828, 'epoch': 3.0}


In [10]:
from sklearn.metrics import classification_report
predictions = trainer.predict(tokenized_dataset["test"])
y_pred = predictions.predictions.argmax(axis=1)
y_true = tokenized_dataset["test"]["label"]
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.92      0.93     12500
           1       0.92      0.94      0.93     12500

    accuracy                           0.93     25000
   macro avg       0.93      0.93      0.93     25000
weighted avg       0.93      0.93      0.93     25000



**Reflection**:

- Key Challenges: A major challenge faced during training was tuning the hyperparameters while finding the right balance between training speed and final model accuracy. This is because fine-tuning is computationally- and time-intensive and trade-offs may need to be made.

- Suggestions: Decreasing the learning rate and increasing the number of training epochs would likely increase the model's performance and accuracy even further, although for this assignment hyperparameters were selected based on a reasonable balance between training speed and final performance.