# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

- **PEFT technique**: LoRA to fine-tune MobileBERT with fewer parameters.

- **Model**: distilBERT, smaller version of BERT, chosen for memory efficiency.

- **Evaluation approach**: Accuracy is the primary evaluation metric, with loss also monitored to track model fitting. Evaluation speed (samples/steps per second) is tracked to assess efficiency.

- **Fine-tuning dataset**: IMDb dataset chosen for simplicity to train and evaluate the model.

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
import numpy as np
import torch
from datasets import load_dataset
from transformers import Trainer, TrainingArguments

### Loading Dataset 

In [2]:
# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Load dataset
dataset = load_dataset("imdb")

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 21.0M/21.0M [00:00<00:00, 33.5MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:00<00:00, 44.2MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:00<00:00, 49.9MB/s]


Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

###  Dataset Features:

In [3]:
# Create a properly balanced subset of the full dataset
def create_balanced_subset(dataset_split, n_samples):
    pos_indices = [i for i, label in enumerate(dataset_split['label']) if label == 1]
    neg_indices = [i for i, label in enumerate(dataset_split['label']) if label == 0]
    
    np.random.shuffle(pos_indices)
    np.random.shuffle(neg_indices)
    
    n_per_class = n_samples // 2
    selected_indices = pos_indices[:n_per_class] + neg_indices[:n_per_class]
    
    np.random.shuffle(selected_indices)
    
    return dataset_split.select(selected_indices)

# Create balanced subsets
train_subset = create_balanced_subset(dataset['train'], 5000)
test_subset = create_balanced_subset(dataset['test'], 1000)

# Verify balanced distribution
print(f"Train subset label distribution: {np.bincount(train_subset['label'])}")
print(f"Test subset label distribution: {np.bincount(test_subset['label'])}")


Train subset label distribution: [2500 2500]
Test subset label distribution: [500 500]


### Tokenizing Data

In [4]:
# Load DistilBERT model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

# Tokenization function
def tokenize_function(examples):
    tokenized = tokenizer(
        examples["text"], 
        padding="max_length", 
        truncation=True, 
        max_length=512
    )
    return tokenized

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [5]:
# Tokenize data
tokenized_train = train_subset.map(tokenize_function, batched=True)
tokenized_test = test_subset.map(tokenize_function, batched=True)

# Set format for PyTorch
tokenized_train.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
tokenized_test.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [6]:
# Metrics function with detailed logging
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    # Calculate metrics
    accuracy = np.mean(predictions == labels)
    pos_pred = np.sum(predictions == 1)
    neg_pred = np.sum(predictions == 0)
    
    # Print detailed stats
    print(f"Predictions distribution: Positive={pos_pred}, Negative={neg_pred}")
    print(f"First 10 predictions: {predictions[:10]}")
    print(f"First 10 labels: {labels[:10]}")
    
    return {
        "accuracy": accuracy,
        "pos_ratio": pos_pred / len(predictions),
        "neg_ratio": neg_pred / len(predictions)
    }


### Evaluate Base Model

In [7]:
# Initialize model from scratch
base_model = DistilBertForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=2
)

# Set up evaluation args
eval_args = TrainingArguments(
    output_dir="./tmp/base_model_eval",
    per_device_eval_batch_size=32,
    do_train=False,
    do_eval=True,
    remove_unused_columns=False,
)

# Set up evaluator for base model
base_evaluator = Trainer(
    model=base_model,
    args=eval_args,
    compute_metrics=compute_metrics,
    eval_dataset=tokenized_test,
)

# Evaluate base model
print("Evaluating base model...")
base_results = base_evaluator.evaluate()
print("Base Model Evaluation Results:", base_results)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Evaluating base model...


Predictions distribution: Positive=562, Negative=438
First 10 predictions: [1 0 0 1 1 1 1 1 0 0]
First 10 labels: [0 0 0 0 0 0 1 0 0 1]
Base Model Evaluation Results: {'eval_loss': 0.6965502500534058, 'eval_accuracy': 0.414, 'eval_pos_ratio': 0.562, 'eval_neg_ratio': 0.438, 'eval_runtime': 15.6483, 'eval_samples_per_second': 63.905, 'eval_steps_per_second': 2.045}


### Base model evaluation results:
- **Accuracy (41.4%)**: The base model's accuracy is below 50%, which is worse than random guessing on this balanced dataset. This indicates that the pre-trained model, without task-specific fine-tuning, is performing poorly on sentiment classification.

- **Class Distribution Bias**: The model is predicting positive reviews 56.2% of the time and negative reviews 43.8% of the time, despite our test set being perfectly balanced (500 positive, 500 negative). This shows the model has a slight bias toward predicting positive sentiment.

- **First 10 Predictions vs Labels**: Looking at the first 10 examples, the model gets many wrong (e.g., predicting 1 when the label is 0), confirming the poor accuracy.

- **Loss Value (0.697)**: This is a reasonable cross-entropy loss value for a classification task, suggesting the model is producing probabilities that are uncertain but not wildly miscalibrated.

In short what these results tell us is that the pre-trained DistilBERT model, without any fine-tuning on sentiment analysis tasks, performs worse than random guessing on IMDB reviews. 


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [8]:
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Create PEFT configuration
peft_config = LoraConfig(
    r=16,
    target_modules=["q_lin", "v_lin"],  
    lora_alpha=64,
    lora_dropout=0.1,
    task_type=TaskType.SEQ_CLS,
)

In [9]:
# Apply PEFT to a fresh model
fresh_model = DistilBertForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=2
)
peft_model = get_peft_model(fresh_model, peft_config)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Setting up training

In [10]:
# Training arguments
training_args = TrainingArguments(
    output_dir="/tmp/distilbert_lora",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,  
    evaluation_strategy="epoch",
    logging_steps=100,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    learning_rate=1e-4,  
    warmup_ratio=0.1,
    weight_decay=0.01,
    remove_unused_columns=False,
)

# Set up trainer
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
)


In [11]:
# Train the model
print("Training the model...")
trainer.train()

ft_results_after_training = trainer.evaluate()
print("Fine-tuned Model Results (right after training):", ft_results_after_training)

Training the model...


Epoch,Training Loss,Validation Loss,Accuracy,Pos Ratio,Neg Ratio
1,0.2991,0.256404,0.889,0.507,0.493
2,0.2476,0.240817,0.9,0.49,0.51
3,0.2239,0.242268,0.903,0.493,0.507


Predictions distribution: Positive=507, Negative=493
First 10 predictions: [0 1 0 0 0 0 1 0 0 1]
First 10 labels: [0 0 0 0 0 0 1 0 0 1]
Predictions distribution: Positive=490, Negative=510
First 10 predictions: [0 0 0 0 0 0 1 0 0 1]
First 10 labels: [0 0 0 0 0 0 1 0 0 1]
Predictions distribution: Positive=493, Negative=507
First 10 predictions: [0 0 0 0 0 0 1 0 0 1]
First 10 labels: [0 0 0 0 0 0 1 0 0 1]


Predictions distribution: Positive=493, Negative=507
First 10 predictions: [0 0 0 0 0 0 1 0 0 1]
First 10 labels: [0 0 0 0 0 0 1 0 0 1]
Fine-tuned Model Results (right after training): {'eval_loss': 0.24226754903793335, 'eval_accuracy': 0.903, 'eval_pos_ratio': 0.493, 'eval_neg_ratio': 0.507, 'eval_runtime': 16.5517, 'eval_samples_per_second': 60.417, 'eval_steps_per_second': 1.933, 'epoch': 3.0}


### PEFT model evaluation results:
Based on these training and evaluation results for DistilBERT model with LoRA fine-tuning the IMDb dataset, we can observe these improvements:

- **Significant Performance Improvement**: The model showed excellent progress from the base model, which had only 41.4% accuracy, to the fine-tuned model reaching 90.3% accuracy after 3 epochs. This is a dramatic improvement of nearly 49 percentage points.
- **Consistent Learning Progress**: The accuracy improved across epochs:
    - Epoch 1: 88.9%
    - Epoch 2: 90.0%
    - Epoch 3: 90.3%
    This consistent improvement suggests the training was effective and stable.
- **Balanced Predictions**: The final predictions are well-balanced between positive (493) and negative (507) classes, this tells us that the model isn't biased and predicting one class over the other.
- **Low Loss Values**: The validation is significantly lower (0.24) than the base model (0.69).
- **First 10 Predictions Analysis**: The first 10 predictions match the labels almost perfectly by the final epoch, which is another positive sign of the model's performance.
- **Effective LoRA Adaptation**: The results demonstrate that LoRA is working effectively for this task, fine-tuning a relatively small number of parameters while showing great performance.

In short, the LoRA fine-tuning approach was highly successful for this sentiment analysis task. The model efficiently learned the task patterns with minimal parameter tuning achieving over 90% accuracy, this is the power of LoRA!


###  ⚠️ IMPORTANT ⚠️

Due to workspace storage constraints, you should not store the model weights in the same directory but rather use `/tmp` to avoid workspace crashes which are irrecoverable.
Ensure you save it in /tmp always.

In [12]:
# Saving the model and tokenizer
peft_model.save_pretrained("/tmp/peft_distilbert")

tokenizer.save_pretrained("/tmp/peft_distilbert")


('/tmp/peft_distilbert/tokenizer_config.json',
 '/tmp/peft_distilbert/special_tokens_map.json',
 '/tmp/peft_distilbert/vocab.txt',
 '/tmp/peft_distilbert/added_tokens.json')

In [21]:
import os

print(os.listdir("/tmp/peft_distilbert"))

['vocab.txt', 'adapter_model.bin', 'adapter_config.json', 'README.md', 'tokenizer_config.json', 'special_tokens_map.json']


## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [15]:
# Load a fresh base model for comparison
base_model_for_inference = DistilBertForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=2
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
# Set up base model trainer for inference
base_trainer_inference = Trainer(
    model=base_model_for_inference,
    compute_metrics=compute_metrics,
    eval_dataset=tokenized_test,
)

In [17]:
# load the saved PEFT model 
base_model_for_peft = DistilBertForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=2
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
from peft import PeftModel

# load the adapter 
peft_model_for_inference = PeftModel.from_pretrained(
    base_model_for_peft,
    "/tmp/peft_distilbert",
    is_trainable=False
)

# Set up PEFT model trainer for inference
peft_trainer_inference = Trainer(
    model=peft_model_for_inference,
    compute_metrics=compute_metrics,
    eval_dataset=tokenized_test,
)

In [23]:
# Evaluate both models
print("\nEvaluating base model during inference...")
base_inference_results = base_trainer_inference.evaluate()

print("\nEvaluating fine-tuned PEFT model during inference...")
peft_inference_results = peft_trainer_inference.evaluate()

# Compare results
print("\n===== Final Evaluation Results =====")
print(f"Base Model Accuracy: {base_inference_results['eval_accuracy']:.4f}")
print(f"Fine-tuned PEFT Model Accuracy: {peft_inference_results['eval_accuracy']:.4f}")
print(f"Improvement: {peft_inference_results['eval_accuracy'] - base_inference_results['eval_accuracy']:.4f}")



Evaluating base model during inference...


Predictions distribution: Positive=2, Negative=998
First 10 predictions: [0 0 0 0 0 0 0 0 0 0]
First 10 labels: [0 0 0 0 0 0 1 0 0 1]

Evaluating fine-tuned PEFT model during inference...


Predictions distribution: Positive=493, Negative=507
First 10 predictions: [0 0 0 0 0 0 1 0 0 1]
First 10 labels: [0 0 0 0 0 0 1 0 0 1]

===== Final Evaluation Results =====
Base Model Accuracy: 0.5000
Fine-tuned PEFT Model Accuracy: 0.9030
Improvement: 0.4030


### Trained PEFT model performance:
These results validate the effectiveness of this LoRA fine-tuning approach. Here’s a breakdown of the improvements:

- **Base Model Performance**: During inference the base model shows a highly skewed prediction distribution (998 negative, only 2 positive predictions). It’s basically defaulting to negative predictions and the accuracy of 50% means the base model is random guessing the classification task. 
- **Fine-tuned Model Success**: The LoRA fine-tuned model shows excellent performance with 90.3% accuracy, exactly matching what we got during training, confirming the PEFT approach worked as intended.
- **Balanced Predictions from Fine-tuned Model**: The fine-tuned model produces a balanced distribution of predictions (493 positive, 507 negative) meaning the model is making nuanced decisions rather than defaulting to one class.
- **Substantial Improvement**: We achieved 40.3 percentage point improvement! 
- **First 10 Predictions**: The fine-tuned model's first 10 predictions perfectly match the labels, while the base model incorrectly predicted all of them as negative.
- **Effective Parameter Efficiency**: The entire DistilBERT model wasn't fine-tuned during LoRA just a few injected adapters got the job done.

These results represent a substantial improvement over the base model's performance.

