# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: 
* Model: 
* Evaluation approach: 
* Fine-tuning dataset: 

## 1. Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

### 1.1 Load a pretrained HF model

The code includes the relevant imports and loads a pretrained Hugging Face model designed for sequence classification tasks.

In [1]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label={0: "negative", 1: "positive"},
    label2id={"negative": 0, "positive": 1},
)

model.classifier

  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Linear(in_features=768, out_features=2, bias=True)

### 1.2 Load and preprocess a dataset

The code includes the relevant imports and loads a Hugging Face dataset suitable for sequence classification tasks. It then proceeds to include the necessary imports for and loads a Hugging Face tokenizer, which is used to prepare the dataset for processing. To minimize the computational resources required, a subset of the full dataset may be utilized.

In [2]:
# Install the required version of datasets in case you have an older version
# You will need to choose "Kernel > Restart Kernel" from the menu after executing this cell
# ! pip install -q "datasets==2.15.0"

In [3]:
from datasets import load_dataset

# Initialize a new dictionary to hold the modified dataset
dataset = {}

# Define the splits
splits = ["train", "test"]

# Load, shuffle, and select a subset for each split
for split in splits:
    # Load the dataset split
    ds = load_dataset("imdb", split=split)

    # Shuffle and select the first 500 samples
    dataset[split] = ds.shuffle(seed=23).select(range(500))

# Display the modified datasets
dataset

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 })}

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_function(examples):
    """Preprocess the imdb dataset by returning tokenized examples."""
    outputs  = tokenizer(examples['text'], truncation=True, padding='max_length', max_length=512, return_tensors="pt")
    
    return outputs 


tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = dataset[split].map(preprocess_function, batched=True)

### 1.3 Evaluate the pretrained model

At least one classification metric is calculated by applying the pretrained model to the dataset.

In [5]:
# Freeze all the parameters of the base model
for param in model.base_model.parameters():
    param.requires_grad = False


In [6]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}
# My compute_metrics function calculates the "accuracy" as a classification metric. 
# Accuracy is a common metric used to measure the correctness of a model's predictions. 

trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data/positive_negative",
        learning_rate=5e-5,  # Set the learning rate
         per_device_train_batch_size=8,  # Set the per device train batch size
        per_device_eval_batch_size=16,  # Set the per device eval batch size
        evaluation_strategy="epoch",  # Evaluate after each epoch
        save_strategy="epoch",  # Save the model after each epoch
        num_train_epochs=2,
        weight_decay=0.01,
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.66701,0.722
2,No log,0.659756,0.75


TrainOutput(global_step=126, training_loss=0.6694205147879464, metrics={'train_runtime': 461.9193, 'train_samples_per_second': 2.165, 'train_steps_per_second': 0.273, 'total_flos': 132467398656000.0, 'train_loss': 0.6694205147879464, 'epoch': 2.0})

In [7]:
# Show the performance of the model on the test set
trainer.evaluate()

{'eval_loss': 0.6597560048103333,
 'eval_accuracy': 0.75,
 'eval_runtime': 112.076,
 'eval_samples_per_second': 4.461,
 'eval_steps_per_second': 0.286,
 'epoch': 2.0}

In [8]:
# Train all the parameters of the base model
for param in model.base_model.parameters():
    param.requires_grad = True


In [9]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.642288,0.75
2,No log,0.637578,0.74


TrainOutput(global_step=126, training_loss=0.6418577527242993, metrics={'train_runtime': 763.153, 'train_samples_per_second': 1.31, 'train_steps_per_second': 0.165, 'total_flos': 132467398656000.0, 'train_loss': 0.6418577527242993, 'epoch': 2.0})

In [10]:
# Show the performance of the model on the test set
trainer.evaluate()

{'eval_loss': 0.6375781893730164,
 'eval_accuracy': 0.74,
 'eval_runtime': 107.4887,
 'eval_samples_per_second': 4.652,
 'eval_steps_per_second': 0.298,
 'epoch': 2.0}

## 2. Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

%reset

### 2.1 Create a PEFT model

The code includes the relevant imports, initializes a Hugging Face Parameter-Efficient Fine-Tuning (PEFT) config, and creates a PEFT model using that config.

#### Creating a PEFT Config

The PEFT config specifies the adapter configuration for your parameter-efficient fine-tuning process. The base class for this is a PeftConfig, but this example will use a LoraConfig, the subclass used for low rank adaptation (LoRA).

A LoRA config can be instantiated like this:

In [12]:
from peft import LoraConfig
config = LoraConfig(target_modules=["classifier"])

#### Converting a Transformers Model into a PEFT Model

Once you have a PEFT config object, you can load a Hugging Face transformers model as a PEFT model by first loading the pre-trained model as usual (here we load GPT-2):

In [13]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label={0: "negative", 1: "positive"},
    label2id={"negative": 0, "positive": 1}
)

model.classifier

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Linear(in_features=768, out_features=2, bias=True)

Then using get_peft_model() to get a trainable PEFT model (using the LoRA config instantiated previously):

In [14]:
from peft import get_peft_model
peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()

trainable params: 6,160 || all params: 66,961,170 || trainable%: 0.009199361361218747


In [15]:
from peft import AutoPeftModelForSequenceClassification

peft_model.save_pretrained("peft_pretrained")
peft_model = AutoPeftModelForSequenceClassification.from_pretrained("peft_pretrained")
peft_model.print_trainable_parameters()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 598,290 || all params: 67,559,460 || trainable%: 0.8855754619708328


####  Load the `IMDB` dataset

In [16]:
from datasets import load_dataset

# Initialize a new dictionary to hold the modified dataset
dataset = {}

# Define the splits
splits = ["train", "test"]

# Load, shuffle, and select a subset for each split
for split in splits:
    # Load the dataset split
    ds = load_dataset("imdb", split=split)

    # Shuffle and select the first 500 samples
    dataset[split] = ds.shuffle(seed=23).select(range(500))

# Display the modified datasets
dataset

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 })}

#### Preprocess(tokenize) the `IMDB` dataset

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_function(examples):
    """Preprocess the imdb dataset by returning tokenized examples."""
    outputs  = tokenizer(examples['text'], truncation=True, padding='max_length', max_length=512, return_tensors="pt")
    
    return outputs 


tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = dataset[split].map(preprocess_function, batched=True)


### 2.2 Train the PEFT model

The model undergoes training for at least one epoch, utilizing the Parameter-Efficient Fine-Tuning (PEFT) model and the specified dataset.

After calling `get_peft_model()`, you can then use the resulting lora_model in a training process of your choice (PyTorch training loop or Hugging Face Trainer).

#### Checking Trainable Parameters of a PEFT Model
A helpful way to check the number of trainable parameters with the current config is the print_trainable_parameters() method:

In [18]:
peft_model.print_trainable_parameters()

trainable params: 598,290 || all params: 67,559,460 || trainable%: 0.8855754619708328


https://huggingface.co/docs/peft/quicktour

In [19]:
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./lora",
    learning_rate=1e-3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

import numpy as np
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.531912,0.732
2,No log,0.459229,0.784


TrainOutput(global_step=126, training_loss=0.5362108018663194, metrics={'train_runtime': 491.1082, 'train_samples_per_second': 2.036, 'train_steps_per_second': 0.257, 'total_flos': 134324269056000.0, 'train_loss': 0.5362108018663194, 'epoch': 2.0})

### 2.3 Save the PEFT model

The fine-tuned parameters of the model are saved to a separate directory, which is located in the same home directory as the notebook file.

#### Saving a Trained PEFT Model
Once a PEFT model has been trained, the standard Hugging Face save_pretrained() method can be used to save the weights locally. For example:

In [20]:
peft_model.save_pretrained("peft_lora")

## 3. Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

### 3.1 Load the saved PEFT model

Includes the relevant imports then loads the saved PEFT model

Because we have only saved the adapter weights and not the full model weights, we can't use from_pretrained() with the regular Transformers class (e.g., AutoModelForCausalLM). Instead, we need to use the PEFT version (e.g., AutoPeftModelForCausalLM). For example:

In [21]:
from peft import AutoPeftModelForSequenceClassification
peft_model = AutoPeftModelForSequenceClassification.from_pretrained("peft_lora")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
trainer.evaluate()

{'eval_loss': 0.45922911167144775,
 'eval_accuracy': 0.784,
 'eval_runtime': 109.1066,
 'eval_samples_per_second': 4.583,
 'eval_steps_per_second': 0.293,
 'epoch': 2.0}

#### A Comparison

||original model (Freeze Params)|original model (tuning Params)|PEFT model|
|---|--|--|--|
|accuracy|0.75|0.74|0.784|
|training time (epoch=2)| 461.9193|763.153 |491.1082 |

This table compares the performance of three different models: the original model with frozen parameters, the original model with tuned parameters, and the PEFT (Progressive Early Termination Fine-Tuning) model.

* The "accuracy" column represents the accuracy of each model. Accuracy is a measure of how well the model predicts the correct output. The higher the accuracy, the better the model performs. In this case, the original model with frozen parameters has an accuracy of 0.75, the original model with tuned parameters has an accuracy of 0.74, and the PEFT model has an accuracy of 0.784.

* The "training time (epoch=2)" column represents the time it takes to train each model for 2 epochs. Training time is the time it takes for the model to learn from the training data and adjust its parameters. In this case, the original model with frozen parameters takes 461.9193 seconds to train for 2 epochs, the original model with tuned parameters takes 763.153 seconds, and the PEFT model takes 491.1082 seconds.

From this table, we can see that the PEFT model has the highest accuracy among the three models, indicating that it performs the best in terms of prediction accuracy. Additionally, the PEFT model has a shorter training time compared to the original model with tuned parameters, suggesting that it is more efficient in terms of training time.