# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: 
* Model: 
* Evaluation approach: 
* Fine-tuning dataset: 

In [1]:
## Install requirements packages:
# pip install 'transformers[torch]'
## or
# pip install transformers

# pip install peft

## 1. Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

### 1.1 Load a pretrained HF model

The code includes the relevant imports and loads a pretrained Hugging Face model designed for sequence classification tasks.

In [2]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label={0: "negative", 1: "positive"},
    label2id={"negative": 0, "positive": 1},
)

model.classifier

  from .autonotebook import tqdm as notebook_tqdm
config.json: 100%|██████████| 483/483 [00:00<00:00, 1.02MB/s]
model.safetensors: 100%|██████████| 268M/268M [00:01<00:00, 205MB/s]  
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Linear(in_features=768, out_features=2, bias=True)

### 1.2 Load and preprocess a dataset

The code includes the relevant imports and loads a Hugging Face dataset suitable for sequence classification tasks. It then proceeds to include the necessary imports for and loads a Hugging Face tokenizer, which is used to prepare the dataset for processing. To minimize the computational resources required, a subset of the full dataset may be utilized.

In [3]:
# Install the required version of datasets in case you have an older version
# You will need to choose "Kernel > Restart Kernel" from the menu after executing this cell
# ! pip install -q "datasets==2.15.0"

In [4]:
from datasets import load_dataset

# Initialize a new dictionary to hold the modified dataset
dataset = {}

# Define the splits
splits = ["train", "test"]

# Load, shuffle, and select a subset for each split
for split in splits:
    # Load the dataset split
    ds = load_dataset("imdb", split=split)

    # Shuffle and select the first 500 samples
    dataset[split] = ds.shuffle(seed=23).select(range(500))

# Display the modified datasets
dataset

Downloading readme: 100%|██████████| 7.81k/7.81k [00:00<00:00, 6.77MB/s]
Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s][A
Downloading data:  20%|██        | 4.19M/20.5M [00:00<00:00, 29.6MB/s][A
Downloading data:  61%|██████▏   | 12.6M/20.5M [00:00<00:00, 46.7MB/s][A
Downloading data: 100%|██████████| 20.5M/20.5M [00:00<00:00, 33.7MB/s][A
Downloading data files:  33%|███▎      | 1/3 [00:00<00:01,  1.61it/s]
Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s][A
Downloading data:  20%|█▉        | 4.19M/21.0M [00:00<00:00, 29.3MB/s][A
Downloading data:  60%|█████▉    | 12.6M/21.0M [00:00<00:00, 46.1MB/s][A
Downloading data: 100%|██████████| 21.0M/21.0M [00:00<00:00, 38.0MB/s][A
Downloading data files:  67%|██████▋   | 2/3 [00:01<00:00,  1.70it/s]
Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s][A
Downloading data:  10%|▉         | 4.19M/42.0M [00:00<00:01, 31.4MB/s][A
Downloading 

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 })}

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_function(examples):
    """Preprocess the imdb dataset by returning tokenized examples."""
    outputs  = tokenizer(examples['text'], truncation=True, padding='max_length', max_length=512, return_tensors="pt")
    
    return outputs 


tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = dataset[split].map(preprocess_function, batched=True)

tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 169kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 6.61MB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 8.95MB/s]
Map: 100%|██████████| 500/500 [00:00<00:00, 747.07 examples/s]
Map: 100%|██████████| 500/500 [00:00<00:00, 768.92 examples/s]


### 1.3 Evaluate the pretrained model

At least one classification metric is calculated by applying the pretrained model to the dataset.

In [6]:
# Freeze all the parameters of the base model
for param in model.base_model.parameters():
    param.requires_grad = False

In [7]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}
# My compute_metrics function calculates the "accuracy" as a classification metric. 
# Accuracy is a common metric used to measure the correctness of a model's predictions. 

trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data/positive_negative",
        learning_rate=5e-5,  # Set the learning rate
         per_device_train_batch_size=8,  # Set the per device train batch size
        per_device_eval_batch_size=16,  # Set the per device eval batch size
        evaluation_strategy="epoch",  # Evaluate after each epoch
        save_strategy="epoch",  # Save the model after each epoch
        num_train_epochs=2,
        weight_decay=0.01,
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.668613,0.73
2,No log,0.659988,0.766


TrainOutput(global_step=126, training_loss=0.6712383088611421, metrics={'train_runtime': 35.3615, 'train_samples_per_second': 28.279, 'train_steps_per_second': 3.563, 'total_flos': 132467398656000.0, 'train_loss': 0.6712383088611421, 'epoch': 2.0})

In [8]:
# Show the performance of the model on the test set
trainer.evaluate()

{'eval_loss': 0.65998774766922,
 'eval_accuracy': 0.766,
 'eval_runtime': 7.9415,
 'eval_samples_per_second': 62.96,
 'eval_steps_per_second': 4.029,
 'epoch': 2.0}

In [9]:
# Train all the parameters of the base model.--For the purpose of comparison
for param in model.base_model.parameters():
    param.requires_grad = True

In [10]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.641921,0.764
2,No log,0.636097,0.764


TrainOutput(global_step=126, training_loss=0.6371427869039868, metrics={'train_runtime': 65.3948, 'train_samples_per_second': 15.292, 'train_steps_per_second': 1.927, 'total_flos': 132467398656000.0, 'train_loss': 0.6371427869039868, 'epoch': 2.0})

In [11]:
# Show the performance of the model on the test set
trainer.evaluate()

{'eval_loss': 0.6360965371131897,
 'eval_accuracy': 0.764,
 'eval_runtime': 8.8896,
 'eval_samples_per_second': 56.246,
 'eval_steps_per_second': 3.6,
 'epoch': 2.0}

## 2. Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

### 2.1 Create a PEFT model

The code includes the relevant imports, initializes a Hugging Face Parameter-Efficient Fine-Tuning (PEFT) config, and creates a PEFT model using that config.

#### Creating a PEFT Config

The PEFT config specifies the adapter configuration for your parameter-efficient fine-tuning process. The base class for this is a PeftConfig, but this example will use a LoraConfig, the subclass used for low rank adaptation (LoRA).

A LoRA config can be instantiated like this:

In [12]:
from peft import LoraConfig
config = LoraConfig(target_modules=["classifier"])

#### Converting a Transformers Model into a PEFT Model

Once you have a PEFT config object, you can load a Hugging Face transformers model as a PEFT model by first loading the pre-trained model as usual (here we load GPT-2):

In [13]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label={0: "negative", 1: "positive"},
    label2id={"negative": 0, "positive": 1}
)

model.classifier

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Linear(in_features=768, out_features=2, bias=True)

Then using get_peft_model() to get a trainable PEFT model (using the LoRA config instantiated previously):

In [14]:
from peft import get_peft_model
peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()

trainable params: 6,160 || all params: 66,961,170 || trainable%: 0.009199361361218747


In [15]:
from peft import AutoPeftModelForSequenceClassification

peft_model.save_pretrained("peft_pretrained")
peft_model = AutoPeftModelForSequenceClassification.from_pretrained("peft_pretrained")
peft_model.print_trainable_parameters()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 1,196,580 || all params: 67,559,460 || trainable%: 1.7711509239416656


####  Load the `IMDB` dataset

In [16]:
from datasets import load_dataset

# Initialize a new dictionary to hold the modified dataset
dataset = {}

# Define the splits
splits = ["train", "test"]

# Load, shuffle, and select a subset for each split
for split in splits:
    # Load the dataset split
    ds = load_dataset("imdb", split=split)

    # Shuffle and select the first 500 samples
    dataset[split] = ds.shuffle(seed=23).select(range(1000))

# Display the modified datasets
dataset

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 1000
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 1000
 })}

#### Preprocess(tokenize) the `IMDB` dataset

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_function(examples):
    """Preprocess the imdb dataset by returning tokenized examples."""
    outputs  = tokenizer(examples['text'], truncation=True, padding='max_length', max_length=512, return_tensors="pt")
    
    return outputs 


tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = dataset[split].map(preprocess_function, batched=True)

tokenized_dataset

Map: 100%|██████████| 1000/1000 [00:01<00:00, 715.35 examples/s]
Map: 100%|██████████| 1000/1000 [00:01<00:00, 750.70 examples/s]


{'train': Dataset({
     features: ['text', 'label', 'input_ids', 'attention_mask'],
     num_rows: 1000
 }),
 'test': Dataset({
     features: ['text', 'label', 'input_ids', 'attention_mask'],
     num_rows: 1000
 })}

### 2.2 Train the PEFT model

The model undergoes training for at least one epoch, utilizing the Parameter-Efficient Fine-Tuning (PEFT) model and the specified dataset.

After calling `get_peft_model()`, you can then use the resulting lora_model in a training process of your choice (PyTorch training loop or Hugging Face Trainer).

#### Checking Trainable Parameters of a PEFT Model
A helpful way to check the number of trainable parameters with the current config is the print_trainable_parameters() method:

In [18]:
peft_model.print_trainable_parameters()

trainable params: 1,196,580 || all params: 67,559,460 || trainable%: 1.7711509239416656


https://huggingface.co/docs/peft/quicktour

In [19]:
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./lora",
    learning_rate=1e-3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

import numpy as np
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.552454,0.72
2,No log,0.395143,0.816


TrainOutput(global_step=250, training_loss=0.48890200805664064, metrics={'train_runtime': 69.041, 'train_samples_per_second': 28.968, 'train_steps_per_second': 3.621, 'total_flos': 268648538112000.0, 'train_loss': 0.48890200805664064, 'epoch': 2.0})

### 2.3 Save the PEFT model

The fine-tuned parameters of the model are saved to a separate directory, which is located in the same home directory as the notebook file.

#### Saving a Trained PEFT Model
Once a PEFT model has been trained, the standard Hugging Face save_pretrained() method can be used to save the weights locally. For example:

In [20]:
peft_model.save_pretrained("peft_lora")

## 3. Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

### 3.1 Load the saved PEFT model

Includes the relevant imports then loads the saved PEFT model

Because we have only saved the adapter weights and not the full model weights, we can't use from_pretrained() with the regular Transformers class (e.g., AutoModelForCausalLM). Instead, we need to use the PEFT version (e.g., AutoPeftModelForCausalLM). For example:

In [21]:
from peft import AutoPeftModelForSequenceClassification
peft_model = AutoPeftModelForSequenceClassification.from_pretrained("peft_lora")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
trainer.evaluate()

{'eval_loss': 0.3951427638530731,
 'eval_accuracy': 0.816,
 'eval_runtime': 16.6003,
 'eval_samples_per_second': 60.24,
 'eval_steps_per_second': 7.53,
 'epoch': 2.0}

## 4. A Comparison

||original model (Freeze Params)|original model (tuning Params)|PEFT model|
|---|--|--|--|
|accuracy|0.766|0.764|0.816|
|training time (epoch=2)| 35.36|65.39 |69.04 |

This table compares the performance and training time of three different setups of a machine learning model: an original model with frozen parameters, the same original model with tunable parameters, and a PEFT (Progressive Embedding Fine-Tuning) model. Each model was evaluated based on its accuracy and the time taken for training over two epochs.

Accuracy:

* Original Model (Freeze Params): This model achieved an accuracy of 0.766. Here, "Freeze Params" indicates that the parameters (weights) of the model were kept constant during training, i.e., they were not updated or changed.

* Original Model (Tuning Params): This version of the model, with tunable parameters, achieved a slightly lower accuracy of 0.764. "Tuning Params" means that the model's parameters were allowed to update and change during the training process.

* PEFT Model: The PEFT model outperformed the other two with an accuracy of 0.816. PEFT typically involves more sophisticated training techniques, often leading to better performance.

Training Time (epoch=2):

* Original Model (Freeze Params): It took 35.36 units of time (presumably minutes or seconds) for training over 2 epochs. Freezing parameters generally results in shorter training times as fewer calculations are required.

* Original Model (Tuning Params): The training time increased to 65.39 units when the parameters were tunable, as this requires more computations to update the weights during training.

* PEFT Model: This model had the longest training time at 69.04 units, which is expected due to the more complex nature of the PEFT approach, involving progressive updates to the embedding layers and potentially other parts of the model.

In summary, the PEFT model shows the highest accuracy, the training time for the PEFT model and the original model with tunable parameters is quite similar. This makes the PEFT model an efficient choice in terms of balancing performance with computational resources, especially considering the notable improvement in accuracy.