# 📚 Fine-Tuning Phi-2 for Text Summarization using PEFT


- Large Language Models (LLMs) like Phi-2 have shown impressive capabilities in understanding and generating human-like text. However, fine-tuning these powerful models from scratch can be expensive and computationally intensive.


- In this notebook, we'll explore how to fine-tune Phi-2 for text summarization using a lightweight and efficient technique called **PEFT** (**Parameter-Efficient Fine-Tuning**). Instead of updating all of a model’s parameters, PEFT allows us to adapt pre-trained models using a small number of additional parameters, significantly reducing training cost while achieving competitive performance.


- We’ll walk through the full process—loading the model, preparing the dataset, applying PEFT via LoRA (Low-Rank Adaptation), training, and evaluating results. Whether you're new to LLM fine-tuning or looking for a practical example of PEFT in action, this notebook is designed to help you understand the how and why behind efficient fine-tuning.


Let’s get started and make Phi-2 a summarizer!

In [1]:
## Installing the necessary dependencies;

!pip install -qq -U bitsandbytes transformers peft accelerate datasets scipy einops evaluate trl

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m109.3 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.1/411.1 kB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m354.7/354.7 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m348.0/348.0 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[

In [2]:
### Importing the necessary libraries;

from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    GenerationConfig
)
from tqdm import tqdm
from trl import SFTTrainer
import torch
import time
from huggingface_hub import login

import pandas as pd
import numpy as np


## Using kaggle's feature to add the hugging face token so that it restricts public visibility. Please find
## this feature available by clicking the "Add-ons" button on the menu bar of the editor.
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("hf_token")


### This sort of mimics the .env file that we create during project creation.
login(token=secret_value_0)

2025-05-07 16:50:41.805674: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746636641.992996      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746636642.043959      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
### Utility function to get the GPU memory used at any point of time
from pynvml import *

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")


print_gpu_utilization()

GPU memory occupied: 116 MB.


# 📑 Dataset: CNN/DailyMail for Abstractive Summarization


- To fine-tune our Phi-2 model for summarization, we're using the CNN/DailyMail dataset—one of the most popular benchmarks for training and evaluating summarization models.


- This dataset consists of news articles from CNN and the Daily Mail, along with human-written highlights that serve as target summaries. It's well-suited for abstractive summarization, where the model learns to generate concise, paraphrased summaries instead of merely copying parts of the original text.

We'll use the **datasets** library to load the pre-processed version:

In [4]:
## Loading the dataset;
from datasets import load_dataset

dataset = load_dataset("abisee/cnn_dailymail", "1.0.0")

README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/256M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [5]:
## The dataset is already divided into three partitions, so we dont have to split the same for inferencing

dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [6]:
## Viewing a sample observation;

dataset['train']['article'][0], dataset['train']['highlights'][0]

('LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details 

# ⚙️ Loading Phi-2 with 4-bit Quantization (BitsAndBytes)

- To **reduce memory usage** and make training feasible on limited hardware, we’re loading the Phi-2 model with **4-bit quantization** using the **bitsandbytes** library. This dramatically reduces the model's memory footprint while preserving much of its performance.

In [7]:
## Defining the bnb config/parameters, which defines the qunatisation settings for loading the model; 
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=False,
    )


## To place the model on the first available GPU
device_map = {"": 0}


## Loading the model;
model_name='microsoft/phi-2'
original_model = AutoModelForCausalLM.from_pretrained(model_name, 
                                                      device_map=device_map,
                                                      quantization_config=bnb_config,  ## the bnb config is passed as a parameter
                                                      trust_remote_code=True,
                                                      use_auth_token=True)


## Printing the GPU Util;
print_gpu_utilization()



config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPU memory occupied: 3104 MB.


# 🔤 Tokenizer Setup


- Before we can feed text into the Phi-2 model, we need to load its corresponding tokenizer. The tokenizer is responsible for converting raw text into input tokens that the model understands, and vice versa.

We’ll demonstrate two ways to load the tokenizer, depending on whether you’re training or evaluating.

In [8]:
### Tokeniser

model_name='microsoft/phi-2'


## Loading the tokenizer for trainer;

tokenizer = AutoTokenizer.from_pretrained(model_name,
                                          trust_remote_code=True,
                                          padding_side="left",
                                          add_eos_token=True,
                                          add_bos_token=True,
                                          use_fast=False)
tokenizer.pad_token = tokenizer.eos_token


## Printing the GPU Util;
print_gpu_utilization()


## Loading the tokenizer for inferencing a sample record;
eval_tokenizer = AutoTokenizer.from_pretrained(model_name, 
                                               add_bos_token=True, 
                                               trust_remote_code=True, 
                                               use_fast=False)

eval_tokenizer.pad_token = eval_tokenizer.eos_token


tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

GPU memory occupied: 3104 MB.


Looking at the above code block, you might wonder why I have defined the tokenizer object twice. While you might think they are the same at first glance, but infact they are slightly different. 

- **tokenizer**: This tokenizer is used during **training** and **generation**. Padding on the left helps with alignment during causal attention, and setting the pad token to the EOS token avoids errors related to undefined tokens.

- **eval_tokenizer**:  This is used for evaluation purposes only, hence we skip setting **padding_side** and **add_eos_token** since evaluation usually doesn’t involve padding or sequence truncation in the same way as training.


- To easily test the vanilla model, we define a simple function gen() that takes in a prompt, feeds it to the model, and returns the generated text.

In [10]:
def gen(model, p, maxlen=100, sample=True):
    
    toks = eval_tokenizer(p, return_tensors="pt")  # Tokenize the prompt into input IDs
    
    res = model.generate(
        **toks.to("cuda"),                     # Move tokens to GPU
        max_new_tokens=maxlen,                # Limit generation length
        do_sample=sample,                     # Enable sampling if True (for diversity)
        num_return_sequences=1,               # Generate a single sequence
        temperature=0.1,                      # Low temperature = more deterministic output
        num_beams=1,                          # No beam search (greedy or sampling instead)
        top_p=0.95                            # Nucleus sampling: sample from top 95% of token probability mass
    ).to('cpu')                               # Move result back to CPU
    
    return eval_tokenizer.batch_decode(res, skip_special_tokens=True)  # Decode tokens into readable text


Picking up a random sample and checking the prediction of the vanilla model.

In [11]:
from transformers import set_seed
seed = 42
set_seed(seed)


## defining some random index;
index = 12

article = dataset['train']['article'][index]
summary = dataset['train']['highlights'][index]

formatted_prompt = f"Instruct: You are a text summarizer that reads a passage and generates a concise summary capturing the main idea. Summarize the following text:\n{article}\nOutput:\n"


res = gen(original_model,formatted_prompt,100,)

#print(res[0])
output = res[0].split('Output:\n')[1]

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{formatted_prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Instruct: You are a text summarizer that reads a passage and generates a concise summary capturing the main idea. Summarize the following text:
BREMEN, Germany -- Carlos Alberto, who scored in FC Porto's Champions League final victory against Monaco in 2004, has joined Bundesliga club Werder Bremen for a club record fee of  7.8 million euros ($10.7 million). Carlos Alberto enjoyed success at FC Porto under Jose Mourinho. "I'm here to win titles with Werder," the 22-year-old said after his first training session with his new club. "I like Bremen and would only have wanted to come here." Carlos Alberto started his career with Fluminense, and helped them to lift the Campeonato Carioca in 2002. In January 2004 he moved on to FC Porto, who were coached by José Mourinho, and the club won the Portuguese title as well as the Champions League. Early in 2005, he moved to Corinthians,

In [None]:
## Preprocessing the dataset;

# 🧾 Converting Samples into Prompt Format for Instruction Tuning


- To fine-tune our model effectively, we need to structure the training data in a way that mimics instruction-based prompting. This helps the model learn not just from the raw article/summary pairs, but also from a consistent pattern of "prompt → response."

- The function below takes a single sample (an article and its human-written summary) and formats it into a structured prompt using natural language cues.

- These parts are combined into a full prompt in the following order:

1. System blurb

2. Instruction header

3. Article text (input)

4. Output header + summary

5. End token

In [28]:
## Converting each sample into a suitable prompt;

def create_prompt_formats(sample):
    """
    Format various fields of the sample ('instruction','output')
    Then concatenate them using two newline characters 
    :param sample: Sample dictionnary
    """

    ## system prompt;
    INTRO_BLURB = "You are a text summarizer that reads a newspaper article and generates a concise summary capturing the main idea."

    ## user prompt;
    INSTRUCTION_KEY = "### Instruct: Summarize the following text: "
    
    RESPONSE_KEY = "### Output:"
    END_KEY = "### End"
    
    blurb = f"\n{INTRO_BLURB}"
    instruction = f"{INSTRUCTION_KEY}"
    input_context = f"{sample['article']}" if sample["article"] else None
    response = f"{RESPONSE_KEY}\n{sample['highlights']}"
    end = f"{END_KEY}"
    
    parts = [part for part in [blurb, instruction, input_context, response, end] if part]

    formatted_prompt = "\n\n".join(parts)
    
    sample["text"] = formatted_prompt

    return sample

**📏 Determining the Model's Maximum Sequence Length**


- When tokenizing text for training or inference, it's essential to know the maximum number of tokens your model can handle in a single input. Feeding in longer sequences than the model supports will result in errors or truncated outputs.

- The function below attempts to dynamically fetch this limit from the model's configuration:

In [15]:
# SOURCE - https://github.com/databrickslabs/dolly/blob/master/training/trainer.py

def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max length: {max_length}")
            break
            
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


In [None]:
### Tokenising the input as batches;

def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )

#### **🧹 Dataset Preprocessing for Fine-Tuning**


Before we can train our model, we need to **format**, **tokenize**, **filter**, and shuffle the dataset. This function prepares the dataset so it's ready to be fed into the model.

```
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, seed, dataset):
```

#### **Step-by-step Explanation:**

**1. Format the data as prompts:**

- We use the previously defined create_prompt_formats() function to convert each article-summary pair into a well-structured instruction-style prompt.

```
dataset = dataset.map(create_prompt_formats)
```

**2. Tokenize the formatted text:**


We use functools.partial to create a preprocessing function that will tokenize each sample with a fixed max_length and tokenizer:

```
_preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
```

Then, apply this to the dataset with batching enabled. We also remove the raw columns we no longer need:

```
dataset = dataset.map(
    _preprocessing_function,
    batched=True,
    remove_columns=['article', 'highlights', 'id']
)
```

**3. Filter out overly long samples:**

Sometimes, even after formatting, tokenized samples exceed the model's max length. We drop them to avoid runtime errors:

```
dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
```


✅ This function ensures our model gets clean, consistently formatted, and length-safe data, crucial for stable and effective training.

In [16]:
from functools import partial

# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int,seed, dataset):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """
    
    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)  ### creates the prompt text for the model
    
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=['article', 'highlights', 'id'], 
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
    
    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset

In [17]:
## Printing the GPU Util;
print_gpu_utilization()

GPU memory occupied: 3120 MB.


In [18]:
## Getting the max sequence length;
max_length = get_max_length(original_model)
print(max_length)

Found max length: 2048
2048


In [29]:
%%time
## Shuffling and picking a random sample of 100000 record for training and 10000 for testing

train_record_cnt = 50000
test_record_cnt = 5000

train_dataset = preprocess_dataset(tokenizer, max_length,seed, dataset['train'].shuffle(seed=42).select(range(train_record_cnt)))
test_dataset = preprocess_dataset(tokenizer, max_length,seed, dataset['test'].shuffle(seed=42).select(range(test_record_cnt)))

Preprocessing dataset...


Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/50000 [00:00<?, ? examples/s]

Preprocessing dataset...


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5000 [00:00<?, ? examples/s]

CPU times: user 6min 53s, sys: 2.2 s, total: 6min 55s
Wall time: 6min 55s


In [30]:
print(f"Shapes of the datasets:")
print(f"Training: {train_dataset.shape}")
print(f"Validation: {test_dataset.shape}")
print(train_dataset)

Shapes of the datasets:
Training: (48679, 3)
Validation: (4859, 3)
Dataset({
    features: ['text', 'input_ids', 'attention_mask'],
    num_rows: 48679
})


# Setup the PEFT/LoRA model for Fine-Tuning

- When using Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA, we freeze most of the base model's weights and only train a small subset of parameters. This function helps us verify that setup by calculating how many model parameters are actually being trained.

In [31]:
# This function helps us verify that setup by calculating how many model parameters are actually being trained.
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
            
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"


In [34]:
## Printing the number of trainable parameters of the original vanilla model
print(print_number_of_trainable_model_parameters(original_model))

print('\n\nModel Configuration: \n')
print(original_model)

trainable model parameters: 262364160
all model parameters: 1521392640
percentage of trainable model parameters: 17.24%


Model Configuration: 

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2560)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (dense): Linear4bit(in_features=2560, out_features=2560, bias=True)
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear4bit(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear4bit(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.

**🛠️ Applying PEFT with LoRA for Efficient Fine-Tuning**

- r=32: The rank of the low-rank decomposition. Controls how much of the original weights LoRA tries to approximate. Higher = more capacity (more weighta to train).

- lora_alpha=32: A scaling factor applied to the LoRA updates. Think of it as a learning rate multiplier for the new trainable layers.

- target_modules: The names of the layers to inject LoRA adapters into. In transformers, these are typically **q_proj**, **k_proj**, **v_proj**, and **dense layers** inside the attention mechanism.

- bias="none": Don't apply LoRA to any bias terms.

- lora_dropout=0.05: Apply dropout within the LoRA layers for **regularization**.

- task_type="CAUSAL_LM": Specifies the task type, needed for adapter setup.

In [36]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training


## LoRA Configuration
config = LoraConfig(
    r=32, #Rank
    lora_alpha=32,
    target_modules=[
        'q_proj',
        'k_proj',
        'v_proj',
        'dense'
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)


# 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning:
# This helps reduce memory usage during training by saving intermediate activations to recompute 
# them on the backward pass, rather than storing all of them. 
# It's essential when training large models on limited GPU memory.
original_model.gradient_checkpointing_enable()



# 2 - Using the prepare_model_for_kbit_training method from PEFT:
# This function:
# - Ensures the quantized model is ready for training.
# - Casts certain layers to appropriate data types (e.g., normalization layers to float32).
# - Ensures gradients behave properly in a 4-bit quantized environment.

original_model = prepare_model_for_kbit_training(original_model)


## Inject LoRA Adapters - making only a small subset of parameters trainable
peft_model = get_peft_model(original_model, config)



In [38]:
## Printing the number of trainable parameters after loading the model using the LoRA adapter.
print(print_number_of_trainable_model_parameters(peft_model))


# See how the model looks different now, with the LoRA adapters added:
print('\n\nModel Configuration: \n')
print(peft_model)

trainable model parameters: 20971520
all model parameters: 1542364160
percentage of trainable model parameters: 1.36%


Model Configuration: 

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): PhiForCausalLM(
      (model): PhiModel(
        (embed_tokens): Embedding(51200, 2560)
        (layers): ModuleList(
          (0-31): 32 x PhiDecoderLayer(
            (self_attn): PhiAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2560, out_features=2560, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2560, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=2560, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                

**Now you can see that the number of trainable parameters reduced from 17.24% to 1.36%. This is the power of LoRA.**

# 🚀 Training the Model

**Define training arguments and create Trainer instance.**


This block defines the training configuration:

- warmup_steps=5: Gradually increases the learning rate during the first 5 steps to stabilize training.

- per_device_train_batch_size=1: We use a small batch size to reduce memory usage.

- gradient_accumulation_steps=4: Gradients are accumulated over 4 steps before a backward pass—effectively simulating a larger batch size.

- max_steps=1000: Total number of training steps.

- learning_rate=2e-4: The learning rate used by the optimizer.

- optim="paged_adamw_8bit": Uses memory-efficient optimizer suited for large models.

- logging_steps=25, save_steps=25, eval_steps=25: Logs, saves, and evaluates the model every 25 steps.

- do_eval=True: Enables evaluation during training.

- gradient_checkpointing=True: Reduces memory usage at the cost of extra computation by checkpointing intermediate activations.

- overwrite_output_dir='True': Ensures that the output directory can be overwritten.


**I would suggest looking up each parameter to understand them in a better fashion.**


In [39]:
# 🔧 Define Training Arguments for PEFT Model Fine-Tuning
output_dir = './peft-newspaper-summary-training/final-checkpoint'
import transformers

# ⚙️ Configure TrainingArguments
peft_training_args = TrainingArguments(
    output_dir = output_dir,
    warmup_steps=5,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    max_steps=1000,
    learning_rate=2e-4,
    optim="paged_adamw_8bit",
    logging_steps=25,
    logging_dir="./logs",
    save_strategy="steps",
    save_steps=25,
    eval_strategy="steps",
    eval_steps=25,
    do_eval=True,
    gradient_checkpointing=True,
    report_to="none",
    overwrite_output_dir = 'True',
    group_by_length=True,
)

# 🛠 Disable Cache for Training:
# Disabling caching is necessary for models that use gradient_checkpointing, 
# as cached outputs interfere with memory optimization.
peft_model.config.use_cache = False


# 🧑‍🏫 Initialize the PEFT Trainer
peft_trainer = transformers.Trainer(
    model=peft_model,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    args=peft_training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),  ### does dynamic padding
)

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [40]:
## Ensures that our model is loaded up on the GPU
peft_training_args.device

device(type='cuda', index=0)

**The Training begins!**

In [43]:
%%time
peft_trainer.train()

Step,Training Loss,Validation Loss


Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/IPython/core/magics/execution.py", line 1327, in time
    out = eval(code, glob, local_ns)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<timed eval>", line 1, in <module>
  File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2245, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2627, in _inner_training_loop
    self._maybe_log_save_evaluate(
  File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3096, in _maybe_log_save_evaluate
    metrics = self._evaluate(trial, ignore_keys_for_eval)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3045, in _evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

TypeError: object of type 'NoneType' has no len()

In [None]:
print_gpu_utilization()

In [None]:
# Free memory for merging weights
del original_model
del peft_trainer
torch.cuda.empty_cache()

In [None]:
## Printing the GPU Util;
print_gpu_utilization()

Once the training process is done, which took almost 4 hours for me, we can understanding the output of the model.

# 🧠 Qualitative Evaluation of Model Outputs

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

## Loading the vanilla model and tokeniser to compare its results with the fine tuned model;
base_model_id = "microsoft/phi-2"
base_model = AutoModelForCausalLM.from_pretrained(base_model_id, 
                                                      device_map='auto',
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)

eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True, use_fast=False)
eval_tokenizer.pad_token = eval_tokenizer.eos_token

Remember, when you fine-tune a model using PEFT (e.g. LoRA), **you are not training all the original model’s parameters**. Instead, you train a **small set of additional, trainable parameters** (called adapters or LoRA modules) that are layered on top of a frozen base model.


Hence when you load the fine-tuned model using PeftModel.from_pretrained, you need to pass the base model as an argument, along with other arguments.

**Essentially, the PeftModel.from_pretrained method:**

- Loads the adapter configuration.

- Injects the fine-tuned adapter weights into the correct layers of base_model.

- Returns a composite model (original + PEFT modules) ready for inference or further fine-tuning.




In [None]:
# 💾 Load the Final Model Checkpoint
# we set is_trainable as False, as we are freezing the weights.

from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model, "/kaggle/working/peft-newspaper-summary-training/final-checkpoint/checkpoint-1000",torch_dtype=torch.float16,is_trainable=False)

**Picking up a random index and comparing the outputs;**

In [None]:
%%time
from transformers import set_seed
set_seed(seed)


index = 10

article = dataset['validation'][index]['article']
summary = dataset['validation'][index]['highlights']

prompt = f"Instruct: You are a text summarizer that reads a passage and generates a concise summary capturing the main idea. Summarize the following text:\n{article}\nOutput:\n"

peft_model_res = gen(ft_model,prompt,100,)
peft_model_output = peft_model_res[0].split('Output:\n')[1]
#print(peft_model_output)
prefix, success, result = peft_model_output.partition('#End')

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'PEFT MODEL:\n{prefix}')

### Evaluate the Model Quantitatively (with **ROUGE** Metric)

Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time).

In [None]:
original_model = AutoModelForCausalLM.from_pretrained(base_model_id, 
                                                      device_map='auto',
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)

In [None]:
import pandas as pd

articles = dataset['validation'][0:10]['article']
human_baseline_summaries = dataset['validation'][0:10]['highlights']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []


for idx, article in enumerate(articles):
    human_baseline_text_output = human_baseline_summaries[idx]
    prompt = f"Instruct: You are a text summarizer that reads a passage and generates a concise summary capturing the main idea. Summarize the following text:\n{article}\nOutput:\n"
    
    original_model_res = gen(original_model,prompt,100,)
    original_model_text_output = original_model_res[0].split('Output:\n')[1]
    
    peft_model_res = gen(ft_model,prompt,100,)
    peft_model_output = peft_model_res[0].split('Output:\n')[1]
    #print(peft_model_output)
    peft_model_text_output, success, result = peft_model_output.partition('#End')
    

    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))



In [None]:
## To compare the summaries;
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df.head()

#### **The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) SCORE;**

- Typically used for evaluating the produced text and the reference text - here it is for evaluating summarization quality by comparing model outputs to human-written reference summaries.

- ROUGE evaluates the **overlap of n-grams** and **longest common subsequences** between predicted and reference summaries.


#### **📏 How ROUGE is Calculated (Concise Explanation):**

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the quality of a generated summary by comparing it to a reference summary, focusing on text overlap. It reports three main metrics:

1. ROUGE-1: Measures the overlap of individual words (unigrams).

2. ROUGE-2: Measures the overlap of word pairs (bigrams).

3. ROUGE-L: Measures the longest common subsequence (LCS) of words in the same order.


For each summary pair (prediction vs. reference), ROUGE computes:

1. Precision: How many words in the prediction are also in the reference.

2. Recall: How many words in the reference are captured by the prediction.

3. F1-Score: Harmonic mean of precision and recall.

The final ROUGE score is the average F1 score across all prediction–reference pairs, providing an overall measure of how well the model’s summaries align with human-written ones.


#### **Example Explanation;**

- Reference: "The cat sat on the mat"
- Prediction: "A cat sat on mat"

| **Metric**  | **Matching Units**                  | **Precision** | **Recall**   | **F1 Score** |
| ----------- | ----------------------------------- | ------------- | ------------ | ------------ |
| **ROUGE-1** | `["cat", "sat", "on", "mat"]`       | 4 / 5 = 0.80  | 4 / 6 ≈ 0.67 | **0.73**     |
| **ROUGE-2** | `["cat sat", "sat on"]`             | 2 / 4 = 0.50  | 2 / 5 = 0.40 | **0.44**     |
| **ROUGE-L** | `["cat", "sat", "on", "mat"]` (LCS) | 4 / 5 = 0.80  | 4 / 6 ≈ 0.67 | **0.73**     |



#### In general, ROUGE-1, ROUGE-2, ROUGE-L are preferred and reported, as it's harder for the model to match exact sequences of 3+ words.


In [None]:
!pip install rouge_score
import evaluate

rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)

In [None]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

# 📌 End Notes

- In this notebook, we demonstrated how to fine-tune a large language model using Parameter-Efficient Fine-Tuning (PEFT) on a news summarization task. We also evaluated the model's performance using ROUGE metrics, and compared it against the base model both quantitatively and qualitatively.

### **Key takeaways:**

1. PEFT allows for efficient training with fewer parameters and lower compute costs.

2. ROUGE provides a structured way to evaluate summarization quality through n-gram and sequence overlap.

3. Human evaluation remains essential to capture fluency, relevance, and coherence.


## Have questions or suggestions? Feel free to post your doubts in the comments section below — I’ll be happy to help!

## Found this notebook helpful? Please consider **upvoting** — it helps others discover this content and motivates me to create more educational notebooks!