<a href="https://colab.research.google.com/github/veydantkatyal/Llama-LoRA-FineTuning/blob/main/Fine_Tune_Llama_3_2_1B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning Llama-3.2 1B for Dialogue Summarization

This notebook demonstrates how to fine-tune Meta's Llama-3.2 1B language model for a specific task: summarizing dialogues. We'll use modern techniques like LoRA (Low-Rank Adaptation) and quantization to make this process efficient and accessible even with limited computational resources.

## What we'll cover:
1. Setting up the required libraries
2. Loading and preparing the model
3. Processing our dataset
4. Configuring the fine-tuning process
5. Training the model

Note: This notebook assumes you have access to a GPU. We'll be using techniques to minimize memory usage while maintaining performance.

## Setup

First, we'll install the necessary libraries:
- `bitsandbytes`: For model quantization (reducing model size)
- `transformers`: Hugging Face's library for working with language models
- `peft`: For efficient fine-tuning using LoRA
- `accelerate`: For optimized model training
- `datasets`: For handling our training data
- `trl`: For supervised fine-tuning

In [None]:
!pip install -q -U bitsandbytes transformers peft accelerate datasets trl

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.0/76.0 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m45.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.0/411.0 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m335.7/335.7 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Importing Required Libraries

We import the necessary modules for our task. Each has a specific purpose:
- `datasets`: To load and process our training data
- `AutoModelForCausalLM`: To load our pre-trained language model
- `BitsAndBytesConfig`: For model quantization
- `TrainingArguments`: To configure training parameters
- `SFTTrainer`: For supervised fine-tuning

In [None]:
import torch
import time
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    DataCollatorForLanguageModeling,
    set_seed
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from functools import partial
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM


We'll set CUDA (GPU) as our default device. This ensures our model training will use GPU acceleration instead of CPU, making it much faster.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

if device.type == "cuda":
    device_index = torch.cuda.current_device()
    device_name = torch.cuda.get_device_name(device_index)
    total_mem = torch.cuda.get_device_properties(device_index).total_memory / 1e9  # bytes to GB
    allocated_mem = torch.cuda.memory_allocated(device_index) / 1e9
    reserved_mem = torch.cuda.memory_reserved(device_index) / 1e9

    print(f"CUDA device name: {device_name}")
    print(f"Total memory: {total_mem:.2f} GB")
    print(f"Memory allocated: {allocated_mem:.2f} GB")
    print(f"Memory reserved: {reserved_mem:.2f} GB")

Using device: cuda
CUDA device name: Tesla T4
Total memory: 15.83 GB
Memory allocated: 0.00 GB
Memory reserved: 0.00 GB


## Loading the Dataset

We're using the "dialogsum-test" dataset, which contains conversations and their summaries. This dataset will help us train our model to generate concise summaries of dialogues.

In [None]:
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)

## Model Quantization Configuration

Here we set up quantization parameters to reduce the model's memory footprint. We're using 4-bit quantization, which significantly reduces memory usage while maintaining most of the model's performance.

Key concepts:
- Quantization: Converting model weights to lower precision (4-bit instead of 16/32-bit)
- `compute_dtype`: The data type used for computations

In [None]:
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

## Loading the Base Model

We're loading Meta's Llama-3.2 1B model, a relatively compact but powerful language model. We're applying our quantization configuration to make it memory-efficient.

Note: The model is loaded with `trust_remote_code=True` because it contains custom code from its creators.

In [None]:
model_name='meta-llama/Llama-3.2-1B-Instruct'
device_map = {"": 0}
original_model = AutoModelForCausalLM.from_pretrained(model_name,
                                                      device_map=device_map,
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)
MAX_LENGTH = original_model.config.max_position_embeddings



## Setting Up the Tokenizer

The tokenizer converts text into numbers that the model can process. We configure it with specific settings:
- `padding_side="left"`: Adds padding tokens at the start of sequences
- `add_eos_token` and `add_bos_token`: Adds special tokens to mark the beginning and end of sequences

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name, trust_remote_code=True, padding=True, padding_side="left",
    add_eos_token=False
)
tokenizer.pad_token = '<|finetune_right_pad_id|>'

## Testing Initial Model Performance

Let's test our base model before fine-tuning to see how it handles dialogue summarization. This will give us a baseline to compare against after training.
The promopt template als includes the topic, which helps to guide the tone and type of summarization

In [None]:
PROMPT_TEMPLATE = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert on summarizing conversations considering a particular topic.
The user request will contain the topic and the conversation
Answer with the summary only. Do not explain your answer
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
Topic: {0}
Conversation: {1}

<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{2}
"""

def generate_response(
    model, topic, conversation, summary='',
    max_length=MAX_LENGTH, prompt_template=PROMPT_TEMPLATE,
    seed=42, tokenizer=tokenizer
):
    set_seed(seed)
    prompt = prompt_template.format(topic, conversation, summary)
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        return_attention_mask=True,
        padding=True
    ).to(device)

    outputs = model.generate(
        **inputs,
        max_length=max_length,
        pad_token_id=tokenizer.eos_token_id
    )

    # Decode full output and prompt
    full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    prompt_text = tokenizer.decode(inputs['input_ids'][0], skip_special_tokens=True)

    # Get only the response part
    response_only = full_text[len(prompt_text):].strip()

    return response_only

In [None]:
idx = 0
dialogue = dataset['train'][idx]['dialogue']
topic = dataset['train'][idx]['topic']
summary = dataset['train'][idx]['summary']
response = generate_response(original_model, topic, dialogue)
print('MODEL RESPONSE: ')
print(response)
print('-'*100)
print('HUMAN GENERATED SUMMARY: ')
print(summary)

MODEL RESPONSE: 
Person1: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
Person2: I found it would be a good idea to get a check-up.
Person1: Yes, well, you haven't had one for 5 years. You should have one every year.
Person2: I know. I figure as long as there is nothing wrong, why go see the doctor?
Person1: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.
Person2: Ok.
Person1: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?
Person2: Yes.
Person1: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.
Person2: I've tried hundreds of times, but I just can't seem to kick the habit.
Person1: Well, we have classes and some medications that might help. I'll give you more information before you leave.
----------------------------------------------------------------------------------------------------
HU

We can see there are some room to improve the response by fine tuning

## Preparing Data Processing Functions

We define several helper functions to prepare our data:
- `apply_prompt`: Formats our input in a consistent way
- `process_batch`: Handles tokenization of multiple examples at once
- `process_dataset`: Combines all processing steps and prepares the final dataset

These functions ensure our data is in the right format for training.

In [None]:
def apply_prompt(sample):
    dialogue = sample['dialogue']
    summary = sample['summary']
    topic = sample['topic']

    sample['text'] = PROMPT_TEMPLATE.format(topic, dialogue, summary)
    sample['text'] += tokenizer.eos_token
    return sample

def process_batch(batch, tokenizer, max_length):
    return tokenizer(batch['text'])

def process_dataset(dataset, tokenizer, max_length=MAX_LENGTH, seed=42):
    dataset = dataset.map(apply_prompt)
    proc_fn = partial(process_batch, max_length=max_length, tokenizer=tokenizer)

    # Generated input ids
    dataset = dataset.map(
        proc_fn,
        batched=True,
        remove_columns=['id', 'topic', 'dialogue', 'summary'],
    )

    # filter samples larger than max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)

    # Shuffle
    dataset = dataset.shuffle(seed=seed)

    return dataset

In [None]:
train_dataset = process_dataset(dataset['train'], tokenizer)
eval_dataset = process_dataset(dataset['validation'], tokenizer)

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
print(train_dataset[0]['text'])


<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert on summarizing conversations considering a particular topic.
The user request will contain the topic and the conversation
Answer with the summary only. Do not explain your answer
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
Topic: get a promotion
Conversation: #Person1#: Hello, Anna speaking!
#Person2#: Hey, Anna, this is Jason.
#Person1#: Jason, where have you been hiding lately? You know it's been a long time since your last call. Have you been good?
#Person2#: Yes. How are you, Anna?
#Person1#: I am fine. What have you been doing?
#Person2#: Working. I've been really busy these days. I got a promotion.
#Person1#: That's great, congratulations!
#Person2#: Thanks. I am feeling pretty good about myself too. You know, bigger office, a raise and even an assistant.
#Person1#: That's good. So I guess I'll have to make an appointment to see you.
#Person2#: You are kidding.
#Person1#: How long have you bee

## Setting Up LoRA Configuration

LoRA (Low-Rank Adaptation) is a technique that makes fine-tuning more efficient by only training a small number of additional parameters, called adapters, instead of the entire model.

Key parameters:
- `r`: The rank of the LoRA update matrices
- `lora_alpha`: Scaling factor for LoRA updates
- `target_modules`: Which model layers to apply LoRA to

In [None]:
original_model = prepare_model_for_kbit_training(original_model)

lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=[
        'q_proj',
        'k_proj',
        'v_proj',
        'dense'
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

# 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
original_model.gradient_checkpointing_enable()

peft_model = get_peft_model(original_model, lora_config)
peft_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 2048)
        (layers): ModuleList(
          (0-15): 16 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lor

## Configuring Training Parameters

We set up the training process with specific parameters:
- Small batch size to manage memory usage
- Gradient accumulation to simulate larger batches
- Learning rate and optimization settings
- Evaluation and saving checkpoints during training

These settings help balance training efficiency with resource constraints.

We'll train only on responses, there is one workaround to fix an issue of using `DataCollatorForCompletionOnlyLM` with Llama tokenizer. You can find the reference [here](https://github.com/huggingface/trl/blob/main/docs/source/sft_trainer.md#using-token_ids-directly-for-response_template)

In [None]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'
response_template = "<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
response_template_with_context = f"\n{response_template}"  # We added context here: "\n". This is enough for this tokenizer
response_template_ids = tokenizer.encode(response_template_with_context, add_special_tokens=False)[2:]  # Now we have it like in the dataset texts: `[2277, 29937, 4007, 22137, 29901]`

peft_training_args = TrainingArguments(
    output_dir = output_dir,
    warmup_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    optim="paged_adamw_8bit",
    logging_steps=25,
    logging_dir="./logs",
    save_strategy="steps",
    save_steps=25,
    eval_strategy="steps",
    eval_steps=25,
    do_eval=True,
    gradient_checkpointing=True,
    report_to="none",
    overwrite_output_dir = 'True',
    group_by_length=True,
    dataloader_pin_memory=False,
    load_best_model_at_end=True,
    save_total_limit=3,
    metric_for_best_model="eval_loss",
)

peft_model.config.use_cache = False

peft_trainer = SFTTrainer(
    model=peft_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=peft_training_args,
    processing_class=tokenizer,
    data_collator=DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer),
)

Truncating train dataset:   0%|          | 0/12460 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


## Training the adapters

Note that some samples are going to be larger than the context size, which will be ignored by the trainer.

In [None]:
training_history = peft_trainer.train()

  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss,Validation Loss
25,1.4229,1.186716
50,1.0454,1.13748
75,1.2082,1.078773
100,1.0758,1.077345
125,1.2394,1.051057
150,0.9417,1.045796
175,1.1743,1.035365
200,0.885,1.053288
225,1.1717,1.023783
250,0.995,1.012308


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert on summarizing conversations considering a particular topic. 
The user request will contain the topic and the conversation
Answer with the summary only. Do not explain your answer
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
Topic: depression
Conversation: #Person

## Save the adapter

In [None]:
peft_model.save_pretrained(f"{output_dir}/best_model")

The model adapter is saved inside the `best_model` folder

In [None]:
!ls -lah "{output_dir}/best_model"

total 19M
drwxr-xr-x 2 root root 4.0K Mar 28 09:19 .
drwxr-xr-x 4 root root 4.0K Mar 28 09:19 ..
-rw-r--r-- 1 root root  815 Mar 28 09:19 adapter_config.json
-rw-r--r-- 1 root root  19M Mar 28 09:19 adapter_model.safetensors
-rw-r--r-- 1 root root 5.0K Mar 28 09:19 README.md


Let's now jump to the [evaluation notebook](https://drive.google.com/file/d/1q3rR9-JsKaeWi9kCL0OAlERt76dlsWW5/view?usp=sharing) to load and test the trained model