# QLoRA Fine-Tuning: Llama 3.2 on SAMSum

This notebook demonstrates how to fine-tune a Llama 3.2 model for dialogue summarization using **QLoRA (Quantized Low-Rank Adaptation)**.

**Key Techniques:**
- 4-bit quantization to reduce memory usage
- LoRA adapters to make fine-tuning efficient
- Custom preprocessing with assistant-only masking


## 1. Setup: Install Dependencies

Install required packages for QLoRA training including transformers, PEFT (for LoRA), and bitsandbytes (for 4-bit quantization).


In [98]:
! pip install -q evaluate torch tqdm datasets peft transformers rouge_score
! pip install -q -U bitsandbytes

## 2. Import Libraries

Import necessary libraries for:
- Model loading and training (transformers)
- QLoRA implementation (PEFT, bitsandbytes)
- Dataset handling (datasets)


In [None]:
import os
import yaml
import torch
from transformers import (
    TrainingArguments,
    Trainer,
)
from torch.utils.data import DataLoader
from datasets import load_dataset, load_from_disk
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig


## 3. Data Preprocessing Functions

These functions format dialogue-summary pairs into chat templates compatible with Llama 3.2.


In [100]:
DATASETS_DIR = "./datasets"
OUTPUTS_DIR = "./outputs"
CONFIG_FILE_PATH = "./config.yaml"

os.makedirs(DATASETS_DIR, exist_ok=True)
os.makedirs(OUTPUTS_DIR, exist_ok=True)

### Tokenization with Assistant-Only Masking

This function tokenizes dialogues and applies **label masking** so the model only trains on assistant responses (summaries), not the input prompt. This prevents the model from learning to generate the prompt itself.


In [101]:
def load_config(config_path: str = CONFIG_FILE_PATH):
    """
    Load and parse a YAML configuration file.

    Args:
        config_path (str): Path to the config file.

    Returns:
        dict: Parsed configuration dictionary.
    """
    with open(config_path, "r", encoding="utf-8") as f:
        cfg = yaml.safe_load(f)
    return cfg


def setup_model_and_tokenizer(cfg, use_4bit: bool = None, use_lora: bool = None):
    """
    Load model, tokenizer, and apply quantization + LoRA config if specified.

    Args:
        cfg (dict): Configuration dictionary containing:
            - base_model
            - quantization parameters
            - lora parameters (optional)
            - bf16 or fp16 precision
        use_4bit (bool, optional): Override whether to load in 4-bit mode.
        use_lora (bool, optional): Override whether to apply LoRA adapters.

    Returns:
        tuple: (model, tokenizer)
    """
    model_name = cfg["base_model"]
    print(f"\nLoading model: {model_name}")

    # ------------------------------
    # Tokenizer setup
    # ------------------------------
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    # Determine quantization + LoRA usage
    load_in_4bit = use_4bit if use_4bit is not None else cfg.get("load_in_4bit", False)
    apply_lora = use_lora if use_lora is not None else ("lora_r" in cfg)

    # ------------------------------
    # Quantization setup (optional)
    # ------------------------------
    quant_cfg = None
    if load_in_4bit:
        print("‚öôÔ∏è  Enabling 4-bit quantization...")
        quant_cfg = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type=cfg.get("bnb_4bit_quant_type", "nf4"),
            bnb_4bit_use_double_quant=cfg.get("bnb_4bit_use_double_quant", True),
            bnb_4bit_compute_dtype=getattr(
                torch, cfg.get("bnb_4bit_compute_dtype", "bfloat16")
            ),
        )
    else:
        print("‚öôÔ∏è  Loading model in full precision (no quantization).")

    # ------------------------------
    # Model loading
    # ------------------------------
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quant_cfg,
        device_map="auto",
        dtype=(
            torch.bfloat16
            if cfg.get("bf16", True) and torch.cuda.is_available()
            else torch.float32
        ),
    )

    # ------------------------------
    # LoRA setup (optional)
    # ------------------------------
    if apply_lora:
        print("üîß Applying LoRA configuration...")
        model = prepare_model_for_kbit_training(model)
        lora_cfg = LoraConfig(
            r=cfg.get("lora_r", 8),
            lora_alpha=cfg.get("lora_alpha", 16),
            target_modules=cfg.get("target_modules", ["q_proj", "v_proj"]),
            lora_dropout=cfg.get("lora_dropout", 0.05),
            bias="none",
            task_type="CAUSAL_LM",
        )
        model = get_peft_model(model, lora_cfg)
        model.print_trainable_parameters()
    else:
        print("üîπ Skipping LoRA setup ‚Äî using base model only.")

    return model, tokenizer


def select_subset(dataset, n_samples, seed=42):
    """
    Select a subset of the dataset.
    If n_samples is "all" or None, return the entire dataset.
    Otherwise, sample n_samples examples.
    """
    if n_samples == "all" or n_samples is None:
        return dataset

    if n_samples > len(dataset):
        print(f"‚ö†Ô∏è  Requested {n_samples} samples but only {len(dataset)} available. Using all samples.")
        return dataset

    return dataset.shuffle(seed=seed).select(range(n_samples))


def load_and_prepare_dataset(cfg):
    """
    Load dataset splits according to configuration.
    Ensures the FULL dataset is cached, and subsets are selected per run.
    Supports both new-style ("dataset": {"splits": {...}}) and old-style (top-level keys) configs.
    """
    # -----------------------------------------------------------------------
    # Extract dataset configuration
    # -----------------------------------------------------------------------
    if "dataset" in cfg:
        cfg_dataset = cfg["dataset"]
        dataset_name = cfg_dataset["name"]
        splits_cfg = cfg_dataset.get("splits", {})
        n_train = splits_cfg.get("train", "all")
        n_val = splits_cfg.get("validation", "all")
        n_test = splits_cfg.get("test", "all")
        seed = cfg_dataset.get("seed", 42)
    elif "datasets" in cfg and isinstance(cfg["datasets"], list):
        cfg_dataset = cfg["datasets"][0]
        dataset_name = cfg_dataset["path"]
        n_train = cfg.get("train_samples", "all")
        n_val = cfg.get("val_samples", "all")
        n_test = cfg.get("test_samples", "all")
        seed = cfg.get("seed", 42)
    else:
        raise KeyError("Dataset configuration not found. Expected 'dataset' or 'datasets' key.")

    # -----------------------------------------------------------------------
    # Load or download full dataset
    # -----------------------------------------------------------------------
    os.makedirs(DATASETS_DIR, exist_ok=True)
    local_path = os.path.join(DATASETS_DIR, dataset_name.replace("/", "_"))

    if os.path.exists(local_path):
        print(f"üìÇ Loading dataset from local cache: {local_path}")
        dataset = load_from_disk(local_path)
    else:
        print(f"‚¨áÔ∏è  Downloading dataset from Hugging Face: {dataset_name}")
        dataset = load_dataset(dataset_name)
        dataset.save_to_disk(local_path)
        print(f"‚úÖ Full dataset saved locally to: {local_path}")

    # -----------------------------------------------------------------------
    # Handle variations in split keys and select subsets dynamically
    # -----------------------------------------------------------------------
    val_key = "validation" if "validation" in dataset else "val"

    train = select_subset(dataset["train"], n_train, seed=seed)
    val = select_subset(dataset[val_key], n_val, seed=seed)
    test = select_subset(dataset["test"], n_test, seed=seed)

    print(f"üìä Loaded {len(train)} train / {len(val)} val / {len(test)} test samples (from full cache).")
    return train, val, test


## 4. Training Setup

Configure the Trainer with optimized settings for QLoRA including:
- Learning rate scheduling (cosine)
- Gradient accumulation
- Mixed precision training (bf16)
- 8-bit optimizers for memory efficiency


In [102]:
def build_user_prompt(dialogue: str, task_instruction: str) -> str:
    """Construct a summarization-style prompt given a dialogue and instruction."""
    return f"{task_instruction}\n\n## Dialogue:\n{dialogue}\n## Summary:"


def build_messages_for_sample(sample, task_instruction, include_assistant=False):
    """
    Build a chat-style message list for a given sample, compatible with
    models that use chat templates (like Llama 3).
    """
    messages = [
        {
            "role": "user",
            "content": build_user_prompt(sample["dialogue"], task_instruction),
        }
    ]
    if include_assistant:
        messages.append({"role": "assistant", "content": sample["summary"]})
    return messages


## 5. Dataset Loading

Load the SAMSum dataset (dialogue summarization) with options to:
- Cache the full dataset locally
- Select subsets for training/validation/testing
- Handle different split naming conventions


In [103]:
def preprocess_samples(examples, tokenizer, task_instruction, max_length):
    """Tokenize dialogues and apply assistant-only masking for causal LM."""
    input_ids_list, labels_list, attn_masks = [], [], []

    for d, s in zip(examples["dialogue"], examples["summary"]):
        sample = {"dialogue": d, "summary": s}

        # Build chat-style text

        msgs_full = build_messages_for_sample(
            sample, task_instruction, include_assistant=True
        )
        msgs_prompt = build_messages_for_sample(
            sample, task_instruction, include_assistant=False
        )

        text_full = tokenizer.apply_chat_template(
            msgs_full, tokenize=False, add_generation_prompt=False
        )
        text_prompt = tokenizer.apply_chat_template(
            msgs_prompt, tokenize=False, add_generation_prompt=True
        )
        prompt_len = len(text_prompt)

        tokens = tokenizer(
            text_full,
            max_length=max_length,
            truncation=True,
            padding=False,
            add_special_tokens=False,
            return_offsets_mapping=True,
        )

        # Mask non-assistant tokens
        start_idx = len(tokens["input_ids"])
        for i, (start, _) in enumerate(tokens["offset_mapping"]):
            if start >= prompt_len:
                start_idx = i
                break

        labels = [-100] * start_idx + tokens["input_ids"][start_idx:]
        input_ids_list.append(tokens["input_ids"])
        labels_list.append(labels)
        attn_masks.append(tokens["attention_mask"])

    return {
        "input_ids": input_ids_list,
        "labels": labels_list,
        "attention_mask": attn_masks,
    }

## 6. Model Setup with QLoRA

This function:
1. Loads the model in **4-bit quantization** (reduces memory by ~75%)
2. Applies **LoRA adapters** to specific layers (only trains ~0.1% of parameters)
3. Configures the tokenizer for chat-based completion

**QLoRA = Quantization + LoRA** for extremely efficient fine-tuning!


In [104]:

from torch.nn.utils.rnn import pad_sequence

class PaddingCollator:
    def __init__(self, tokenizer, label_pad_token_id=-100):
        self.tokenizer = tokenizer
        self.label_pad_token_id = label_pad_token_id

    def __call__(self, batch):
        # Convert lists to tensors
        input_ids = [torch.tensor(f["input_ids"], dtype=torch.long) for f in batch]
        attn_masks = [torch.tensor(f["attention_mask"], dtype=torch.long) for f in batch]
        labels = [torch.tensor(f["labels"], dtype=torch.long) for f in batch]

        # Pad to the max length in this batch
        input_ids = pad_sequence(input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id)
        attn_masks = pad_sequence(attn_masks, batch_first=True, padding_value=0)
        labels = pad_sequence(labels, batch_first=True, padding_value=self.label_pad_token_id)

        return {
            "input_ids": input_ids,
            "attention_mask": attn_masks,
            "labels": labels,
        }


def tokenize_dataset(cfg, tokenizer, train_data, val_data):
    task_instruction = cfg["task_instruction"]

    print("\nTokenizing datasets...")
    tokenized_train = train_data.map(
        lambda e: preprocess_samples(
            e, tokenizer, task_instruction, cfg["sequence_len"]
        ),
        batched=True,
        remove_columns=train_data.column_names,
    )
    tokenized_val = val_data.map(
        lambda e: preprocess_samples(
            e, tokenizer, task_instruction, cfg["sequence_len"]
        ),
        batched=True,
        remove_columns=val_data.column_names,
    )

    return tokenized_train, tokenized_val

def push_to_hub(
    model: PeftModel, tokenizer: AutoTokenizer, model_name: str, hf_username: str
):
    """
    Push a model and tokenizer to Hugging Face Hub.
    """
    model_id = f"{hf_username}/{model_name}"
    try:
        model.push_to_hub(f"{model_id}-adapters", private=False)

        merged_model = model.merge_and_unload()
        merged_model.push_to_hub(model_id, private=False)

        tokenizer.push_to_hub(model_id)
        print(f"Adapters successfully pushed to: https://huggingface.co/{model_id}")
    except Exception as e:
        print(f"Error pushing to Hugging Face: {e}")
        print("Make sure you're logged in with: huggingface-cli login")


def train_model(cfg, model, tokenizer, tokenized_train, tokenized_val):
    """Configure Trainer, and run LoRA fine-tuning."""
    collator = PaddingCollator(tokenizer=tokenizer)

    output_dir = os.path.join(OUTPUTS_DIR, "lora_samsum")
    os.makedirs(output_dir, exist_ok=True)

    args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=cfg["num_epochs"],
        max_steps=cfg.get("max_steps", 500),
        per_device_train_batch_size=cfg["batch_size"],
        per_device_eval_batch_size=cfg["batch_size"],
        gradient_accumulation_steps=cfg["gradient_accumulation_steps"],
        learning_rate=float(cfg["learning_rate"]),
        lr_scheduler_type=cfg.get("lr_scheduler", "cosine"),
        warmup_steps=cfg.get("warmup_steps", 100),
        bf16=cfg.get("bf16", True),
        optim=cfg.get("optim", "paged_adamw_8bit"),
        eval_strategy="steps",
        save_strategy="steps",
        logging_steps=cfg.get("logging_steps", 25),
        save_total_limit=cfg.get("save_total_limit", 2),
        max_grad_norm=cfg.get("max_grad_norm", 1.0),
        report_to="none",
    )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_val,
        data_collator=collator,
    )

    print("\nStarting LoRA fine-tuning...")
    trainer.train()
    print("\nTraining complete!")

    save_dir = os.path.join(output_dir, "lora_adapters")
    model.save_pretrained(save_dir)
    tokenizer.save_pretrained(save_dir)
    print(f"Saved LoRA adapters to {save_dir}")
    return model


## 7. Load Configuration and Initialize

Load hyperparameters from `config.yaml`, initialize the model with QLoRA, load the dataset, and tokenize all samples.


In [105]:
cfg = load_config()
model, tokenizer = setup_model_and_tokenizer(cfg, use_4bit=True, use_lora=True)
train_data, val_data, _ = load_and_prepare_dataset(cfg)

tokenized_train, tokenized_val = tokenize_dataset(cfg, tokenizer, train_data, val_data)



Loading model: meta-llama/Llama-3.2-1B-Instruct
‚öôÔ∏è  Enabling 4-bit quantization...
üîß Applying LoRA configuration...
trainable params: 1,703,936 || all params: 1,237,518,336 || trainable%: 0.1377
üìÇ Loading dataset from local cache: ./datasets/knkarthick_samsum
üìä Loaded 14731 train / 200 val / 200 test samples (from full cache).

Tokenizing datasets...


Map:   0%|          | 0/14731 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

## 8. Inspect Tokenized Data (Optional)

Verify that:
- Input IDs are correctly tokenized
- Labels are masked with `-100` for the prompt (only assistant tokens are trained)
- Attention masks are properly set


In [106]:
input_ids = tokenized_train[0]['input_ids']
labels = tokenized_train[0]['labels']
mask = tokenized_train[0]['attention_mask']

print(f"Input IDs: {input_ids}")
print(f"Labels: {labels}")
print(f"Attention mask: {mask}")


Input IDs: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 605, 4723, 220, 2366, 20, 271, 128009, 128006, 882, 128007, 271, 2675, 527, 264, 11190, 18328, 889, 14238, 64694, 11, 61001, 70022, 315, 21633, 13, 8279, 5730, 553, 279, 2768, 10652, 1139, 264, 3254, 11914, 4286, 567, 70589, 512, 32, 36645, 25, 358, 41778, 220, 8443, 13, 3234, 499, 1390, 1063, 5380, 90757, 25, 23371, 4999, 32, 36645, 25, 358, 3358, 4546, 499, 16986, 21629, 340, 567, 22241, 25, 128009, 128006, 78191, 128007, 271, 32, 36645, 41778, 8443, 323, 690, 4546, 29808, 1063, 16986, 13, 128009]
Labels: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -

In [107]:
collator = PaddingCollator(tokenizer)
loader = DataLoader(tokenized_train, collate_fn=collator, batch_size=4)

batch = next(iter(loader))

In [108]:
first_example = tokenized_train[0]
input_ids = first_example["input_ids"]
labels = first_example["labels"]

masked_tokens = sum(1 for label in labels if label == -100)
total_tokens = len(labels)

print(f"Total tokens: {total_tokens}")
print(f"Masked tokens: {masked_tokens}")
print(f"Training tokens: {total_tokens - masked_tokens}")
print(f"Mask ratio: {masked_tokens/total_tokens:.2%}")

print(f"Input IDs: {input_ids}")
print(f"Labels: {labels}")
print(f"Attention mask: {first_example['attention_mask']}")

Total tokens: 105
Masked tokens: 93
Training tokens: 12
Mask ratio: 88.57%
Input IDs: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 605, 4723, 220, 2366, 20, 271, 128009, 128006, 882, 128007, 271, 2675, 527, 264, 11190, 18328, 889, 14238, 64694, 11, 61001, 70022, 315, 21633, 13, 8279, 5730, 553, 279, 2768, 10652, 1139, 264, 3254, 11914, 4286, 567, 70589, 512, 32, 36645, 25, 358, 41778, 220, 8443, 13, 3234, 499, 1390, 1063, 5380, 90757, 25, 23371, 4999, 32, 36645, 25, 358, 3358, 4546, 499, 16986, 21629, 340, 567, 22241, 25, 128009, 128006, 78191, 128007, 271, 32, 36645, 41778, 8443, 323, 690, 4546, 29808, 1063, 16986, 13, 128009]
Labels: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100

## 9. Start Training! üöÄ

Begin QLoRA fine-tuning. This will:
- Train only the LoRA adapter weights (~1.7M parameters)
- Save checkpoints periodically
- Log training metrics (if logging is enabled)

**Note:** Training time depends on your hardware and dataset size.


In [109]:
model = train_model(cfg, model, tokenizer, tokenized_train, tokenized_val)



Starting LoRA fine-tuning...


  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss
25,2.0095,1.572561
50,1.4729,1.421943
75,1.4062,1.382779
100,1.4165,1.364123
125,1.4087,1.352062
150,1.3335,1.341762
175,1.3465,1.333588
200,1.3666,1.328513
225,1.3394,1.323102
250,1.3816,1.322351



Training complete!
Saved LoRA adapters to ./outputs/lora_samsum/lora_adapters


## 10. Deploy to Hugging Face Hub (Optional)

Push your fine-tuned model to Hugging Face Hub to:
- Share with the community
- Use it in inference pipelines
- Version control your models

This uploads both the LoRA adapters and the merged model.


In [110]:
from google.colab import userdata
HF_USERNAME = userdata.get('HF_USERNAME')
push_to_hub(model, tokenizer, "Llama-3.2-1B-QLoRA-Summarizer", HF_USERNAME)

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors:   8%|8         |  565kB / 6.82MB            



Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...49_ngp7/model.safetensors:   3%|2         | 41.8MB / 1.55GB            

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...mpmmoayfsc/tokenizer.json: 100%|##########| 17.2MB / 17.2MB            

Adapters successfully pushed to: https://huggingface.co/moo3030/Llama-3.2-1B-QLoRA-Summarizer


## ‚úÖ Training Complete!

**Next Steps:**
1. Evaluate your model on the test set
2. Compare performance with the baseline model
3. Experiment with different hyperparameters (learning rate, LoRA rank, etc.)
4. Try the model on new dialogues

**Key Takeaways:**
- QLoRA enables fine-tuning large models on consumer GPUs
- Assistant-only masking improves instruction-following quality
- LoRA adapters are lightweight and easy to share/deploy
