# Evaluating QLoRA Fine-Tuned Models

This notebook demonstrates how to evaluate a QLoRA fine-tuned model on dialogue summarization using the SAMSum dataset.

**What we'll do:**
- Load a fine-tuned model with LoRA adapters from Hugging Face Hub
- Generate predictions on the validation set
- Compute ROUGE metrics to measure performance
- Save results and predictions for analysis


## 1. Setup: Install Dependencies

Install the required packages for loading and evaluating QLoRA models, including PEFT for LoRA adapters and evaluation metrics.


In [8]:
! pip install -q torch transformers peft evaluate rouge_score
! pip install bitsandbytes -U



## 2. Import Libraries

Import necessary libraries for:
- Model loading (transformers, PEFT)
- Evaluation metrics (evaluate, ROUGE)
- Data handling and utilities


In [9]:
import os
import yaml
import json
import torch
import evaluate
from datasets import load_dataset, load_from_disk
from tqdm import tqdm
from peft import PeftModel, get_peft_model, LoraConfig
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, DataCollatorWithPadding



## 3. Configure Paths

Set up directories for:
- Configuration files
- Output storage (predictions and metrics)
- Dataset caching


In [10]:
CONFIG_FILE_PATH = './config.yaml'
OUTPUTS_DIR = './outputs'
DATASETS_DIR = './datasets'
os.makedirs(OUTPUTS_DIR, exist_ok=True)
os.makedirs(DATASETS_DIR, exist_ok=True)



## 4. Utility Functions

Core functions for:
- **Loading configuration** from YAML
- **Model setup** with 4-bit quantization (without training LoRA from scratch)
- **Dataset loading** with subset selection
- **Data preparation** for evaluation


In [11]:
def load_config(config_path: str = CONFIG_FILE_PATH):
    """
    Load and parse a YAML configuration file.

    Args:
        config_path (str): Path to the config file.

    Returns:
        dict: Parsed configuration dictionary.
    """
    with open(config_path, "r", encoding="utf-8") as f:
        cfg = yaml.safe_load(f)
    return cfg


def setup_model_and_tokenizer(cfg, use_4bit: bool = None, use_lora: bool = None):
    """
    Load model, tokenizer, and apply quantization + LoRA config if specified.

    Args:
        cfg (dict): Configuration dictionary containing:
            - base_model
            - quantization parameters
            - lora parameters (optional)
            - bf16 or fp16 precision
        use_4bit (bool, optional): Override whether to load in 4-bit mode.
        use_lora (bool, optional): Override whether to apply LoRA adapters.

    Returns:
        tuple: (model, tokenizer)
    """
    model_name = cfg["base_model"]
    print(f"\nLoading model: {model_name}")

    # ------------------------------
    # Tokenizer setup
    # ------------------------------
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    # Determine quantization + LoRA usage
    load_in_4bit = use_4bit if use_4bit is not None else cfg.get("load_in_4bit", False)
    apply_lora = use_lora if use_lora is not None else ("lora_r" in cfg)

    # ------------------------------
    # Quantization setup (optional)
    # ------------------------------
    quant_cfg = None
    if load_in_4bit:
        print("‚öôÔ∏è  Enabling 4-bit quantization...")
        quant_cfg = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type=cfg.get("bnb_4bit_quant_type", "nf4"),
            bnb_4bit_use_double_quant=cfg.get("bnb_4bit_use_double_quant", True),
            bnb_4bit_compute_dtype=getattr(
                torch, cfg.get("bnb_4bit_compute_dtype", "bfloat16")
            ),
        )
    else:
        print("‚öôÔ∏è  Loading model in full precision (no quantization).")

    # ------------------------------
    # Model loading
    # ------------------------------
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quant_cfg,
        device_map="auto",
        dtype=(
            torch.bfloat16
            if cfg.get("bf16", True) and torch.cuda.is_available()
            else torch.float32
        ),
    )

    # ------------------------------
    # LoRA setup (optional)
    # ------------------------------
    if apply_lora:
        print("üîß Applying LoRA configuration...")
        model = prepare_model_for_kbit_training(model)
        lora_cfg = LoraConfig(
            r=cfg.get("lora_r", 8),
            lora_alpha=cfg.get("lora_alpha", 16),
            target_modules=cfg.get("target_modules", ["q_proj", "v_proj"]),
            lora_dropout=cfg.get("lora_dropout", 0.05),
            bias="none",
            task_type="CAUSAL_LM",
        )
        model = get_peft_model(model, lora_cfg)
        model.print_trainable_parameters()
    else:
        print("üîπ Skipping LoRA setup ‚Äî using base model only.")

    return model, tokenizer


def select_subset(dataset, n_samples, seed=42):
    """
    Select a subset of the dataset.
    If n_samples is "all" or None, return the entire dataset.
    Otherwise, sample n_samples examples.
    """
    if n_samples == "all" or n_samples is None:
        return dataset

    if n_samples > len(dataset):
        print(f"‚ö†Ô∏è  Requested {n_samples} samples but only {len(dataset)} available. Using all samples.")
        return dataset

    return dataset.shuffle(seed=seed).select(range(n_samples))


def load_and_prepare_dataset(cfg):
    """
    Load dataset splits according to configuration.
    Ensures the FULL dataset is cached, and subsets are selected per run.
    Supports both new-style ("dataset": {"splits": {...}}) and old-style (top-level keys) configs.
    """
    # -----------------------------------------------------------------------
    # Extract dataset configuration
    # -----------------------------------------------------------------------
    if "dataset" in cfg:
        cfg_dataset = cfg["dataset"]
        dataset_name = cfg_dataset["name"]
        splits_cfg = cfg_dataset.get("splits", {})
        n_train = splits_cfg.get("train", "all")
        n_val = splits_cfg.get("validation", "all")
        n_test = splits_cfg.get("test", "all")
        seed = cfg_dataset.get("seed", 42)
    elif "datasets" in cfg and isinstance(cfg["datasets"], list):
        cfg_dataset = cfg["datasets"][0]
        dataset_name = cfg_dataset["path"]
        n_train = cfg.get("train_samples", "all")
        n_val = cfg.get("val_samples", "all")
        n_test = cfg.get("test_samples", "all")
        seed = cfg.get("seed", 42)
    else:
        raise KeyError("Dataset configuration not found. Expected 'dataset' or 'datasets' key.")

    # -----------------------------------------------------------------------
    # Load or download full dataset
    # -----------------------------------------------------------------------
    os.makedirs(DATASETS_DIR, exist_ok=True)
    local_path = os.path.join(DATASETS_DIR, dataset_name.replace("/", "_"))

    if os.path.exists(local_path):
        print(f"üìÇ Loading dataset from local cache: {local_path}")
        dataset = load_from_disk(local_path)
    else:
        print(f"‚¨áÔ∏è  Downloading dataset from Hugging Face: {dataset_name}")
        dataset = load_dataset(dataset_name)
        dataset.save_to_disk(local_path)
        print(f"‚úÖ Full dataset saved locally to: {local_path}")

    # -----------------------------------------------------------------------
    # Handle variations in split keys and select subsets dynamically
    # -----------------------------------------------------------------------
    val_key = "validation" if "validation" in dataset else "val"

    train = select_subset(dataset["train"], n_train, seed=seed)
    val = select_subset(dataset[val_key], n_val, seed=seed)
    test = select_subset(dataset["test"], n_test, seed=seed)

    print(f"üìä Loaded {len(train)} train / {len(val)} val / {len(test)} test samples (from full cache).")
    return train, val, test


## 5. Inference & Evaluation Functions

### Generation Function
Generate summaries using the model in batches with the same prompt template used during training.

### ROUGE Computation
Calculate ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) to measure:
- **ROUGE-1**: Unigram overlap (word-level similarity)
- **ROUGE-2**: Bigram overlap (phrase-level similarity)  
- **ROUGE-L**: Longest common subsequence (sentence structure similarity)


In [12]:
def generate_predictions(
    model,
    tokenizer,
    dataset,
    task_instruction,
    num_samples=None,
    batch_size=8,
    max_new_tokens=256,
):
    """
    Generate model predictions for a dataset (e.g., summaries).

    Args:
        model: The loaded model (base or fine-tuned).
        tokenizer: Corresponding tokenizer.
        dataset: Hugging Face dataset split containing 'dialogue' and 'summary'.
        task_instruction (str): Instruction prefix for generation.
        num_samples (int, optional): Number of samples to evaluate.
        batch_size (int): Number of examples per inference batch.
        max_new_tokens (int): Max tokens to generate per sample.

    Returns:
        list[str]: Generated summaries.
    """
    if num_samples is not None and num_samples < len(dataset):
        dataset = dataset.select(range(num_samples))

    # Prepare prompts
    prompts = []
    for sample in dataset:
        user_prompt = (
            f"{task_instruction}\n\n"
            f"## Dialogue:\n{sample['dialogue']}\n"
            "## Summary:"
        )
        messages = [{"role": "user", "content": user_prompt}]
        prompt = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        prompts.append(prompt)

    # Initialize pipeline
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        dtype="auto",
        do_sample=False,
    )

    # Generate predictions
    preds = []
    for i in tqdm(range(0, len(prompts), batch_size), desc="Generating summaries"):
        batch = prompts[i : i + batch_size]
        outputs = pipe(batch, max_new_tokens=max_new_tokens, return_full_text=False)
        preds.extend([o[0]["generated_text"].strip() for o in outputs])

    return preds


def compute_rouge(predictions, samples):
    """
    Compute ROUGE scores between predictions and reference summaries.

    Args:
        predictions (list[str]): Model-generated outputs.
        samples (datasets.Dataset): Dataset containing reference 'summary' field.

    Returns:
        dict: ROUGE-1, ROUGE-2, and ROUGE-L scores.
    """
    rouge = evaluate.load("rouge")
    references = [s["summary"] for s in samples]
    return rouge.compute(predictions=predictions, references=references)


## 6. Main Evaluation Pipeline

This function orchestrates the entire evaluation process:

1. **Load base model** in 4-bit quantization
2. **Attach LoRA adapters** from Hugging Face Hub
3. **Load validation dataset** (same split used during training)
4. **Generate predictions** for all validation samples
5. **Compute ROUGE metrics** comparing predictions to references
6. **Save results** (metrics + predictions) for analysis

**Note:** We load the base model and attach pre-trained adapters rather than loading a fully merged model.


In [13]:
def evaluate_peft_model(cfg, adapters_name):
    """Load base model, attach LoRA adapters, and evaluate."""

    # ----------------------------
    # Model & Tokenizer
    # ----------------------------
    print("\nLoading base model...")
    model, tokenizer = setup_model_and_tokenizer(cfg, use_4bit=True, use_lora=False)
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model = PeftModel.from_pretrained(model, adapters_name).to(device)
    model.eval()
    tokenizer.padding_side = "left"

    model = model.to(device)

    # ----------------------------
    # Dataset
    # ----------------------------
    print("\nLoading dataset...")
    _, val_data, _ = load_and_prepare_dataset(cfg)
    print(f"Validation set size: {len(val_data)} samples")

    # ----------------------------
    # Inference
    # ----------------------------
    print("\nGenerating summaries...")
    preds = generate_predictions(
        model=model,
        tokenizer=tokenizer,
        dataset=val_data,
        task_instruction=cfg["task_instruction"],
        batch_size=cfg.get("eval_batch_size", 4),
    )

    # ----------------------------
    # Evaluation
    # ----------------------------
    print("\nComputing ROUGE metrics...")
    scores = compute_rouge(preds, val_data)

    print("\nEvaluation Results:")
    print(f"  ROUGE-1: {scores['rouge1']:.2%}")
    print(f"  ROUGE-2: {scores['rouge2']:.2%}")
    print(f"  ROUGE-L: {scores['rougeL']:.2%}")

    # ----------------------------
    # Save Outputs
    # ----------------------------
    output_dir = os.path.join(OUTPUTS_DIR, "lora_samsum")
    os.makedirs(output_dir, exist_ok=True)

    results_path = os.path.join(output_dir, "eval_results.json")
    preds_path = os.path.join(output_dir, "predictions.jsonl")

    results = {
        "rouge1": scores["rouge1"],
        "rouge2": scores["rouge2"],
        "rougeL": scores["rougeL"],
        "num_samples": len(val_data),
        "base_model": cfg["base_model"],
    }

    with open(results_path, "w", encoding="utf-8") as f:
        json.dump(results, f, indent=2)

    with open(preds_path, "w", encoding="utf-8") as f:
        for i, pred in enumerate(preds):
            json.dump(
                {
                    "dialogue": val_data[i]["dialogue"],
                    "reference": val_data[i]["summary"],
                    "prediction": pred,
                },
                f,
            )
            f.write("\n")

    print(f"\nSaved results to {results_path}")
    print(f"Saved predictions to {preds_path}")

    return scores, preds


## 7. Run Evaluation üöÄ

Load your fine-tuned model adapters from Hugging Face Hub and evaluate on the validation set.

**Expected Outputs:**
- ROUGE scores printed to console
- `eval_results.json` - Metrics summary
- `predictions.jsonl` - All predictions with dialogues and references

**Tip:** Compare these scores against the baseline model to measure improvement from fine-tuning!


In [15]:
from google.colab import userdata
HF_USERNAME = userdata.get('HF_USERNAME')

cfg = load_config()

scores, preds = evaluate_peft_model(cfg, f'{HF_USERNAME}/Llama-3.2-1B-QLoRA-Summarizer-adapters')


Loading base model...

Loading model: meta-llama/Llama-3.2-1B-Instruct
‚öôÔ∏è  Enabling 4-bit quantization...
üîπ Skipping LoRA setup ‚Äî using base model only.


Device set to use cuda:0



Loading dataset...
üìÇ Loading dataset from local cache: ./datasets/knkarthick_samsum
üìä Loaded 14731 train / 200 val / 200 test samples (from full cache).
Validation set size: 200 samples

Generating summaries...


Generating summaries: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [03:03<00:00,  3.68s/it]



Computing ROUGE metrics...

Evaluation Results:
  ROUGE-1: 47.33%
  ROUGE-2: 22.77%
  ROUGE-L: 39.13%

Saved results to ./outputs/lora_samsum/eval_results.json
Saved predictions to ./outputs/lora_samsum/predictions.jsonl


## ‚úÖ Evaluation Complete!

**Next Steps:**
1. **Analyze predictions** - Review `predictions.jsonl` to see where the model succeeds/fails
2. **Compare with baseline** - Look at ROUGE scores from the baseline model evaluation
3. **Error analysis** - Identify patterns in low-quality summaries
4. **Iterate** - Adjust hyperparameters (LoRA rank, learning rate) and retrain if needed

**Key Metrics Interpretation:**
- **ROUGE-1 > 40%** ‚Üí Good word-level coverage
- **ROUGE-2 > 20%** ‚Üí Captures key phrases well
- **ROUGE-L > 35%** ‚Üí Maintains sentence structure

Typical improvements from QLoRA fine-tuning: **+5-15%** on ROUGE scores compared to zero-shot baseline!
