# Qwen 2.5-7B QLoRA Fine-tuning for Grid Compliance QA

This notebook fine-tunes Qwen 2.5-7B using QLoRA on the Grid Compliance QA dataset.

**Features:**
- Balanced test split (20 samples from 3 sources)
- QLoRA 4-bit quantization
- WandB tracking
- ROUGE/BLEU evaluation
- Inference latency & throughput comparison

## 1. Setup & Installation

In [1]:
!pip install -q transformers datasets accelerate peft bitsandbytes trl
!pip install -q wandb evaluate rouge_score nltk
!pip install -q huggingface_hub

In [39]:
import os
import time
import torch
import numpy as np
import pandas as pd
from datetime import datetime

import wandb
import evaluate
import nltk
nltk.download('punkt_tab')


from datasets import load_dataset, Dataset, DatasetDict
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel
from trl import SFTTrainer
from huggingface_hub import login

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


PyTorch version: 2.9.0+cu126
CUDA available: True
GPU: Tesla T4


## 2. Configuration

In [None]:
from google.colab import userdata

# ============= CONFIGURATION =============
# HuggingFace
HF_DATASET_ID = "sayedsalem/grid-compliance-qa"
HF_MODEL_OUTPUT_ID = "sayedsalem/qwen2.5-7b-grid-compliance"
HF_TOKEN = 'your_huggingface_token_here'

# WandB
WANDB_PROJECT = "grid-compliance-qwen-qlora"
WANDB_API_KEY = "your_wandb_api_key_here"

# Model
BASE_MODEL = "Qwen/Qwen2.5-7B-Instruct"

# Training
MAX_SEQ_LENGTH = 512
BATCH_SIZE = 2
GRADIENT_ACCUMULATION = 4
NUM_EPOCHS = 3
LEARNING_RATE = 2e-4
WARMUP_RATIO = 0.03

# LoRA
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05

# Test split
TEST_SIZE = 20  # Total test samples
# =========================================

In [4]:
# Login to HuggingFace and WandB
login(token=HF_TOKEN)

os.environ["WANDB_API_KEY"] = WANDB_API_KEY
wandb.login()
wandb.init(project=WANDB_PROJECT, name=f"qwen-qlora-{datetime.now().strftime('%Y%m%d-%H%M')}")

[34m[1mwandb[0m: Currently logged in as: [33msayedsalem767[0m ([33msayedsalem767-ml-eng-[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


## 3. Load Dataset & Balanced Split

In [5]:
# Load dataset from HuggingFace
dataset = load_dataset(HF_DATASET_ID, split="train")
print(f"Total samples: {len(dataset)}")
print(f"Columns: {dataset.column_names}")

# Show distribution
df = dataset.to_pandas()
print("\nDistribution by source:")
print(df['source'].value_counts())

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Total samples: 416
Columns: ['input', 'output', 'source', 'page']

Distribution by source:
source
G99_Issue_2.pdf            294
SPEN_EV_Fleet_Guide.pdf     77
UKPN_EDS_08_5050.pdf        45
Name: count, dtype: int64


In [6]:
# Create balanced test split (20 samples: 10 G99, 5 SPEN, 5 UKPN)
def create_balanced_split(df, test_size=20):
    """
    Create balanced test split from 3 sources.
    Distribution: G99=10, SPEN=5, UKPN=5 (proportional to dataset)
    """
    # Define samples per source (balanced based on original distribution)
    # G99: 294/416 ‚âà 70% -> 14 samples
    # SPEN: 77/416 ‚âà 19% -> 4 samples
    # UKPN: 45/416 ‚âà 11% -> 2 samples
    # Adjusted for round numbers: G99=10, SPEN=6, UKPN=4

    split_config = {
        'G99_Issue_2.pdf': 10,
        'SPEN_EV_Fleet_Guide.pdf': 6,
        'UKPN_EDS_08_5050.pdf': 4
    }

    test_indices = []

    for source, count in split_config.items():
        source_df = df[df['source'] == source]
        sampled = source_df.sample(n=min(count, len(source_df)), random_state=42)
        test_indices.extend(sampled.index.tolist())
        print(f"{source}: {len(sampled)} test samples")

    # Create train/test split
    test_df = df.loc[test_indices]
    train_df = df.drop(test_indices)

    return train_df, test_df

train_df, test_df = create_balanced_split(df, TEST_SIZE)
print(f"\nTrain samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")

G99_Issue_2.pdf: 10 test samples
SPEN_EV_Fleet_Guide.pdf: 6 test samples
UKPN_EDS_08_5050.pdf: 4 test samples

Train samples: 396
Test samples: 20


In [7]:
# Convert back to HuggingFace Dataset
train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
test_dataset = Dataset.from_pandas(test_df.reset_index(drop=True))

dataset_dict = DatasetDict({
    'train': train_dataset,
    'test': test_dataset
})

print(dataset_dict)

DatasetDict({
    train: Dataset({
        features: ['input', 'output', 'source', 'page'],
        num_rows: 396
    })
    test: Dataset({
        features: ['input', 'output', 'source', 'page'],
        num_rows: 20
    })
})


## 4. Qwen Prompt Template & Formatting

In [8]:
# Qwen 2.5 Chat Template
SYSTEM_PROMPT = """You are a UK Grid Compliance Expert specializing in G99, UKPN EDS, and SPEN EV Fleet regulations. Provide accurate, technical answers based on official documentation."""

def format_qwen_prompt(example):
    """
    Format example for Qwen 2.5 Instruct using ChatML template.
    Only uses 'input' (question) and 'output' (answer).
    """
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": example['input']},
        {"role": "assistant", "content": example['output']}
    ]

    # Qwen ChatML format
    formatted = "<|im_start|>system\n" + SYSTEM_PROMPT + "<|im_end|>\n"
    formatted += "<|im_start|>user\n" + example['input'] + "<|im_end|>\n"
    formatted += "<|im_start|>assistant\n" + example['output'] + "<|im_end|>"

    return {"text": formatted}

# Apply formatting
train_formatted = train_dataset.map(format_qwen_prompt)
test_formatted = test_dataset.map(format_qwen_prompt)

# Preview
print("Sample formatted prompt:")
print("="*50)
print(train_formatted[0]['text'])

Map:   0%|          | 0/396 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

Sample formatted prompt:
<|im_start|>system
You are a UK Grid Compliance Expert specializing in G99, UKPN EDS, and SPEN EV Fleet regulations. Provide accurate, technical answers based on official documentation.<|im_end|>
<|im_start|>user
What is the minimum Registered Capacity for a Power Generating Facility to be considered an Embedded Medium Power Station in England and Wales under G99 regulations?<|im_end|>
<|im_start|>assistant
A Power Generating Facility in England and Wales is classified as an Embedded Medium Power Station if its Registered Capacity is 50MW or greater.<|im_end|>


## 5. Load Model with QLoRA (4-bit Quantization)

In [9]:
# 4-bit Quantization Config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print(f"Tokenizer loaded. Vocab size: {tokenizer.vocab_size}")

Tokenizer loaded. Vocab size: 151643


In [10]:
# Load model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    dtype=torch.bfloat16,
)

model.config.use_cache = False
model.config.pretraining_tp = 1

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

print(f"Model loaded: {BASE_MODEL}")
print(f"Model memory: {model.get_memory_footprint() / 1e9:.2f} GB")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Model loaded: Qwen/Qwen2.5-7B-Instruct
Model memory: 7.62 GB


In [11]:
# LoRA Configuration
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 40,370,176 || all params: 7,655,986,688 || trainable%: 0.5273


## 6. Training with SFTTrainer

In [13]:
# Training Arguments
training_args = TrainingArguments(
    output_dir="./qwen-qlora-checkpoints",
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION,
    optim="paged_adamw_32bit",
    learning_rate=LEARNING_RATE,
    lr_scheduler_type="cosine",
    warmup_ratio=WARMUP_RATIO,
    weight_decay=0.01,
    max_grad_norm=0.3,
    logging_steps=10,
    save_strategy="epoch",
    eval_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    report_to="wandb",
    fp16=False,
    bf16=True,
    gradient_checkpointing=True,
    save_total_limit=2,
)

In [18]:
# SFT Trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_formatted,
    eval_dataset=test_formatted,
)



Adding EOS to train dataset:   0%|          | 0/396 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/396 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/396 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

In [19]:
# Train!
train_result = trainer.train()

# Log final metrics
print("\nTraining Complete!")
print(f"Training Loss: {train_result.training_loss:.4f}")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Epoch,Training Loss,Validation Loss
1,0.9683,0.933124
2,0.6078,0.864924
3,0.4149,0.908318



Training Complete!
Training Loss: 0.8314


## 7. Save Best Model to HuggingFace

In [35]:
from huggingface_hub import HfApi, create_repo

# 1. Manually ensure the repo exists
api = HfApi()
create_repo(repo_id=HF_MODEL_OUTPUT_ID, exist_ok=True, token=HF_TOKEN)

# 2. Upload the local folder you saved earlier
api.upload_folder(
    folder_path="./qwen-qlora-best",
    repo_id=HF_MODEL_OUTPUT_ID,
    repo_type="model",
    token=HF_TOKEN
)

print("Upload complete! Check your profile now.")

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors:   0%|          | 49.4kB /  162MB            

  ...qlora-best/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            

  ...ra-best/training_args.bin:   4%|4         |   279B / 6.35kB            

Upload complete! Check your profile now.


## 8. Evaluation: ROUGE & BLEU

In [36]:
# Load metrics
rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

def generate_response(model, tokenizer, question, max_new_tokens=256):
    """Generate response using Qwen format."""
    prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n"

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
        )

    response = tokenizer.decode(outputs[0], skip_special_tokens=False)
    # Extract assistant response
    if "<|im_start|>assistant" in response:
        response = response.split("<|im_start|>assistant")[-1]
        response = response.replace("<|im_end|>", "").strip()

    return response

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

In [41]:
def evaluate_model(model, tokenizer, test_data, model_name="Model"):
    """Evaluate model on test set with ROUGE and BLEU."""
    predictions = []
    references = []
    latencies = []

    print(f"\nEvaluating {model_name} on {len(test_data)} samples...")

    for i, example in enumerate(test_data):
        question = example['input']
        reference = example['output']

        # Measure latency
        start_time = time.time()
        prediction = generate_response(model, tokenizer, question)
        latency = time.time() - start_time

        predictions.append(prediction)
        references.append(reference)
        latencies.append(latency)

        if (i + 1) % 5 == 0:
            print(f"  Processed {i + 1}/{len(test_data)}")

    # 1. Calculate ROUGE
    # Expects: predictions=[str, str], references=[str, str]
    rouge_scores = rouge.compute(predictions=predictions, references=references)

    # 2. Calculate BLEU
    # Expects: predictions=[str, str], references=[[str], [str]]
    # We wrap each reference string in a list because BLEU supports multiple refs per prediction
    bleu_references = [[r] for r in references]
    bleu_score = bleu.compute(predictions=predictions, references=bleu_references)

    # Latency stats
    avg_latency = np.mean(latencies)
    throughput = len(test_data) / sum(latencies)

    results = {
        "model": model_name,
        "rouge1": rouge_scores['rouge1'],
        "rouge2": rouge_scores['rouge2'],
        "rougeL": rouge_scores['rougeL'],
        "bleu": bleu_score['bleu'],
        "avg_latency_sec": avg_latency,
        "throughput_samples_per_sec": throughput,
    }

    return results, predictions

In [42]:
# Evaluate Fine-tuned Model
finetuned_results, finetuned_preds = evaluate_model(
    model, tokenizer, test_dataset, "Qwen-2.5-7B-QLoRA-Finetuned"
)

print("\n" + "="*60)
print("FINE-TUNED MODEL RESULTS")
print("="*60)
for k, v in finetuned_results.items():
    if isinstance(v, float):
        print(f"{k}: {v:.4f}")
    else:
        print(f"{k}: {v}")


Evaluating Qwen-2.5-7B-QLoRA-Finetuned on 20 samples...
  Processed 5/20
  Processed 10/20
  Processed 15/20
  Processed 20/20

FINE-TUNED MODEL RESULTS
model: Qwen-2.5-7B-QLoRA-Finetuned
rouge1: 0.3827
rouge2: 0.1646
rougeL: 0.2993
bleu: 0.1005
avg_latency_sec: 10.8052
throughput_samples_per_sec: 0.0925


## 9. Baseline Comparison

In [43]:
# Load baseline model (without LoRA adapters)
print("Loading baseline model for comparison...")

baseline_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    dtype=torch.bfloat16,
)

baseline_tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
baseline_tokenizer.pad_token = baseline_tokenizer.eos_token

print("Baseline model loaded.")

`torch_dtype` is deprecated! Use `dtype` instead!


Loading baseline model for comparison...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Baseline model loaded.


In [44]:
# Evaluate Baseline
baseline_results, baseline_preds = evaluate_model(
    baseline_model, baseline_tokenizer, test_dataset, "Qwen-2.5-7B-Baseline"
)

print("\n" + "="*60)
print("BASELINE MODEL RESULTS")
print("="*60)
for k, v in baseline_results.items():
    if isinstance(v, float):
        print(f"{k}: {v:.4f}")
    else:
        print(f"{k}: {v}")


Evaluating Qwen-2.5-7B-Baseline on 20 samples...
  Processed 5/20
  Processed 10/20
  Processed 15/20
  Processed 20/20

BASELINE MODEL RESULTS
model: Qwen-2.5-7B-Baseline
rouge1: 0.1764
rouge2: 0.0517
rougeL: 0.1141
bleu: 0.0182
avg_latency_sec: 19.9129
throughput_samples_per_sec: 0.0502


## 10. Results Comparison

In [45]:
# Create comparison table
comparison_df = pd.DataFrame([baseline_results, finetuned_results])
comparison_df = comparison_df.set_index('model')

print("\n" + "="*80)
print("COMPARISON: BASELINE vs FINE-TUNED")
print("="*80)
print(comparison_df.to_string())

# Calculate improvements
print("\n" + "="*80)
print("IMPROVEMENTS")
print("="*80)
for col in ['rouge1', 'rouge2', 'rougeL', 'bleu']:
    baseline_val = baseline_results[col]
    finetuned_val = finetuned_results[col]
    improvement = ((finetuned_val - baseline_val) / baseline_val) * 100 if baseline_val > 0 else 0
    print(f"{col}: {improvement:+.2f}%")

# Log to WandB
wandb.log({
    "baseline_rouge1": baseline_results['rouge1'],
    "baseline_rouge2": baseline_results['rouge2'],
    "baseline_rougeL": baseline_results['rougeL'],
    "baseline_bleu": baseline_results['bleu'],
    "finetuned_rouge1": finetuned_results['rouge1'],
    "finetuned_rouge2": finetuned_results['rouge2'],
    "finetuned_rougeL": finetuned_results['rougeL'],
    "finetuned_bleu": finetuned_results['bleu'],
    "baseline_latency": baseline_results['avg_latency_sec'],
    "finetuned_latency": finetuned_results['avg_latency_sec'],
    "baseline_throughput": baseline_results['throughput_samples_per_sec'],
    "finetuned_throughput": finetuned_results['throughput_samples_per_sec'],
})


COMPARISON: BASELINE vs FINE-TUNED
                               rouge1    rouge2    rougeL      bleu  avg_latency_sec  throughput_samples_per_sec
model                                                                                                           
Qwen-2.5-7B-Baseline         0.176379  0.051670  0.114090  0.018196        19.912871                    0.050219
Qwen-2.5-7B-QLoRA-Finetuned  0.382666  0.164579  0.299332  0.100456        10.805220                    0.092548

IMPROVEMENTS
rouge1: +116.96%
rouge2: +218.52%
rougeL: +162.36%
bleu: +452.08%


In [46]:
# Sample predictions comparison
print("\n" + "="*80)
print("SAMPLE PREDICTIONS COMPARISON")
print("="*80)

for i in range(min(3, len(test_dataset))):
    print(f"\n--- Example {i+1} ---")
    print(f"QUESTION: {test_dataset[i]['input'][:200]}...")
    print(f"\nGROUND TRUTH: {test_dataset[i]['output'][:200]}...")
    print(f"\nBASELINE: {baseline_preds[i][:200]}...")
    print(f"\nFINE-TUNED: {finetuned_preds[i][:200]}...")


SAMPLE PREDICTIONS COMPARISON

--- Example 1 ---
QUESTION: What is the aggregate power rating for a Power Generating Facility comprised of three 400 kW Type A Synchronous Power Generating Modules?...

GROUND TRUTH: The total power of a Power Generating Facility composed of three 400 kW Type A Synchronous Power Generating Modules is 1.2 MW....

BASELINE: The aggregate power rating for a Power Generating Facility comprised of three 400 kW Type A Synchronous Power Generating Modules would be the sum of the individual power ratings.

Given:
- Each module...

FINE-TUNED: A Power Generating Facility made up of three 400 kW Type A Synchronous Power Generating Modules has an aggregate power rating of 1.2 MW (3 x 400 kW)....

--- Example 2 ---
QUESTION: What is the minimum voltage threshold for a system to be classified as Low Voltage (LV), exceeding extra-low voltage?...

GROUND TRUTH: A voltage exceeding 50 V is considered to be within the Low Voltage (LV) range, above extra-low voltage....


In [47]:
# Finish WandB run
wandb.finish()

print("\n" + "="*80)
print("TRAINING AND EVALUATION COMPLETE!")
print("="*80)
print(f"Fine-tuned model saved to: {HF_MODEL_OUTPUT_ID}")
print(f"WandB project: {WANDB_PROJECT}")

0,1
baseline_bleu,‚ñÅ
baseline_latency,‚ñÅ
baseline_rouge1,‚ñÅ
baseline_rouge2,‚ñÅ
baseline_rougeL,‚ñÅ
baseline_throughput,‚ñÅ
eval/entropy,‚ñà‚ñÉ‚ñÅ
eval/loss,‚ñà‚ñÅ‚ñÖ
eval/mean_token_accuracy,‚ñÅ‚ñà‚ñá
eval/num_tokens,‚ñÅ‚ñÖ‚ñà

0,1
baseline_bleu,0.0182
baseline_latency,19.91287
baseline_rouge1,0.17638
baseline_rouge2,0.05167
baseline_rougeL,0.11409
baseline_throughput,0.05022
eval/entropy,0.57231
eval/loss,0.90832
eval/mean_token_accuracy,0.78721
eval/num_tokens,147777



TRAINING AND EVALUATION COMPLETE!
Fine-tuned model saved to: sayedsalem/qwen2.5-7b-grid-compliance
WandB project: grid-compliance-qwen-qlora
