# QLoRA Fine-Tuning for Policy Compliance LLM

This notebook fine-tunes Llama 3.1 8B using QLoRA (4-bit quantization) on policy compliance training data.

**Requirements:**
- Google Colab with GPU (T4 free tier works)
- HuggingFace account with Llama 3.1 access
- Training data (upload `training_data_augmented.jsonl`)

**Runtime:** ~2-3 hours on T4 GPU

## 1. Setup Environment

In [None]:
# Install required packages
!pip install -q torch transformers accelerate peft bitsandbytes trl datasets sentencepiece

In [None]:
# Check GPU
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Login to HuggingFace (required for Llama 3.1)
from huggingface_hub import login
login()  # Enter your HuggingFace token

## 2. Upload Training Data

Upload your `training_data_augmented.jsonl` file (546 examples).

In [None]:
from google.colab import files
import json

# Upload training data
uploaded = files.upload()

# Check uploaded file
for filename in uploaded.keys():
    print(f"Uploaded: {filename}")
    with open(filename, 'r') as f:
        lines = f.readlines()
        print(f"Total examples: {len(lines)}")
        print(f"Sample: {json.loads(lines[0])}")
    DATA_FILE = filename

## 3. Load Model with 4-bit Quantization

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# Model configuration
MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# 4-bit quantization config (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print("Loading model with 4-bit quantization...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16,
)

# Prepare for training
model = prepare_model_for_kbit_training(model)
model.gradient_checkpointing_enable()

print(f"Model loaded! Parameters: {model.num_parameters():,}")

## 4. Configure LoRA

In [None]:
# LoRA configuration
lora_config = LoraConfig(
    r=64,                    # LoRA rank
    lora_alpha=128,          # Scaling factor
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Print trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable_params:,} ({100 * trainable_params / all_params:.2f}%)")
print(f"Total: {all_params:,}")

## 5. Prepare Training Data

In [None]:
from datasets import Dataset
import json

# Load training data
data = []
with open(DATA_FILE, 'r') as f:
    for line in f:
        if line.strip():
            data.append(json.loads(line))

print(f"Loaded {len(data)} examples")

# Prompt template for Llama 3.1
PROMPT_TEMPLATE = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a compliance assistant specializing in corporate policies. You provide accurate, helpful answers about policy requirements, procedures, and best practices. Always cite specific policy requirements when applicable.<|eot_id|><|start_header_id|>user<|end_header_id|>

{instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{response}<|eot_id|>"""

def format_example(example):
    instruction = example.get('question', example.get('instruction', ''))
    response = example.get('answer', example.get('output', ''))
    return {"text": PROMPT_TEMPLATE.format(instruction=instruction, response=response)}

# Create dataset
formatted_data = [format_example(d) for d in data]
dataset = Dataset.from_list(formatted_data)

# Split into train/eval
split = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split['train']
eval_dataset = split['test']

print(f"Train: {len(train_dataset)}, Eval: {len(eval_dataset)}")
print(f"\nSample:\n{train_dataset[0]['text'][:500]}...")

## 6. Train with QLoRA

In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer

# Training arguments
training_args = TrainingArguments(
    output_dir="./policy-llama-qlora",
    
    # Batch settings (adjust for your GPU memory)
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,  # Effective batch = 16
    
    # Training duration
    num_train_epochs=3,
    
    # Learning rate
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    
    # Optimization
    optim="paged_adamw_32bit",
    max_grad_norm=0.3,
    
    # Logging
    logging_steps=10,
    report_to="none",
    
    # Evaluation
    evaluation_strategy="steps",
    eval_steps=50,
    
    # Checkpointing
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    load_best_model_at_end=True,
    
    # Memory optimization
    fp16=True,
    gradient_checkpointing=True,
    
    # Misc
    seed=42,
    remove_unused_columns=False,
)

# Create trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=2048,
    packing=False,
)

print("Trainer ready! Starting training...")

In [None]:
# Run training
trainer.train()

In [None]:
# Save the trained adapter
trainer.save_model("./policy-llama-qlora/final")
tokenizer.save_pretrained("./policy-llama-qlora/final")
print("Model saved!")

## 7. Test the Fine-Tuned Model

In [None]:
# Test the fine-tuned model
def generate_response(prompt, max_new_tokens=256):
    inputs = tokenizer(
        f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a compliance assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        return_tensors="pt"
    ).to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test questions
test_questions = [
    "What are the key requirements for handling personal data?",
    "What should I do if I suspect a security breach?",
    "Can I use personal cloud storage for work files?",
    "How long should confidential information be protected after leaving the company?",
]

for q in test_questions:
    print(f"\n{'='*60}")
    print(f"Q: {q}")
    response = generate_response(q)
    # Extract just the assistant's response
    if "assistant" in response:
        response = response.split("assistant")[-1].strip()
    print(f"A: {response}")

## 8. Merge Adapter & Export

In [None]:
from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model (full precision for merging)
print("Loading base model for merging...")
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load adapter and merge
print("Merging adapter...")
merged_model = PeftModel.from_pretrained(base_model, "./policy-llama-qlora/final")
merged_model = merged_model.merge_and_unload()

# Save merged model
print("Saving merged model...")
merged_model.save_pretrained("./policy-llama-merged", safe_serialization=True)
tokenizer.save_pretrained("./policy-llama-merged")

print("Merged model saved to ./policy-llama-merged")

In [None]:
# Download the merged model
!zip -r policy-llama-merged.zip ./policy-llama-merged
files.download('policy-llama-merged.zip')

## 9. Convert to GGUF (for Ollama)

Run this section to convert to GGUF format for use with Ollama.

In [None]:
# Install llama.cpp for GGUF conversion
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && pip install -r requirements.txt

In [None]:
# Convert to GGUF
!python llama.cpp/convert_hf_to_gguf.py ./policy-llama-merged --outfile policy-compliance-llm-f16.gguf --outtype f16

print("GGUF file created!")

In [None]:
# Download GGUF file
files.download('policy-compliance-llm-f16.gguf')

## 10. Create Ollama Model (Local)

After downloading the GGUF file, run these commands locally:

```bash
# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./policy-compliance-llm-f16.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

SYSTEM """You are a compliance assistant specializing in corporate policies. You provide accurate, helpful answers about policy requirements, procedures, and best practices. Always cite specific policy requirements when applicable."""
EOF

# Create Ollama model
ollama create policy-compliance-llm -f Modelfile

# Test it
ollama run policy-compliance-llm "What are the key data privacy requirements?"
```