# Customer Support LLM Fine-Tuning (QLoRA)

This notebook walks through the full pipeline from **main.py**: loading the Bitext Customer Support dataset, preparing it for chat-style training, fine-tuning **Llama 3.2 3B** with **QLoRA**, evaluating the model, and running a Gradio demo.

## Overview

| Item | Value |
|------|-------|
| **Base model** | meta-llama/Llama-3.2-3B |
| **Technique** | QLoRA (4-bit + LoRA) or float16/bf16 + LoRA on Mac |
| **Dataset** | Bitext Customer Support (Hugging Face) |
| **Output** | Adapter weights + tokenizer in `output_dir` |

## Notebook structure

1. **Setup** – Install/import dependencies and set paths.
2. **Load & explore data** – Load the dataset and inspect schema and samples.
3. **Prepare dataset** – Format as chat (system / user / assistant) and split train/eval.
4. **Train** – Run QLoRA fine-tuning (optionally on a subset for quick runs).
5. **Evaluate** – Compare base vs fine-tuned on custom prompts and save results.
6. **Demo** – Launch Gradio chat UI with the fine-tuned adapter or **Ollama** (localhost).

Set `USE_OLLAMA = True` and run `ollama serve` + `ollama pull llama3.2` to use local Ollama for evaluate & demo.

---
## 1. Setup

Ensure dependencies are installed (`pip install -r requirements.txt`). We import from **main.py** so the notebook stays in sync with the script. Run this cell first.

In [None]:
# Optional: install dependencies if running in a fresh environment (e.g. Colab)
# !pip install -q torch transformers datasets accelerate bitsandbytes peft trl huggingface-hub gradio

import sys
from pathlib import Path

# Add project root so we can import main
ROOT = Path.cwd()
if str(ROOT) not in sys.path:
    sys.path.insert(0, str(ROOT))

import torch
from main import (
    DATASET_NAME,
    MODEL_NAME,
    TRAINING_CONFIG,
    TEST_PROMPTS,
    OLLAMA_MODEL_DEFAULT,
    get_ollama_models,
    ensure_ollama_model,
    prepare_dataset,
    train,
    evaluate,
    run_demo,
    load_model_for_inference,
    generate_response,
    generate_response_ollama,
    generate_evaluation_report,
)

print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
try:
    print(f"MPS available: {torch.backends.mps.is_available()}")
except AttributeError:
    print("MPS: N/A")

### Configuration (edit these for your run)

- **SUBSET_SIZE**: `None` = full dataset; `500` = quick test on 500 train + 50 eval.
- **OUTPUT_DIR**: Where the adapter and tokenizer are saved after training.
- **ADAPTER_PATH**: Path to load for evaluation and demo (when not using Ollama).
- **USE_OLLAMA**: `True` = use Ollama on localhost for evaluate & demo (no Hugging Face model load).
- **OLLAMA_MODEL**: Model name in Ollama (default `gemma3:12b`). Pull with: `ollama pull gemma3:12b`.

In [None]:
SUBSET_SIZE = 500          # None for full dataset; 500 for quick test
OUTPUT_DIR = "output/customer-support-llm"
ADAPTER_PATH = "output/customer-support-llm"   # Use after training, or your own path
EVAL_OUTPUT_JSON = "evaluation_results.json"

# Use Ollama (localhost) for inference instead of Hugging Face model
USE_OLLAMA = True
OLLAMA_MODEL = "gemma3:12b"   # Local model; pull with: ollama pull gemma3:12b

In [None]:
# When using Ollama: list local models and verify the selected one is available
if USE_OLLAMA:
    models = get_ollama_models()
    if models:
        print("Locally available Ollama models:", models)
        ensure_ollama_model(OLLAMA_MODEL)
        print(f"Using: {OLLAMA_MODEL}")
    else:
        print("Ollama not running or no models. Start: ollama serve && ollama pull gemma3:12b")

---
## 2. Load and explore the dataset

The **Bitext Customer Support** dataset has one split (`train`) with columns: `instruction` (customer query), `response` (agent reply), `category`, `intent`, `flags`. We load it and show a few rows.

In [None]:
from datasets import load_dataset

raw = load_dataset(DATASET_NAME)
ds_train = raw["train"]

print(f"Dataset: {DATASET_NAME}")
print(f"Total rows: {len(ds_train)}")
print(f"Columns: {ds_train.column_names}")
print(f"Features: {ds_train.features}")

In [None]:
# Sample rows
for i in [0, 1, 2]:
    row = ds_train[i]
    print(f"--- Example {i} ---")
    print(f"Instruction: {row['instruction'][:120]}..." if len(row['instruction']) > 120 else f"Instruction: {row['instruction']}")
    print(f"Response (first 150 chars): {row['response'][:150]}...")
    print(f"Category: {row['category']}, Intent: {row['intent']}")
    print()

---
## 3. Prepare dataset for training

We convert each example into a **chat** format with three roles:
- **system**: "You are a helpful customer support assistant."
- **user**: the `instruction` (customer query)
- **assistant**: the `response` (target reply)

Then we split into train/eval (default 95% / 5%). Optionally we use a **subset** for fast iteration.

In [None]:
train_ds, eval_ds = prepare_dataset(
    subset_size=SUBSET_SIZE,
    seed=TRAINING_CONFIG["seed"],
    test_size=TRAINING_CONFIG["test_size"],
)

print(f"Train size: {len(train_ds)}, Eval size: {len(eval_ds)}")

In [None]:
# Inspect one formatted example (what SFTTrainer will see)
ex = train_ds[0]
print("Keys:", ex.keys())
for msg in ex["messages"]:
    print(f"  {msg['role']}: {msg['content'][:80]}...")

---
## 4. Train the model (QLoRA)

This cell loads the base model (4-bit on CUDA, or float16/bf16 on Mac), attaches LoRA adapters, and runs **SFTTrainer** with the config from **main.py** (max length 1024, 1 epoch, etc.).

- **With GPU (CUDA)**: Uses QLoRA (4-bit quantization + LoRA).
- **On Mac (no CUDA)**: Uses full-precision or float16 + LoRA; batch size is reduced to avoid OOM.

Training can take a while; using `SUBSET_SIZE = 500` keeps it short for testing.

In [None]:
train(
    output_dir=OUTPUT_DIR,
    subset_size=SUBSET_SIZE,
)

After training, the adapter and tokenizer are saved under `OUTPUT_DIR`. You can use this path for evaluation and the demo.

### Evaluation report (HTML)

Run the **Evaluate** cell (section 5) first, then run the cell below to generate a self-contained HTML report: training stats (LoRA/QLoRA), evaluation metrics (relevance, length), scalability/maintainability notes, and before/after comparison table. Open **evaluation_report.html** in a browser.

In [None]:
# Generate HTML report (optional: pass training_stats for LoRA/QLoRA stats and loss curve)
generate_evaluation_report(
    evaluation_path=EVAL_OUTPUT_JSON,
    output_html_path="evaluation_report.html",
    training_stats_path=f"{OUTPUT_DIR}/training_stats.json" if not USE_OLLAMA else None,
)
# Open evaluation_report.html in your browser to view the report.

---
## 5. Evaluation

We compare **base** vs **fine-tuned** model on 15 custom test prompts (not from the training set). Results are written to a JSON file and a short summary is printed.

In [None]:
evaluate(
    adapter_path=ADAPTER_PATH,
    output_path=EVAL_OUTPUT_JSON,
    use_ollama=USE_OLLAMA,
    ollama_model=OLLAMA_MODEL,
)

In [None]:
# Optional: load and display a few results (JSON has "results" + "metrics")
import json
with open(EVAL_OUTPUT_JSON) as f:
    data = json.load(f)
results = data["results"] if isinstance(data, dict) and "results" in data else data
if isinstance(data, dict) and data.get("metrics"):
    print("Metrics:", data["metrics"])
for r in results[:3]:
    print("Prompt:", r["prompt"])
    if "ollama_output" in r:
        print("Ollama (first 200 chars):", r["ollama_output"][:200])
    else:
        print("Base (first 200 chars):", r["base_output"][:200])
        print("Fine-tuned (first 200 chars):", r["fine_tuned_output"][:200])
    print("-" * 60)

---
## 6. Interactive demo (Gradio)

Launches a web UI: with **Ollama** (`USE_OLLAMA = True`) it uses your local Ollama model; otherwise it loads the Hugging Face adapter. **Full chat history** is sent each turn so multi-turn answers stay correct. Use **Max response tokens** to trade off speed vs length.

In [None]:
run_demo(
    adapter_path=ADAPTER_PATH,
    use_ollama=USE_OLLAMA,
    ollama_model=OLLAMA_MODEL,
)

### Optional: multi-turn inference with chat history

Same as the demo: pass **chat_history** so the model sees the full conversation. Example with Ollama:

In [None]:
# Example: two turns with Ollama (full history sent each time)
if USE_OLLAMA:
    history = []
    for user_msg in ["I want to cancel my order #12345.", "What do I do next?"]:
        reply = generate_response_ollama(
            user_msg,
            model_name=OLLAMA_MODEL,
            chat_history=history,
            max_tokens=128,
        )
        history.append({"role": "user", "content": user_msg})
        history.append({"role": "assistant", "content": reply})
        print(f"User: {user_msg}")
        print(f"Assistant: {reply[:200]}...")
        print()
# With Hugging Face: use generate_response(..., chat_history=history) the same way.

---
## Summary

| Step | What it does |
|------|----------------|
| **main.py** | CLI: `train`, `evaluate`, `demo` with the same logic as this notebook. |
| **Data** | Bitext → chat format (system/user/assistant) → train/eval split. |
| **Train** | QLoRA (or LoRA on Mac) with SFTTrainer; adapter saved to `OUTPUT_DIR`. |
| **Evaluate** | Base vs fine-tuned on 15 prompts → `evaluation_results.json` (with metrics). |
| **Report** | `report --evaluation ... --training_stats ...` → HTML evaluation report. |
| **Demo** | Gradio UI; full chat history sent each turn (Ollama or HF). |

For more detail, see **PLAN.md**, **Instructions.md**, and **cutomer-support-plan.txt**.