# Low-VRAM PEFT (LoRA) fine-tuning with Pi-Scorer on a small Unsloth model

Welcome! This notebook shows how to fine-tune a small open model with **LoRA/PEFT** on a tiny, realistic dataset,
evaluate with **Pi-Scorer** (fast, deterministic scorers), and optionally upload an evaluation set to **Azure AI Foundry**.

You will:
1. Install compatible libraries.
2. Load a 4-bit Unsloth model and attach LoRA adapters (low VRAM).
3. Create a small **customer-support** dataset.
4. (Bootstrap) Generate teacher replies once to create supervision labels for SFT.
5. Run **SFT training** with LoRA.
6. Save the LoRA adapter.
7. Generate answers and score them with **Pi-Scorer**.
8. (Optional) Upload a JSONL to **Azure AI Foundry** and launch a cloud evaluation.

### Hardware notes
- A single 16–24 GB GPU (A10, L4) is enough.
- On **ND H100 v5** (H100 Tensor Core) in Azure, `bf16` is used automatically where possible.

### Why SFT + LoRA first?
Reinforcement learning (GRPO) can be sensitive to shapes and trainer internals. A **LoRA SFT baseline** is simpler,
stable, and a great starting point. You can bring Pi-Scorer back as a reward for GRPO later.


## 0) Install required packages

This cell pins versions that play nicely together for LoRA SFT with Unsloth. It may take a few minutes on a fresh runtime.

In [None]:
# Core stack (Unsloth + TRL + Transformers + vLLM)
!pip -q install unsloth vllm==0.8.5.post1
!pip -q install bitsandbytes accelerate xformers peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
!pip -q install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
!pip -q install transformers==4.51.3 ipywidgets requests

# Azure AI Foundry (Projects SDK) to upload dataset & run cloud evaluations
!pip -q install azure-ai-projects azure-identity

## 1) Configure keys (Pi-Scorer required, Azure optional)

This cell *only* reads environment variables (and prompts if missing). Nothing is sent anywhere yet.

- **WITHPI_API_KEY** is required (https://build.withpi.ai/account/keys)
- For Azure evaluation later, you may set either:
  - `AZURE_AI_PROJECT` (preferred) — e.g. `https://<account>.services.ai.azure.com/api/projects/<project>` and rely on `DefaultAzureCredential()`
  - or an older `AZURE_AI_PROJECT_CONNECTION_STRING` if your SDK environment still uses it.
- If you will use LLM-judged built-ins, set `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, and `AZURE_OPENAI_DEPLOYMENT`.

In [2]:
import os, getpass

def _need(key: str, prompt: str) -> str:
    val = os.environ.get(key)
    if val:
        return val
    try:
        val = getpass.getpass(prompt)
    except Exception:
        val = input(prompt)
    if val:
        os.environ[key] = val.strip()
    return os.environ.get(key, "")

# Required for Pi-Scorer reward calls
_need("WITHPI_API_KEY", "Enter your WITHPI_API_KEY (input hidden): ")
print("WITHPI_API_KEY set?", bool(os.environ.get("WITHPI_API_KEY")))

# Optional for Azure evaluation later (uncomment and fill if you prefer setting here)
# os.environ.setdefault("AZURE_AI_PROJECT", "https://<account>.services.ai.azure.com/api/projects/<project>")
# os.environ.setdefault("AZURE_OPENAI_ENDPOINT", "https://<your-aoai>.openai.azure.com/")
# os.environ.setdefault("AZURE_OPENAI_API_KEY", "<key>")
# os.environ.setdefault("AZURE_OPENAI_DEPLOYMENT", "<gpt-4o-mini / o4-mini / ...>")

WITHPI_API_KEY set? True


## 2) Load a small Unsloth model and attach LoRA (PEFT)

We use **`unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit`** by default. If you are extremely VRAM-constrained, change `model_name` to **`unsloth/Qwen2.5-0.5B-Instruct-unsloth-bnb-4bit`**.

- `load_in_4bit=True` keeps the base model small.
- `get_peft_model(...)` attaches LoRA adapters so we train only a tiny fraction of weights.
- We also set a few tokenizer/model flags for training.


In [4]:
# --- Unsloth PEFT load with automatic compatibility for trust_remote_code vs fast_inference ---
from unsloth import FastLanguageModel
import os, torch

# Caches (same as before)
os.environ.setdefault("HF_HOME", "/mnt/hf/cache")
os.environ.setdefault("HF_HUB_CACHE", "/mnt/hf/cache/hub")
os.environ.setdefault("TRANSFORMERS_CACHE", "/mnt/hf/cache/hub")

max_seq_length = 1024
lora_rank = 64

# Pick one of these small, PEFT-friendly models:
model_name = "unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit"       # ~1.5B, great on a single 16–24GB GPU
# model_name = "unsloth/Qwen2.5-0.5B-Instruct-unsloth-bnb-4bit"  # ~0.5B, ultra small

def _load_unsloth(model_name: str):
    # 1) Try fast path (fast_inference=True) with trust_remote_code=False
    try:
        print("[load] Trying: fast_inference=True, trust_remote_code=False")
        m, t = FastLanguageModel.from_pretrained(
            model_name           = model_name,
            max_seq_length       = max_seq_length,
            load_in_4bit         = True,
            fast_inference       = True,          # fast path (vLLM-style kernels)
            max_lora_rank        = lora_rank,
            gpu_memory_utilization = 0.5,
            trust_remote_code    = False,         # <-- important: avoid the NotImplementedError
            cache_dir            = os.environ.get("HF_HUB_CACHE", None),
        )
        return m, t, dict(fast_inference=True, trust_remote_code=False)
    except NotImplementedError as e:
        # 2) Fallback: if the repo actually needs remote code, drop fast_inference
        print("[load] Fast path + trust_remote_code not compatible; falling back "
              "to fast_inference=False, trust_remote_code=True")
        m, t = FastLanguageModel.from_pretrained(
            model_name           = model_name,
            max_seq_length       = max_seq_length,
            load_in_4bit         = True,
            fast_inference       = False,         # <-- compatible with remote code
            max_lora_rank        = lora_rank,
            gpu_memory_utilization = 0.5,
            trust_remote_code    = True,
            cache_dir            = os.environ.get("HF_HUB_CACHE", None),
        )
        return m, t, dict(fast_inference=False, trust_remote_code=True)

model, tokenizer, _loader_flags = _load_unsloth(model_name)

# Attach LoRA/PEFT
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = ["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

# Tokenizer padding policy for your RL run
tokenizer.padding_side = "left"      # GRPO prefers left-padding
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("Loaded:", model_name, "| flags:", _loader_flags)
if torch.cuda.is_available():
    print("Device:", torch.cuda.get_device_name(0), "| bfloat16:", torch.cuda.is_bf16_supported())

[load] Trying: fast_inference=True, trust_remote_code=False
Unsloth: Patching vLLM v1 graph capture
Unsloth: Patching vLLM v0 graph capture
==((====))==  Unsloth 2025.8.9: Fast Qwen2 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA H100 NVL. Num GPUs = 1. Max memory: 93.016 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 9.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-1.5b-instruct-bnb-4bit with actual GPU utilization = 43.94%
Unsloth: Your GPU has CUDA compute capability 9.0 with VRAM = 93.02 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 320.
Unsloth: vLLM's KV Cache can use up to 39.54 GB. Also swap space = 6 GB.
INFO 08-27 01:32:50 [config.py:717] This mode

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

INFO 08-27 01:32:57 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='unsloth/qwen2.5-1.5b-instruct-bnb-4bit', speculative_config=None, tokenizer='unsloth/qwen2.5-1.5b-instruct-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/qwen2.5-1.5b-instruct-bnb-4bit, num_scheduler_steps=1, multi_step_stream_outputs=True,

model.safetensors:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

INFO 08-27 01:33:14 [weight_utils.py:281] Time spent downloading weights for unsloth/qwen2.5-1.5b-instruct-bnb-4bit: 16.076469 seconds
INFO 08-27 01:33:14 [weight_utils.py:315] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 08-27 01:33:15 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 08-27 01:33:15 [gpu_model_runner.py:1347] Model loading took 1.2132 GiB and 18.115439 seconds
INFO 08-27 01:33:26 [backends.py:420] Using cache directory: /home/govind/.cache/vllm/torch_compile_cache/7061a7a5b6/rank_0_0 for vLLM's torch.compile
INFO 08-27 01:33:26 [backends.py:430] Dynamo bytecode transform time: 10.32 s


Unsloth: Compiling kernels: 100%|██████████| 5/5 [00:01<00:00,  3.66it/s, triton_poi_fused_cat_4]                                                 

INFO 08-27 01:33:32 [backends.py:136] Cache the graph of shape None for later use



Unsloth: Compiling kernels: 100%|██████████| 9/9 [00:00<00:00, 23.19it/s, triton_poi_fused_cat_8]                            
Unsloth: Compiling kernels: 100%|██████████| 9/9 [00:00<00:00, 147.60it/s, triton_poi_fused_cat_8]                            
Unsloth: Compiling kernels: 100%|██████████| 9/9 [00:00<00:00, 148.33it/s, triton_poi_fused_cat_8]                            
Unsloth: Compiling kernels: 100%|██████████| 9/9 [00:00<00:00, 150.96it/s, triton_poi_fused_cat_8]                            
Unsloth: Compiling kernels: 100%|██████████| 9/9 [00:00<00:00, 65.29it/s, triton_poi_fused_cat_8]                            
Unsloth: Compiling kernels: 100%|██████████| 9/9 [00:00<00:00, 144.39it/s, triton_poi_fused_cat_8]                            
Unsloth: Compiling kernels: 100%|██████████| 9/9 [00:00<00:00, 146.83it/s, triton_poi_fused_cat_8]                            
Unsloth: Compiling kernels: 100%|██████████| 9/9 [00:00<00:00, 74.18it/s, triton_poi_fused_cat_8]               

INFO 08-27 01:33:57 [backends.py:148] Compiling a graph for general shape takes 29.47 s
INFO 08-27 01:34:20 [monitor.py:33] torch.compile takes 39.79 s in total
INFO 08-27 01:34:21 [kv_cache_utils.py:634] GPU KV cache size: 990,464 tokens
INFO 08-27 01:34:21 [kv_cache_utils.py:637] Maximum concurrency for 1,024 tokens per request: 967.25x
INFO 08-27 01:34:21 [vllm_utils.py:643] Unsloth: Running patched vLLM v1 `capture_model`.
INFO 08-27 01:34:21 [vllm_utils.py:643] Unsloth: Running patched vLLM v1 `capture_model`.
INFO 08-27 01:34:39 [gpu_model_runner.py:1686] Graph capturing finished in 18 secs, took 6.96 GiB
INFO 08-27 01:34:39 [vllm_utils.py:650] Unsloth: Patched vLLM v1 graph capture finished in 18 secs.
INFO 08-27 01:34:39 [vllm_utils.py:650] Unsloth: Patched vLLM v1 graph capture finished in 18 secs.
INFO 08-27 01:34:40 [core.py:159] init engine (profile, create kv cache, warmup model) took 84.91 seconds
Unsloth: Just some info: will skip parsing ['k_norm', 'post_feedforward_lay

Unsloth 2025.8.9 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Loaded: unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit | flags: {'fast_inference': True, 'trust_remote_code': False}
Device: NVIDIA H100 NVL | bfloat16: True


## 3) Build your original **business support** dataset

Exactly as before: the **system** message carries the policy, and the **user** message carries the ticket. For SFT, we also precompute a `text` field (a single string using the tokenizer’s chat template) that the trainer will learn from.

In [5]:
from datasets import Dataset

SYSTEM_PROMPT = (
    "You are a customer support agent for Contoso Retail. Write a short, professional email reply that:\n"
    "- acknowledges the customer’s issue empathetically,\n"
    "- provides clear next steps or resolution,\n"
    "- follows the policy provided (do not contradict it),\n"
    "- stays concise (≤150 words)."
)

business_samples = [
    {
        "ticket": "Order #78421 arrived with a cracked mug. Can you replace it? I need it before Friday.",
        "policy": "Damaged on arrival: offer free replacement shipped via 2-day; if out of stock, offer full refund."
    },
    {
        "ticket": "I’m past the 30-day window but this shirt still has tags. Any chance I can return it?",
        "policy": "Returns accepted within 30 days only; exceptions allowed as store credit at manager discretion."
    },
    {
        "ticket": "I canceled my order yesterday but I still see a pending charge on my card.",
        "policy": "Cancellations void the authorization immediately; banks may take 3–5 business days to release funds."
    },
    {
        "ticket": "The promo code SPRING25 didn’t apply at checkout. Can you refund the difference?",
        "policy": "SPRING25: 25% off full-price items only; cannot be combined; adjustments allowed within 7 days of purchase."
    },
    {
        "ticket": "I need to change the shipping address on my order to my office downtown.",
        "policy": "Address changes allowed until fulfillment starts; otherwise reroute via carrier once tracking is issued."
    },
    {
        "ticket": "My gift card shows $0 after one use but I only spent $12 of $25.",
        "policy": "Gift cards decrement in real-time; if balance mismatch occurs, reissue a new card with remaining funds."
    },
    {
        "ticket": "The espresso machine stopped working after two weeks. What can I do?",
        "policy": "Appliances: 1-year warranty. Offer troubleshooting; if unresolved, advance replacement or repair."
    },
    {
        "ticket": "Do you price match Amazon on the headphones I bought yesterday?",
        "policy": "Price match within 14 days against authorized retailers only; Amazon eligible when seller is Amazon.com."
    }
]

dataset = Dataset.from_list(business_samples)

def to_chat(ex):
    msgs = [
        {"role": "system", "content": SYSTEM_PROMPT + "\n\nPolicy:\n" + ex["policy"]},
        {"role": "user", "content": ex["ticket"]},
    ]
    return {
        "prompt": msgs,
        "text": tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
    }

dataset = dataset.map(to_chat)
print(dataset[0]["prompt"]) 
print("--- sample text ---")
print(dataset[0]["text"][:400] + " ...")

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

[{'content': 'You are a customer support agent for Contoso Retail. Write a short, professional email reply that:\n- acknowledges the customer’s issue empathetically,\n- provides clear next steps or resolution,\n- follows the policy provided (do not contradict it),\n- stays concise (≤150 words).\n\nPolicy:\nDamaged on arrival: offer free replacement shipped via 2-day; if out of stock, offer full refund.', 'role': 'system'}, {'content': 'Order #78421 arrived with a cracked mug. Can you replace it? I need it before Friday.', 'role': 'user'}]
--- sample text ---
<|im_start|>system
You are a customer support agent for Contoso Retail. Write a short, professional email reply that:
- acknowledges the customer’s issue empathetically,
- provides clear next steps or resolution,
- follows the policy provided (do not contradict it),
- stays concise (≤150 words).

Policy:
Damaged on arrival: offer free replacement shipped via 2-day; if out of stock, offer full refu ...


## 4) Pi-Scorer helpers (tone, resolution, policy)

We reuse your three questions. These helpers will be used **after** SFT to score model replies on a handful of examples (quick, sanity-check style).

In [6]:
import requests
PI_API_KEY = os.environ["WITHPI_API_KEY"]
PI_API_URL = "https://api.withpi.ai/v1/scoring_system/score"
HEADERS = {"Content-Type": "application/json", "x-api-key": PI_API_KEY}

def get_pi_score(input_text: str, output_text: str, question: str) -> float:
    payload = {
        "llm_input": input_text,
        "llm_output": output_text,
        "scoring_spec": [{"question": question}]
    }
    resp = requests.post(PI_API_URL, headers=HEADERS, json=payload, timeout=30)
    resp.raise_for_status()
    data = resp.json()
    if "total_score" not in data:
        raise KeyError("'total_score' missing in Pi response")
    return float(data["total_score"])

def _compose_business_input_from_prompt(prompt_messages: list[dict]) -> str:
    system_text = prompt_messages[0]["content"]
    user_text = prompt_messages[-1]["content"]
    policy = system_text.split("Policy:", 1)[1].strip() if "Policy:" in system_text else ""
    return f"Ticket:\n{user_text}\n\nPolicy:\n{policy}"

def score_business(prompts, completions, question: str) -> list[float]:
    inputs = [_compose_business_input_from_prompt(p) for p in prompts]
    outputs = [c[0]["content"] for c in completions]
    return [get_pi_score(i, o, question) for i, o in zip(inputs, outputs)]

def pi_professional_tone(prompts, completions, **kwargs) -> list[float]:
    return score_business(prompts, completions, "Is the response polite, empathetic, and professional?")

def pi_issue_resolution(prompts, completions, **kwargs) -> list[float]:
    return score_business(prompts, completions, "Does the response directly address the customer's request with clear next steps?")

def pi_policy_adherence(prompts, completions, **kwargs) -> list[float]:
    return score_business(prompts, completions, "Does the response follow the provided policy without contradicting it?")

## 5) SFT training with TRL’s `SFTTrainer` (LoRA already attached)

Because the base is 4-bit **and** LoRA is attached, this avoids the “purely quantized model can’t be fine-tuned” error. We train on the `text` field created above.

This is a small, quick run. Scale `num_train_epochs`, batch size, etc., for better quality once you’re happy with the pipeline.

In [7]:
from trl import SFTTrainer, SFTConfig

args = SFTConfig(
    output_dir="outputs_sft",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    logging_steps=5,
    bf16=(torch.cuda.is_available() and torch.cuda.is_bf16_supported()),
    fp16=not (torch.cuda.is_available() and torch.cuda.is_bf16_supported()),
    packing=False
)

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    dataset_text_field="text",
    tokenizer=tokenizer
)

trainer.train()
print("SFT training complete.")

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/8 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 8 | Num Epochs = 1 | Total steps = 2
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 2 x 1) = 4
 "-____-"     Trainable parameters = 73,859,072 of 1,617,573,376 (4.57% trained)


Step,Training Loss


SFT training complete.


## 6) Save the LoRA adapter

You’ll get a small folder that can be reapplied for inference (HF generate or Unsloth’s vLLM fast path).

In [8]:
import os
lora_dir = os.path.abspath("sft_saved_lora")
try:
    model.save_lora(lora_dir)
    print("Saved LoRA to:", lora_dir)
except Exception as e:
    if os.path.isdir(lora_dir):
        print("LoRA already exists at:", lora_dir)
    else:
        raise e

Saved LoRA to: /home/govind/sft_saved_lora


## 7) Inference helpers + quick Pi scoring

We generate on a few random tickets and score the replies with Pi (tone, resolution, policy). This is just to sanity-check that the pipeline works end-to-end.

In [9]:
import random
from typing import List, Dict

def _chat_to_text(msgs: List[Dict[str, str]]) -> str:
    return tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)

def generate_batch(prompts: List[List[Dict[str,str]]], max_tokens: int = 180) -> List[str]:
    try:
        from vllm import SamplingParams
        params = SamplingParams(temperature=0.4, top_p=0.9, max_tokens=max_tokens)
        texts = [_chat_to_text(p) for p in prompts]
        outs = model.fast_generate(texts, sampling_params=params, lora_request=None)
        if outs and hasattr(outs[0], "outputs"):
            return [o.outputs[0].text for o in outs]
        if outs and hasattr(outs[0], "text"):
            return [o.text for o in outs]
        return [str(o) for o in outs]
    except Exception:
        # HF generate fallback
        results = []
        for p in prompts:
            txt = _chat_to_text(p)
            toks = tokenizer(txt, return_tensors="pt").to(model.device)
            with torch.no_grad():
                out = model.generate(
                    **toks,
                    max_new_tokens=max_tokens,
                    do_sample=True,
                    temperature=0.4,
                    top_p=0.9,
                    pad_token_id=tokenizer.eos_token_id
                )
            gen = out[0][toks["input_ids"].shape[1]:]
            results.append(tokenizer.decode(gen, skip_special_tokens=True).strip())
        return results

def sample_and_score(n=3, seed=123):
    random.seed(seed)
    idxs = random.sample(range(len(dataset)), k=min(n, len(dataset)))
    prompts = [dataset[i]["prompt"] for i in idxs]
    replies = generate_batch(prompts)
    tone = score_business(prompts, [[{"role": "assistant", "content": r}] for r in replies], "Is the response polite, empathetic, and professional?")
    reso = score_business(prompts, [[{"role": "assistant", "content": r}] for r in replies], "Does the response directly address the customer's request with clear next steps?")
    polc = score_business(prompts, [[{"role": "assistant", "content": r}] for r in replies], "Does the response follow the provided policy without contradicting it?")
    for j, i in enumerate(idxs):
        sys = prompts[j][0]["content"]
        ticket = prompts[j][1]["content"]
        policy = sys.split("Policy:", 1)[1].strip() if "Policy:" in sys else ""
        print("=" * 80)
        print(f"[{i}] Ticket: {ticket}\nPolicy: {policy}\n\nReply:\n{replies[j].strip()}")
        avg = (tone[j] + reso[j] + polc[j]) / 3.0
        print(f"\nPi Scores — tone: {tone[j]:.3f}, resolution: {reso[j]:.3f}, policy: {polc[j]:.3f}, avg: {avg:.3f}")

sample_and_score(n=3)

Processed prompts:   0%|          | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

[0] Ticket: Order #78421 arrived with a cracked mug. Can you replace it? I need it before Friday.
Policy: Damaged on arrival: offer free replacement shipped via 2-day; if out of stock, offer full refund.

Reply:
Dear [Customer],

Thank you for reaching out to us. I'm sorry to hear that your order #78421 arrived with a cracked mug. I understand your concern and will do everything possible to replace it for you.

I will send you a free replacement mug shipped via 2-day service. If this is not available, I will offer you a full refund.

Please let me know if you need any further assistance.

Thank you for choosing Contoso Retail.

Best regards,
[Your Name]  
Customer Support

Pi Scores — tone: 0.906, resolution: 0.664, policy: 0.977, avg: 0.849
[2] Ticket: I canceled my order yesterday but I still see a pending charge on my card.
Policy: Cancellations void the authorization immediately; banks may take 3–5 business days to release funds.

Reply:
Dear [Customer's Name],

I apologize for any

## 8) (Optional) Azure AI Foundry — upload dataset and run an evaluation

This cell generates a small JSONL with `{query, response, policy, input, output}`, uploads it to your Azure AI Project, and launches an evaluation with the built-in **Relevance** evaluator. If you’ve registered a custom evaluator, set `CUSTOM_EVALUATOR_ID` and it will be included too.

Auth options:
- Preferred: set `AZURE_AI_PROJECT` and rely on `DefaultAzureCredential()` (e.g., `az login` or Managed Identity on your compute).
- Legacy preview: `AZURE_AI_PROJECT_CONNECTION_STRING` also works if your SDK build still supports it.

In [None]:
import os, json, pathlib
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import Evaluation, InputDataset, DatasetInputType, EvaluatorConfiguration, EvaluatorIds

def _gen_reply(messages):
    # Use HF generate for portability across environments
    txt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    toks = tokenizer(txt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**toks, max_new_tokens=200, do_sample=False, pad_token_id=tokenizer.eos_token_id)
    gen = out[0][toks["input_ids"].shape[1]:]
    return tokenizer.decode(gen, skip_special_tokens=True).strip()

sample = dataset.select(range(min(32, len(dataset))))
rows = []
for ex in sample:
    reply = _gen_reply(ex["prompt"])
    system_text = ex["prompt"][0]["content"]
    user_text = ex["prompt"][-1]["content"]
    policy = system_text.split("Policy:", 1)[1].strip() if "Policy:" in system_text else ""
    rows.append({
        "query": user_text,
        "response": reply,
        "policy": policy,
        "input": f"Ticket:\n{user_text}\n\nPolicy:\n{policy}",
        "output": reply
    })

jsonl_path = pathlib.Path("azure_eval_dataset.jsonl")
with jsonl_path.open("w", encoding="utf-8") as f:
    for r in rows:
        f.write(json.dumps(r, ensure_ascii=False) + "\n")
print(f"Wrote {len(rows)} rows -> {jsonl_path.resolve()}")

conn_str = os.environ.get("AZURE_AI_PROJECT_CONNECTION_STRING")
if conn_str:
    project_client = AIProjectClient.from_connection_string(conn_str)
else:
    endpoint = os.environ.get("AZURE_AI_PROJECT")
    if not endpoint:
        raise RuntimeError("Set AZURE_AI_PROJECT or AZURE_AI_PROJECT_CONNECTION_STRING to run this cell.")
    project_client = AIProjectClient(endpoint=endpoint, credential=DefaultAzureCredential())

with project_client:
    dataset_id = project_client.datasets.upload_file(path=str(jsonl_path))
    print("Uploaded dataset id:", dataset_id)

    evaluators = [
        EvaluatorConfiguration(
            id=EvaluatorIds.RELEVANCE.value,
            data_mapping={"query": "${data.query}", "response": "${data.response}"},
            settings={"definition": "Is the agent’s reply relevant and does it address the customer’s request?"}
        )
    ]

    custom_eval_id = os.environ.get("CUSTOM_EVALUATOR_ID")
    if custom_eval_id:
        evaluators.append(
            EvaluatorConfiguration(
                id=custom_eval_id,
                data_mapping={"input": "${data.input}", "output": "${data.output}", "policy": "${data.policy}"}
            )
        )

    headers = {}
    if all(k in os.environ for k in ("AZURE_OPENAI_ENDPOINT", "AZURE_OPENAI_API_KEY", "AZURE_OPENAI_DEPLOYMENT")):
        headers = {
            "model-endpoint": os.environ["AZURE_OPENAI_ENDPOINT"],
            "api-key": os.environ["AZURE_OPENAI_API_KEY"],
            "azureml-model-deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"]
        }

    eval_job = project_client.evaluations.create(
        evaluation=Evaluation(
            name="contoso-cs-email-eval",
            data=InputDataset(type=DatasetInputType.URI_FILE_DATASET, path=dataset_id),
            evaluators=evaluators
        ),
        headers=headers
    )
    print("Submitted evaluation job id:", eval_job)