# GRPO fine-tuning with Pi‑Scorer rewards on **Phi‑4 Mini Instruct** (LoRA/PEFT)

This end‑to‑end notebook shows how to:

1) Load **Phi‑4 Mini Instruct** quantized to 4‑bit and enable **LoRA/PEFT** with [Unsloth](https://unsloth.ai/).
2) Train with **GRPO** (Group Relative Policy Optimization) using **Pi‑Scorer** functions as rewards.
3) Save and reuse the LoRA adapter.
4) Evaluate locally and in **Azure AI Foundry** (Projects SDK) using built‑in and custom evaluators.

> Why these choices?
- **Unsloth** provides an optimized GRPO trainer and tight integration with PEFT/LoRA, vLLM fast generation, and quantized loading.
- **GRPO** stabilizes RL for LLMs by comparing rollouts within a group.
- **Pi‑Scorer** lets you replace expensive LLM‑as‑a‑judge with fast, deterministic scorers.

References:
- Unsloth GRPO tutorial (methods/config & saving LoRA).  
  - See: *GRPO & vLLM* and *save_lora()* usage in Unsloth docs.  
- Azure AI Foundry Evaluations (built‑in evaluators, Projects SDK).  
  - See the Evaluate how‑to & Projects SDK API for Evaluations in Azure docs.


## 0) Installs

We install Unsloth, TRL (for the GRPOConfig/GRPOTrainer API), vLLM (fast gen path), and support libs.

**Tip:** If you're running this on a fresh environment, the first cell can take a few minutes.

In [None]:
# Install pinned/compatible packages for this workflow.
!pip -q install unsloth vllm==0.8.5.post1
!pip -q install bitsandbytes accelerate xformers peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
!pip -q install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
!pip -q install transformers==4.51.3 ipywidgets requests
!pip -q install azure-ai-projects azure-identity  # Optional: Azure AI Foundry Evaluations

## 1) Configure API keys (Pi‑Scorer & optional Azure)

We read **WITHPI_API_KEY** for Pi‑Scorer (from https://build.withpi.ai) and (optionally) Azure variables for evaluations.

Nothing is sent anywhere in this cell — it just checks environment variables and prompts if missing.

In [None]:
import os, getpass

def _need(key: str, prompt: str) -> str:
    val = os.environ.get(key)
    if val:
        return val
    try:
        val = getpass.getpass(prompt)
    except Exception:
        val = input(prompt)
    if val:
        os.environ[key] = val.strip()
    return os.environ.get(key, "")

# Required for Pi-Scorer reward calls
_need("WITHPI_API_KEY", "Enter your WITHPI_API_KEY (input hidden): ")

# Optional: If you plan to use Azure AI Foundry Evaluations via Projects SDK
# os.environ.setdefault("AZURE_AI_PROJECT_CONNECTION_STRING", "Endpoint=...;Project=...;SubscriptionId=...;ResourceGroup=...")
# os.environ.setdefault("AZURE_OPENAI_ENDPOINT", "https://<your-aoai>.openai.azure.com/")
# os.environ.setdefault("AZURE_OPENAI_API_KEY", "<key>")
# os.environ.setdefault("AZURE_OPENAI_DEPLOYMENT", "<gpt-4o-mini / o4-mini / etc>")

print("WITHPI_API_KEY set?", bool(os.environ.get("WITHPI_API_KEY")))

## 2) Load **Phi‑4 Mini Instruct** (4‑bit) and enable **LoRA/PEFT**

We use Unsloth's `FastLanguageModel` loader to pull a 4‑bit model and then attach LoRA adapters. PEFT training drastically reduces GPU memory use compared to full‑precision full‑parameter RL fine‑tuning.

In [None]:
from unsloth import FastLanguageModel
import os

# Cache locations (helps on repeated runs)
os.environ.setdefault("HF_HOME", "/mnt/hf/cache")
os.environ.setdefault("HF_HUB_CACHE", "/mnt/hf/cache/hub")
os.environ.setdefault("TRANSFORMERS_CACHE", "/mnt/hf/cache/hub")

max_seq_length = 1024
lora_rank = 64
model_name = "unsloth/Phi-4-mini-instruct-bnb-4bit"  # 4-bit, fast to load

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    load_in_4bit = True,
    fast_inference = True,
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5,
    trust_remote_code = True,
    cache_dir = os.environ.get("HF_HUB_CACHE", None),
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = ["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

tokenizer.padding_side = "left"
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("Loaded:", model_name)

## 3) Create a **business support** scenario dataset

We'll train the assistant to write short, policy‑compliant customer‑support emails. Each item has a *ticket* and a *policy*. The **system** message includes the policy; the **user** message contains the ticket. GRPO will sample responses and Pi‑Scorer will grade them.

In [None]:
from datasets import Dataset

SYSTEM_PROMPT = """
You are a customer support agent for Contoso Retail. Write a short, professional email reply that:
- acknowledges the customer’s issue empathetically,
- provides clear next steps or resolution,
- follows the policy provided (do not contradict it),
- stays concise (≤150 words).
"""

business_samples = [
    {
        "ticket": "Order #78421 arrived with a cracked mug. Can you replace it? I need it before Friday.",
        "policy": "Damaged on arrival: offer free replacement shipped via 2-day; if out of stock, offer full refund."
    },
    {
        "ticket": "I’m past the 30-day window but this shirt still has tags. Any chance I can return it?",
        "policy": "Returns accepted within 30 days only; exceptions allowed as store credit at manager discretion."
    },
    {
        "ticket": "I canceled my order yesterday but I still see a pending charge on my card.",
        "policy": "Cancellations void the authorization immediately; banks may take 3–5 business days to release funds."
    },
    {
        "ticket": "The promo code SPRING25 didn’t apply at checkout. Can you refund the difference?",
        "policy": "SPRING25: 25% off full-price items only; cannot be combined; adjustments allowed within 7 days of purchase."
    },
    {
        "ticket": "I need to change the shipping address on my order to my office downtown.",
        "policy": "Address changes allowed until fulfillment starts; otherwise reroute via carrier once tracking is issued."
    },
    {
        "ticket": "My gift card shows $0 after one use but I only spent $12 of $25.",
        "policy": "Gift cards decrement in real-time; if balance mismatch occurs, reissue a new card with remaining funds."
    },
    {
        "ticket": "The espresso machine stopped working after two weeks. What can I do?",
        "policy": "Appliances: 1-year warranty. Offer troubleshooting; if unresolved, advance replacement or repair."
    },
    {
        "ticket": "Do you price match Amazon on the headphones I bought yesterday?",
        "policy": "Price match within 14 days against authorized retailers only; Amazon eligible when seller is Amazon.com."
    },
]

dataset = Dataset.from_list(business_samples)
dataset = dataset.map(
    lambda x: {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT + "\n\nPolicy:\n" + x["policy"]},
            {"role": "user", "content": x["ticket"]},
        ]
    }
)
print(dataset[0])

## 4) Define **Pi‑Scorer** reward functions

Each reward returns a scalar in \[0,1] for a (prompt, completion) pair. We'll call the Pi API for three questions: **professional tone**, **issue resolution**, and **policy adherence**.

In GRPO, the *relative* ranking of samples in a group matters most — perfect calibration isn't required.

In [None]:
import os, requests

PI_API_KEY = os.environ.get("WITHPI_API_KEY")
PI_API_URL = "https://api.withpi.ai/v1/scoring_system/score"
HEADERS = {"Content-Type": "application/json", "x-api-key": PI_API_KEY}

def get_pi_score(input_text: str, output_text: str, question: str) -> float:
    """Call Pi‑Scorer for a single (input, output, question). Returns float score."""
    payload = {
        "llm_input": input_text,
        "llm_output": output_text,
        "scoring_spec": [{"question": question}],
    }
    resp = requests.post(PI_API_URL, headers=HEADERS, json=payload, timeout=30)
    resp.raise_for_status()
    data = resp.json()
    if "total_score" not in data:
        raise KeyError("'total_score' missing in Pi response")
    return float(data["total_score"])

def _compose_business_input_from_prompt(prompt_messages: list[dict]) -> str:
    # Extract the ticket & policy we inserted in the system/user turns
    system_text = prompt_messages[0]["content"]
    user_text = prompt_messages[-1]["content"]
    policy = system_text.split("Policy:", 1)[1].strip() if "Policy:" in system_text else ""
    return f"Ticket:\n{user_text}\n\nPolicy:\n{policy}"

def score_business(prompts, completions, question: str) -> list[float]:
    # prompts: list[list[{role, content}]]
    # completions: list[list[{role: 'assistant', content: str}]]
    inputs = [_compose_business_input_from_prompt(p) for p in prompts]
    outputs = [c[0]["content"] for c in completions]
    return [get_pi_score(i, o, question) for i, o in zip(inputs, outputs)]

def pi_professional_tone(prompts, completions, **kwargs) -> list[float]:
    return score_business(prompts, completions, "Is the response polite, empathetic, and professional?")

def pi_issue_resolution(prompts, completions, **kwargs) -> list[float]:
    return score_business(prompts, completions, "Does the response directly address the customer's request with clear next steps?")

def pi_policy_adherence(prompts, completions, **kwargs) -> list[float]:
    return score_business(prompts, completions, "Does the response follow the provided policy without contradicting it?")


## 5) GRPO config **and** a minimal wrapper to satisfy Unsloth's trainer

Unsloth's GRPO trainer sometimes expects policy models to expose extra helpers. Two sticking points we solve here:

- The trainer writes a flag to `model.warnings_issued[...]` — we provide a dict.
- The Unsloth trainer calls `model.add_model_tags(tags)` during init — our wrapper forwards that (or no‑ops).

We also return **last hidden states** via `.logits` to match trainer expectations without altering the base model. This avoids attribute errors like `HiddenAsLogitsWrapper has no attribute 'add_model_tags'` and earlier recursion issues when accessing `base_model`.

In [None]:
from trl import GRPOConfig
import torch, torch.nn as nn
import types

# Keep GRPO on HF path (disable in-trainer vLLM); we still use vLLM for inference.
def _bf16_supported() -> bool:
    return torch.cuda.is_available() and torch.cuda.is_bf16_supported()

training_args = GRPOConfig(
    use_vllm = False,                 # critical: avoid vLLM engine inside trainer
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = _bf16_supported(),
    fp16 = not _bf16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1,
    num_generations = 8,
    max_prompt_length = 1024,
    max_completion_length = 200,
    max_steps = 40,                   # toy run; increase for real training
    save_steps = 10,
    max_grad_norm = 0.1,
    report_to = "none",
    output_dir = "outputs",
)

class HiddenAsLogitsWrapper(nn.Module):
    """Wrap a (PEFT) CausalLM so forward(...).logits returns LAST HIDDEN STATES (B,S,H).
    Also: expose trainer-expected attrs/methods and delegate unknowns to the base model.
    This fixes AttributeErrors like `add_model_tags` and avoids recursion when accessing `base_model`.
    """
    def __init__(self, base_model: nn.Module):
        super().__init__()
        # IMPORTANT: assign base_model before any code that might touch __getattr__
        object.__setattr__(self, "base_model", base_model)
        # Commonly-inspected attributes
        self.config = getattr(base_model, "config", None)
        self.lm_head = getattr(base_model, "lm_head", None)
        # Trainer writes to this; make sure it's present & mutable
        self.warnings_issued = getattr(base_model, "warnings_issued", {}) or {}

    # ============ Core forward: return last hidden states as `.logits` ============
    def forward(self, *args, **kwargs):
        kwargs = dict(kwargs)
        kwargs["output_hidden_states"] = True
        kwargs["return_dict"] = True
        out = self.base_model(*args, **kwargs)
        last_hidden = out.hidden_states[-1]  # (B, S, H)
        return types.SimpleNamespace(logits=last_hidden)

    # ============ Trainer convenience hooks ============
    def add_model_tags(self, tags):
        # Forward if implemented by the base model, else record & no-op
        if hasattr(self.base_model, "add_model_tags"):
            return self.base_model.add_model_tags(tags)
        self._wrapped_model_tags = list(tags) if isinstance(tags, (list, tuple, set)) else [tags]
        return None

    # ============ Delegate helpers to preserve normal behavior ============
    def get_output_embeddings(self):
        return self.base_model.get_output_embeddings()

    def get_input_embeddings(self):
        return self.base_model.get_input_embeddings()

    def generate(self, *args, **kwargs):
        return self.base_model.generate(*args, **kwargs)

    def train(self, mode: bool = True):
        self.base_model.train(mode)
        return super().train(mode)

    def eval(self):
        self.base_model.eval()
        return super().eval()

    def to(self, *args, **kwargs):
        self.base_model.to(*args, **kwargs)
        return self

    @property
    def device(self):
        try:
            return next(self.base_model.parameters()).device
        except StopIteration:
            return torch.device("cpu")

    # ============ Robust attribute forwarding without recursion ============
    def __getattr__(self, name):
        # If base_model isn't set yet, fall back to default to avoid recursion
        if name == "base_model":
            return object.__getattribute__(self, "base_model")
        try:
            return super().__getattribute__(name)
        except AttributeError:
            return getattr(self.base_model, name)

grpo_model = HiddenAsLogitsWrapper(model)
print("Wrapper ready. add_model_tags?", hasattr(grpo_model, "add_model_tags"))

## 6) Train with GRPO

We point the Unsloth GRPO trainer at our wrapped model, dataset, tokenizer (as `processing_class`), and the three Pi reward functions.

> **Note:** This is a short run (`max_steps=40`, `num_generations=8`) so it fits modest GPUs. Increase for real training.

In [None]:
# --- Patch: return LAST HIDDEN STATES in .logits so Unsloth GRPO can matmul with lm_head ---
import types
import torch
from transformers.modeling_outputs import CausalLMOutputWithPast

def patch_logits_to_hidden_states(model):
    if not hasattr(model, "warnings_issued"):
        model.warnings_issued = {}
    # keep original forward
    _orig_forward = model.forward

    def _forward(*args, **kwargs):
        kwargs = dict(kwargs)
        # force hidden states + dict outputs
        kwargs["output_hidden_states"] = True
        kwargs["return_dict"] = True
        out = _orig_forward(*args, **kwargs)

        # pull last hidden states robustly
        hidden = None
        if hasattr(out, "hidden_states") and out.hidden_states is not None:
            hidden = out.hidden_states[-1]              # (B, S, H)
        elif isinstance(out, (tuple, list)) and len(out) >= 3 and out[2] is not None:
            hidden = out[2][-1]                         # (B, S, H) for tuple returns
        if hidden is None:
            raise RuntimeError("Hidden states not returned by model; cannot train GRPO.")

        # If the output object lets us overwrite .logits, do it in place.
        try:
            out.logits = hidden                         # <- make logits be (B, S, H)
            return out
        except Exception:
            # Fall back to constructing a fresh, standard output
            return CausalLMOutputWithPast(
                logits=hidden,
                past_key_values=getattr(out, "past_key_values", None),
                hidden_states=getattr(out, "hidden_states", None),
                attentions=getattr(out, "attentions", None),
            )

    # monkey-patch
    model.forward = types.MethodType(_forward, model)
    return model

# apply the patch
model = patch_logits_to_hidden_states(model)

# (Optional) quick sanity check on shapes
with torch.no_grad():
    tok = tokenizer("hi", return_tensors="pt").to(next(model.parameters()).device)
    dbg = model(**tok)
    print("Patched logits shape (should be B,S,H):", tuple(dbg.logits.shape))
    emb = model.get_output_embeddings().weight
    print("Output embedding (V,H):", tuple(emb.shape))

In [None]:
from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    use_vllm=False,                 # <-- important: disable vLLM during GRPO
    learning_rate=5e-6,
    per_device_train_batch_size=4,  # multiple of num_generations
    num_generations=4,
    gradient_accumulation_steps=1,
    max_prompt_length=1024,
    max_completion_length=200,
    max_steps=20,                   # toy
    save_steps=10,
    bf16=True,                      # H100: yes
    report_to="none",
    output_dir="outputs",
)

# left pad for GRPO sampling
tokenizer.padding_side = "left"
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

trainer = GRPOTrainer(
    model=model,                    # <- the PEFT model, **after** patch_logits_to_hidden_states()
    processing_class=tokenizer,
    reward_funcs=[pi_professional_tone, pi_issue_resolution, pi_policy_adherence],
    args=training_args,
    train_dataset=dataset,
)

trainer.train()
print("Training complete.")

## 7) Save the LoRA adapter

This gives you a small adapter folder you can mount later for inference (HF generate **or** vLLM fast path).

In [None]:
import os
lora_dir = os.path.abspath("grpo_saved_lora")
try:
    model.save_lora(lora_dir)  # Unsloth helper
    print("Saved LoRA to:", lora_dir)
except Exception as e:
    if os.path.isdir(lora_dir):
        print("LoRA already exists at:", lora_dir)
    else:
        raise e

## 8) Quick local inference helpers (vLLM fast path when available)

Below we: (a) warm up; (b) define a batch inference helper; (c) score a few random examples with Pi‑Scorer to see progress.

In [None]:
import sys, time
from typing import List, Dict, Tuple, Any

print(f"Python: {sys.version.split()[0]}")
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
    print("CUDA capability:", torch.cuda.get_device_capability(0))
    print("bfloat16 supported:", torch.cuda.is_bf16_supported())
    try:
        total, free = torch.cuda.get_device_properties(0).total_memory, torch.cuda.mem_get_info()[0]
        print("VRAM (total / free GB):", round(total/1e9,2), "/", round(free/1e9,2))
    except Exception as e:
        print("VRAM info unavailable:", e)

def _warm_up(model, tokenizer):
    """One tiny generation to compile kernels/caches."""
    prompt = tokenizer.apply_chat_template(
        [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Say hello in 5 words."},
        ],
        tokenize=False,
        add_generation_prompt=True,
    )
    t0 = time.time()
    try:
        from vllm import SamplingParams
        out = model.fast_generate([prompt], sampling_params=SamplingParams(max_tokens=16), lora_request=None)
        elapsed = time.time() - t0
        print("Warm-up (vLLM)", f"{elapsed:.2f}s")
    except Exception as e:
        inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
        with torch.no_grad():
            _ = model.generate(**inputs, max_new_tokens=16, pad_token_id=tokenizer.eos_token_id)
        print("Warm-up (HF generate)")

_warm_up(model, tokenizer)

from vllm import SamplingParams

def _chat_to_prompt_text(messages: List[Dict[str, str]]) -> str:
    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

def _normalize_outputs(outs: List[Any]) -> List[str]:
    if not outs:
        return []
    if isinstance(outs[0], str):
        return outs
    if hasattr(outs[0], "outputs"):
        return [(o.outputs[0].text if getattr(o, "outputs", None) else "") for o in outs]
    if hasattr(outs[0], "text"):
        return [getattr(o, "text", str(o)) for o in outs]
    return [str(o) for o in outs]

def _infer_batch(prompts: List[List[Dict[str, str]]],
                 temperature: float = 0.4,
                 top_p: float = 0.9,
                 max_tokens: int = 180,
                 lora_request=None) -> Tuple[List[str], dict]:
    texts = [_chat_to_prompt_text(p) for p in prompts]
    params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_tokens)
    raw_outs = model.fast_generate(texts, sampling_params=params, lora_request=lora_request)
    replies = _normalize_outputs(raw_outs)
    return replies, {"backend": "vllm", "num": len(replies)}

def run_business_eval(n: int = 3, score_with_pi: bool = True, seed: int = 123, **gen_kwargs) -> list:
    import random
    random.seed(seed)
    n = min(n, len(dataset))
    idxs = random.sample(range(len(dataset)), k=n)
    prompts = [dataset[i]["prompt"] for i in idxs]
    outs, stats = _infer_batch(prompts, **gen_kwargs)
    rows = []
    if score_with_pi:
        tone_scores = score_business(prompts, [[{"role": "assistant", "content": o}] for o in outs],
                                     "Is the response polite, empathetic, and professional?")
        res_scores  = score_business(prompts, [[{"role": "assistant", "content": o}] for o in outs],
                                     "Does the response directly address the customer's request with clear next steps?")
        policy_scores = score_business(prompts, [[{"role": "assistant", "content": o}] for o in outs],
                                     "Does the response follow the provided policy without contradicting it?")
    for j, i in enumerate(idxs):
        p = prompts[j]
        sys = p[0]["content"]
        ticket = p[1]["content"]
        policy = sys.split("Policy:", 1)[1].strip() if "Policy:" in sys else ""
        row = {"index": int(i), "ticket": ticket, "policy": policy, "reply": outs[j].strip()}
        if score_with_pi:
            row.update({
                "pi_tone": float(tone_scores[j]),
                "pi_resolution": float(res_scores[j]),
                "pi_policy": float(policy_scores[j]),
                "pi_avg": float((tone_scores[j] + res_scores[j] + policy_scores[j]) / 3.0),
            })
        rows.append(row)
        print("=" * 80)
        print(f"[{row['index']}] Ticket: {row['ticket']}\nPolicy: {row['policy']}\n\nReply:\n{row['reply']}")
        if score_with_pi:
            print(f"\nPi Scores — tone: {row['pi_tone']:.3f}, resolution: {row['pi_resolution']:.3f}, policy: {row['pi_policy']:.3f}, avg: {row['pi_avg']:.3f}")
    return rows

_ = run_business_eval(n=3, score_with_pi=True)

## 9) (Optional) Azure AI Foundry: run a cloud evaluation

We generate a small JSONL from the dataset + model replies, upload it to your Azure AI Project, and launch an evaluation using built‑in **Relevance** (and optionally a custom evaluator you registered).

In [None]:
import os, json, pathlib
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import Evaluation, InputDataset, DatasetInputType, EvaluatorConfiguration, EvaluatorIds

def _gen_reply(messages):
    # Simple HF-generate for portability (no vLLM requirement on the compute that runs this cell)
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    toks = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**toks, max_new_tokens=200, do_sample=False, pad_token_id=tokenizer.eos_token_id)
    gen = out[0][toks["input_ids"].shape[1]:]
    return tokenizer.decode(gen, skip_special_tokens=True).strip()

sample = dataset.select(range(min(32, len(dataset))))
rows = []
for ex in sample:
    reply = _gen_reply(ex["prompt"])
    system_text = ex["prompt"][0]["content"]
    user_text = ex["prompt"][-1]["content"]
    policy = system_text.split("Policy:", 1)[1].strip() if "Policy:" in system_text else ""
    rows.append({
        "query": user_text,
        "response": reply,
        "policy": policy,
        "input": f"Ticket:\n{user_text}\n\nPolicy:\n{policy}",
        "output": reply,
    })

jsonl_path = pathlib.Path("azure_eval_dataset.jsonl")
with jsonl_path.open("w", encoding="utf-8") as f:
    for r in rows:
        f.write(json.dumps(r, ensure_ascii=False) + "\n")
print(f"Wrote {len(rows)} rows -> {jsonl_path.resolve()}")

conn_str = os.environ.get("AZURE_AI_PROJECT_CONNECTION_STRING")
if not conn_str and not os.environ.get("AZURE_AI_PROJECT"):
    print("Skip: set AZURE_AI_PROJECT_CONNECTION_STRING or AZURE_AI_PROJECT to run this cell.")
else:
    if conn_str:
        project_client = AIProjectClient.from_connection_string(conn_str)
    else:
        project_client = AIProjectClient(endpoint=os.environ["AZURE_AI_PROJECT"], credential=DefaultAzureCredential())

    with project_client:
        dataset_id = project_client.datasets.upload_file(path=str(jsonl_path))
        print("Uploaded dataset id:", dataset_id)

        evaluators = [
            EvaluatorConfiguration(
                id=EvaluatorIds.RELEVANCE.value,
                data_mapping={"query": "${data.query}", "response": "${data.response}"},
                settings={"definition": "Is the agent’s reply relevant and does it address the customer’s request?"}
            )
        ]
        custom_eval_id = os.environ.get("CUSTOM_EVALUATOR_ID")
        if custom_eval_id:
            evaluators.append(
                EvaluatorConfiguration(
                    id=custom_eval_id,
                    data_mapping={"input": "${data.input}", "output": "${data.output}", "policy": "${data.policy}"},
                )
            )

        headers = {}
        if all(k in os.environ for k in ("AZURE_OPENAI_ENDPOINT", "AZURE_OPENAI_API_KEY", "AZURE_OPENAI_DEPLOYMENT")):
            headers = {
                "model-endpoint": os.environ["AZURE_OPENAI_ENDPOINT"],
                "api-key": os.environ["AZURE_OPENAI_API_KEY"],
                "azureml-model-deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"],
            }

        eval_job = project_client.evaluations.create(
            evaluation=Evaluation(
                name="contoso-cs-email-eval",
                data=InputDataset(type=DatasetInputType.URI_FILE_DATASET, path=dataset_id),
                evaluators=evaluators,
            ),
            headers=headers,
        )
        print("Submitted evaluation job id:", eval_job)

---
**Notes**
- If you want to serve the base model + LoRA with vLLM elsewhere, construct a `LoRARequest` (signature varies across vLLM versions) and pass it to `model.fast_generate(..., lora_request=...)`.
- For a larger training, scale `num_generations`, `max_steps`, and batch size gradually; monitor GPU VRAM.
- Always verify your policies in the system message and ensure your reward questions match business goals.
