In [1]:
"""
Cell A â€” Set up a stable Unsloth + TRL GRPO environment for Colab T4 (CUDA 12.4).

What this cell does:
1) Upgrades pip.
2) Installs the *paired* CUDA 12.4 wheels (torch==2.6.0, torchvision==0.21.0, torchaudio==2.6.0).
3) Installs Unsloth via the *correct extras* for (cu124, torch 2.6) so Zoo matches Torch/CUDA.
4) Pins training stack versions that are known-good for GRPO with Unsloth:
   transformers==4.56.1, trl==0.23.0, accelerate>=1.0.1, peft>=0.13.2, bitsandbytes, datasets==4.3.0, etc.
5) Prints a reminder to manually restart the runtime so the new wheels are loaded.

Notes:
- We intentionally avoid importing torch/transformers in this cell to prevent in-memory version confusion.
- If you previously ran other notebooks in the same runtime, do Runtime â†’ Factory reset runtimeâ€¦ before this.
"""

import sys, subprocess

PIP = [sys.executable, "-m", "pip"]

# 1) Fresh pip
subprocess.check_call(PIP + ["install", "-U", "pip"])

# 2) PyTorch + CUDA 12.4 wheels (paired versions avoid torchvision::nms errors)
subprocess.check_call(PIP + [
    "install", "-U", "--quiet", "--no-cache-dir",
    "torch==2.6.0+cu124", "torchvision==0.21.0+cu124", "torchaudio==2.6.0+cu124",
    "--index-url", "https://download.pytorch.org/whl/cu124"
])

# 3) Unsloth for this exact (CUDA, Torch) pair via extras (pulls matching zoo)
subprocess.check_call([
    "bash", "-lc",
    'pip install -U "unsloth[cu124-torch260] @ git+https://github.com/unslothai/unsloth.git"'
])

# 4) Core training libs (versions aligned for TRL + Unsloth)
subprocess.check_call(PIP + [
    "install", "-U", "--quiet", "--no-cache-dir",
    "transformers==4.56.1",
    "trl==0.23.0",
    "accelerate>=1.0.1",
    "peft>=0.13.2",
    "bitsandbytes",
    "datasets==4.3.0",
    "sentencepiece",
    "protobuf>=5.28.3",
    "huggingface_hub>=0.24.6",
    "hf_transfer",
])

print("\nâœ… Install complete. Now do: Runtime â†’ Restart runtimeâ€¦  then run the next cell Iâ€™ll send.")



âœ… Install complete. Now do: Runtime â†’ Restart runtimeâ€¦  then run the next cell Iâ€™ll send.


In [1]:
"""
Cell B â€” Safe imports & environment sanity for GRPO

What this cell does:
1) Sets Unslothâ€™s safe flags for a stable first import.
2) Imports `unsloth` FIRST (required for its optimizations), then Hugging Face libs.
3) Prints concise environment & GPU info.
4) Asserts that TRLâ€™s `GRPOTrainer` is available and that Transformers exposes chat templating,
   so we can format reasoning prompts safely.

If this cell prints the final âœ… line, weâ€™re ready to load data and define rewards next.
"""

import os, platform, importlib

# 1) Conservative flags for a clean first import (can be relaxed later)
os.environ.setdefault("UNSLOTH_COMPILE_DISABLE", "1")
os.environ.setdefault("UNSLOTH_DISABLE_FAST_GENERATION", "1")
os.environ.setdefault("HF_HUB_DISABLE_PROGRESS_BARS", "1")
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")

# 2) Import order: Unsloth FIRST
import unsloth
from unsloth import FastLanguageModel, is_bfloat16_supported

# 3) Hugging Face stack
import torch, transformers, trl, datasets

# 4) Minimal sanity checks
# - TRL GRPO available?
try:
    from trl import GRPOTrainer  # noqa: F401
    grpo_ok = True
except Exception as e:
    grpo_ok = False
    print("GRPO import error:", repr(e))

# - Chat templating available?
templating_ok = hasattr(transformers.PreTrainedTokenizerBase, "apply_chat_template")

# - GPU info
cuda_ok = torch.cuda.is_available()
gpu_name = torch.cuda.get_device_name(0) if cuda_ok else "CPU"
cc = torch.cuda.get_device_capability(0) if cuda_ok else ("-", "-")

print("Python       :", platform.python_version())
print("Torch        :", torch.__version__, "| CUDA:", torch.version.cuda, "| CUDA available:", cuda_ok)
print("GPU          :", gpu_name, "| CC:", cc)
print("Transformers :", transformers.__version__)
print("TRL          :", trl.__version__, "| GRPO available:", grpo_ok)
print("Datasets     :", datasets.__version__)
print("Unsloth      :", getattr(unsloth, "__version__", "git"))
print("Chat templating available:", templating_ok)

assert grpo_ok, "TRLâ€™s GRPOTrainer not found â€” please re-check TRL install/pin."
assert templating_ok, "Transformers chat templating missing â€” please re-check Transformers version."

print("âœ… Imports & sanity checks passed. Ready for dataset + rewards.")


ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
Python       : 3.12.12
Torch        : 2.6.0+cu124 | CUDA: 12.4 | CUDA available: True
GPU          : Tesla T4 | CC: (7, 5)
Transformers : 4.56.1
TRL          : 0.23.0 | GRPO available: True
Datasets     : 4.3.0
Unsloth      : 2025.11.3
Chat templating available: True
âœ… Imports & sanity checks passed. Ready for dataset + rewards.


In [3]:
"""
Cell C â€” Prepare the GRPO dataset (GSM8K â†’ prompts + ground_truth) â€” fixed

What this cell does:
1) Loads a true `datasets.Dataset` slice using the split slicer (NOT `dataset[:N]`, which returns a dict).
2) Extracts the numeric ground-truth answer from GSM8K's 'answer' (after the '####' marker).
3) Builds a conversational `prompt` with a system format instruction enforcing:
      <reasoning>...</reasoning><answer>...</answer>
4) Filters rows missing ground-truth and prints 2 examples.

Why this fixes your error:
- `dataset[:N]` returns a dict-of-lists, so `.map(...)` fails. Using
  `split="train[:N]"` (or `.select(range(N))`) preserves the `Dataset` API (supports `.map`, `.filter`).
"""

import re
from datasets import load_dataset

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
""".strip()

def extract_hash_answer(text: str):
    # GSM8K final answer appears after '####'
    if "####" not in text:
        return None
    return text.split("####", 1)[1].strip()

_num_pat = re.compile(r"^[\s]*([-+]?\d[\d,]*([.]\d+)?)\s*$")

def normalize_number(s: str):
    if s is None:
        return None
    m = _num_pat.match(s.strip())
    s = (m.group(1) if m else s).replace(",", "").strip()
    return s

# 1) Load a Dataset slice (keeps Dataset API intact)
N_ROWS = 1000  # bump later
train = load_dataset("openai/gsm8k", "main", split=f"train[:{N_ROWS}]")

# 2) Map â†’ add prompt + ground_truth
def to_prompt_row(row):
    gt = normalize_number(extract_hash_answer(row["answer"]))
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user",   "content": row["question"]},
        ],
        "ground_truth": gt,
        "question": row["question"],
    }

train = train.map(to_prompt_row)

# 3) Keep only the columns we need
keep_cols = ["prompt", "ground_truth", "question"]
drop_cols = [c for c in train.column_names if c not in keep_cols]
if drop_cols:
    train = train.remove_columns(drop_cols)

# 4) Filter rows without ground_truth (rare)
train = train.filter(lambda r: r["ground_truth"] is not None)

# 5) Peek
def compact(msgs, limit=140):
    parts = []
    for m in msgs:
        s = f"{m['role'].upper()}: {m['content']}"
        parts.append(s if len(s) <= limit else s[:limit] + " â€¦")
    return " | ".join(parts)

print("Rows:", len(train))
print("Columns:", train.column_names)
print("\nExample 1:")
print(compact(train[0]["prompt"]))
print("ground_truth:", train[0]["ground_truth"])
if len(train) > 1:
    print("\nExample 2:")
    print(compact(train[1]["prompt"]))
    print("ground_truth:", train[1]["ground_truth"])

print("\nâœ… GSM8K prepared with conversational prompts and normalized numeric ground_truth.")


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

Rows: 1000
Columns: ['question', 'prompt', 'ground_truth']

Example 1:
SYSTEM: Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer> | USER: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altoget â€¦
ground_truth: 72

Example 2:
SYSTEM: Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer> | USER: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
ground_truth: 10

âœ… GSM8K prepared with conversational prompts and normalized numeric ground_truth.


In [4]:
"""
Cell D â€” Reward functions for GRPO (format + exact answer)

What this cell does:
1) Implements two TRL-compatible reward functions:
   - format_reward(completions, **kwargs) â†’ 1.0 if completion matches
     `<reasoning>â€¦</reasoning><answer>â€¦</answer>`, else 0.0.
   - exact_answer_reward(completions, ground_truth, **kwargs) â†’ 1.0 if the
     extracted <answer> equals the normalized numeric ground_truth, else 0.0.

2) Includes a tiny, fast unit test to prove both functions return lists of floats
   and have the right shape for GRPOTrainer.

Why this matches references:
- TRL reward functions receive a list of completions (each completion is a single
  assistant message with a "content" field) and must return list[float]. (See TRL docs)
- GSM8Kâ€™s answer is the number after the `####` marker; we already prepared the
  dataset to expose a normalized `ground_truth` column. (See GSM8K dataset card)
- Unslothâ€™s GRPO tutorial enforces reasoning+answer tags and uses accuracy-like rewards.
"""

import re
from typing import List, Dict, Any

# Precompile a strict "reasoning+answer" pattern, allowing whitespace/newlines between tags
_REASONING_ANSWER_RE = re.compile(
    r"^\s*<reasoning>[\s\S]+?</reasoning>\s*<answer>\s*([\s\S]+?)\s*</answer>\s*$",
    re.IGNORECASE,
)

_NUM_NORM_RE = re.compile(r"^[\s]*([-+]?\d[\d,]*([.]\d+)?)\s*$")

def _get_text_from_completion(c: List[Dict[str, str]]) -> str:
    """
    Each completion is a list with one dict {"content": "..."} (TRL convention).
    Be lenient if a model returns role+content.
    """
    if not c:
        return ""
    msg = c[0]
    return msg.get("content", "") if isinstance(msg, dict) else str(msg)

def _normalize_number(s: str | None) -> str | None:
    if s is None:
        return None
    s = s.strip()
    m = _NUM_NORM_RE.match(s)
    if m:
        s = m.group(1)
    return s.replace(",", "").strip()

def format_reward(completions: List[List[Dict[str, str]]], **kwargs: Any) -> List[float]:
    """
    Reward = 1.0 if completion matches:
      <reasoning>...</reasoning><answer>...</answer>
    else 0.0
    """
    rewards: List[float] = []
    for c in completions:
        text = _get_text_from_completion(c)
        rewards.append(1.0 if _REASONING_ANSWER_RE.match(text) else 0.0)
    return rewards

def exact_answer_reward(
    completions: List[List[Dict[str, str]]],
    ground_truth: List[str] | None = None,
    **kwargs: Any,
) -> List[float]:
    """
    Reward = 1.0 if extracted <answer> equals ground_truth (normalized), else 0.0.
    `ground_truth` is expected to come from the batch (our dataset column).
    """
    gt_list = ground_truth or kwargs.get("ground_truth") or []
    out: List[float] = []
    for i, c in enumerate(completions):
        text = _get_text_from_completion(c)
        m = _REASONING_ANSWER_RE.match(text)
        if not m:
            out.append(0.0)
            continue
        pred = _normalize_number(m.group(1))
        gold = _normalize_number(gt_list[i] if i < len(gt_list) else None)
        out.append(1.0 if (pred is not None and gold is not None and pred == gold) else 0.0)
    return out

# --------- Tiny unit tests (run fast) ---------
_test_completions = [
    [{"content": "<reasoning>\nsteps\n</reasoning>\n<answer>\n72\n</answer>"}],
    [{"content": "<reasoning>\noops missing answer tag\n</reasoning>"}],
    [{"content": "<reasoning>ok</reasoning><answer>10</answer>"}],
]
_test_gt = ["72", "10", "10"]

fr = format_reward(_test_completions)
ar = exact_answer_reward(_test_completions, ground_truth=_test_gt)

print("format_reward â†’", fr)  # expect [1.0, 0.0, 1.0]
print("exact_answer_reward â†’", ar)  # expect [1.0, 0.0, 1.0]
assert all(isinstance(x, float) for x in fr) and len(fr) == len(_test_completions)
assert all(isinstance(x, float) for x in ar) and len(ar) == len(_test_completions)

print("âœ… Rewards ready: TRL-compatible signatures and shapes.")


format_reward â†’ [1.0, 0.0, 1.0]
exact_answer_reward â†’ [1.0, 0.0, 1.0]
âœ… Rewards ready: TRL-compatible signatures and shapes.


In [5]:
"""
Cell E â€” Model setup for GRPO (4-bit Qwen2.5-1.5B-Instruct + LoRA + chat template) + smoke generation

What this cell does:
1) Loads a small, T4-friendly 4-bit instruct model: unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit.
2) Applies LoRA adapters (memory-efficient training).
3) Ensures a proper chat template is attached (Qwen2.5 Instruct models usually ship one;
   we attach Unsloth's 'qwen25' template if missing).
4) Runs a quick smoke generation that enforces the <reasoning>...</reasoning><answer>...</answer> format.

Why this setup:
- Qwen2.5-1.5B-Instruct-bnb-4bit is lightweight enough to sample multiple completions per prompt on a T4.
- Instruct variants include chat templates; if not present, we attach Unsloth's 'qwen25' template.
- LoRA keeps only ~1% of parameters trainable, ideal for GRPO on a single T4.
"""

import torch, math
from unsloth import FastLanguageModel, is_bfloat16_supported
from transformers import TextStreamer

MODEL_NAME   = "unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit"
MAX_SEQ_LEN  = 2048
use_bf16     = is_bfloat16_supported()
dtype        = torch.bfloat16 if use_bf16 else torch.float16

# 1) Load 4-bit base + tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name      = MODEL_NAME,
    max_seq_length  = MAX_SEQ_LEN,
    load_in_4bit    = True,
    dtype           = dtype,
)

# 2) Tokenizer padding sanity
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# 3) Ensure chat template exists (Qwen2.5 Instruct usually has one; attach if missing)
if not getattr(tokenizer, "chat_template", None):
    from unsloth.chat_templates import get_chat_template
    tokenizer = get_chat_template(tokenizer, chat_template="qwen25")
print("Chat template attached:", bool(getattr(tokenizer, "chat_template", None)))

# 4) Attach LoRA adapters (Qwen/Llama-style projection names)
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    lora_dropout=0,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    use_gradient_checkpointing="unsloth",
    random_state=42,
    max_seq_length=MAX_SEQ_LEN,
)

# ---- Diagnostics ----
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total     = sum(p.numel() for p in model.parameters())
print(f"Loaded: {MODEL_NAME}")
print(f"Max seq len: {MAX_SEQ_LEN} | Dtype: {dtype} | BF16 supported: {use_bf16}")
print(f"Params: {trainable:,} trainable / {total:,} total (~{100*trainable/total:.2f}% trainable)")
print("Tokenizer pad_token_id:", tokenizer.pad_token_id, "| padding_side:", tokenizer.padding_side)

# 5) Quick smoke generation to verify formatting & template
def _to_device(batch, device):
    # Accept BatchEncoding (dict-like) or Tensor â†’ mapping
    from collections.abc import Mapping
    if isinstance(batch, Mapping):
        return {k: (v.to(device) if hasattr(v, "to") else v) for k, v in batch.items()}
    import torch
    if torch.is_tensor(batch):
        return {"input_ids": batch.to(device), "attention_mask": torch.ones_like(batch, device=device)}
    raise TypeError(f"Unexpected inputs type: {type(batch)}")

def chat(messages, max_new_tokens=160, temperature=0.8, top_p=0.95):
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True,
        tokenize=True,
    )
    inputs = _to_device(inputs, model.device)
    prompt_len = inputs["input_ids"].shape[1]
    from torch import no_grad
    with no_grad(), torch.amp.autocast("cuda", dtype=torch.float16):
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )
    new_tokens = outputs[0, prompt_len:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

SYSTEM_PROMPT = (
    "Respond in the following format:\n"
    "<reasoning>\n...\n</reasoning>\n<answer>\n...\n</answer>"
)
demo_q = "If a book costs $8 and a pen costs $2, how much for 3 books and 4 pens?"

print("\n=== Smoke generation ===")
print(chat([
    {"role":"system","content": SYSTEM_PROMPT},
    {"role":"user","content": demo_q},
])[:600])

print("\nâœ… Model, LoRA and chat template are ready.")


==((====))==  Unsloth 2025.11.3: Fast Qwen2 patching. Transformers: 4.56.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Chat template attached: True


Unsloth 2025.11.3 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Loaded: unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit
Max seq len: 2048 | Dtype: torch.float16 | BF16 supported: False
Params: 18,464,768 trainable / 907,081,216 total (~2.04% trainable)
Tokenizer pad_token_id: 151654 | padding_side: right

=== Smoke generation ===
To calculate the total cost of 3 books and 4 pens, we need to multiply the cost of each item by the quantity and then add them together. 

The calculation is as follows:

\( (3 \times \$8) + (4 \times \$2) = \$24 + \$8 = \$32 \).

So, the total cost would be \( \$32 \).

âœ… Model, LoRA and chat template are ready.


In [6]:
"""
Cell F â€” Configure GRPO (T4-friendly) and build the GRPOTrainer

What this cell does:
1) Creates a GRPOConfig tuned for Colab T4:
   - num_generations=3  (sample 3 completions per prompt)
   - modest lengths (max_prompt_length=512, max_completion_length=256)
   - small batch with gradient accumulation (fits T4)
   - loss_type="dapo" and beta=0.0 (common, length-bias aware; KL off by default)
   - temperature/top_p for diversity during GRPO sampling
   - 8-bit AdamW optimizer

2) Instantiates GRPOTrainer with:
   - our LoRA model + tokenizer as processing_class (applies chat template)
   - BOTH reward functions: format_reward + exact_answer_reward
   - the GSM8K prompt dataset from earlier (columns: prompt, ground_truth)

3) Prints a short summary so we can sanity-check shapes & knobs before training.

Why these choices (backed by docs):
- TRLâ€™s GRPO docs show using `GRPOConfig` (num_generations, lengths, loss_type, beta).
- DAPO/Î²=0.0 is a recommended starting point; scale rewards can be changed later.
- Passing `processing_class=tokenizer` is the documented way to enable chat templating in GRPO.
"""

from trl import GRPOConfig, GRPOTrainer

# --- T4-friendly GRPO defaults (you can relax later) ---
USE_BF16 = False  # our runtime reports BF16 False; keep fp16
training_args = GRPOConfig(
    output_dir="grpo_qwen15b_gsm8k_runs",
    learning_rate=5e-6,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=6,
    max_steps=100,                     # quick loop check; scale later
    logging_steps=5,
    save_steps=1000,                   # effectively off for this short run
    report_to="none",

    # Generation (GRPO sampling) knobs
    num_generations=3,                 # completions per prompt (G)
    max_prompt_length=512,
    max_completion_length=256,
    temperature=1.0,
    top_p=0.95,
    top_k=0,

    # GRPO objective & shaping
    loss_type="dapo",                  # length-bias mitigation
    beta=0.0,                          # KL off (per TRL defaults/notes)
    scale_rewards="batch",             # robust shaping across batch (optional)

    # Precision / optimizer
    fp16=not USE_BF16,
    bf16=USE_BF16,
    optim="paged_adamw_8bit",
)

# Build trainer â€” TRL will:
#  * sample G completions per prompt
#  * call our reward functions
#  * apply DAPO objective over generated tokens
trainer = GRPOTrainer(
    model=model,                                   # LoRA model from Cell E
    processing_class=tokenizer,                    # ensures chat templating
    args=training_args,
    train_dataset=train,                           # Dataset with 'prompt' + 'ground_truth'
    reward_funcs=[format_reward, exact_answer_reward],
)

# --- Summarize critical bits ---
print("Trainer ready.")
print("num_generations       :", trainer.num_generations)
print("max_prompt_length     :", trainer.max_prompt_length)
print("max_completion_length :", trainer.max_completion_length)
print("loss_type / beta      :", training_args.loss_type, training_args.beta)
print("batch / grad_accum    :", training_args.per_device_train_batch_size, training_args.gradient_accumulation_steps)
print("dtype(fp16/bf16)      :", training_args.fp16, training_args.bf16)
print("âœ… GRPOTrainer constructed. Next cell will run a short train (100 steps).")


Unsloth: The DAPO paper recommends `mask_truncated_completions = True` - we will set it.
Unsloth: The DAPO paper recommends `epsilon_high = 0.28` - we will set it.
Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 3
Trainer ready.
num_generations       : 3
max_prompt_length     : 512
max_completion_length : 256
loss_type / beta      : dapo 0.0
batch / grad_accum    : 3 6
dtype(fp16/bf16)      : True False
âœ… GRPOTrainer constructed. Next cell will run a short train (100 steps).


In [7]:
"""
Cell G â€” Run a short GRPO training loop (100 steps)

What this cell does:
1) Sets a seed for reproducibility of sampling and rewards.
2) Calls `trainer.train()` to:
   - sample `num_generations` completions per prompt,
   - compute both rewards (format + exact-answer),
   - optimize with the DAPO loss (reference-free, beta=0.0),
   - log progress every few steps.

Notes:
- Settings and behavior follow Unslothâ€™s GRPO tutorial and TRLâ€™s GRPOTrainer API.
- You should see logs per a few steps; on completion, a TrainOutput summary prints.
"""
from transformers import set_seed
import time

set_seed(42)
t0 = time.time()
train_output = trainer.train()
t1 = time.time()

print("\nTraining complete.")
print(train_output)
print(f"Wall clock (s): {t1 - t0:.1f}")


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 3 | Gradient accumulation steps = 6
\        /    Data Parallel GPUs = 1 | Total batch size (3 x 6 x 1) = 18
 "-____-"     Trainable parameters = 18,464,768 of 1,562,179,072 (1.18% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,sampling / sampling_logp_difference / mean,sampling / sampling_logp_difference / max,sampling / importance_sampling_ratio / min,sampling / importance_sampling_ratio / mean,sampling / importance_sampling_ratio / max,kl,rewards / format_reward / mean,rewards / format_reward / std,rewards / exact_answer_reward / mean,rewards / exact_answer_reward / std
5,0.0,0.044444,0.188562,188.255557,86.0,256.0,0.277778,161.88425,86.0,237.2,0,0,0,0,0,0.0,0.044444,0.188562,0.0,0.0
10,0.0,0.0,0.0,185.233334,92.0,250.2,0.155556,172.005865,92.0,242.4,No Log,No Log,No Log,No Log,No Log,0.0,0.0,0.0,0.0,0.0
15,0.0,0.022222,0.064676,198.055554,106.4,256.0,0.266667,177.34155,106.4,242.8,No Log,No Log,No Log,No Log,No Log,0.0,0.022222,0.064676,0.0,0.0
20,0.0,0.044444,0.188562,186.044446,103.0,256.0,0.2,168.65358,103.0,247.0,No Log,No Log,No Log,No Log,No Log,0.0,0.033333,0.141421,0.011111,0.04714
25,0.0,0.144444,0.383523,186.888889,88.6,256.0,0.211111,168.556992,88.6,237.6,No Log,No Log,No Log,No Log,No Log,0.0,0.111111,0.274072,0.033333,0.141421
30,0.0,0.122222,0.420813,201.655557,91.4,256.0,0.3,177.401416,91.4,241.6,No Log,No Log,No Log,No Log,No Log,0.0,0.066667,0.223633,0.055556,0.206098
35,0.0,0.3,0.572613,185.611115,90.4,256.0,0.244444,162.873083,90.4,245.8,No Log,No Log,No Log,No Log,No Log,0.0,0.233333,0.421678,0.066667,0.188513
40,0.0,0.444444,0.673521,162.444443,69.4,256.0,0.2,139.33613,69.4,240.4,No Log,No Log,No Log,No Log,No Log,0.0,0.344444,0.479759,0.1,0.30033
45,0.0,0.9,0.852407,171.0,78.2,256.0,0.177778,154.922461,78.2,236.8,No Log,No Log,No Log,No Log,No Log,0.0,0.577778,0.469853,0.322222,0.47871
50,0.0,0.966667,0.769931,176.266669,82.8,256.0,0.166667,159.794537,82.8,249.0,No Log,No Log,No Log,No Log,No Log,0.0,0.688889,0.463208,0.277778,0.451773



Training complete.
TrainOutput(global_step=100, training_loss=1.536249492062325e-08, metrics={'train_runtime': 3629.7756, 'train_samples_per_second': 0.496, 'train_steps_per_second': 0.028, 'total_flos': 0.0, 'train_loss': 1.536249492062325e-08})
Wall clock (s): 3633.4


In [8]:
"""
Cell H â€” Evaluate GRPO model on GSM8K (exact-match of <answer> tag)

What this cell does:
1) Switches the model to fast inference and eval mode.
2) Runs greedy generation (do_sample=False) on N_EVAL prompts.
3) Extracts the numeric string inside <answer>...</answer>.
4) Computes exact-match accuracy against our `ground_truth`.
5) Prints a few qualitative examples (Q, predicted, truth).

Why this is correct:
- GSM8K evaluation usually checks the final numeric answer only.
- Our GRPO format enforces:
      <reasoning>...</reasoning>
      <answer>...</answer>
  so we parse the answer tag directly.
- We use Transformers' chat templating for consistent formatting.
"""

import re, math, torch, random, time
from unsloth import FastLanguageModel

# 1) Enable fast inference & eval mode
FastLanguageModel.for_inference(model)
model.eval()

# 2) Config
N_EVAL = min(64, len(train))
MAX_NEW_TOKENS = 256
ANSWER_RE = re.compile(
    r"<answer>\s*([\s\S]+?)\s*</answer>", re.IGNORECASE
)
NUM_NORM = re.compile(r"^\s*([-+]?\d[\d,]*([.]\d+)?)\s*$")

def _normalize_num(s: str | None):
    if not s: return None
    s = s.strip()
    m = NUM_NORM.match(s)
    if m:
        s = m.group(1)
    return s.replace(",", "").strip()

def generate_one(messages):
    # Return decoded new text only (post prompt)
    enc = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True,
        return_tensors="pt", return_dict=True, tokenize=True
    )
    # Move to device
    enc = {k: (v.to(model.device) if hasattr(v,"to") else v) for k,v in enc.items()}
    prompt_len = enc["input_ids"].shape[1]
    with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.float16):
        out = model.generate(
            **enc,
            max_new_tokens=MAX_NEW_TOKENS,
            do_sample=False,                # greedy for deterministic eval
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )
    new_tokens = out[0, prompt_len:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

correct = 0
examples = []
t0 = time.time()

# 3) Evaluate first N_EVAL rows (you can randomize if you prefer)
for i in range(N_EVAL):
    row = train[i]
    pred_text = generate_one(row["prompt"])
    # Parse <answer>...</answer>
    m = ANSWER_RE.search(pred_text)
    pred = _normalize_num(m.group(1) if m else None)
    gold = _normalize_num(row["ground_truth"])
    ok = (pred is not None and gold is not None and pred == gold)
    correct += int(ok)
    if len(examples) < 5:
        examples.append({
            "q": row["question"][:140] + ("â€¦" if len(row["question"])>140 else ""),
            "pred": (pred if pred is not None else "âˆ…"),
            "gold": gold,
            "ok": ok,
        })

acc = correct / N_EVAL if N_EVAL else 0.0
t1 = time.time()

print(f"Evaluated {N_EVAL} prompts.")
print(f"Exact-match accuracy: {acc:.3f}")
print("Examples:")
for ex in examples:
    print(f"â€¢ ok={ex['ok']} | pred={ex['pred']} | gold={ex['gold']} | Q: {ex['q']}")
print(f"Wall clock: {t1 - t0:.1f}s")
print("âœ… Eval complete.")


Evaluated 64 prompts.
Exact-match accuracy: 0.281
Examples:
â€¢ ok=False | pred=36 | gold=72 | Q: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether inâ€¦
â€¢ ok=False | pred=\$10 | gold=10 | Q: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
â€¢ ok=False | pred=$5 | gold=5 | Q: Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 forâ€¦
â€¢ ok=True | pred=42 | gold=42 | Q: Julie is reading a 120-page book. Yesterday, she was able to read 12 pages and today, she read twice as many pages as yesterday. If she wantâ€¦
â€¢ ok=False | pred=416 | gold=624 | Q: James writes a 3-page letter to 2 different friends twice a week.  How many pages does he write a year?
Wall clock: 690.2s
âœ… Eval complete.


In [9]:
"""
Cell I â€” Deploy: fast inference helper + sample generations + save LoRA adapters

What this cell does:
1) Switches the model to Unslothâ€™s fast inference path.
2) Defines a robust `chat()` that uses the tokenizerâ€™s chat template and works with PEFT models.
3) Runs a couple of sample generations (one from GSM8K, one custom).
4) Saves your LoRA adapters + tokenizer files for reuse or serving.

Why this is correct:
- Unslothâ€™s docs recommend `FastLanguageModel.for_inference(model)` for faster inference and show saving
  adapters with `save_pretrained`.  (We keep the chat template path from training.)
- Transformersâ€™ chat templating (`apply_chat_template`) is the supported way to format prompts for chat models.
References: Unsloth Inference & Running/Saving docs; HF chat templating docs; TRL GRPO reward/inference conventions.
"""

import torch
from collections.abc import Mapping
from unsloth import FastLanguageModel

# 1) Fast inference path
FastLanguageModel.for_inference(model)
model.eval()

def _to_device(batch, device):
    if isinstance(batch, Mapping):
        return {k: (v.to(device) if hasattr(v, "to") else v) for k, v in batch.items()}
    if torch.is_tensor(batch):
        return {"input_ids": batch.to(device),
                "attention_mask": torch.ones_like(batch, device=device)}
    raise TypeError(f"Unexpected inputs type: {type(batch)}")

def chat(messages, max_new_tokens=256, temperature=None, top_p=None):
    """
    Format messages with the tokenizer's chat template and generate a reply.
    If `temperature` is None, we do greedy decoding (deterministic); otherwise we sample.
    Returns only the newly generated text after the prompt.
    """
    enc = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True,
        tokenize=True,
    )
    enc = _to_device(enc, model.device)
    prompt_len = enc["input_ids"].shape[1]

    gen_kwargs = dict(
        max_new_tokens=max_new_tokens,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )
    if temperature is None:
        gen_kwargs.update(do_sample=False)
    else:
        gen_kwargs.update(do_sample=True, temperature=float(temperature), top_p=float(top_p or 0.95))

    with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.float16):
        out = model.generate(**enc, **gen_kwargs)

    new_tokens = out[0, prompt_len:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

# 2) Sample generations
SYSTEM_PROMPT = (
    "Respond in the following format:\n"
    "<reasoning>\n...\n</reasoning>\n<answer>\n...\n</answer>"
)

# (A) Reuse one GSM8K question to verify tags/format
ex_user = train[0]["question"]
print("=== Example A (GSM8K) â€” greedy ===")
print(chat([{"role":"system","content":SYSTEM_PROMPT},
            {"role":"user","content":ex_user}], max_new_tokens=256))

# (B) Custom math prompt (sampled)
print("\n=== Example B (custom) â€” sampled ===")
print(chat([{"role":"system","content":SYSTEM_PROMPT},
            {"role":"user","content":"A bag has 6 red and 4 blue marbles. If I draw 3 without replacement, what is the probability all are red?"}],
            temperature=0.8, top_p=0.95, max_new_tokens=256))

# 3) Save LoRA adapters + tokenizer
SAVE_DIR = "grpo_qwen15b_lora"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)
print(f"\nâœ… Saved LoRA adapters & tokenizer to: {SAVE_DIR}")


=== Example A (GSM8K) â€” greedy ===
<reasoning>
To find out how many clips Natalia sold altogether in April and May, we need to follow these steps:

1. Determine the number of clips sold in May.
2. Add the number of clips sold in April to the number of clips sold in May.

Given that Natalia sold half as many clips in May as she did in April, we can calculate the number of clips sold in May by dividing the number of clips sold in April by 2.

Let's perform this calculation.
</reasoning>

<answer>
36
</answer>

=== Example B (custom) â€” sampled ===
<reasoning>
To find the probability that all three drawn marbles are red, we need to consider the total number of ways to choose 3 marbles from the bag and divide it by the number of ways to choose 3 red marbles.
The total number of ways to choose 3 marbles from a bag with 10 marbles (6 red and 4 blue) can be calculated using combinations: $C(10,3)$.

Similarly, the number of ways to choose 3 red marbles from 6 red marbles can also be calcul