<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/SFT_Model_Checkpoint_Observability_withPi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

Are you constantly relying on LLM-as-a-judge to evaluate your model’s performance?

Have you ever wanted to assess your model at every training checkpoint but hesitated because LLM-as-a-judge is too slow and expensive?

**Now you can — with [Pi-Scorer](https://build.withpi.ai).**

[Pi-Scorer](https://build.withpi.ai) offers an alternative to LLM-as-a-judge with several advantages:

* Significantly faster

* Highly consistent — always returns the same score for the same inputs

* Eliminates the need for prompt tuning or adjustments

In this Colab, we integrate [Pi-Scorer](https://build.withpi.ai) as the evaluator within the [Unsloth](https://unsloth.ai/) SFT training loop, based on the [Unsloth Llama3.2_(1B_and_3B)-Conversational.ipynb colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) notebook.

### Installation


In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [46]:
from google.colab import userdata
import os

# Get PI API key: https://build.withpi.ai/account/keys
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

### Pi-Scorer Setup

In [54]:
import requests

# Pi constants
PI_API_URL = "https://api.withpi.ai/v1/scoring_system/score"
HEADERS = {
    "Content-Type": "application/json",
    "x-api-key": os.environ.get("WITHPI_API_KEY"),
}

# Pi util functions
def get_pi_score(input: str, output: str, question: str) -> float:
    payload = {
        "llm_input": input,
        "llm_output": output,
        "scoring_spec": [{"question": question}]
    }
    # Can add retry if needed.
    response = requests.post(PI_API_URL, headers=HEADERS, json=payload)
    return response.json()["total_score"]

### Unsloth

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch SmolVLMForConditionalGeneration forward function.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.4.1: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.4.1 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


### Data Prep

In [93]:
from unsloth.chat_templates import get_chat_template
from datasets import load_dataset

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

SYSTEM_PROMPT = """Generate a short TLDR of a subreddit post without any surrounding text. Here are some requirement of the TLDR:
1. Make sure that the TLDR is short and concise.
2. Make sure that the TLDR state the important points of the post
3. Make sure that the TLDR should make sense on its own.
"""

def formatting_prompts_func(examples):
    convos = [[{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": post}, {"role": "assistant", "content": tldr}] for post, tldr in zip(examples["post"], examples["tldr"])]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

ds_train = load_dataset("withpi/tldr-sft", split="train")
ds_train = ds_train.map(formatting_prompts_func, batched = True)

ds_test = load_dataset("withpi/tldr-sft", split="test").select(range(20))

ds_train[5]["text"]

Map:   0%|          | 0/801 [00:00<?, ? examples/s]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nGenerate a short TLDR of a subreddit post without any surrounding text. Here are some requirement of the TLDR:\n1. Make sure that the TLDR is short and concise.\n2. Make sure that the TLDR state the important points of the post\n3. Make sure that the TLDR should make sense on its own.\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nSUBREDDIT: r/offmychest\n\nTITLE: To the customer who walked in only to ask for directions\n\nPOST: Oh you don\'t know your way around town? How about use GPS on your fucking smartphone? Even better, get your directions before driving. You interrupt a busy service from afar, speaking over the ambient noise of the store and make misleading hand gestures pointing to our menu, causing me to construe your inquiry as one about mixing a drink "three ways". I tell you that it\'s possible, and ask which three. You fucking ignore m

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [94]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq, TrainerCallback
from unsloth import is_bfloat16_supported
from datasets import Dataset
from unsloth.chat_templates import get_chat_template
from tqdm import tqdm
from unsloth import FastLanguageModel
import statistics


class PiScorerCallback(TrainerCallback):
    def __init__(self, test_ds: Dataset):
        self.test_ds = test_ds


    def on_log(self, args, state, control, model, processing_class, logs=None, **kwargs):
        if state.global_step % 5 == 0:
          print(f"\nEvaluation withPi at {state.global_step=}...")

          concise_score = []
          redundancy_score = []
          understandable_score = []

          FastLanguageModel.for_inference(model)
          for ex in tqdm(self.test_ds):
            messages = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": ex["post"]}]
            inputs = tokenizer.apply_chat_template(
              messages,
              tokenize = True,
              add_generation_prompt = True, # Must add for generation
              return_tensors = "pt",
            ).to("cuda")
            outputs = model.generate(
                input_ids = inputs,
                max_new_tokens = 4096,
                use_cache = True,
                temperature = 1.5,
                min_p = 0.1
            )
            decoded_outputs = tokenizer.batch_decode(outputs)
            output = decoded_outputs[0].split("<|start_header_id|>assistant<|end_header_id|>")[-1].removesuffix("<|eot_id|>").strip()

            # Compute Pi-Scores
            concise_score.append(get_pi_score(input=ex["post"], output=output, question="Is the TLDR concise and to the point?"))
            redundancy_score.append(get_pi_score(input=ex["post"], output=output, question="Does the TLDR avoid redundant information?"))
            understandable_score.append(get_pi_score(input=ex["post"], output=output, question="Is the TLDR understandable without additional context?"))

          FastLanguageModel.for_training(model)

          print(f" * Concise score={statistics.mean(concise_score)}")
          print(f" * Redundancy score={statistics.mean(redundancy_score)}")
          print(f" * Understandable score={statistics.mean(understandable_score)}")

          overall_score = [(s1 + s2 + s3)/3.0 for s1, s2, s3 in zip(concise_score, redundancy_score, understandable_score)]
          print(f" * Overall score={statistics.mean(overall_score)}")


trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = ds_train,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    callbacks=[PiScorerCallback(ds_test)],
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/801 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [95]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=2):   0%|          | 0/801 [00:00<?, ? examples/s]

We verify masking is actually done:

In [96]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nGenerate a short TLDR of a subreddit post without any surrounding text. Here are some requirement of the TLDR:\n1. Make sure that the TLDR is short and concise.\n2. Make sure that the TLDR state the important points of the post\n3. Make sure that the TLDR should make sense on its own.\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nSUBREDDIT: r/offmychest\n\nTITLE: To the customer who walked in only to ask for directions\n\nPOST: Oh you don\'t know your way around town? How about use GPS on your fucking smartphone? Even better, get your directions before driving. You interrupt a busy service from afar, speaking over the ambient noise of the store and make misleading hand gestures pointing to our menu, causing me to construe your inquiry as one about mixing a drink "three ways". I tell you that it\'s possible, and ask which three. You

In [97]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                                                                                                                                                                                                                                               Service was busy, the context made me misinterpret customer\'s question about freeways as "three ways", to which I assumed he was asking if he could mix three flavors. He rudely blows me off.<|eot_id|>'

We can see the System and Instruction prompts are successfully masked!

In [98]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
2.336 GB of memory reserved.


In [99]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 801 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192/1,000,000,000 (1.13% trained)


Step,Training Loss
1,0.0175
2,0.022
3,0.0242
4,0.0139
5,0.0286
6,0.0123
7,0.0524
8,0.0405
9,0.028
10,0.0294



Evaluation withPi at state.global_step=5...


100%|██████████| 20/20 [00:32<00:00,  1.62s/it]


 * Concise score=0.921875
 * Redundancy score=0.742681884765625
 * Understandable score=0.812255859375
 * Overall score=0.825604248046875

Evaluation withPi at state.global_step=10...


100%|██████████| 20/20 [00:28<00:00,  1.44s/it]


 * Concise score=0.94921875
 * Redundancy score=0.771142578125
 * Understandable score=0.87177734375
 * Overall score=0.8640462239583333

Evaluation withPi at state.global_step=15...


100%|██████████| 20/20 [00:28<00:00,  1.45s/it]


 * Concise score=0.89873046875
 * Redundancy score=0.76875
 * Understandable score=0.82958984375
 * Overall score=0.8323567708333334

Evaluation withPi at state.global_step=20...


100%|██████████| 20/20 [00:30<00:00,  1.55s/it]


 * Concise score=0.84228515625
 * Redundancy score=0.5892333984375
 * Understandable score=0.6940185546875001
 * Overall score=0.7085123697916667

Evaluation withPi at state.global_step=25...


100%|██████████| 20/20 [00:34<00:00,  1.71s/it]


 * Concise score=0.826318359375
 * Redundancy score=0.6078369140625
 * Understandable score=0.71572265625
 * Overall score=0.7166259765625

Evaluation withPi at state.global_step=30...


100%|██████████| 20/20 [00:33<00:00,  1.67s/it]


 * Concise score=0.878125
 * Redundancy score=0.6717041015625
 * Understandable score=0.7794921875
 * Overall score=0.7764404296875

Evaluation withPi at state.global_step=35...


100%|██████████| 20/20 [00:30<00:00,  1.53s/it]


 * Concise score=0.85361328125
 * Redundancy score=0.6726806640625
 * Understandable score=0.7537109375
 * Overall score=0.7600016276041667

Evaluation withPi at state.global_step=40...


100%|██████████| 20/20 [00:33<00:00,  1.70s/it]


 * Concise score=0.9205078125
 * Redundancy score=0.79921875
 * Understandable score=0.85263671875
 * Overall score=0.8574544270833333

Evaluation withPi at state.global_step=45...


100%|██████████| 20/20 [00:32<00:00,  1.63s/it]


 * Concise score=0.9552734375
 * Redundancy score=0.82705078125
 * Understandable score=0.88671875
 * Overall score=0.8896809895833333

Evaluation withPi at state.global_step=50...


100%|██████████| 20/20 [00:33<00:00,  1.66s/it]


 * Concise score=0.93125
 * Redundancy score=0.7226806640625
 * Understandable score=0.8386718750000001
 * Overall score=0.8308675130208333

Evaluation withPi at state.global_step=55...


100%|██████████| 20/20 [00:33<00:00,  1.67s/it]


 * Concise score=0.873876953125
 * Redundancy score=0.6524169921875
 * Understandable score=0.78564453125
 * Overall score=0.7706461588541667

Evaluation withPi at state.global_step=60...


100%|██████████| 20/20 [00:35<00:00,  1.76s/it]


 * Concise score=0.9048828125
 * Redundancy score=0.7373779296875
 * Understandable score=0.81728515625
 * Overall score=0.8198486328125

Evaluation withPi at state.global_step=60...


100%|██████████| 20/20 [00:32<00:00,  1.62s/it]

 * Concise score=0.848779296875
 * Redundancy score=0.6371826171875
 * Understandable score=0.724462890625
 * Overall score=0.7368082682291667





In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1850.7701 seconds used for training.
30.85 minutes used for training.
Peak reserved memory = 5.65 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 38.328 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [100]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
     {"role": "user", "content": """SUBREDDIT: r/relationships

TITLE: Me [26F] with my LDR Boyfriend [26M] might not really like me because I'm not Asian?

POST: I'm from Eastern Europe and we've been together now for about 6 years, I met my Asian boyfriend while I was studying abroad in his country (North America).

After dating in person for about 2 years I had to return back to my country and we started dating long distance while I make the transition to permanently move to his country. However I've noticed some strange habits that are making me think that perhaps he's not really physically attracted to me.

My country (at least some of the older people) can be a tad Xenophobic. When he came to visit me I would instantly shoot down any remark someone would say, regardless of whether he understood it or not.

However I've recently returned from a trip abroad and it's making me feel unnerved. His family would constantly go on about how fat I am and say things like "That is what you get when you date a white girl!". His friend once remarked how he doesn't get to see me much anyway and that he should start dating a "cute Asian girl close by". I mean, I know people can be mean but mostly my BF would nod and agree to these things- WITH ME PRESENT!!

I've asked him if he is attracted to me and he just kind of shrugged. I don't know what that means. Now I've returned and he is sending me all these "I miss you", "I miss holding you in my arms." all these texts but when we are together in person it is like he is ashamed of me??

I know I could stand to lose some weight. I'm about 167 cm (5 ft 6 in) and weight 66kg.

Is this normal? Are customs in my country just different? Is there something I can do to be more Asian for him?

TL;DR:"""},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 4096, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nGenerate a short TLDR of a subreddit post without any surrounding text. Here are some requirement of the TLDR:\n1. Make sure that the TLDR is short and concise.\n2. Make sure that the TLDR state the important points of the post\n3. Make sure that the TLDR should make sense on its own.\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nSUBREDDIT: r/relationships\n\nTITLE: Me [26F] with my LDR Boyfriend [26M] might not really like me because I\'m not Asian?\n\nPOST: I\'m from Eastern Europe and we\'ve been together now for about 6 years, I met my Asian boyfriend while I was studying abroad in his country (North America).\n\nAfter dating in person for about 2 years I had to return back to my country and we started dating long distance while I make the transition to permanently move to his country. However I\'ve noticed some strange habits that are making 

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [101]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
     {"role": "user", "content": """SUBREDDIT: r/relationships

TITLE: Me [26F] with my LDR Boyfriend [26M] might not really like me because I'm not Asian?

POST: I'm from Eastern Europe and we've been together now for about 6 years, I met my Asian boyfriend while I was studying abroad in his country (North America).

After dating in person for about 2 years I had to return back to my country and we started dating long distance while I make the transition to permanently move to his country. However I've noticed some strange habits that are making me think that perhaps he's not really physically attracted to me.

My country (at least some of the older people) can be a tad Xenophobic. When he came to visit me I would instantly shoot down any remark someone would say, regardless of whether he understood it or not.

However I've recently returned from a trip abroad and it's making me feel unnerved. His family would constantly go on about how fat I am and say things like "That is what you get when you date a white girl!". His friend once remarked how he doesn't get to see me much anyway and that he should start dating a "cute Asian girl close by". I mean, I know people can be mean but mostly my BF would nod and agree to these things- WITH ME PRESENT!!

I've asked him if he is attracted to me and he just kind of shrugged. I don't know what that means. Now I've returned and he is sending me all these "I miss you", "I miss holding you in my arms." all these texts but when we are together in person it is like he is ashamed of me??

I know I could stand to lose some weight. I'm about 167 cm (5 ft 6 in) and weight 66kg.

Is this normal? Are customs in my country just different? Is there something I can do to be more Asian for him?

TL;DR:"""},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 4096,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

 BF doesn't like me that much when we're together in person. But when he's around other people he instantly mocks me and says I'm fat. Is this normal?<|eot_id|>
