# Post-train an LLM with Supervised Fine Tuning

## The tasks I want the LLM (assistant) to perform

* Prepare demonstrations/examples with the instructions, input and the desired output
* I can use these examples to evaluate an off-the-shelf LLM, to fine tune an LLM, to evaluate the fine-tuned version, to test and score

In [58]:
import os
import json
import numpy as np
import pandas as pd
import sklearn

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
import huggingface_hub
import torch
from peft import LoraConfig, get_peft_model, PeftModel
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorForLanguageModeling
from transformers import DataCollatorForSeq2Seq

import sefaria_code as sef

In [59]:
def do_generate(use_tokenizer, use_model, prompt, max_new_tokens=100, temperature=0.7):
    inputs = use_tokenizer(prompt, return_tensors="pt")
    input_len = inputs["input_ids"].shape[1]
    if temperature > 0:
        outputs = use_model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=temperature)
    else:
        outputs = use_model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    output = outputs[0][input_len:] # Remove the repeated input token sequence. Keep only the newly generated tokens
    outtext = use_tokenizer.decode(output, skip_special_tokens=True)
    return outtext

def do_generate_response(use_tokenizer, use_model, user_message, max_new_tokens=100, temperature=0.7):
    prompt = use_tokenizer.apply_chat_template([
        {"role":"user", "content":user_message}],
    tokenize=False,
    add_generation_prompt=True)
    return do_generate(use_tokenizer, use_model, prompt, max_new_tokens=max_new_tokens, temperature=temperature)

In [91]:
versions = [sef.VersionCode.HE_TEXT_ONLY, sef.VersionCode.HE_MASORAH, sef.VersionCode.EN_JSP_1917, sef.VersionCode.EN_KOREN, sef.VersionCode.FR_RABBINAT_1899, sef.VersionCode.FR_SAMUEL_CAHEN_1831]
torah = None
for book in [sef.BookCode.GENESIS, sef.BookCode.EXODUS, sef.BookCode.LEVITICUS, sef.BookCode.NUMBERS, sef.BookCode.DEUTERONOMY]:
    verses = sef.sefaria_read_multiversions_of_book(book, versions, col_per_version=True)
    if torah is None:
        torah = verses
    else:
        torah = pd.concat([torah, verses], ignore_index=True)

display(torah)

Unnamed: 0,book,chapter_num,verse_num,text.en.koren,text.en.new.jps1917,text.fr.rabbinat1899,text.fr.samuel_cahen1831,text.he.masorah,text.he.text_only
0,genesis,1,1,IN THE BEGINNING God created the heaven and th...,In the beginning God created the heaven and th...,"Au commencement, Dieu créa le ciel et la terre.",Au commencement Dieu créa le ciel et la terre;,בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖י...,בראשית ברא אלהים את השמים ואת הארץ
1,genesis,1,2,And the earth was without form and void; and d...,"Now the earth was unformed and void, and darkn...",Or la terre n’était que solitude et chaos; des...,"La terre était informe et en désordre, les tén...",וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֙הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁך...,והארץ היתה תהו ובהו וחשך על פני תהום ורוח אלהי...
2,genesis,1,3,"And God said, Let there be light: and there wa...",And God said: ‘Let there be light.’ And there ...,"Dieu dit: ""Que la lumière soit!"" Et la lumière...","Dieu dit: que la lumière soit, et la lumière fut;",וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי־אֽוֹר׃,ויאמר אלהים יהי אור ויהי אור
3,genesis,1,4,"And God saw the light, that it was good: and G...","And God saw the light, that it was good; and G...","Dieu considéra que la lumière était bonne, et ...","Dieu voyant que la lumière était bonne, la sép...",וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָא֖וֹר כִּי־ט֑וֹב וַי...,וירא אלהים את האור כי טוב ויבדל אלהים בין האור...
4,genesis,1,5,"And God called the light Day, and the darkness...","And God called the light Day, and the darkness...","Dieu appela la lumière jour, et les ténèbres, ...",Dieu nomma la lumière jour et les ténèbres nui...,וַיִּקְרָ֨א אֱלֹהִ֤ים׀לָאוֹר֙ י֔וֹם וְלַחֹ֖שֶׁ...,ויקרא אלהים לאור יום ולחשך קרא לילה ויהי ערב ו...
...,...,...,...,...,...,...,...,...,...
5841,deuteronomy,34,8,And the children of Yisra᾽el wept for Moshe in...,And the children of Israel wept for Moses in t...,"Les enfants d’Israël pleurèrent Moïse, dans le...",Les enfants d’Israel pleurèrent Mosché dans le...,וַיִּבְכּוּ֩ בְנֵ֨י יִשְׂרָאֵ֧ל אֶת־מֹשֶׁ֛ה בּ...,ויבכו בני ישראל את משה בערבת מואב שלשים יום וי...
5842,deuteronomy,34,9,And Yehoshua the son of Nun was full of the sp...,And Joshua the son of Nun was full of the spir...,"Or, Josué, fils de Noun, était plein de l’espr...","Iehoschoua (Josué), fils de Noun, était rempli...",וִיהוֹשֻׁ֣עַ בִּן־נ֗וּן מָלֵא֙ ר֣וּחַ חׇכְמָ֔ה...,ויהושע בן נון מלא רוח חכמה כי סמך משה את ידיו ...
5843,deuteronomy,34,10,And there arose not a prophet since in Yisra᾽e...,And there hath not arisen a prophet since in I...,"Mais il n’a plus paru, en Israël, un prophète ...",Il ne s’est pas encore élevé un prophète en Is...,וְלֹא־קָ֨ם נָבִ֥יא ע֛וֹד בְּיִשְׂרָאֵ֖ל כְּמֹש...,ולא קם נביא עוד בישראל כמשה אשר ידעו יהוה פנים...
5844,deuteronomy,34,11,"in all the signs and the wonders, which the Lo...","in all the signs and the wonders, which the LO...",eu égard à tant de signes et de prodiges que l...,Selon tous les signes et les prodiges que l’Ét...,לְכׇל־הָ֨אֹתֹ֜ת וְהַמּוֹפְתִ֗ים אֲשֶׁ֤ר שְׁלָח...,לכל האתות והמופתים אשר שלחו יהוה לעשות בארץ מצ...


In [92]:
# def gen_translate_instruction(from_version, to_version):
#     from_name = sef.version_code2name[from_version]
#     to_name = sef.version_code2name[to_version]
#     variations = [
#         f"Translate the following verse from its '{from_name}' version to its '{to_name}' version:",
#         f"Please translate a biblical verse to '{to_name}'. Here's the verse:",
#         f"I have a verse from the bible, taken from the '{from_name}' version of text. I want you to give me the '{to_name}' version of this verse. Here is the verse:",
#         f"Give me the '{to_name}' version of the following bible verse:",
#         f"I want the '{to_name}' text version of the biblical verse (taken from the '{from_name}' version of the bible):",
#         f"I have a biblical verse (from '{from_name}' edition) but I need the equivalent verse in another version ('{to_name}'):"
#     ]
#     flavor = int(np.random.choice(len(variations), 1)[0])
#     instruction = variations[flavor]
#     return (flavor, instruction)

# def gen_translate_examples_for_a_verse(verse_dict, versions):
#     examples = []
#     for vi, version_i in enumerate(versions):
#         for vj, version_j in enumerate(versions):
#             if (vi == vj):
#                 continue
#             example = {"metadata":{
#                 "task": "translate",
#                 "from_version": version_i,
#                 "to_version": version_j,
#                 "book": verse_dict["book"],
#                 "chapter_num": verse_dict["chapter_num"],
#                 "verse_num": verse_dict["verse_num"]
#             }}
#             input_text = verse_dict["text." + version_i]
#             output_text = verse_dict["text." + version_j]
#             (flavor, instruction) = gen_translate_instruction(version_i, version_j)
#             example["metadata"]["instruction_flavor"] = flavor
#             user_msg = f"""{instruction}
# {input_text}
# """
#             example["messages"] = [
#                 {"role": "user", "content": user_msg},
#                 {"role": "assistant", "content": output_text}
#             ]
#             examples.append(example)
#     return examples

def instructions_for_recognize_version(versions):
    part1 = f"""You are a Biblical research assistant. Please help me recognize the translation/edition/version of some biblical text.
There are {len(versions)} possible versions:
    """
    for vi, version in enumerate(versions):
        (name, desc, exam) = sef.version_code2metadata(version)
        version_lines = f"""
{vi+1}. {name}
Description: {desc}
Example: {exam}
"""
        part1 += version_lines
    
    part1 += f"""

Task: Identify which version the following verse belongs to.
Output format: respond with a single number from 1 to {len(versions)}. Do not explain your answer.

Verse:"""
    return part1

def gen_recognize_verse_version(verse_dict, versions):

    part1 = instructions_for_recognize_version(versions)
    template = f"""{part1}
{{verse_text}}"""
    
    examples = []
    for vi, version_i in enumerate(versions):
        instruction = template.format(verse_text=verse_dict["text."+version_i])
        example = {
            "messages":[
                {"role": "user", "content": instruction},
                {"role": "assistant", "content": str(vi+1)}
            ],
            "metadata":{
                "task": "recognize_version",
                "version": version_i,
                "book": verse_dict["book"],
                "chapter_num": verse_dict["chapter_num"],
                "verse_num": verse_dict["verse_num"]
                }}
        examples.append(example)
    return examples

def gen_ex_recognize_short(verse_dict, versions):
    examples = []
    for vi, version_i in enumerate(versions):
        verse_text=verse_dict["text."+version_i]
        example = {
            "messages":[
                {"role": "user", "content": f"Classify this verse:\n{verse_text}"},
                {"role": "assistant", "content": str(vi+1)}
            ],
            "metadata":{
                "task": "recognize_version",
                "version": version_i,
                "book": verse_dict["book"],
                "chapter_num": verse_dict["chapter_num"],
                "verse_num": verse_dict["verse_num"]
                }}
        examples.append(example)
    return examples

def generate_examples_for_a_verse(verse_row):
    verse_dict = verse_row.to_dict()
    versions = [version.replace('text.','') for version in list(filter(lambda col:col.startswith('text.'), verse_dict.keys()))]
    examples = []
    examples.extend(gen_ex_recognize_short(verse_dict, versions))
#    examples.extend(gen_recognize_verse_version(verse_dict, versions))
#    examples.extend(gen_translate_examples_for_a_verse(verse_dict, versions))
    return examples

def generate_examples(verses):
    examples = []
    for ri, row in verses.iterrows():
        examples.extend(generate_examples_for_a_verse(row))
    return examples

## Train/test data

In [93]:
dev_folder = os.path.abspath("data/dev")
if not os.path.exists(dev_folder):
    os.mkdir(dev_folder)

In [94]:
examples = generate_examples(torah)
trainset_file = os.path.join(dev_folder, "translate1.train1.jsonl")
testset_file = os.path.join(dev_folder, "translate1.test1.jsonl")
cutoff = int(len(examples)*0.75)

with open(trainset_file, "w", encoding="utf-8") as f:
    for ex in examples[:cutoff]:
        f.write(json.dumps(ex, ensure_ascii=False) + "\n")

with open(testset_file, "w", encoding="utf-8") as f:
    for ex in examples[cutoff:]:
        f.write(json.dumps(ex, ensure_ascii=False) + "\n")

In [95]:
for i, ex in enumerate(examples):
    msgs = ex["messages"]
    if len(msgs) != 2:
        print("Bad example at index", i)
    if msgs[0]["role"] != "user" or msgs[1]["role"] != "assistant":
        print("Role mismatch at index", i)

In [96]:
len(examples)

35076

In [97]:
model_name = "google/gemma-3-1b-it"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-3-1b-it")

In [98]:
user_msg = examples[4]['messages'][0]['content']
print(user_msg)
print("-"*40)
resp = do_generate_response(tokenizer, base_model, user_msg)
print(resp)

Classify this verse:
בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃
----------------------------------------
This verse is a classic example of a **Biblical opening verse** and falls under the category of **Genesis**.

Here's a breakdown of why:

*   **Genesis:** It’s the very first verse of Genesis, the first book of the Bible.
*   **Biblical Opening:** It establishes the foundational narrative of creation.
*   **Literal Meaning:** It’s a straightforward statement of God’s creative action.

Let me know if you'd like


In [99]:
examples[:20]

[{'messages': [{'role': 'user',
    'content': 'Classify this verse:\nIN THE BEGINNING God created the heaven and the earth.'},
   {'role': 'assistant', 'content': '1'}],
  'metadata': {'task': 'recognize_version',
   'version': 'en.koren',
   'book': 'genesis',
   'chapter_num': 1,
   'verse_num': 1}},
 {'messages': [{'role': 'user',
    'content': 'Classify this verse:\nIn the beginning God created the heaven and the earth.'},
   {'role': 'assistant', 'content': '2'}],
  'metadata': {'task': 'recognize_version',
   'version': 'en.new.jps1917',
   'book': 'genesis',
   'chapter_num': 1,
   'verse_num': 1}},
 {'messages': [{'role': 'user',
    'content': 'Classify this verse:\nAu commencement, Dieu créa le ciel et la terre.'},
   {'role': 'assistant', 'content': '3'}],
  'metadata': {'task': 'recognize_version',
   'version': 'fr.rabbinat1899',
   'book': 'genesis',
   'chapter_num': 1,
   'verse_num': 1}},
 {'messages': [{'role': 'user',
    'content': 'Classify this verse:\nAu commen

## Load data from files (in case you don't have it in memory)

In [100]:
dev_folder = os.path.abspath("data/dev")
trainset_file = os.path.join(dev_folder, "translate1.train1.jsonl")
testset_file = os.path.join(dev_folder, "translate1.test1.jsonl")

In [101]:
train_ds = load_dataset("json", data_files=trainset_file, split="train")
test_ds = load_dataset("json", data_files=testset_file, split="train")

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [102]:
huggingface_hub.login(os.environ["HUGGING_FACE_TOKEN"])

In [None]:
tokenizer.apply_chat_template([{'role':'assistant'}])

In [130]:
def tokenize(example):
    # Convert the list of messages into a single text sequence
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False
    )
    tokens = tokenizer(
        text,
        truncation=True,
        max_length=2048,
        padding=False,
        return_tensors=None
    )
    labels = tokens["input_ids"].copy()
    assistant_token_ids = tokenizer("<start_of_turn>model")[1:]
    # Find the index
    for i in range(len(labels)-len(assistant_token_ids)):
        if all([labels[i+shift]==assistant_token_ids[shift] for shift in range(len(assistant_token_ids))]):
            idx = i
            break
    # Mask everything before assistant
    for i in range(idx):
        labels[i] = -100
    tokens["labels"] = labels
    return tokens

In [131]:
#train_subds = train_ds.select(np.random.choice(len(train_ds), 2000))
train_subds = train_ds.select(range(2000))
print(train_subds[0])
test_subds = test_ds.select(np.random.choice(len(test_ds), 20))
print(test_subds[0])

{'messages': [{'role': 'user', 'content': 'Classify this verse:\nIN THE BEGINNING God created the heaven and the earth.'}, {'role': 'assistant', 'content': '1'}], 'metadata': {'task': 'recognize_version', 'version': 'en.koren', 'book': 'genesis', 'chapter_num': 1, 'verse_num': 1}}
{'messages': [{'role': 'user', 'content': 'Classify this verse:\nבנים ובנות תוליד ולא יהיו לך כי ילכו בשבי'}, {'role': 'assistant', 'content': '6'}], 'metadata': {'task': 'recognize_version', 'version': 'he.text_only', 'book': 'deuteronomy', 'chapter_num': 28, 'verse_num': 41}}


In [132]:
tokenizer("""<start_of_turn>model""")

{'input_ids': [2, 105, 4368], 'attention_mask': [1, 1, 1]}

In [133]:
train_tok = train_subds.map(tokenize, remove_columns=train_ds.column_names)
test_tok = test_subds.map(tokenize, remove_columns=train_ds.column_names)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

In [134]:
print(f"Train {len(train_tok)}")
print(f"Test {len(test_tok)}")

Train 2000
Test 20


## Train with Low Rank Adaptation

In [135]:
base_model

Gemma3ForCausalLM(
  (model): Gemma3TextModel(
    (embed_tokens): Gemma3TextScaledWordEmbedding(262144, 1152, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma3DecoderLayer(
        (self_attn): Gemma3Attention(
          (q_proj): Linear(in_features=1152, out_features=1024, bias=False)
          (k_proj): Linear(in_features=1152, out_features=256, bias=False)
          (v_proj): Linear(in_features=1152, out_features=256, bias=False)
          (o_proj): Linear(in_features=1024, out_features=1152, bias=False)
          (q_norm): Gemma3RMSNorm((256,), eps=1e-06)
          (k_norm): Gemma3RMSNorm((256,), eps=1e-06)
        )
        (mlp): Gemma3MLP(
          (gate_proj): Linear(in_features=1152, out_features=6912, bias=False)
          (up_proj): Linear(in_features=1152, out_features=6912, bias=False)
          (down_proj): Linear(in_features=6912, out_features=1152, bias=False)
          (act_fn): GELUTanh()
        )
        (input_layernorm): Gemma3RMSNorm((1152,), e

In [136]:
lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
)

model = get_peft_model(base_model, lora_config)
model

PeftModel(
  (base_model): LoraModel(
    (model): Gemma3ForCausalLM(
      (model): Gemma3TextModel(
        (embed_tokens): Gemma3TextScaledWordEmbedding(262144, 1152, padding_idx=0)
        (layers): ModuleList(
          (0-25): 26 x Gemma3DecoderLayer(
            (self_attn): Gemma3Attention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=1152, out_features=1024, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=1152, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
   

In [137]:
output_dir = "models/gemma3-lora-test5"

In [138]:
args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=5e-5,
    logging_steps=50,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    save_only_model=True,
    save_safetensors=False,
    remove_unused_columns=False
)

use_train_set = train_tok
use_eval_set = test_tok

#collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=use_train_set,
    eval_dataset=use_eval_set,
    data_collator=collator
)

trainer.train()


Step,Training Loss,Validation Loss
50,6.418,5.949674
100,4.6021,4.636192
150,3.704,4.00209
200,3.1129,3.622358
250,2.9156,3.523619
300,2.8489,3.484901
350,2.7169,3.439584
400,2.721,3.394263
450,2.6238,3.35888
500,2.5604,3.317852


TrainOutput(global_step=1500, training_loss=2.694953425089518, metrics={'train_runtime': 17754.1747, 'train_samples_per_second': 0.338, 'train_steps_per_second': 0.084, 'total_flos': 1522257786132480.0, 'train_loss': 2.694953425089518, 'epoch': 3.0})

In [165]:
model.save_pretrained(f"{output_dir}/lora-final")

## Load an adapted model and use it

In [208]:
base_model_name = "google/gemma-3-1b-it"
tokenizer_inf = AutoTokenizer.from_pretrained(base_model_name)
base_model_inf = AutoModelForCausalLM.from_pretrained(base_model_name)

In [240]:
def load_lora_checkpoint(base_model_name, training_dir, checkpoint):
    # Use this function to load an adapted version of the model. 
    # Keep in mind you need to get a fresh base every time before attaching to the adaptation (otherwise you may mutate the same model with different adaptations)
    base_tmp = AutoModelForCausalLM.from_pretrained(base_model_name)
    adapted = PeftModel.from_pretrained(base_tmp, f"{training_dir}/checkpoint-{checkpoint}")
    return adapted

In [140]:
output_dir

'models/gemma3-lora-test5'

## Compare different stages of the model
How do they perform on a test set

In [210]:

compare_models = {
    'base': base_model_inf,
}
for ch in [50, 500, 1500]:
    model_i = load_lora_checkpoint(base_model_name, output_dir, ch)
    compare_models[f'm{ch}'] = model_i


In [211]:
def get_lora_l2_norm(model):
    lora_l2_norm = 0.0
    for name, param in model.named_parameters():
        if "lora" in name:
            lora_l2_norm += torch.norm(param.data, p=2).item() ** 2
    return torch.sqrt(torch.tensor(lora_l2_norm)).item()

def compare_models_responses(example, use_tokenizer, models, temperature=0.7):
    print("="*40)
    print(f"Task {example['metadata']['task']}")
    print(example['messages'][0]['content'])
    print("-"*10)
    print("Ground truth response:")
    print(example['messages'][1]['content'])
    print('-'*40)

    for model_name, use_model in models.items():
        print(f"Model {model_name} ({type(use_model)} l2: {get_lora_l2_norm(use_model)}). response:")
        resp = do_generate_response(use_tokenizer, use_model, example['messages'][0]['content'], temperature=temperature, max_new_tokens=10)
        print(resp)
        print('-'*30)

In [212]:

compare_models_responses(test_ds[0], tokenizer_inf, compare_models, temperature=0)

Task recognize_version
Classify this verse:
Dieu dit à Bileam: Tu n’iras pas avec eux, tu ne maudiras pas ce peuple, car il est béni.
----------
Ground truth response:
4
----------------------------------------
Model base (<class 'transformers.models.gemma3.modeling_gemma3.Gemma3ForCausalLM'> l2: 0.0). response:
This verse is best classified as **religious scripture/
------------------------------
Model m50 (<class 'peft.peft_model.PeftModel'> l2: 16.835968017578125). response:
This verse is a **religious/biblical**
------------------------------
Model m500 (<class 'peft.peft_model.PeftModel'> l2: 18.290239334106445). response:
4
------------------------------
Model m1500 (<class 'peft.peft_model.PeftModel'> l2: 18.99247169494629). response:
3
------------------------------


In [231]:
def add_metric_cols(df, use_model, model_name):
    resp_col = f'{model_name}_resp'
    is_num = f'{model_name}_isnum'
    resp_num = f'{model_name}_prednum'
    is_range = f'{model_name}_isrange'
    is_correct = f'{model_name}_iscorrect'
    df[resp_col] = df['messages'].apply(lambda msgs:do_generate_response(tokenizer_inf, use_model, msgs[0]['content'], temperature=0.0, max_new_tokens=4))
    df[is_num] = df[resp_col].apply(lambda ss:ss.strip().isdigit())
    df[resp_num] = df[resp_col].apply(lambda ss:int(ss) if ss.strip().isdigit() else -1)
    df[is_range] = df[resp_num].apply(lambda num:  (num>0) and (num<=6))
    df[is_correct] = df['gt']==df[resp_num]
    return df

In [242]:
test_set = pd.DataFrame(test_ds.select(range(100)))
test_set['gt'] = test_set['messages'].apply(lambda msgs:int(msgs[-1]['content']))
display(test_set.head(3))

Unnamed: 0,messages,metadata,gt
0,"[{'role': 'user', 'content': 'Classify this ve...","{'task': 'recognize_version', 'version': 'fr.s...",4
1,"[{'role': 'user', 'content': 'Classify this ve...","{'task': 'recognize_version', 'version': 'he.m...",5
2,"[{'role': 'user', 'content': 'Classify this ve...","{'task': 'recognize_version', 'version': 'he.t...",6


In [243]:
compare_models.keys()

dict_keys(['base', 'm50', 'm500', 'm1500'])

In [244]:
for model_name in ['base', 'm50', 'm500', 'm1500']:
    add_metric_cols(test_set, compare_models[model_name], model_name)
display(test_set)

Unnamed: 0,messages,metadata,gt,base_resp,base_isnum,base_prednum,base_isrange,base_iscorrect,m50_resp,m50_isnum,...,m500_resp,m500_isnum,m500_prednum,m500_isrange,m500_iscorrect,m1500_resp,m1500_isnum,m1500_prednum,m1500_isrange,m1500_iscorrect
0,"[{'role': 'user', 'content': 'Classify this ve...","{'task': 'recognize_version', 'version': 'fr.s...",4,This verse is best,False,-1,False,False,This verse is a,False,...,4,True,4,True,True,3,True,3,True,False
1,"[{'role': 'user', 'content': 'Classify this ve...","{'task': 'recognize_version', 'version': 'he.m...",5,This verse is a,False,-1,False,False,This verse is a,False,...,5,True,5,True,True,5,True,5,True,True
2,"[{'role': 'user', 'content': 'Classify this ve...","{'task': 'recognize_version', 'version': 'he.t...",6,This verse is a,False,-1,False,False,This verse is a,False,...,6,True,6,True,True,6,True,6,True,True
3,"[{'role': 'user', 'content': 'Classify this ve...","{'task': 'recognize_version', 'version': 'en.k...",1,This verse is best,False,-1,False,False,This verse is best,False,...,2,True,2,True,False,1,True,1,True,True
4,"[{'role': 'user', 'content': 'Classify this ve...","{'task': 'recognize_version', 'version': 'en.n...",2,This verse is best,False,-1,False,False,This verse is best,False,...,2,True,2,True,True,2,True,2,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,"[{'role': 'user', 'content': 'Classify this ve...","{'task': 'recognize_version', 'version': 'fr.r...",3,This verse is a,False,-1,False,False,This verse is a,False,...,4,True,4,True,False,3,True,3,True,True
96,"[{'role': 'user', 'content': 'Classify this ve...","{'task': 'recognize_version', 'version': 'fr.s...",4,This verse is best,False,-1,False,False,This verse is a,False,...,4,True,4,True,True,4,True,4,True,True
97,"[{'role': 'user', 'content': 'Classify this ve...","{'task': 'recognize_version', 'version': 'he.m...",5,This verse is a,False,-1,False,False,This verse is a,False,...,5,True,5,True,True,5,True,5,True,True
98,"[{'role': 'user', 'content': 'Classify this ve...","{'task': 'recognize_version', 'version': 'he.t...",6,This verse is a,False,-1,False,False,This verse is a,False,...,6,True,6,True,True,6,True,6,True,True


In [245]:
for model_name in ['base', 'm50', 'm500', 'm1500']:
    print(f"{model_name} confusion matrix (rows=ground truth, cols=prediction):")
    gt = test_set['gt']
    pred = test_set[f'{model_name}_prednum']
    print(f"Accuracy {np.mean(gt == pred)}")
    print(sklearn.metrics.confusion_matrix(gt, pred))

base confusion matrix (rows=ground truth, cols=prediction):
Accuracy 0.0
[[ 0  0  0  0  0  0  0]
 [17  0  0  0  0  0  0]
 [16  0  0  0  0  0  0]
 [16  0  0  0  0  0  0]
 [17  0  0  0  0  0  0]
 [17  0  0  0  0  0  0]
 [17  0  0  0  0  0  0]]
m50 confusion matrix (rows=ground truth, cols=prediction):
Accuracy 0.0
[[ 0  0  0  0  0  0  0]
 [17  0  0  0  0  0  0]
 [16  0  0  0  0  0  0]
 [16  0  0  0  0  0  0]
 [17  0  0  0  0  0  0]
 [17  0  0  0  0  0  0]
 [17  0  0  0  0  0  0]]
m500 confusion matrix (rows=ground truth, cols=prediction):
Accuracy 0.71
[[ 7 10  0  0  0  0]
 [ 0 16  0  0  0  0]
 [ 0  0  2 14  0  0]
 [ 0  0  5 12  0  0]
 [ 0  0  0  0 17  0]
 [ 0  0  0  0  0 17]]
m1500 confusion matrix (rows=ground truth, cols=prediction):
Accuracy 0.79
[[13  4  0  0  0  0]
 [ 3 12  1  0  0  0]
 [ 0  0  8  8  0  0]
 [ 0  0  5 12  0  0]
 [ 0  0  0  0 17  0]
 [ 0  0  0  0  0 17]]


## Notes after gemma3-lora-test2:

I did a short experiment: gemma3 1B fine tuned with lora (rank 8), using 2000 chat examples of user request and assistant response. Variety of 2 tasks (either translate a verse from one version/language to another) or recognize which version it is, with variability of different phrasings of the instructions. 1 epoch.

Brief evaluation: comparing the response of base model to checkpoint 250 to checkpoint 500 (last checkpoint, because update every 4 examples accumulated gradient).
Briefly sampled examples *from the train set*.

My impression: experiment failed miserably. :-)
* when asking to "translate" it sometimes generates Hebrew with Nikkud even when I request "Hebrew (text only)" .
* generated text may have a word related to the input text but overall made up content.
* sometimes generate nonsense characters.
* When asking to recognize version it often tells me "this is from the bible" (duh) or tells me which book and verse, instead of what I wanted (I wanted it to recognize "French rabbinate" vs. "English Korean" vs. "Hebrew (text only)", etc.). It's possible my user instructions were not clear enough, but I was hoping to see the model "understand" from the training examples. I don't think it did.

Possible reasons for failures:
* Tasks too hard. Perhaps I should clarify more in instructions, perhaps also give example in the instructions.
* Data not clean: I noticed that the Mesoratic Hebrew version has html tags in the text. THis may cause confusion.
* Train size: not enough? too much? I'm not sure. I think not enough because there's not a clear trend of learning from base to 250 to 500.