#Welcome to the 'Reasoning with LLMs Workshop!

## Introduction
In this workshop we will learn what are and how to build large language ‘reasoning’ models. After the OpenAI’s o-family of models, Gemini Thinking models and recent Grok3 models releases, so called ‘thinking’ or ‘reasoning’ LLMs became the latest buzz in the AI community. The DeepSeek family of models recently caused a lot of attention and their models have been open sourced as well as their training recipes. During the workshop we will demystify how these models are trained and use open source models and frameworks to replicate some of the concepts used in DeepSeek training.

We will start with the model that is not particularly good at math and by using the publicly available models, datasets and frameworks, we will perform supervised finetuning (SFT) and Reinforcement Learning methods to train a reasoning model capable of solving math word problems.

We will learn what are reasoning models, theory behind them, explain how and why Chain-of-Thought works, how to preprocess data, evaluate model, leverage techniques like, quantization and lora to train big models with a very small resource footprint (for GPU poor), track experiments and push models to HugginFace hub, so you will ‘take home’ your own reasoning LLM!

During the workshop we will gradually build our own library for learning, exploring and playing with Reasoning (or Thinking) LLMs. We will work through notebooks, but at the same time, we will export important pieces of code into the separate .py files and thus build a library that is easy to use later and build upon (and please, feel free to do so after the workshop!). For example, you can try other models, datasets, reward functions etc.

This is the folder structure that we will create:

```
reasoning_workshop/
├── notebooks/
│   ├── reasoning_llms_workshop.ipynb
│   ├── setup.ipynb
│
├── src/
│   ├── __init__.py
│   ├── data_preparation.py
│   ├── utils.py
│   ├── evaluation.py
│   └── reward_functions.py
│
├── scripts/
│   ├── run_sft_training.py
│   └── run_grpo_training.py
│
├── outputs/
│   ├── sft_model/
│   └── grpo_model/
│
├── requirements.txt
└── README.md
```

LLM Reasoning is a controversial topic, but before we jump into why is it so (and discuss whether LLMs actually reason at all !?), let's get our hands dirty by tinkering with an actual model. There will be time for discussion when our models will be in training, evaluation etc.

IMPORTANT - please save the copy of the notebook in your Google Drive. You can click on 'File' -> 'Save a copy in Drive' in the main menu. This will allow you to place all files and models in the folder of your Drive so you can easily access them later, convert to your own GitHub repo etc.

In [None]:
# --- Initial Project Setup ---
# This cell creates the directory structure for our project.
# We'll be populating these files as we go through the workshop.

import os
import sys
from google.colab import drive
drive.mount('/content/drive')

# Add the path to your project folder in Drive to Python's search path
directory = '/content/drive/MyDrive/Colab_Notebooks/llm_workshop'
sys.path.append(directory)
file_path = f"{directory}/utils_hello_drive.py"
if not os.path.exists(directory):
    os.makedirs(directory)


In [None]:
#Subdirs
!mkdir {directory}/notebooks
!mkdir {directory}/scripts
!mkdir {directory}/src
!mkdir {directory}/outputs
!mkdir {directory}/outputs/sft_model
!mkdir {directory}/outputs/grpo_model
!mkdir {directory}/models

In [None]:
# Create the __init__.py file to make 'src' a Python package
!touch {directory}/src/__init__.py


In [None]:
%%writefile {directory}/src/utils.py

Now, let's try to put some .py file in out ```src``` folder and then try to use it in our notebook with ```import```

In [None]:
%%writefile  {directory}/hello_from_drive.py

def hello_world():
  print("Hello from our python scrip in Drive.py!")

In [None]:
import hello_from_drive

hello_from_drive.hello_world()

Now let's do all our neccessary installations and imports for playing with LLMs

In [None]:
!pip install uv

In [None]:
!uv pip install unsloth vllm
!pip install --upgrade datasets


## Loading a model

In [None]:
import unsloth
from unsloth import FastModel
max_seq_length = 1024

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-1b-it", # "unsloth/gemma-3-1b-it-unsloth-bnb-4bit" for 4-bit version
    max_seq_length = max_seq_length,
    load_in_4bit = False,
    load_in_8bit = False,
    full_finetuning = False,
)

We now add LoRA adapters so we only need to update a small amount of parameters!

LoRA is the Parameter Efficient Fine Tuning (PEFT) method. You can read more about it here:

*   https://arxiv.org/abs/2106.09685
*   https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms



In [None]:
# if we are about to train a model, let's add some lora adapters
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # Should leave on always!

    r = 8,           # Larger = higher accuracy, but might overfit
    lora_alpha = 8,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Now let's test our model with some generations. Note that here we are using 'streaming' generation mode.

In [None]:
system_prompt = "solve this math problem."
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": "What is the sqrt of 101?"},
]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    tokenize = False,
)
from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = False),
)

In [None]:
!nvidia-smi

In [None]:
model

Let's generate some responses from our model (without streaming).

In [None]:
prompt = "Once upon a time in a land far, far away, there lived a"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)


In [None]:
outputs = model.generate(
    **inputs,
    max_length=50,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.7,
    output_scores=True,
    return_dict_in_generate=True,
    num_return_sequences=1
)


In [None]:
# outputs.scores is a tuple of tensors, one per generation step
# Each tensor shape: (batch_size, vocab_size)
print(f"Number of generation steps: {len(outputs.scores)}")
output_scores = outputs.scores # Logits for each generation step

# Convert logits to probabilities
# Get probabilities for the first generated token
first_step_probs = torch.softmax(output_scores[0], dim=-1)
top_k_probs, top_k_indices = torch.topk(first_step_probs, k=5)

print("\nTop 5 tokens and probabilities for the first generated token:")
for i in range(5):
    token = tokenizer.decode(top_k_indices[0, i])
    prob = top_k_probs[0, i].item()
    print(f"- Token: '{token}', Probability: {prob:.4f}")

In [None]:
outputs[0]

In [None]:
# Decode the generated text
generated_text = tokenizer.decode(outputs[0][0], skip_special_tokens=False)
print(generated_text)

As we can see, models are probabilistic in nature, i.e. they autoregressively generate probabilities for the next token. We can influence this process with generation parameters like top_p, top_k, temperature etc.
Keep this in mind when we discuss 'reasoning'.

## Loading a dataset

In [None]:
from datasets import load_dataset
import torch

In [None]:
train_dataset = load_dataset("gsm8k", "main", split="train")
test_dataset = load_dataset("gsm8k", "main", split="test")

In [None]:
train_dataset[0]

In [None]:
test_dataset[0]['question'], test_dataset[0]['answer']

In [None]:
# Standard Prompt
prompt = "Natalia sold 48 pastries in the morning and 23 pastries in the afternoon. She baked a total of 100 pastries. How many pastries did she have left?"


In [None]:

# Prepare the input
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate text
outputs = model.generate(
    **inputs,
    max_new_tokens = 256,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.7,
    output_scores=True,
    return_dict_in_generate=True,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id # Ensure the model stops at the end of the sequence
)


In [None]:

# Decode the generated text
generated_text = tokenizer.decode(outputs[0][0], skip_special_tokens=False)
print("\nGenerated Text:")
print(generated_text)

Setting the ```num_return_sequences``` parameter to a number greater than one will result in the model generating multiple different outputs for the same initial prompt.

It is significantly faster to generate multiple sequences in a single execution by setting ```num_return_sequences``` than it is to run the same code multiple times in a loop.

When we set ```num_return_sequences``` to a value like 5, we are instructing the model to produce five independent continuations of the initial prompt. Since our code uses do_sample=True, the model employs a sampling strategy (specifically, top-k and top-p sampling) to choose the next word at each step. This inherent randomness in the selection process allows for the generation of diverse sequences from the same starting point.

If we were using a deterministic method (e.g., do_sample=False for greedy search), all the returned sequences would be identical. However, with sampling enabled, each of the returned sequences represents a different path the model has explored.

Setting num_return_sequences to a higher value is more performant than iterating through the generation process for two primary reasons:

Reduced Overhead: Each call to model.generate() involves a certain amount of overhead. This includes preparing the inputs, moving data to the GPU (if applicable), and initializing the generation process. Running the code in a loop incurs this overhead with every single iteration. A single call with multiple return sequences minimizes this repeated overhead.

The most significant speed-up comes from batched computation. When you request multiple sequences in one go, the model can process the initial prompt and subsequent generation steps for all sequences in parallel on hardware like GPUs or TPUs. Modern deep learning models are highly optimized for this kind of parallel, batched computation. In contrast, a for loop that calls the generation function repeatedly processes each sequence sequentially, failing to take full advantage of the underlying hardware's parallel capabilities.

Let's put generation code inside a function that we can simply reuse later on. Also, let's save this function in utils.py file - we are starting to build our small library!

In [None]:
#%%writefile {directory}/utils.py
import torch
def generate_output(model, tokenizer, prompt: str,  **generation_kwargs) -> list[str]:

    model.eval()

    with torch.no_grad():
      inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
      prompt_token_count = inputs.input_ids.shape[1]

      # This makes the function flexible; you can override defaults on-the-fly.
      default_kwargs = {
          "max_new_tokens": 256,
          "do_sample": True,
          "top_k": 64,
          "top_p": 0.95,
          "temperature": 1,
          "num_return_sequences": 1,
          "pad_token_id": tokenizer.eos_token_id,
          "eos_token_id": tokenizer.eos_token_id, # good practice to set this
      }

      # The following line with update default generation parameters
      default_kwargs.update(generation_kwargs)

      outputs = model.generate(
          **inputs,
          **default_kwargs
      )

    # Decode and slice each sequence to get only the generated text
    generated_texts = []
    for sequence in outputs:
        generated_tokens = sequence[prompt_token_count:]
        generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=False)
        generated_texts.append(generated_text.strip())

    return generated_texts

In [None]:
#import utils
#import importlib
#importlib.reload(utils) # Reload the utils module if you have problems loading it after modification

#utils.generate_output(model, tokenizer, prompt, num_return_sequences=5)
generate_output(model, tokenizer, prompt, num_return_sequences=5)

Try running the cell above several times. **It seems like out model is not reliably following our instruction!**
This is where finetuning (or Supervised FuneTuning - SFT) can help.
Also, this is a good point to discuss different types of prompting techniques and how they influence the 'reasoning' capabilities.


*   Few-shot prompting
*   Chain of thought (CoT)

But, before we jump into that, there is one more thing to check!
Let's look at the official Gemma documentation. Try to find out if there is anything we missed!

# Spoiler alert
---



Gemma official documentation (https://ai.google.dev/gemma/docs/core/prompt-structure) says this:



> \<start_of_turn>user
knock knock\<end_of_turn>
\<start_of_turn>model
who is there\<end_of_turn>
\<start_of_turn>user
Gemma\<end_of_turn>
\<start_of_turn>model
Gemma who?\<end_of_turn>






Gemma Techical Report: https://arxiv.org/html/2503.19786v1

In [None]:
prompt = train_dataset[0]['question']
instruction = 'Solve this math problem step by step. After step by step solution write out #### followed with the number solution. '
prompt = "<bos><start_of_turn>user " + instruction + prompt + "<end_of_turn><start_of_turn>model "
prompt


In [None]:
# We can make it more flexible with preamble and suffix
preamble = "<bos><start_of_turn>user "
suffix = "<end_of_turn><start_of_turn>model "
prompt = preamble + instruction + train_dataset[0]['question'] + suffix
prompt

In [None]:
%%time
outputs = utils.generate_output(model, tokenizer, prompt, num_return_sequences=5, max_new_tokens=512)
outputs

Clearly, this is much better!
One needs to be careful with these things. It is often best to consult the official documentation, but be careful as this can also be stale and not up to date. For example, Gemma documentation still says that Gemma models are not using system isntruction as a separate chat block, but their own examples do have it:
https://huggingface.co/google/gemma-3-1b-it



Still, it is not ideal and now we can try to achieve better formatting with SFT. We just need to preprocess our data to be in the above mentioned format. Transformers templates can help:
https://huggingface.co/docs/trl/main/en/dataset_formats#converting-a-conversational-dataset-into-a-standard-dataset

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [None]:
example = train_dataset[0]
example

In [None]:
formatted_example = tokenizer.apply_chat_template(
    [{"role" : "user", "content" : example['question']},
     {"role" : "assistant", "content" : example['answer']}],
    add_generation_prompt=False,
    tokenize=False,
    #return_dict=True,
    return_tensors="pt",
)#.to(model.device).to(torch.bfloat16) #this makes sense if we are returning tokens
formatted_example

Using templates is convenient because we can easily choose if we want strings or tokens returned etc.
But, what is happening behind the scenes can be done, of course, without templates as well:

```
def format_gsm8k_prompt(example):
    """Format GSM8K examples into a chat format for instruction tuning"""
    question = example["question"]
    answer = example["answer"]
    
    # Create a structured prompt
    prompt = f"""<bos><start_of_turn>user
    Solve this math problem step by step:

    {question}<end_of_turn>
    <start_of_turn>model
    {answer}<end_of_turn><eos>"""
    
    return {"text": prompt}
```


#Analyzing the Dataset

In [None]:
import matplotlib.pyplot as plt

def plot_answer_length_distribution(dataset_split, split_name):
    """
    Calculates and plots the distribution of answer lengths for a dataset split.

    Args:
        dataset_split: A Hugging Face dataset split (e.g., train_dataset or test_dataset).
        split_name (str): The name of the dataset split (e.g., "Train" or "Test").
    """
    answer_lengths = []
    for example in dataset_split:
        answer = example['answer']
        answer_lengths.append(len(answer))

    plt.figure(figsize=(10, 6))
    plt.hist(answer_lengths, bins=50, edgecolor='black')
    plt.title(f"Distribution of Answer Lengths ({split_name} Dataset)")
    plt.xlabel("Answer Length")
    plt.ylabel("Frequency")
    plt.show()

plot_answer_length_distribution(train_dataset, "Train")
plot_answer_length_distribution(test_dataset, "Test")

It seems like we need to increase the number of tokens in the response if we are to expect from our model to be able to solve this tasks.
For good results, we would need longer outputs.

## Homework No.1:
Calculate basic descriptive statistics for lengths of responses in our dataset (if you do now know which are common descriptive statistics, Google it!)

## Homework No.2:
Try using different types of prompts that we discussed (few shot prompt, Chain-of-Tought) with the non-trained model and observe the results.

#Speeding up generation - Meet vLLM
As we saw in the dataset analysis, we need longer outputs so that our model can realistically solve these math problems. But longer sequences mean longer time for generations.
But luckily, we can significantly speed up the generation by using the vLLM library for fast inference. vLLM has a bunch of smart optimizations enabling much faster inference.
For more info, visit https://github.com/vllm-project/vllm

Note the `gpu_memory_utilization` parameter, this defines how much of a VRAM memory will be used for vLLM optimizations (paged attention etc.). Reduce this if you run into CUDA out of memory issues (e.g. if you also have a model loaded from Transformers).


In [None]:
%%time
outputs = utils.generate_output(model, tokenizer, prompt, num_return_sequences=1, max_new_tokens=256)
outputs


In [None]:
from vllm import LLM, SamplingParams

# Load the model with vLLM
vanilla = "unsloth/gemma-3-1b-it"
#trained_model = "/content/drive/My Drive/..."

llm = LLM(model=vanilla, trust_remote_code=True, gpu_memory_utilization=0.75)


In [None]:
llm # checking if the model was initializes successfully in vllm engine

In [None]:
# Let's define the sampling parameters. Remember, we had these before:
"""default_kwargs = {
        "max_new_tokens": 256,
        "do_sample": True,
        "top_k": 64,
        "top_p": 0.95,
        "temperature": 0.7,
        "num_return_sequences": 1,
        "pad_token_id": tokenizer.eos_token_id,
        "eos_token_id": tokenizer.eos_token_id, # good practice to set this
    }
so let's use the same, so that our results are comparable.
"""
sampling_params = SamplingParams(temperature=1, top_p=0.95, top_k = 64, max_tokens=386)


In [None]:

# Prepare your prompts
prompts = [
    preamble + instruction + test_dataset[0]['question'] + suffix,
    preamble + instruction + test_dataset[1]['question'] + suffix,
]


In [None]:
prompts = [
    instruction + train_dataset[0]['question'],
    instruction + train_dataset[1]['question'],
]

In [None]:
prompts

vLLM reserves space on your GPU, no matter what the model size is (actual reserved space is controllable with the parameter where you can set a percentage for it). Let's check the status of out GPU utilization now:

In [None]:
!nvidia-smi

In [None]:
%%time
# Generate text
outputs = llm.generate(prompts, sampling_params)


Oh, that's much faster!


In [None]:
# Print the outputs
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt}, Generated: {generated_text}")

# Supervised Fine Tuning - SFT

For the official Unsloth example of SFT for Gemma 3, check out this notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(4B).ipynb


Now let's prepare our entire dataset for SFT. (let's stick with the templates):
https://huggingface.co/docs/trl/main/en/dataset_formats#which-dataset-type-to-use

In [None]:
subset_dataset = train_dataset.select(range(100))

In [None]:
reasoning_start = "<start_working_out>"
reasoning_end   = "<end_working_out>"
solution_start = "<SOLUTION>"
solution_end = "</SOLUTION>"

system_prompt = \
f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}\n\n"""
system_prompt

In [None]:
train_dataset[0]

In [None]:
answer = 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'

In [None]:
import re
from typing import List, Tuple, Optional

def extract_final_answer(text: str, pattern: str) -> Optional[float]:
    try:
        if pattern.startswith('<') and pattern.endswith('>'):
            # Extract tag name (e.g., 'answer' from '<answer>')
            tag = pattern[1:-1]
            # Look for <tag>number</tag>
            match = re.search(f'<{tag}>\s*([+-]?\d*\.?\d+)\s*</{tag}>', text)
            if match:
                return float(match.group(1))
        else:
            # Any pattern - look for the pattern followed by a number (other characters are allowed between a patern and a number)
            escaped_pattern = re.escape(pattern)
            match = re.search(f'{escaped_pattern}.*?([+-]?\d*\.?\d+)', text) #*?([+-]?\d*\.?\d+) would be for strict match
            if match:
                return float(match.group(1))

    except (ValueError, AttributeError):
        pass

    return None


In [None]:
extract_final_answer(answer, '####')

In [None]:
pattern = '####'
escaped_pattern = re.escape(pattern)
re.search(f'{escaped_pattern}.*?([+-]?\d*\.?\d+)', answer).group(0)

In [None]:
escaped_pattern

In [None]:
#%%writefile -a {directory}/utils.py

def format_gsm8k_sft(examples):
    # examples is a dictionary where each key holds a list of items.
    # We zip the lists for 'question' and 'answer' together to process them in pairs.
    texts = [
        tokenizer.apply_chat_template(
            [
                {"role": "user", "content": system_prompt + question},
                {"role": "assistant", "content": reasoning_start + answer.split(re.search('\#\#\#\#.*?([+-]?\d*\.?\d+)', answer).group(0))[0] + reasoning_end + solution_start + re.search('\#\#\#\#.*?([+-]?\d*\.?\d+)', answer).group(1) + solution_end}
            ],
            tokenize=False,
            add_generation_prompt=False
        ).removeprefix('<bos>')
        for question, answer in zip(examples['question'], examples['answer'])
    ]
    return {"text": texts}


In [None]:
#sft_dataset = subset_dataset.map(format_gsm8k_sft, batched = True)
sft_dataset_train = train_dataset.map(format_gsm8k_sft, batched = True, remove_columns=train_dataset.column_names)
#sft_dataset_test = test_dataset.map(format_gsm8k_sft, batched = True, remove_columns=test_dataset.column_names)

In [None]:
sft_dataset_train[2]

In [None]:
sft_dataset_train[3]["text"]

In [None]:
import json
from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer
import wandb


In [None]:
import torch

In [None]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = sft_dataset_train,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 600,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "wandb", # Use this for WandB etc
        dataset_num_proc=2,
        output_dir=f"{directory}/outputs/sft_model",
        save_steps=200,
        save_strategy="steps",
        eval_steps=50,
        per_device_eval_batch_size=1,
    ),
)


We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!

In [None]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

Now let's print the masked out example - you should see only the answer is present


In [None]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

Now, finally, let's train our model!

In [None]:
import wandb
from google.colab import userdata

# Access the stored secret
wandb_api_key = userdata.get('wandb')
wandb.login(key=wandb_api_key)

# Initialize your wandb run and set the experiment name
run = wandb.init(
    project="gemma3-gsm8k-sft",  # Replace with your project name
    #name="experiment name"     # Replace with your desired experiment name
)


In [None]:
trainer_stats = trainer.train()

In [None]:
trainer_stats

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained(f"{directory}/outputs/sft_model")  # Saving to Drive
tokenizer.save_pretrained(f" {directory}/outputs/sft_model")
# model.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving
# tokenizer.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
#Loading a model
if False:
    from unsloth import FastModel
    model, tokenizer = FastModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )


In [None]:
#Testing inference (as we had in the beginning)
messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "What is the square root of 256?",}]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)


In [None]:
from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

### Saving to float16 for VLLM

Unsloth also supports saving to `float16` directly for deployment! We save it in the folder `gemma-3-finetune`. Set `if False` to `if True` to let it run!

In [None]:
model.save_pretrained_merged(f"{directory}/outputs/sft_model_deploy", tokenizer)

If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False: # Change to True to upload finetune
    model.push_to_hub_merged(
        "HF_ACCOUNT/gemma-3-finetune", tokenizer,
        token = "hf_..."
    )

#Inference and Evaluation

In [None]:
system_prompt

In [None]:
reasoning_start = "<start_working_out>"
reasoning_end   = "<end_working_out>"
solution_start = "<SOLUTION>"
solution_end = "</SOLUTION>"

system_prompt = \
f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}\n\n"""
system_prompt

In [None]:
preamble = "<start_of_turn>user\n"
suffix = "<end_of_turn>\n<start_of_turn>model\n"
prompt = preamble + system_prompt + train_dataset[0]['question'] + suffix
#prompt = train_dataset[0]['question']
prompt

In [None]:
print(prompt)

In [None]:
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": train_dataset[0]['question']},
]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    tokenize = False,
)
print(text)

In [None]:
outputs = generate_output(model, tokenizer, prompt, num_return_sequences=5, max_new_tokens=256)
#outputs = utils.generate_output(model, tokenizer, prompt, num_return_sequences=5, max_new_tokens=256)

outputs

## Evaluation of the trained model


In [None]:
outputs[3]

In [None]:
pattern = "Final Answer:"
match = re.search(f'{pattern}.*?([+-]?\d*\.?\d+)', outputs[3])
match.group(1)

In [None]:
extract_final_answer(outputs[0],"<SOLUTION>")

Let's write evaluation function. To make it more versatile, we will pass separate patterns for dataset and for model. (so that we can use it later to evaluate Reasoning model that will have different pattern)

In [None]:
# for the GRPO trained model
reasoning_start = "<start_working_out>"
reasoning_end   = "<end_working_out>"
solution_start = "<SOLUTION>"
solution_end = "</SOLUTION>"

system_prompt = \
f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""
system_prompt

In [None]:
def format_gsm8k_eval_prompt(example, pattern):
    """Format GSM8K examples into a chat format for evaluation"""
    question = example["question"]
    answer = example["answer"]

    ## Create a structured prompt:
    # Using the 'system' prompt
    #prompt = f"""<bos><start_of_turn>system\n{system_prompt}<start_of_turn>user\n{question}\n<end_of_turn><start_of_turn>model"""
    #Without the 'system' prompt, i.e. prepending it to 'user' message (as in Gemma documentation)
    prompt = f"""<bos><start_of_turn>user\n{system_prompt}\n{question}\n<end_of_turn><start_of_turn>model"""
    # Without structuring - just system prompt + plain text
    #prompt = f"""{system_prompt} {question}\n"""


    ground_truth = extract_final_answer(answer, pattern)

    return {"prompt": prompt, 'ground_truth': ground_truth}

In [None]:
print(format_gsm8k_eval_prompt(train_dataset[0], '####'))

In [None]:
formatted_gm8k_test_dataset = test_dataset.map(format_gsm8k_eval_prompt, fn_kwargs={'pattern':'####'})

In [None]:
formatted_gm8k_test_dataset[0]

Our evaluation function will work on batches of data.

In [None]:
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import re
from typing import Dict, Any, Optional

def evaluate_gsm8k_batch(
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    test_dataset,
    model_pattern: str,
    max_new_tokens: int,
    temperature: float,
    batch_size: int,
    max_samples: Optional[int] = None,
    top_p: float = 0.95,
    top_k: int = 64,
    verbose: bool = False,
) -> Dict[str, Any]:

    if max_samples:
        test_dataset = test_dataset.select(range(min(max_samples, len(test_dataset))))

    test_dataset = test_dataset.map(format_gsm8k_eval_prompt, fn_kwargs={'pattern':'####'})

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device).eval()

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = 'left'

    all_results = []

    generation_kwargs = {
        "max_new_tokens": max_new_tokens,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
        "do_sample": temperature > 0,
        "pad_token_id": tokenizer.pad_token_id,
    }

    print(f"Evaluating {len(test_dataset)} samples on {device}...")

    with torch.no_grad():
        for i in tqdm(range(0, len(test_dataset), batch_size), desc="Batch Inference"):
            batch_dataset = test_dataset[i:i + batch_size]

            batch_prompts = batch_dataset['prompt']
            batch_gts = batch_dataset['ground_truth']
            batch_questions = batch_dataset.get('question', ['N/A'] * len(batch_dataset)) # Safely get questions

            inputs = tokenizer(batch_prompts, padding=True, return_tensors="pt").to(device)
            outputs = model.generate(**inputs, **generation_kwargs)
            solution_texts = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)

            for j, solution_str in enumerate(solution_texts):
                pred_answer = extract_final_answer(solution_str, model_pattern)
                is_correct = (float(pred_answer) == float(batch_gts[j])) if pred_answer is not None else False

                if verbose:
                    print(f"\n--- Example ---\n"
                          f"Prompt: {batch_prompts[j]}\n"
                          f"Solution: {solution_str}\n"
                          f"Ground Truth: {batch_gts[j]}\n"
                          f"Predicted Answer: {pred_answer}\n"
                          f"Correct: {is_correct}")

                all_results.append({
                    'question': batch_questions[j],
                    'predicted_answer': pred_answer,
                    'ground_truth_answer': batch_gts[j],
                    'solution_str': solution_str,
                    'is_correct': is_correct
                })

    correct_count = sum(r['is_correct'] for r in all_results)
    total_count = len(all_results)
    accuracy = correct_count / total_count if total_count > 0 else 0

    print(f"\nFinal accuracy: {accuracy:.2%}")

    if torch.cuda.is_available():
        torch.cuda.empty_cache()

    return {
        'accuracy': accuracy,
        'correct': correct_count,
        'total': total_count,
        'detailed_results': all_results
    }

In [None]:
print("\nStarting evaluation...")
results = evaluate_gsm8k_batch(
    model=model,
    tokenizer=tokenizer,
    test_dataset=test_dataset,
    #test_pattern='####',  # GSM8K uses #### pattern
    model_pattern='<SOLUTION>',  # Whatever our mode uses, e.g. <answer> tags
    max_new_tokens=384,
    temperature=1,
    batch_size=4,
    max_samples=12,
    verbose=True
)

print("\nEvaluation Results:")
print(f"Accuracy: {results['accuracy']:.2%}")
print(f"Correct: {results['correct']}")
print(f"Total: {results['total']}")

# Optionally, print some detailed results
# print("\nFirst 5 detailed results:")
# for i, res in enumerate(results['detailed_results'][:5]):
#     print(f"--- Example {i+1} ---")
#     print(f"Question: {res['question']}")
#     print(f"True Answer: {res['ground_truth_answer']}")
#     print(f"Model Raw Output: {res['solution_str']}")
#     print(f"Model Extracted Answer: {res['predicted_answer']}")
#     print(f"Correct: {res['is_correct']}")


This is so called pass@1 evaluation. Usually, researchers report pass@k accuracy. One would query model for several times and then measure success rate.

## vLLM Evaluation

This is slow (batching can help), but let's use our friend vLLM to make a blazingly fast evaluation.

Note: for switching models in vLLM on Google Colab, a reset of Runtime is often the best bet
While it might seem intuitive to simply release one model from memory and load another, vLLM's current architecture does not offer a straightforward or reliable mechanism for "hot-swapping" models within the same session.

At the heart of this limitation is vLLM's highly optimized memory management. To achieve its impressive inference speeds, vLLM pre-allocates a significant portion of the available GPU memory for the loaded model and its associated key-value (KV) cache. The library does not, as of now, provide a simple function to completely clear a loaded model and its memory footprint to make way for a new one.

In [None]:
import torch
from tqdm import tqdm
def evaluate_gsm8k_vllm(
    model,
    test_dataset,
    test_pattern: str,
    model_pattern: str,
    max_new_tokens,
    temperature,
    max_samples=None) -> dict:
    if max_samples:
        test_dataset = test_dataset.select(range(min(max_samples, len(test_dataset))))

    print(f"Evaluating {len(test_dataset)} samples...")

    print(f"Preprocessing {len(test_dataset)} samples...")
    test_dataset = test_dataset.map(format_gsm8k_eval_prompt, fn_kwargs={'pattern':test_pattern})

    llm = model
    # Use Gemma's validation sampling parameters, these could also be arguments
    # for the entire function
    sampling_params = SamplingParams(
        temperature=temperature,
        top_p=0.95,
        top_k=64,
        max_tokens=max_new_tokens,
    )

    correct = 0
    total = 0
    all_results = []

    # Prepare all prompts for bulk inference
    prompts = [item['prompt'] for item in test_dataset]
    print(len(prompts))

    # Perform bulk inference once using vLLM
    print("\nRunning inference...")
    outputs = llm.generate(prompts, sampling_params)
    print("Inference complete.")

    try:
        # Iterate through the generated outputs and corresponding dataset items
        for output, item in tqdm(zip(outputs, test_dataset), total=len(test_dataset), desc="Processing results"):
            solution_str = output.outputs[0].text

            # Debugging prints (TODO: consider making these optional with a 'verbose'' flag as in the previous function)
            print('\n--- Example ---')
            print('Prompt: ' + item['prompt'])
            print('Solution: ' + solution_str)
            #print('Solution (last 200 chars): ' + solution_str[-200:])
            print('GT Full Answer (last 100 chars): ' + item['answer'][-100:])
            print(f"Final ground truth answer: {item['ground_truth']}")

            try:
                # Attempt to extract the predicted final answer
                pred_final_answer = extract_final_answer(solution_str, model_pattern)
                print(f"Predicted answer: {pred_final_answer}")
            except Exception as e:
                print(f"Failed to extract final answer from generated answer: {e}")
                pred_final_answer = None # Set to None if extraction fails

            # Check if the prediction is correct
            if pred_final_answer is not None: # Only compare if an answer was extracted
                try:
                    if float(pred_final_answer) == float(item['ground_truth']):
                        correct += 1
                    else:
                        print(f"Incorrect prediction for question: {item.get('original_question', 'N/A')}")
                except ValueError: # Handle cases where conversion to float fails
                    print(f"Type conversion error for comparison: pred='{pred_final_answer}', gt='{item['ground_truth']}'")
            else:
                print(f"No valid predicted answer extracted for question: {item.get('original_question', 'N/A')}")


            all_results.append({
                'question': item['question'],
                'predicted_answer': pred_final_answer,
                'ground_truth_answer': item['ground_truth'],
                'solution_str': solution_str
            })
            total += 1

        accuracy = correct / total if total > 0 else 0
        print(f"\nFinal accuracy: {accuracy:.2%}")

    except Exception as e:
        print(f"Error during evaluation loop: {e}")
        if total > 0:
            print(f"Partial results available for {total} examples")

    finally:
        # Clean up CUDA memory
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        else:
            print("CUDA not available, skipping empty_cache()")

    accuracy = correct / total if total > 0 else 0
    return {
        'accuracy': accuracy,
        'correct': correct,
        'total': total,
        'detailed_results': all_results
    }

In [None]:
# Evaluate the model
results = evaluate_gsm8k_vllm(
    model=llm,
    test_dataset= test_dataset,
    test_pattern='####',  # GSM8K uses #### pattern
    model_pattern='<SOLUTION>',  # Whatever our mode uses, e.g. <answer> tags
    max_new_tokens=768,
    temperature=0.1,
    max_samples=100  # Evaluate on first x samples for testing
)

In [None]:
prompt

In [None]:
utils.generate_output(model, tokenizer, prompt, num_return_sequences=1)

#Reinforcement Learning
Let's take our model's reasoning capabilities to the next level! We will do some RL (Reinforcement Learning), or more specifically - GRPO (Group Relative Policy Optimization).
Following block borrows code from https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/HuggingFace%20Course-Gemma3_(1B)-GRPO.ipynb

## How to format our dataset?
For different tasks we will need to format our dataset differently.
Luckily, there is this nice TRL documentation page:
https://huggingface.co/docs/trl/main/en/dataset_formats#which-dataset-type-to-use

In [None]:
def extract_hash_answer(text):
    if "####" not in text: return None
    return text.split("####")[1].strip()
extract_hash_answer(test_dataset[0]["answer"])

We now create a system prompt which can be customized. We add 4 extra symbols for working out or thinking / reasoning sections and a final answer:

In [None]:
reasoning_start = "<start_working_out>"
reasoning_end   = "<end_working_out>"
solution_start = "<SOLUTION>"
solution_end = "</SOLUTION>"

system_prompt = \
f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""
system_prompt

Let's map the dataset! and see the first row:

In [None]:
rl_dataset = test_dataset.map(lambda x: {
    "prompt" : [
        {"role": "user", "content": preamble + system_prompt + x["question"] + suffix}, # Note the Gemma's treatment of system prompt
       #{"role": "user",   "content": x["question"] + suffix},
    ],
    "answer": extract_hash_answer(x["answer"]),
})
rl_dataset[0]

We create a regex format to match the reasoning sections and answers:

In [None]:
import re

match_format = re.compile(
    rf"^[\s]{{0,}}"\
    rf"{reasoning_start}.+?{reasoning_end}.*?"\
    rf"{solution_start}(.+?){solution_end}"\
    rf"[\s]{{0,}}$",
    flags = re.MULTILINE | re.DOTALL
)

We verify it works:

In [None]:
match_format.search(
    "<start_working_out>Let me think!<end_working_out>"\
    "<SOLUTION>2</SOLUTION>",
)

We now want to create a reward function to match the format exactly - we reward it with 3 points if it succeeds:

In [None]:
def match_format_exactly(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Match if format is seen exactly!
        if match_format.search(response) is not None: score += 3.0
        scores.append(score)
    return scores

If it fails, we want to reward the model if it at least follows the format partially, by counting each symbol:

In [None]:
def match_format_approximately(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Count how many keywords are seen - we penalize if too many!
        # If we see 1, then plus some points!
        score += 0.5 if response.count(reasoning_start) == 1 else -0.5
        score += 0.5 if response.count(reasoning_end)   == 1 else -0.5
        score += 0.5 if response.count(solution_start)  == 1 else -0.5
        score += 0.5 if response.count(solution_end)    == 1 else -0.5
        scores.append(score)
    return scores

Finally, we want to extract the generated answer, and reward or penalize it! We also reward it based on how close the answer is to the true one via ratios:

In [None]:
def check_answer(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_format.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    for guess, true_answer in zip(extracted_responses, answer):
        score = 0
        if guess is None:
            scores.append(0)
            continue
        # Correct answer gets 3 points!
        if guess == true_answer:
            score += 3.0
        # Match if spaces are seen
        elif guess.strip() == true_answer.strip():
            score += 1.5
        else:
            # We also reward it if the answer is close via ratios!
            # Ie if the answer is within some range, reward it!
            try:
                ratio = float(guess) / float(true_answer)
                if   ratio >= 0.9 and ratio <= 1.1: score += 0.5
                elif ratio >= 0.8 and ratio <= 1.2: score += 0.25
                else: score -= 1.0 # Penalize wrong answers
            except:
                score -= 0.5 # Penalize
        scores.append(score)
    return scores

Also sometimes it might not be 1 number as the answer, but like a sentence for example "The solution is $20" -> we extract 20.

In [None]:
match_numbers = re.compile(
    rf"{solution_start}.*?([\d\.]{{1,}})",
    flags = re.MULTILINE | re.DOTALL
)
match_numbers.findall("<SOLUTION>  0.34  </SOLUTION>")

TODO: Fix bug in extraction of numbers to catch negative numbers in responses

In [None]:
def check_numbers(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_numbers.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    # Seems like the following line misses to extract negative numbers
    print('*'*20, f"Question:\n{question}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    for guess, true_answer in zip(extracted_responses, answer):
        if guess is None:
            scores.append(0)
            continue
        # Convert to numbers
        try:
            true_answer = float(true_answer.strip())
            guess       = float(guess.strip())
            scores.append(1.5 if guess == true_answer else 0.0)
        except:
            scores.append(0)
            continue
    return scores

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [None]:
max_prompt_length = 256 # We should do the analysis of question lengths as we did for answers (and account for system instruction and our custom formatting!)

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_torch_fused",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 4, # Decrease if out of memory
    max_prompt_length = max_prompt_length,
    max_completion_length = max_seq_length - max_prompt_length,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 100,
    save_steps = 50,
    max_grad_norm = 0.1,
    report_to = "wandb",
    output_dir = f"{directory}/outputs/grpo_model",
)

In [None]:
# Initialize your wandb run and set the experiment name
run = wandb.init(
    project="gemma3-gsm8k-grpo",  # Replace with your project name
    #name="experiment name"     # Replace with your desired experiment name
)

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!


**[Marko's edit: Unless you use SFT warmup, then your model will start collecting formatting rewards early!]**

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [None]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        match_format_exactly,
        match_format_approximately,
        check_answer,
        check_numbers,
    ],
    args = training_args,
    train_dataset = rl_dataset,
)
trainer.train()

<a name="Inference"></a>
### Inference
Now let's try the model we just trained!

In [None]:
messages = [
    #{"role": "system", "content": system_prompt},
    {"role": "user",   "content": system_prompt + "What is the sqrt of 101?"},
]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    tokenize = False,
)
text

In [None]:
from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained(f"{directory}/outputs/grpo_model")  # Local saving
tokenizer.save_pretrained(f"{directory}/outputs/grpo_model")
# model.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving
# tokenizer.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving

### Saving to float16 for VLLM

We also support saving to `float16` directly for deployment! We save it in the folder `gemma-3-finetune`. Set `if False` to `if True` to let it run!

In [None]:
model.save_pretrained_merged(f"{directory}/outputs/grpo_model_deploy_100_steps", tokenizer)

If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False: # Change to True to upload finetune
    model.push_to_hub_merged(
        "HF_ACCOUNT/gemma-3-finetune", tokenizer,
        token = "hf_..."
    )

#Conclusion and Resources

That's it, congratulations! If everything worked as it should, you now have your own Reasoning model!
There may be some compatibiliy issues or hiccups in following this notebook, but this is normal in a fast-paced, always changing open-source world.

For more notebooks with examples of finetuning various models with Unsloth, visit their website: https://unsloth.ai/blog

Here are some interesting resources to learn more in-depth about reasoning language models:

*   https://magazine.sebastianraschka.com/p/understanding-reasoning-llms
*   https://epichka.com/blog/2025/grpo/

