<a href="https://www.kaggle.com/code/umangkaushik/qwen2-5-3b-openmath-grpo?scriptVersionId=224006891" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Training a CoT Model using Qwen2.5, GRPO and Unsloth

## Installing the Dependencies

For Kaggle

In [None]:
!rm -rf *

In [None]:
%%capture 
!pip install -q unsloth vllm 
# Temporarily install a specific TRL nightly version 
!pip install -q git+https://github.com/huggingface/trl.git@e95f9fb74a3c3647b86f251b7e230ec51c64b72b 
!pip install -q triton==3.1.0 
!pip install -qU pynvml 
!pip install -q math-verify[antlr4_13_2] 

For Google-Colab

In [None]:
# %%capture
# # Skip restarting message in Colab
# import sys; modules = list(sys.modules.keys())
# for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None

# !pip install unsloth vllm
# !pip install --upgrade pillow

In [1]:
from unsloth import FastLanguageModel, PatchFastRL 
PatchFastRL("GRPO", FastLanguageModel) 

Unsloth: Patching Xformers to fix some performance issues.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [None]:
from unsloth import is_bfloat16_supported
import torch

max_seq_length = 1024
lora_rank = 64

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name             = "Qwen/Qwen2.5-3B-Instruct",
    max_seq_length         = max_seq_length,
    load_in_4bit           = True,
    fast_inference         = True,
    max_lora_rank          = lora_rank,
    gpu_memory_utilization = 0.6
)

model = FastLanguageModel.get_peft_model(
    model,
    r                          = lora_rank,
    target_modules             = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha                 = lora_rank,
    use_gradient_checkpointing = "unsloth",
    random_state               = 1337
)

In [None]:
from datasets import load_dataset, Dataset
import re

SYSTEM_PROMPT = """
Respond in the following format:
<think>
...
</think>
<answer>
...
</answer>
"""

def extract_reasoning_generation(text: str) -> str:
    if "<think>" not in text or "</think>" not in text:
        return ""
    think_part = text.split("<think>")[-1]
    think_part = think_part.split("</think>")[0]
    return think_part.strip()

def get_openr1_math_220k(split: str) -> Dataset:
    data = load_dataset("open-r1/OpenR1-Math-220k", split=split)

    def transform_record(x):
        problem_text = x.get("problem", "")
        reasoning_text = extract_reasoning_generation(x.get("generations", "")[-1])
        final_answer = x.get("solution", "")

        xml_output = f"<think>{reasoning_text}</think>\n<answer>\n{final_answer}\n</answer>"

        return {
            "prompt": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": problem_text},
            ],
            "solution": xml_output
        }

    data = data.map(transform_record).remove_columns("messages")
    return data

dataset = get_openr1_math_220k("train")
print(dataset[0]["prompt"])
print(dataset[0]["solution"])

In [None]:
dataset

In [None]:
def extract_xml_answer(text: str) -> str:
    """
    Extracts the <answer>...</answer> from the text, ignoring any <reasoning> blocks.
    """
    if "<answer>" not in text or "</answer>" not in text:
        return ""
    answer_part = text.split("<answer>")[-1]
    answer_part = answer_part.split("</answer>")[0]
    return answer_part.strip() 

In [None]:
extract_xml_answer(dataset[0]["solution"])

In [None]:
import json
import math
import re
from typing import Dict

from latex2sympy2_extended import NormalizationConfig
from math_verify import LatexExtractionConfig, parse, verify

def accuracy_reward(completions, solution, **kwargs):
    """Reward function that checks if the completion is the same as the ground truth."""
    contents = [completion[-1]["content"] for completion in completions]
    rewards = []
    for content, sol in zip(contents, solution):
        gold_parsed = parse(
            extract_xml_answer(sol),
            extraction_mode="first_match",
            extraction_config=[LatexExtractionConfig(
                normalization_config=NormalizationConfig(
                            nits=False,
                            malformed_operators=False,
                            basic_latex=True,
                            equations=True,
                            boxed="all",
                            units=True,
                        ),
                        # Ensures that boxed is tried first
                        boxed_match_priority=0,
                        try_extract_without_anchor=False,
            )],
        )
        if len(gold_parsed) != 0:
            # We require the answer to be provided in correct latex (no malformed operators)
            answer_parsed = parse(content, extraction_config=[LatexExtractionConfig(
                normalization_config=NormalizationConfig(
                            nits=False,
                            malformed_operators=False,
                            basic_latex=True,
                            equations=True,
                            boxed="all",
                            units=True,
                        ),
                        # Ensures that boxed is tried first
                        boxed_match_priority=0,
                        try_extract_without_anchor=False,
            )], extraction_mode="first_match")
            # Reward 1 if the content is the same as the ground truth, 0 otherwise 
            reward = float(verify(answer_parsed, gold_parsed))
        else:
            # If the gold solution is not parseable, we reward 1 to skip this example
            reward = 1.0
            # print("accuracy_reward: Failed to parse gold solution: ", sol)
        rewards.append(reward)

    return rewards

def format_reward(completions, **kwargs):
    """Reward function that checks if the reasoning process is enclosed within <think> and </think> tags, while the final answer is enclosed within <answer> and </answer> tags."""
    pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"
    completion_contents = [completion[-1]["content"] for completion in completions]
    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completion_contents]
    return [1.0 if match else 0.0 for match in matches]

def reasoning_steps_reward(completions, **kwargs):
    r"""Reward function that checks for clear step-by-step reasoning.
    Regex pattern:
        Step \d+: - matches "Step 1:", "Step 2:", etc.
        ^\d+\. - matches numbered lists like "1.", "2.", etc. at start of line
        \n- - matches bullet points with hyphens
        \n\* - matches bullet points with asterisks
        First,|Second,|Next,|Finally, - matches transition words
    """
    pattern = r"(Step \d+:|^\d+\.|\n-|\n\*|First,|Second,|Next,|Finally,)"
    completion_contents = [completion[-1]["content"] for completion in completions]
    matches = [len(re.findall(pattern, content)) for content in completion_contents]

    # Magic nubmer 3 to encourage 3 steps and more, otherwise partial reward
    return [min(1.0, count / 3) for count in matches]

def len_reward(completions: list[Dict[str, str]], solutions: list[str], **kwargs) -> float:
    """Compute length-based rewards to discourage overthinking and promote token efficiency.

    Taken from from the Kimi 1.5 tech report: https://arxiv.org/abs/2501.12599

    Args:
        completions: List of model completions
        solutions: List of ground truth solutions

    Returns:
        List of rewards where:
        - For correct answers: reward = 0.5 - (len - min_len)/(max_len - min_len)
        - For incorrect answers: reward = min(0, 0.5 - (len - min_len)/(max_len - min_len))
    """
    contents = [completion[-1]["content"] for completion in completions]
    # First check correctness of answers
    correctness = []
    for content, sol in zip(contents, solutions):
        gold_parsed = parse(extract_xml_answer(sol), extraction_mode="first_match", extraction_config=[LatexExtractionConfig(
            normalization_config=NormalizationConfig(
                            nits=False,
                            malformed_operators=False,
                            basic_latex=True,
                            equations=True,
                            boxed="all",
                            units=True,
                        ),
                        # Ensures that boxed is tried first
                        boxed_match_priority=0,
                        try_extract_without_anchor=False,
        )])
        if len(gold_parsed) == 0:
            # Skip unparseable examples
            correctness.append(True)  # Treat as correct to avoid penalizing
            # print("len_reward: Failed to parse gold solution: ", sol)
            continue

        answer_parsed = parse(content, extraction_config=[LatexExtractionConfig()], extraction_mode="first_match",)
        correctness.append(verify(answer_parsed, gold_parsed))

    # Calculate lengths
    lengths = [len(content) for content in contents]
    min_len = min(lengths)
    max_len = max(lengths)

    # If all responses have the same length, return zero rewards
    if max_len == min_len:
        return [0.0] * len(completions)

    rewards = []
    for length, is_correct in zip(lengths, correctness):
        lambda_val = 0.5 - (length - min_len) / (max_len - min_len)

        if is_correct:
            reward = lambda_val
        else:
            reward = min(0, lambda_val)

        rewards.append(float(reward))

    return rewards

def get_cosine_scaled_reward(
    min_value_wrong: float = -1.0,
    max_value_wrong: float = -0.5,
    min_value_correct: float = 0.5,
    max_value_correct: float = 1.0,
    max_len: int = 1000,
):
    def cosine_scaled_reward(completions, solution, **kwargs):
        """Reward function that scales based on completion length using a cosine schedule.

        Shorter correct solutions are rewarded more than longer ones.
        Longer incorrect solutions are penalized less than shorter ones.

        Args:
            completions: List of model completions
            solution: List of ground truth solutions

        This function is parameterized by the following arguments:
            min_value_wrong: Minimum reward for wrong answers
            max_value_wrong: Maximum reward for wrong answers
            min_value_correct: Minimum reward for correct answers
            max_value_correct: Maximum reward for correct answers
            max_len: Maximum length for scaling
        """
        contents = [completion[-1]["content"] for completion in completions]
        rewards = []

        for content, sol in zip(contents, solution):
            gold_parsed = parse(extract_xml_answer(sol), extraction_mode="first_match", extraction_config=[LatexExtractionConfig(
                normalization_config=NormalizationConfig(
                            nits=False,
                            malformed_operators=False,
                            basic_latex=True,
                            equations=True,
                            boxed="all",
                            units=True,
                        ),
                        # Ensures that boxed is tried first
                        boxed_match_priority=0,
                        try_extract_without_anchor=False,
            )])
            if len(gold_parsed) == 0:
                rewards.append(1.0)  # Skip unparseable examples
                # print("cosine_scaled_reward: Failed to parse gold solution: ", sol)
                continue

            answer_parsed = parse(content, extraction_config=[LatexExtractionConfig(
                normalization_config=NormalizationConfig(
                            nits=False,
                            malformed_operators=False,
                            basic_latex=True,
                            equations=True,
                            boxed="all",
                            units=True,
                        ),
                        # Ensures that boxed is tried first
                        boxed_match_priority=0,
                        try_extract_without_anchor=False,
            )], extraction_mode="first_match",)

            is_correct = verify(answer_parsed, gold_parsed)
            gen_len = len(content)

            # Apply cosine scaling based on length
            progress = gen_len / max_len
            cosine = math.cos(progress * math.pi)

            if is_correct:
                min_value = min_value_correct
                max_value = max_value_correct
            else:
                # Swap min/max for incorrect answers
                min_value = max_value_wrong
                max_value = min_value_wrong

            reward = min_value + 0.5 * (max_value - min_value) * (1.0 + cosine)
            rewards.append(float(reward))

        return rewards

    return cosine_scaled_reward

def get_repetition_penalty_reward(ngram_size: int, max_penalty: float):
    """
    Computes N-gram repetition penalty as described in Appendix C.2 of https://arxiv.org/abs/2502.03373.
    Reference implementation from: https://github.com/eddycmu/demystify-long-cot/blob/release/openrlhf/openrlhf/reward/repetition.py

    Args:
    ngram_size: size of the n-grams
    max_penalty: Maximum (negative) penalty for wrong answers
    """
    if max_penalty > 0:
        raise ValueError(f"max_penalty {max_penalty} should not be positive")

    def zipngram(text: str, ngram_size: int):
        words = text.lower().split()
        return zip(*[words[i:] for i in range(ngram_size)])

    def repetition_penalty_reward(completions, **kwargs) -> float:
        """
        reward function the penalizes repetitions
        ref implementation: https://github.com/eddycmu/demystify-long-cot/blob/release/openrlhf/openrlhf/reward/repetition.py

        Args:
            completions: List of model completions
        """

        contents = [completion[-1]["content"] for completion in completions]
        rewards = []
        for completion in contents:
            if completion == "":
                rewards.append(0.0)
                continue
            if len(completion.split()) < ngram_size:
                rewards.append(0.0)
                continue

            ngrams = set()
            total = 0
            for ng in zipngram(completion, ngram_size):
                ngrams.add(ng)
                total += 1

            scaling = 1 - len(ngrams) / total
            reward = scaling * max_penalty
            rewards.append(reward)
        return rewards

    return repetition_penalty_reward

In [None]:
from kaggle_secrets import UserSecretsClient 
user_secrets = UserSecretsClient() 
hf_token = user_secrets.get_secret("HF_TOKEN") 

import os
os.environ["WANDB_API_KEY"] = user_secrets.get_secret("wandb_key")

from huggingface_hub import login
login(token=hf_token)

In [None]:
from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    use_vllm = True,
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1,
    num_generations = 5,
    max_prompt_length = 2048,
    max_completion_length = 2048,
    num_train_epochs = 1,
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "wandb",
    output_dir = "qwen2.5-3B-openr1-math",
    remove_unused_columns=False
) 

In [None]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        accuracy_reward,
        format_reward,
        reasoning_steps_reward,
        get_cosine_scaled_reward(
            min_value_wrong=-1.0,
            max_value_wrong=-0.5,
            min_value_correct=0.5,
            max_value_correct=1.0,
            max_len=1000,
        ),
        get_repetition_penalty_reward(ngram_size=3, max_penalty=-0.5),
    ],
    args = training_args,
    train_dataset = dataset
) 

In [None]:
trainer.train() 

model.push_to_hub("ubermenchh/Qwen2.5-3B-openr1-math")
tokenizer.push_to_hub("ubermenchh/Qwen2.5-3B-openr1-math")

In [None]:
problem = """Given that $a > b > 1$ and $θ \in (0, \frac{π}{2})$, determine the correct option among the following: A: $a^{\sin θ} < b^{\sin θ}$ B: $ab^{\sin θ} < ba^{\sin θ}$ C: $a\log _{b}\sin θ < b\log _{a}\sin θ$ D: $\log _{a}\sin θ < \log _{b}\sin θ$"""
text = tokenizer.apply_chat_template([
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role" : "user", "content" : problem},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

print(output)

In [None]:
model.save_lora("qwen2.5-3B-openr1-math-lora")

In [None]:
problem = """Given that $a > b > 1$ and $θ \in (0, \frac{π}{2})$, determine the correct option among the following: A: $a^{\sin θ} < b^{\sin θ}$ B: $ab^{\sin θ} < ba^{\sin θ}$ C: $a\log _{b}\sin θ < b\log _{a}\sin θ$ D: $\log _{a}\sin θ < \log _{b}\sin θ$"""
text = tokenizer.apply_chat_template([
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role" : "user", "content" : problem},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = model.load_lora("qwen2.5-3B-openr1-math-lora"),
)[0].outputs[0].text

print(output)

In [None]:
model.push_to_hub_merged("ubermenchh/Qwen2.5-3B-open-r1-math", tokenizer)

In [None]:
model.push_to_hub_merged("ubermenchh/Qwen2.5-3B-open-r1-math-lora", tokenizer, save_method="lora")

In [3]:
import torch
from unsloth import FastLanguageModel
from transformers import TextStreamer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "ubermenchh/Qwen2.5-3B-open-r1-math-lora",
    max_seq_length = 1024,
    dtype = torch.bfloat16,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model) 

SYSTEM_PROMPT = """
Respond in the following format:
<think>
...
</think>
<answer>
...
</answer>
"""

test_question = """
Let $z \in \mathbf{C}$, satisfying the condition $a z^{n}+b \mathrm{i} z^{n-1}+b \mathrm{i} z-a=0, a, b \in \mathbf{R}, m \in$ $\mathbf{N}$, find $|z|$.
"""

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": test_question},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 2048, pad_token_id = tokenizer.eos_token_id)

==((====))==  Unsloth 2025.2.15: Fast Qwen2 patching. Transformers: 4.49.0.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Device does not support bfloat16. Will change to float16.


model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/479M [00:00<?, ?B/s]

Unsloth 2025.2.15 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


<think>
To solve for \( |z| \) given the equation \( a z^n + b i z^{n-1} + b i z - a = 0 \), we start by factoring out \( z \) from the terms involving \( z \):

\[ a z (z^{n-1} + i z^{n-2} + i z - \frac{a}{z}) = 0. \]

This equation gives us two potential solutions: either \( z = 0 \) or the polynomial inside the parentheses must be zero. However, if \( z = 0 \), then substituting into the original equation would give \( a \cdot 0 + b i \cdot 0 + b i \cdot 0 - a = -a \neq 0 \) unless \( a = 0 \). Since \( a \) and \( b \) are real numbers and the problem specifies that \( z \in \mathbb{C} \), we can assume \( a \neq 0 \). Therefore, we focus on solving:

\[ z^{n-1} + i z^{n-2} + i z - \frac{a}{z} = 0. \]

Multiplying through by \( z \) to clear the fraction, we get:

\[ z^n + i z^2 + i z^2 - a = 0 \]
\[ z^n + i z^2 - a = 0. \]

We need to find the magnitude of \( z \). Let \( z = re^{i\theta} \), where \( r = |z| \) and \( \theta \) is the argument of \( z \). Then, \( z^n = r^n e^{in