## Build your own model

In [2]:
!pip install unsloth vllm
!pip install --upgrade pillow
!pip install git+https://github.com/huggingface/trl.git@e95f9fb74a3c3647b86f251b7e230ec51c64b72b

Collecting git+https://github.com/huggingface/trl.git@e95f9fb74a3c3647b86f251b7e230ec51c64b72b
  Cloning https://github.com/huggingface/trl.git (to revision e95f9fb74a3c3647b86f251b7e230ec51c64b72b) to /var/tmp/pip-req-build-skumbhi9
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/trl.git /var/tmp/pip-req-build-skumbhi9
  Running command git rev-parse -q --verify 'sha^e95f9fb74a3c3647b86f251b7e230ec51c64b72b'
  Running command git fetch -q https://github.com/huggingface/trl.git e95f9fb74a3c3647b86f251b7e230ec51c64b72b
  Running command git checkout -q e95f9fb74a3c3647b86f251b7e230ec51c64b72b
  Resolved https://github.com/huggingface/trl.git to commit e95f9fb74a3c3647b86f251b7e230ec51c64b72b
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


In [3]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 02-17 21:52:13 __init__.py:190] Automatically detected platform cuda.


In [4]:
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 512
lora_rank = 8

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True,
    fast_inference = True,
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",],
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth",
)

==((====))==  Unsloth 2025.2.12: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.381 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit with actual GPU utilization = 59.37%
Unsloth: Your GPU has CUDA compute capability 8.0 with VRAM = 39.38 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 512. Num Sequences = 288.
Unsloth: vLLM's KV Cache can use up to 17.32 GB. Also swap space = 6 GB.
INFO 02-17 21:52:32 config.py:542] This model supports multiple tasks: {'generate', 'embed', 'reward', 'score', 'classify'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config us



INFO 02-17 21:52:34 loader.py:1102] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 02-17 21:52:34 weight_utils.py:252] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-17 21:52:50 model_runner.py:1115] Loading model weights took 5.5976 GB
INFO 02-17 21:52:50 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 02-17 21:52:53 worker.py:267] Memory profiling takes 2.59 seconds
INFO 02-17 21:52:53 worker.py:267] the current vLLM instance can use total_gpu_memory (39.38GiB) x gpu_memory_utilization (0.59) = 23.38GiB
INFO 02-17 21:52:53 worker.py:267] model weights take 5.60GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 1.33GiB; the rest of the memory reserved for KV Cache is 16.36GiB.
INFO 02-17 21:52:53 executor_base.py:110] # CUDA blocks: 8374, # CPU blocks: 3072
INFO 02-17 21:52:53 executor_base.py:115] Maximum concurrency for 512 tokens per request: 261.69x
INFO 02-17 21:52:57 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error o

Capturing CUDA graph shapes: 100%|██████████| 39/39 [00:36<00:00,  1.07it/s]

INFO 02-17 21:53:34 model_runner.py:1562] Graph capturing finished in 36 secs, took 0.89 GiB
INFO 02-17 21:53:34 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 43.34 seconds



Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.2.12 patched 32 layers with 32 QKV layers, 32 O layers and 0 MLP layers.


In [5]:
from datasets import load_dataset
import re

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""


def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def create_dataset(split = "train"):
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = create_dataset()

In [6]:
def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

In [7]:
from trl import GRPOConfig

training_args = GRPOConfig(
    use_vllm = True,
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1,
    num_generations = 6,
    max_prompt_length = 256,
    max_completion_length = 200,
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none",
    output_dir = "outputs",
)

In [8]:
from vllm import SamplingParams
from trl import GRPOTrainer

trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 1
\        /    Total batch size = 1 | Total steps = 250
 "-____-"     Number of trainable parameters = 6,815,744


-------------------- Question:
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? 
Answer:
476 
Response:
Let's break this down step by step:

1. Mr. Benson bought 12 tickets, but the discount only applies to the tickets beyond 10. So, the first 10 tickets are full price, and the last 2 tickets get a 5% discount.

2. The cost of the first 10 tickets:
10 tickets x $40 per ticket = $400

3. To find the cost of the two tickets with a 5% discount, first, we need to find the discounted price of one ticket:
   Original price of one ticket = $40
   Discount = 5% of $40 = 0.05 x $40 = $2
   Discounted price of one ticket = Original price - discount = $40 - $2 = $38

4. Now, we can calculate the cost of the two discounted tickets:
   Cost of 2 discounted tickets = 2 x $38 = $76

5. To find the total cost, 
Extracted:
Let's break this down step by step:

1. Mr. Benson bought 12 tickets, 

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / int_reward_func,rewards / correctness_reward_func
1,0.0,0.0,0.0,171.333344,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,200.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.020833,0.051031,198.166672,0.0,0.020833,0.0,0.0,0.0,0.0
4,0.0,1.211667,1.345531,191.166672,0.000425,-0.038333,0.0,0.0,0.25,1.0
5,0.0,0.080833,0.145295,100.333336,0.000612,-0.0025,0.0,0.0,0.083333,0.0
6,0.0,0.833333,1.290994,190.166672,0.000259,0.0,0.0,0.0,0.166667,0.666667
7,0.0,0.0,0.0,198.833344,0.000364,0.0,0.0,0.0,0.0,0.0
8,0.0,0.326833,1.014773,108.5,0.000655,-0.089833,0.0,0.0,0.083333,0.333333
9,0.0,0.854167,1.275776,161.833344,0.000382,0.020833,0.0,0.0,0.166667,0.666667
10,0.0,0.0,0.0,193.0,0.000265,0.0,0.0,0.0,0.0,0.0


-------------------- Question:
Jane is trying to decide whether to buy a house or a trailer. A house costs $480,000 and a trailer costs $120,000. Each loan will be paid in monthly installments over 20 years. How much more is the monthly payment on the house compared to the trailer? 
Answer:
1500 
Response:
Let's calculate the monthly payment for each option.

The formula to calculate the monthly payment is:

Monthly Payment = (Loan Amount x Interest Rate x Number of Payments) / (1 - (1 + Interest Rate)^(-Number of Payments))

For simplicity, let's assume the interest rate is 5% for both options. 

The number of payments is 20 years * 12 months/year = 240 months.

For the house:

Monthly Payment = ($480,000 x 0.05 x 240) / (1 - (1 + 0.05)^(-240))
Monthly Payment = ($115,200) / (1 - 0.000000169)
Monthly Payment = ($115,200) / 0.999999831
Monthly Payment = $115.44

For the trailer:

Monthly Payment = ($120,000 x 0.05 x 240) / (1 - (1 + 0.05)^(-240))
Monthly Payment = 
Extracted:
Let's cal

TrainOutput(global_step=250, training_loss=7.336921385043383e-05, metrics={'train_runtime': 2429.5951, 'train_samples_per_second': 0.103, 'train_steps_per_second': 0.103, 'total_flos': 0.0, 'train_loss': 7.336921385043383e-05})

In [9]:
# Regular model
query = "which is bigger 9.11 or 9.9?"

text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : query},
], tokenize = False, add_generation_prompt = True)


output = model.fast_generate([text],
                             sampling_params = SamplingParams(
                                 temperature = 0.8,
                                 top_p = 0.95,
                                 max_tokens = 1024),
                             lora_request = None
                             )[0].outputs[0].text

print(output)

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  3.32it/s, est. speed input: 163.22 toks/s, output: 39.97 toks/s]

9.11 is bigger than 9.9.





In [10]:
model.save_lora("grpo_saved_lora")

In [13]:
# Model with LoRA weights

query = "which is bigger 9.11 or 9.9?"

text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : query},
], tokenize = False, add_generation_prompt = True)


output = model.fast_generate(text,
                             sampling_params = SamplingParams(
                                 temperature = 0.8,
                                 top_p = 0.95,
                                 max_tokens = 1024),
                             lora_request = model.load_lora("grpo_saved_lora"),
                             )[0].outputs[0].text

print(output)

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.59it/s, est. speed input: 112.85 toks/s, output: 54.04 toks/s]

9.11 is greater than 9.9 because the first decimal place of both numbers is the same (9), and 11 is greater than 9.





### References: 
- https://unsloth.ai/blog/r1-reasoning
- https://github.com/patchy631/ai-engineering-hub/tree/main/Build-reasoning-model
- https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb