<a href="https://colab.research.google.com/github/scalixte-mdsol/llm_inferences/blob/main/deepseek_r1_0528_qwen3__8b__grpo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Source: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/DeepSeek_R1_0528_Qwen3_(8B)_GRPO.ipynb


### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install or uv pip install
    !pip install unsloth vllm
else:
    pass # For Colab / Kaggle, we need extra instructions hidden below \/

In [None]:
!ls

sample_data


In [None]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
!pip install --upgrade -qqq uv
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install!
    !pip install unsloth vllm
else:
    try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
    except: get_numpy = "numpy"
    try: import subprocess; is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
    except: is_t4 = False
    get_vllm, get_triton = ("vllm==0.10.1", "triton==3.2.0") if is_t4 else ("vllm", "triton")
    !uv pip install -qqq --upgrade \
        unsloth {get_vllm} {get_numpy} torchvision bitsandbytes xformers
    !uv pip install -qqq {get_triton}
!uv pip install transformers==4.55.4

### Unsloth

Goal: To train `DeepSeek-R1-0528-Qwen3-8B` via GRPO by using OpenR1's Math dataset.

Note: DeepSeek-R1-Zero & DeepSeek-R1 are trained based on DeepSeek-V3-Base (https://huggingface.co/deepseek-ai/DeepSeek-R1)

We also use `langid` for language detection. Our main goal is to force the model to generate reasoning traces in English, and we create a reward function using `langid` to check this.

In [None]:
!pip install langid -qq

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.8/1.9 MB[0m [31m23.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.9/1.9 MB[0m [31m32.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for langid (setup.py) ... [?25l[?25hdone


### Unsloth (model load + LoRA)

*   Installs langid (for language detection rewards).
*   Loads DeepSeek-R1-0528-Qwen3-8B via Unsloth with:
     * int4 weights (load_in_4bit=True) to save VRAM,
     * fast_inference=True to use vLLM path,
     * max_seq_length=1024 context,
     * gpu_memory_utilization=0.7 to avoid OOM.


* Wraps the base model with LoRA adapters (low-rank trainable layers).
* Gradient checkpointing reduces memory during training.

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-0528-Qwen3-8B",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.7, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank*2, # *2 speeds up training
    use_gradient_checkpointing = "unsloth", # Reduces memory usage
    random_state = 3407,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 09-08 08:41:29 [__init__.py:241] Automatically detected platform cuda.
ERROR 09-08 08:41:30 [fa_utils.py:57] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
🦥 Unsloth Zoo will now patch everything to make training faster!


Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}


Unsloth: Patching vLLM v1 graph capture
Unsloth: Patching vLLM v0 graph capture
==((====))==  Unsloth 2025.9.1: Fast Qwen3 patching. Transformers: 4.55.4. vLLM: 0.10.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.31. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}


Unsloth: vLLM loading unsloth/deepseek-r1-0528-qwen3-8b-unsloth-bnb-4bit with actual GPU utilization = 69.34%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 160.
Unsloth: vLLM's KV Cache can use up to 3.7 GB. Also swap space = 0 GB.
Unsloth: Not an error, but `device` is not supported in vLLM. Skipping.
INFO 09-08 08:41:43 [utils.py:326] non-default args: {'model': 'unsloth/deepseek-r1-0528-qwen3-8b-unsloth-bnb-4bit', 'load_format': 'bitsandbytes', 'dtype': torch.float16, 'seed': 0, 'max_model_len': 1024, 'enable_prefix_caching': True, 'swap_space': 0, 'gpu_memory_utilization': 0.6933715908761556, 'max_num_batched_tokens': 1024, 'max_num_seqs': 160, 'max_logprobs': 0, 'disable_log_stats': True, 'quantization': 'bitsandbytes', 'enable_lora': True, 'max_lora_rank': 32, 'compilation_config': {"level":3,"debug_dump_path":"","cache_dir":"","backend":"inductor","custom_ops":[],"spli

Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}


INFO 09-08 08:41:59 [__init__.py:711] Resolved architecture: Qwen3ForCausalLM
INFO 09-08 08:41:59 [__init__.py:1750] Using max model len 1024


Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}


Unsloth: vLLM Bitsandbytes config using kwargs = {'load_in_8bit': False, 'load_in_4bit': True, 'bnb_4bit_compute_dtype': 'float16', 'bnb_4bit_quant_storage': 'uint8', 'bnb_4bit_quant_type': 'nf4', 'bnb_4bit_use_double_quant': True, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'llm_int8_skip_modules': ['lm_head', 'multi_modal_projector', 'merger', 'modality_projection', 'model.layers.33.self_attn', 'model.layers.34.self_attn', 'model.layers.1.self_attn', 'model.layers.6.self_attn', 'model.layers.34.mlp', 'model.layers.4.mlp', 'model.layers.2.mlp', 'model.layers.5.mlp', 'model.layers.6.mlp'], 'llm_int8_threshold': 6.0}
INFO 09-08 08:42:02 [llm_engine.py:222] Initializing a V0 LLM engine (v0.10.1) with config: model='unsloth/deepseek-r1-0528-qwen3-8b-unsloth-bnb-4bit', speculative_config=None, tokenizer='unsloth/deepseek-r1-0528-qwen3-8b-unsloth-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_r

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

generation_config.json:   0%|          | 0.00/171 [00:00<?, ?B/s]

INFO 09-08 08:42:05 [cuda.py:384] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 09-08 08:42:05 [cuda.py:433] Using XFormers backend.
INFO 09-08 08:42:06 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 09-08 08:42:06 [model_runner.py:1080] Starting to load model unsloth/deepseek-r1-0528-qwen3-8b-unsloth-bnb-4bit...
INFO 09-08 08:42:07 [bitsandbytes_loader.py:742] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 09-08 08:42:07 [weight_utils.py:296] Using model weights format ['*.safetensors']


model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

INFO 09-08 08:44:22 [weight_utils.py:312] Time spent downloading weights for unsloth/deepseek-r1-0528-qwen3-8b-unsloth-bnb-4bit: 134.742717 seconds


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 09-08 08:44:53 [punica_selector.py:19] Using PunicaWrapperGPU.
INFO 09-08 08:44:54 [model_runner.py:1112] Model loading took 7.1825 GiB and 166.799810 seconds
INFO 09-08 08:45:05 [worker.py:295] Memory profiling takes 10.51 seconds
INFO 09-08 08:45:05 [worker.py:295] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.69) = 10.22GiB
INFO 09-08 08:45:05 [worker.py:295] model weights take 7.18GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.88GiB; the rest of the memory reserved for KV Cache is 2.13GiB.
INFO 09-08 08:45:06 [executor_base.py:114] # cuda blocks: 970, # CPU blocks: 0
INFO 09-08 08:45:06 [executor_base.py:119] Maximum concurrency for 1024 tokens per request: 15.16x
INFO 09-08 08:45:06 [vllm_utils.py:695] Unsloth: Running patched vLLM v0 `capture_model`.
INFO 09-08 08:45:06 [model_runner.py:1383] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run t

Capturing CUDA graph shapes:   0%|          | 0/23 [00:00<?, ?it/s]

INFO 09-08 08:45:31 [model_runner.py:1535] Graph capturing finished in 26 secs, took 0.57 GiB
INFO 09-08 08:45:31 [vllm_utils.py:702] Unsloth: Patched vLLM v0 graph capture finished in 26 secs.
INFO 09-08 08:45:33 [llm_engine.py:417] init engine (profile, create kv cache, warmup model) took 38.56 seconds
INFO 09-08 08:45:33 [llm.py:298] Supported_tasks: ['generate']
Unsloth: Just some info: will skip parsing ['post_feedforward_layernorm', 'pre_feedforward_layernorm']
Unsloth: Just some info: will skip parsing ['post_feedforward_layernorm', 'pre_feedforward_layernorm']


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Unsloth 2025.9.1 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


#### Exercise 1: Try with lora_rank=[8, 16], lower max_seq_length, or set gpu_memory_utilization=0.6

### GRPO Chat Template

Distill Qwen3 from Deepseek has a chat template that is used to format the input and output of the model. This is used to make the model output in a chat format. Including the reasoning step. We have to use that chat template since the model is trained using it.


Qwen3’s tokenizer has special tokens like 'think', and role tags.
This loop auto-discovers the actual strings (so we don’t hardcode them).

In [None]:
reasoning_start = None
reasoning_end = None
user_token = None
assistant_token = None

for token in tokenizer.get_added_vocab().keys():
    if "think" in token and "/" in token:
        reasoning_end = token
    elif "think" in token:
        reasoning_start = token
    elif "user" in token:
        user_token = token
    elif "assistant" in token:
        assistant_token = token

system_prompt = \
f"""You are given a problem.
Think about the problem and provide your working out.
You must think in English."""
system_prompt

'You are given a problem.\nThink about the problem and provide your working out.\nYou must think in English.'

In [None]:
print("reasoning_start:", reasoning_start)
print("reasoning_end  :", reasoning_end)
print("user_token     :", user_token)
print("assistant_token:", assistant_token)


reasoning_start: <think>
reasoning_end  : </think>
user_token     : None
assistant_token: None


In [None]:
{"role":"assistant","content": f"{reasoning_start}I think it's 2.2{reasoning_end}2"}


{'role': 'assistant', 'content': "<think>I think it's 2.2</think>2"}

In [None]:
test = raw = (
    f"{user_token}\nWhat is 1+1?\n"
    f"{assistant_token}\n{reasoning_start}Let me think...{reasoning_end}2"
)


In [None]:
test

'None\nWhat is 1+1?\nNone\n<think>Let me think...</think>2'

In [None]:
print(tokenizer.apply_chat_template([
    {"role" : "user", "content" : "What is 1+1?"},
    {"role" : "assistant", "content" : f"<think>I think it's 2.2</think>2"},
    {"role" : "user", "content" : "What is 1+1?"},
    {"role" : "assistant", "content" : f"<think>I think it's 2.2</think>2"},
], tokenize = False, add_generation_prompt = True))

<｜begin▁of▁sentence｜><｜User｜>What is 1+1?<｜Assistant｜>2<｜end▁of▁sentence｜><｜User｜>What is 1+1?<｜Assistant｜>2<｜end▁of▁sentence｜><｜Assistant｜>


Where those variables actually matter in your notebook

reasoning_end (&lt;/think&gt;) is used in your regex to extract the final answer (everything after &lt;/think&gt;), and in format rewards.

reasoning_start (&lt;think&gt;) + reasoning_end are used in the approximate format reward (counting one open/close).

user_token / assistant_token are not used separately as apply_chat_template handles roles.

### Exercise : How to use &lt;think&gt; with the chat template

If you want the generation to start inside a reasoning block, you have to add it yourself:

#### Option 1: Append &lt;think&gt; after the serialized template
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
text = text + reasoning_start  # e.g., "&lt;think&gt;"

"""feed `text` to your generator; model will continue after &lt;/think&gt;""""



#### Option 2: Seed the assistant turn with &lt;think&gt; (no generation prompt)
seeded = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": "Solve (x + 2)^2 = 0"},
    {"role": "assistant", "content": reasoning_start},  # start with "&lt;think&gt;"
]

text = tokenizer.apply_chat_template(seeded, add_generation_prompt=False, tokenize=False)

Now the model continues right after &lt;think&gt;.



### Data Prep
<a name="Data"></a>

We're using Hugging Face's [Open R1 Math dataset](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed). You can also utilize OpenAI's famous [GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k)

In [None]:
from datasets import load_dataset
dataset = load_dataset("open-r1/DAPO-Math-17k-Processed", "en", split = "train")
dataset

README.md: 0.00B [00:00, ?B/s]

en/train-00000-of-00001.parquet:   0%|          | 0.00/5.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14116 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'solution', 'data_source', 'source_prompt', 'ability', 'reward_model', 'extra_info'],
    num_rows: 14116
})

In [None]:
small_dataset_ = dataset.select(range(10))

Let's look at the first row:

In [None]:
dataset[0]["prompt"]

'In triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \\angle DAC$ and $\\angle BDC = 90^\\circ$. Suppose that $AD = 1$ and that $\\frac{BD}{CD} = \\frac{3}{2}$. If $AB + AC$ can be expressed in the form $\\frac{a\\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.'

In [None]:
dataset[0]["solution"]

'34'

In GSM8K, ee notice all answers like about have a ####, so we extract it. But for the Open R1 dataset, we can skip the below.

In [None]:
def extract_hash_answer(text):
    # if "####" not in text: return None
    # return text.split("####")[1].strip()
    return text
extract_hash_answer(dataset[0]["solution"])

'34'


Converts raw rows into the chat format the model expects.
Keeps the gold answer in "answer" for reward checks.


Let's map the dataset! and see the first row:

In [None]:
small_dataset = small_dataset_.map(lambda x: {
    "prompt" : [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": x["prompt"]},
    ],
    "answer": extract_hash_answer(x["solution"]),
})
# dataset[0]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [None]:
small_dataset[0]

{'prompt': [{'content': 'You are given a problem.\nThink about the problem and provide your working out.\nYou must think in English.',
   'role': 'system'},
  {'content': 'In triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \\angle DAC$ and $\\angle BDC = 90^\\circ$. Suppose that $AD = 1$ and that $\\frac{BD}{CD} = \\frac{3}{2}$. If $AB + AC$ can be expressed in the form $\\frac{a\\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.',
   'role': 'user'}],
 'solution': '34',
 'data_source': 'math_dapo',
 'source_prompt': [{'content': 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\nIn triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \

### Exercise: Change your system prompt and apply on the small dataset to see the difference

We create a regex format to match the reasoning sections and answers:

In [None]:
import re
"""
Build a regex that captures what comes after </think> (the model’s “final answer” area).
"""

# Add optional EOS token matching
solution_end_regex = rf"{reasoning_end}(.*)"

match_format = re.compile(solution_end_regex, re.DOTALL)
match_format

re.compile(r'</think>(.*)', re.DOTALL|re.UNICODE)

We verify it works:

In [None]:
match_format.findall(
    "Let me think!</think>"\
    f"Hence, the solution is 2.",
)

['Hence, the solution is 2.']

In [None]:
match_format.findall(
    "<think>Let me think!</think>"\
    f"\n\nHence, the solution is 2",
)

['\n\nHence, the solution is 2']

### Reward functions (format + answer)

We now want to create a reward function to match the format exactly - we reward it with 3 points if it succeeds:

In [None]:
def match_format_exactly(completions, **kwargs):
  # +3.0 if response contains the expected pattern with </think>
  # Rewards completions that follow the exact think → final answer structure.
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Match if format is seen exactly!
        if match_format.search(response) is not None: score += 3.0
        scores.append(score)
    return scores

If it fails, we want to reward the model if it at least follows the format partially, by counting each symbol:

In [None]:
def match_format_approximately(completions, **kwargs):
  # counts occurrences of <think> and </think>
  # Softer reward for partial format adherence.
  # +0.5 if exactly one of each, else -1.0 penalties
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Count how many keywords are seen - we penalize if too many!
        # If we see 1, then plus some points!

        # No need to reward <think> since we always prepend it!
        score += 0.5 if response.count(reasoning_start) == 1 else -1.0
        score += 0.5 if response.count(reasoning_end)   == 1 else -1.0
        scores.append(score)
    return scores

We want to extract the generated answer, and reward or penalize it! We also reward it based on how close the answer is to the true one via ratios:

Main correctness reward:

- exact/close matches → positive,
- wrong/missing → penalties.

In [None]:
def check_answer(prompts, completions, answer, **kwargs):
    # Extract text after </think> and compare against gold "answer":
    # +5 exact match, +3.5 if strip-equal, +1.5~+2.0 if numeric ratio is close, negative if wrong
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_format.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    for guess, true_answer in zip(extracted_responses, answer):
        score = 0
        if guess is None:
            scores.append(-2.0)
            continue
        # Correct answer gets 5 points!
        if guess == true_answer:
            score += 5.0
        # Match if spaces are seen, but less reward
        elif guess.strip() == true_answer.strip():
            score += 3.5
        else:
            # We also reward it if the answer is close via ratios!
            # Ie if the answer is within some range, reward it!
            try:
                ratio = float(guess) / float(true_answer)
                if   ratio >= 0.9 and ratio <= 1.1: score += 2.0
                elif ratio >= 0.8 and ratio <= 1.2: score += 1.5
                else: score -= 2.5 # Penalize wrong answers
            except:
                score -= 4.5 # Penalize
        scores.append(score)
    return scores

### Exercise: Change the reward scoring mechanism above and run the cells again

Also sometimes it might not be 1 number as the answer, but like a sentence for example "The solution is $20" -> we extract 20.

We also remove possible commas for example as in 123,456

In [None]:
# Number extractor (helper)
match_numbers = re.compile(
    r".*?[\s]{0,}([-]?[\d\.\,]{1,})",
    flags = re.MULTILINE | re.DOTALL
)
print(match_numbers.findall("  0.34  "))
print(match_numbers.findall("  123,456  "))
print(match_numbers.findall("  -0.234  "))
print(match_numbers.findall("17"))

['0.34']
['123,456']
['-0.234']
['17']


Finally, we will try to enforce the thinking process to be in English. This is a simple version of the `language consistency reward` that is used in DeepSeek R1 paper

In [None]:
import langid

def get_lang(text: str) -> str:
    if not text:
        return "und"
    lang, _ = langid.classify(text)
    return lang


print(get_lang("Hello, How are you")) # This should return en
# print(get_lang("Aku berpikir kalau aku adalah kamu")) # This should return id
print(get_lang("我在这里")) # This should return zh

en
zh


### Exercise: Change to "zh" in system prompt and language detector

In [None]:
import re
"""
Encourages English outputs in reasoning/final answer.
"""

def format_and_language_reward_func(completions, **kwargs):
    scores = []

    for completion_item in completions:
        if not completion_item or not isinstance(completion_item[0], dict) or "content" not in completion_item[0]:
            scores.append(-5.0)
            print(f"Warning: Malformed completion item, assigning default low score: {completion_item}")
            continue

        content = completion_item[0]["content"]

        lang = get_lang(content)

        if lang == 'en':
            score = 5.0
        else:
            score = -3.0

        scores.append(score)

    return scores

#### Exercise:
- Flip targets by changing the language code
- Change the reward values
- Add two language codes

In [None]:
prompts = [
    [{"role": "assistant", "content": "What is the result of (1 + 2) * 4?"}],
    [{"role": "assistant", "content": "What is the result of (3 + 1) * 2?"}],
]
completions = [
    [{"role": "assistant", "content": "<think>The sum of 1 and 2 is 3, which we multiply by 4 to get 12.</think><answer>(1 + 2) * 4 = 12</answer>"}],
    [{"role": "assistant", "content": "The sum of 3 and 1 is 4, which we multiply by 2 to get 8. So (3 + 1) * 2 = 8."}],
]
format_and_language_reward_func(prompts=prompts, completions=completions)

[5.0, 5.0]

### Exercise: Run all the reward and language code changes mentioned above and print reward values here

We now prepare our main function which will print out the generated responses and the true answer, along with another reward function which converts text to float via `float` and sees if it's the same.

In [None]:
global PRINTED_TIMES
PRINTED_TIMES = 0
global PRINT_EVERY_STEPS
PRINT_EVERY_STEPS = 5

"""
# Every 5 calls, pretty-print question, gold answer, model response, and extracted number.
# Reward +3.5 if numeric match (after stripping commas), else -1.5 or 0

"""

def check_numbers(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_numbers.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    # Print only every few steps
    global PRINTED_TIMES
    global PRINT_EVERY_STEPS
    if PRINTED_TIMES % PRINT_EVERY_STEPS == 0:
        print(
            '*'*20 + f"Question:\n{question}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}"
        )
    PRINTED_TIMES += 1

    for guess, true_answer in zip(extracted_responses, answer):
        if guess is None:
            scores.append(-2.5)
            continue
        # Convert to numbers
        try:
            true_answer = float(true_answer.strip())
            # Remove commas like in 123,456
            guess       = float(guess.strip().replace(",", ""))
            scores.append(3.5 if guess == true_answer else -1.5)
        except:
            scores.append(0)
            continue
    return scores

Get the top 90% prompt length so we don't accidentally truncate them!

Ie we'll remove the top 10% long prompts.

In [None]:
"""
Tokenize and compute lengths L.
Keep only the shortest 90% (avoid truncation vs max_seq_length).
This preserves prompt integrity and reduces OOM risk.
"""
tokenized = small_dataset.map(
    lambda x: {"tokens" : tokenizer.apply_chat_template(x["prompt"], add_generation_prompt = True, tokenize = True)},
    batched = True,
)
print(tokenizer.decode(tokenized[0]["tokens"]))
tokenized = tokenized.map(lambda x: {"L" : len(x["tokens"])})

import numpy as np
maximum_length = int(np.quantile(tokenized["L"], 0.9))
print("Max Length = ", maximum_length)

# Filter only samples smaller than 90% max length
small_dataset = small_dataset.select(np.where(np.array(tokenized["L"]) <= maximum_length)[0])
del tokenized

Map:   0%|          | 0/9 [00:00<?, ? examples/s]

<｜begin▁of▁sentence｜>You are given a problem.
Think about the problem and provide your working out.
You must think in English.<｜User｜>In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.<｜Assistant｜>


Map:   0%|          | 0/9 [00:00<?, ? examples/s]

Max Length =  129


### Exercise: Change the max_seq_leth and prompt length

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [None]:
# Budget the response length so prompt+completion ≤ max_seq_length.#
max_prompt_length = maximum_length + 1 # + 1 just in case!
max_completion_length = max_seq_length - max_prompt_length

from vllm import SamplingParams

# Controls how the policy samples multiple completions per prompt (for RL).
vllm_sampling_params = SamplingParams(
    min_p = 0.1,
    top_p = 1.0,
    top_k = -1,
    seed = 3407,
    stop = [tokenizer.eos_token],
    include_stop_str_in_output = True,
)

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    vllm_sampling_params = vllm_sampling_params,
    temperature = 1.0,
    learning_rate = 5e-6,
    weight_decay = 0.01,
    warmup_ratio = 0.1,
    lr_scheduler_type = "linear",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 2, # Decrease if out of memory, samples per prompt per step
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 4, # ← tiny demo run (use more for real training)
    save_steps = 1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",

    # For optional training + evaluation
    # fp16_full_eval = True,
    # per_device_eval_batch_size = 4,
    # eval_accumulation_steps = 1,
    # eval_strategy = "steps",
    # eval_steps = 1,
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 2


### Exercise:
- Change top_p, temperature values

**GRPO needs ≥2 samples per prompt to work properly.**
So we keep `per_device_train_batch_size=1` (to save VRAM) but set **`num_generations=2`** so each prompt is sampled twice and the algorithm can do its **group-relative** math.

* ** `per_device_train_batch_size` = how many **prompts** you feed the policy at once.

* `num_generations` = how many **completions per prompt** you sample each step (k). GRPO then compares those k completions for the *same prompt* to compute **relative advantages**:

  $$
  \text{adv}_i = r_i - \text{mean}(r_{1..k}) \quad (\text{often normalized by std})
  $$

  If **k=1**, there’s no “group” → the relative term collapses; training degenerates toward REINFORCE/KL and the signal is much weaker.

* **Why choose k=2 (the minimum that works)**

  * Satisfies the **group** requirement (you can rank/center rewards).
  * Much cheaper in VRAM than bumping batch size (since the **prompt is reused** across the two samples).
  * Plays nicely with vLLM’s sampler (sampling multiple continuations of the same prompt is efficient).

* **Effective group size & memory**:

  Effective samples per step ≈ `batch_size * num_generations`.
  With `batch_size=1, num_generations=2`, you get a group of 2 per prompt—**lowest VRAM** option that still gives GRPO a valid learning signal.

**Summary:** Keep batch size low to fit memory; set `num_generations≥2` so GRPO can compare multiple completions per prompt and produce meaningful gradients.


### Exercise: Change per_device_train_batch_size = 2. What happens?

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [None]:
# For optional training + evaluation
# new_dataset = dataset.train_test_split(test_size = 0.01)

trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        match_format_exactly,
        match_format_approximately,
        check_answer,
        check_numbers,
        format_and_language_reward_func,
    ],
    args = training_args,
    train_dataset = small_dataset,

    # For optional training + evaluation
    # train_dataset = new_dataset["train"],
    # eval_dataset = new_dataset["test"],
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 8 | Num Epochs = 1 | Total steps = 4
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 87,293,952 of 8,278,029,312 (1.05% trained)


********************Question:
For which $n$ is $n^4 + 6n^3 + 11n^2 + 3n + 31$ a perfect square? 
Answer:
10 
Response:
<think>
I need to determine for which integer n the expression n⁴ + 6n³ + 11n² + 3n + 31 is a perfect square. A perfect square is a number that can be expressed as some integer squared, like 1, 4, 9, 16, etc.

First, since it's a quartic expression, I am considering it as close to a square of a quadratic. I know that the square of a quadratic like (n² + a n + b)² would be n⁴ + 2a n³ + (a² + 2b) n² + 2ab n + b². I can compare it to the given expression to find possible values for a and b.

Set up the equation by comparing coefficients.

Given expression: n⁴ + 6n³ + 11n² + 3n + 31

I want this to be (n² + p n + q)² = n⁴ + 2p n³ + (p² + 2q) n² + 2p q n + q²

So, compare coefficients:

For n³: 2p = 6 ⇒ p = 3

For n²: p² + 2q = 9 + 2q = 11 ⇒ 2q = 2 ⇒ q = 1

p = 3, then p² = 9, and p² + 2q = 9 + 2q = 11, so 2q = 2, q = 1.

Now, for n term: 2p q = 2*3*1 = 6, but in the given 

Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,rewards / match_format_exactly / mean,rewards / match_format_exactly / std,rewards / match_format_approximately / mean,rewards / match_format_approximately / std,rewards / check_answer / mean,rewards / check_answer / std,rewards / check_numbers / mean,rewards / check_numbers / std,rewards / format_and_language_reward_func / mean,rewards / format_and_language_reward_func / std
1,0.0,1.0,0.0,894.0,894.0,894.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.5,0.0,-2.0,0.0,-1.5,0.0,5.0,0.0
2,0.0,2.5,0.0,894.0,894.0,894.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.5,0.0,-2.0,0.0,0.0,0.0,5.0,0.0
3,0.0,2.5,0.0,894.0,894.0,894.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.5,0.0,-2.0,0.0,0.0,0.0,5.0,0.0
4,0.0,1.75,1.06066,894.0,894.0,894.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.5,0.0,-2.0,0.0,-0.75,1.06066,5.0,0.0


Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}


Unsloth: Will smartly offload gradients to save VRAM!


Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}


TrainOutput(global_step=4, training_loss=3.0517099958160543e-10, metrics={'train_runtime': 927.5938, 'train_samples_per_second': 0.009, 'train_steps_per_second': 0.004, 'total_flos': 0.0, 'train_loss': 3.0517099958160543e-10})

### Exercise: Increase num_generations, what happens ?

#### Gradient Acculumation

`gradient_accumulation_steps` tells the trainer to **split one optimizer update across multiple forward/backward passes**. You do several small mini-batches, **accumulate** their gradients, and only then call `optimizer.step()` once.

If gradient_accumulation_steps = 1, there’s no accumulation—you do one forward + backward, then immediately optimizer.step() and zero_grad().

#### Why it exists:

* **Fits memory:** You can keep `per_device_train_batch_size` small (e.g., 1) to avoid OOM, but still get a **larger effective batch** by accumulating over several steps.
* **Smoother training:** Larger effective batch often stabilizes rewards/loss.

#### How it works (conceptually)

If `gradient_accumulation_steps = G`:

1. For `i = 1..G`:

   * Run forward → compute loss
   * `loss.backward()` (gradients **add up** in model params)
2. After G mini-batches:

   * `optimizer.step()`
   * `optimizer.zero_grad()`

So you do **G backward passes per update**.

#### Effective batch size

For GRPO (which samples multiple completions per prompt):

```
effective_prompts_per_update   = per_device_train_batch_size × gradient_accumulation_steps
effective_completions_per_update = effective_prompts_per_update × num_generations
```

With your defaults:

* `per_device_train_batch_size = 1`
* `gradient_accumulation_steps = 1`
* `num_generations = 2`

→ `effective_prompts_per_update = 1 × 1 = 1`
→ `effective_completions_per_update = 1 × 2 = 2`

If you set `gradient_accumulation_steps = 4` (keep batch size = 1):
→ `effective_prompts_per_update = 1 × 4 = 4`
→ `effective_completions_per_update = 4 × 2 = 8`
…without raising peak memory like a real batch of 4 would.

#### Memory vs speed trade-off

* **Higher `gradient_accumulation_steps`**:

  * Lower **peak** VRAM than increasing `per_device_train_batch_size`
  * Larger effective batch (better signal)
  * Slower wall-clock per optimizer update (you do more forwards/backwards before stepping)
* **Increasing `per_device_train_batch_size`** instead raises peak memory more (multiple prompts’ KV caches live at once).

### Exercise: Set gradient_accumulation_steps to [2, 3, 4], observe the differences

<a name="Inference"></a>
### Inference (baseline vs LoRA)
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [None]:
text = "What is the sqrt of 101?"

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 1024,
)
# Baseline generation (no LoRA) with vLLM-style fast_generate.
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

" - Brainly.in\nWhat is the sqrt of 101?\nAdvertisement\nAnswer\n3.0 /5\n1\nanamika25\nsquare root of 101 is 10.0498756211\nhope it will help u\nAdvertisement\nAnswer\n3.0 /5\n2\nBrainly User\nAnswer:\nStep-by-step explanation:\n√101 = ?\n10^2=100, 11^2=121, Since 101 is between 100 and 121, so between 10 and 11.\n10.05^2 = 100.1005, 10.09^2 = 100.9081, 10.09*10.09=101.8081, no 10.07^2=100.4049, 10.08^2=101.6064, 101.6064 is greater than 101.\nSo √101 is between 10.07 and 10.08.\n10.08^2=101.6064, 10.07^2=100.4049, so √101 is closer to 10.07, let's say 10.0705^2=?\n10.07^2= (10 + 0.07)^2 = 100 + 2*10*0.07 + 0.07^2 = 100 + 1.4 + 0.0049=101.4049\n10.07*10.07=10.07*(10+0.07)=100.7 + (10.07*0.07)=100.7 + 0.7049=101.4049\nNow, 101.4049 - 101 = 0.4049\nSo, to get to 101 from 10.07^2 we need to subtract 0.4049, and the derivative is 2*10.07=20.14\nSo, we need to reduce by 0.4049/20.14≈0.02\nSo, √101≈10.07 - 0.02* (0.4049)/20.14 ≈ better to use (10.07 - d), where d is the correction.\n(10.07 -

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [None]:
model.save_lora("grpo_lora")

Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}


Verify LoRA is actually trained!

In [None]:
# # Verify LoRA has non-zero weights
from safetensors import safe_open

tensors = {}
with safe_open("grpo_lora/adapter_model.safetensors", framework = "pt") as f:
    # Verify both A and B are non zero
    for key in f.keys():
        tensor = f.get_tensor(key)
        n_zeros = (tensor == 0).sum() / tensor.numel()
        assert(n_zeros.item() != tensor.numel())

Now we load the LoRA and test. We tested without using our custom system prompt which should not (or minimal) affect toward the model's original reasoning ability.:

In [None]:
# # Inference WITH LoRA (no system prompt)
messages = [
    {"role": "user",   "content": "Solve (x + 2)^2 = 0"},
]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    tokenize = False,
)
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 2048,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_lora"),
)[0].outputs[0].text

output

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

'<think>\nI have the equation (x + 2)² = 0 to solve. First, I need to remember what squaring something means. If I square an expression, I\'m multiplying it by itself, so (x + 2)² should be (x + 2) multiplied by (x + 2).\n\nSo, let me expand that out. (x + 2) * (x + 2). Using the FOIL method: First, Outer, Inner, Last.\n\nFirst: x times x is x².\n\nOuter: x times 2 is 2x.\n\nInner: 2 times x is 2x.\n\nLast: 2 times 2 is 4.\n\nSo, altogether, x² + 2x + 2x + 4. Which is x² + 4x + 4.\n\nNow, that equals 0. So, x² + 4x + 4 = 0.\n\nI recognize that as a perfect square trinomial. In fact, x² + 4x + 4 looks familiar. That\'s (x + 2)², which is what we had, so that\'s consistent.\n\nBut now I have (x + 2)² = 0.\n\nTo solve for x, I need to set the thing inside the parentheses to zero, because if any number squared equals zero, then that number must be zero. For example, if a² = 0, then a must be 0.\n\nSo, (x + 2)² = 0 means that x + 2 must be equal to 0.\n\nTherefore, x + 2 = 0.\n\nThen, solvi

Next, let's test using our system prompt which should use the new language :

In [None]:
# Inference WITH system prompt (encourages English chain-of-thought)
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": "Solve (x + 2)^2 = 0"},
]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    tokenize = False,
)
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 2048,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_lora"),
)[0].outputs[0].text

output

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

"<think>\nFirst, the equation is (x + 2)^2 = 0. I need to solve for x.\n\nSince it's a squared term equal to zero, I recall that if something squared is zero, then that something must be zero. So, (x + 2) must be zero.\n\nLet me write that down: (x + 2)^2 = 0 implies x + 2 = 0.\n\nNow, to solve for x, I can subtract 2 from both sides: x = -2.\n\nI should verify if this is correct. Let me plug x = -2 back into the original equation.\n\n(-2 + 2)^2 = (0)^2 = 0, which equals 0. Perfect, it's correct.\n\nIs there only one solution? Since it's a square, it should be a repeated root, but in terms of solving, x = -2 is the solution.\n\nI could expand the left side to see. (x + 2)^2 = x^2 + 4x + 4.\n\nSo the equation is x^2 + 4x + 4 = 0.\n\nNow, factor this. It's a perfect square trinomial, so (x + 2)(x + 2) = 0.\n\nUsing the zero product property, x + 2 = 0, so x = -2.\n\nAgain, same answer.\n\nThe quadratic formula: for ax^2 + bx + c = 0, x = [-b ± sqrt(b^2 - 4ac)] / (2a)\n\nHere, a=1, b=4, c

Lets compare our results with system prompt but without our LoRA

In [None]:
# Control comparison (same prompt, no LoRA)
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": "Solve (x + 2)^2 = 0"},
]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    tokenize = False,
)
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 2048,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

'<think>\nFirst, the equation is (x + 2)^2 = 0. I need to solve for x.\n\nA square is equal to zero. What values can make a square zero? Only when the thing inside the square is zero because the square of any non-zero number is positive, and the square of zero is zero.\n\nSo, if (x + 2)^2 = 0, then x + 2 must be equal to zero.\n\nSet up the equation: x + 2 = 0.\n\nNow, solve for x: subtract 2 from both sides, so x = -2.\n\nI should verify this. Plug x = -2 into the original equation.\n\n(-2 + 2)^2 = (0)^2 = 0, which equals 0. So, that checks out.\n\nIs there any other solution? For real numbers, no, because if x + 2 were not zero, say it were a small number like 0.1, then 0.1^2 is 0.01, not zero. Only at zero does the square become zero.\n\nWhat about complex numbers? But the equation is probably in real numbers, unless specified otherwise. In complex numbers, the square can be zero only if the complex number is zero, because i^2 = -1, but still, for any non-zero complex number, its sq

### Mini language comparison (4 samples)

Let's take 4 samples, and compare the the amount of using our LoRA and not using it, and see which one has better amount of correct language

In [None]:
sample_dataset = small_dataset.shuffle(seed = 3407).select(range(4))
sample_dataset

Dataset({
    features: ['prompt', 'solution', 'data_source', 'source_prompt', 'ability', 'reward_model', 'extra_info', 'answer'],
    num_rows: 4
})

### Exercise: Increase the dataset to 10, and sample 6 data-points from it

In [None]:
with_lora_id_count = 0
without_lora_id_count = 0

print("Comparing language usage with and without LoRA on 4 samples:")
print("=" * 60)

for i, sample in enumerate(sample_dataset):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": sample["prompt"][1]["content"]},
    ]

    text = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=False,
    )

    output_with_lora = model.fast_generate(
        text,
        sampling_params=sampling_params,
        lora_request=model.load_lora("grpo_lora"),
    )[0].outputs[0].text

    output_without_lora = model.fast_generate(
        text,
        sampling_params=sampling_params,
        lora_request=None,
    )[0].outputs[0].text

    lang_with_lora = get_lang(output_with_lora)
    lang_without_lora = get_lang(output_without_lora)

    if lang_with_lora == 'id':
        with_lora_id_count += 1
    if lang_without_lora == 'id':
        without_lora_id_count += 1

    # Print progress every 4 samples
    if (i + 1) % 1 == 0:
        print(f"Processed {i + 1}/4 samples...")

print("\n" + "=" * 60)
print("RESULTS:")
print(f"With LoRA - English responses: {with_lora_id_count}/4 ({with_lora_id_count/20*100:.1f}%)")
print(f"Without LoRA - English responses: {without_lora_id_count}/4 ({without_lora_id_count/20*100:.1f}%)")
print(f"Improvement: +{with_lora_id_count - without_lora_id_count} English responses with LoRA")

Comparing language usage with and without LoRA on 4 samples:


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed 1/4 samples...


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed 2/4 samples...


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed 3/4 samples...


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed 4/4 samples...

RESULTS:
With LoRA - English responses: 0/4 (0.0%)
Without LoRA - English responses: 0/4 (0.0%)
Improvement: +0 English responses with LoRA


### Exercise: Can you all spot the bug in the above cell?

<a name="Save"></a>
### Saving to float16 for VLLM

Select `merged_16bit` for float16 or `merged_4bit` for int4. Can also use `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False:
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


### GGUF / llama.cpp Conversion
From Unsloth:

- Export to GGUF (for llama.cpp, LM Studio, Ollama).
- Offers several quantization flavors (q8_0, q4_k_m, q5_k_m, etc.).


To save to `GGUF` / `llama.cpp`, library supports it natively now: clone `llama.cpp` and default save it to `q8_0`. It allows all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on the [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp.

And we're done! If you have any questions on Unsloth, they have a [Discord](https://discord.gg/unsloth) channel!

### Expert Parallelism

DeepSeek-V3/R1 models replace MLP Layers with MoE Layers. An MoE Layer has 256 routed experts and one shared expert. Each token is dispatched to 8 different routed experts for computation, and the results are weighted summed. Each token also computes in the shared expert, and the result is added to the result from the routed experts.

Expert Parallelism (EP) serves as the typical sharding approach for MoE Layers, with each GPU managing 256 / EP routed experts while maintaining a copy of the shared expert. Compared to TP, the advantage of EP is that it can distribute computation across more GPUs, reducing the computation and memory usage per GPU.

Before performing expert computation, all GPUs need to perform an AllToAll communication to dispatch tokens to the GPUs where the corresponding experts are located; after expert computation, another AllToAll communication is needed to collect computation results from various GPUs and perform weighted summation.

 **Few Inference-time optimizations used in DeepSeek-V3**

1. **Multi-Head Latent Attention (MLA)** — *smaller KV cache, faster decode*
   V3 keeps the MLA attention from V2 specifically **for efficient inference**. MLA compresses Q/K/V into a latent space so the **KV cache you store during generation is much smaller** (low rance KV cache), which cuts memory traffic and speeds token-by-token decoding. (V2 quantified this: **\~93.3% KV-cache reduction** and up to **\~5.76× max throughput**; V3 inherits MLA for inference.) ([arXiv][1], [GitHub][2])

2. **Sparse MoE at runtime** — *compute only a few experts/token*
   V3 is a 671B-param MoE with only **\~37B active per token**. Because only a small subset of experts runs for each token, you do less matmul/communication per decode step than a dense model of similar total size. This lowers per-token FLOPs and improves throughput at inference. ([arXiv][1])

3. **Multi-Token Prediction (MTP) + speculative decoding** — *lower latency decode*
   V3 trains auxiliary MTP modules so the model can **predict multiple future tokens**. At inference you can repurpose them for **speculative decoding** (propose extra tokens and verify). The paper reports **85–90% acceptance** for the second token and about **\~1.8× tokens/s** when used this way. (If you don’t want speculation, you can drop the MTP modules and run normally.) ([arXiv][1])