To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it

### News


Train MoEs - DeepSeek, GLM, Qwen and gpt-oss faster with 32% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)

You can now train embedding models 1.8-3.3x faster with 20% less VRAM. [Blog](https://unsloth.ai/docs/new/embedding-finetuning)

Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)

3x faster LLM training with 30% less VRAM and 500K context. [3x faster](https://unsloth.ai/docs/new/3x-faster-training-packing) • [500K Context](https://unsloth.ai/docs/new/500k-context-length-fine-tuning)

New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)

Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).

### Installation

In [None]:
%%capture
import os
os.environ["UNSLOTH_VLLM_STANDBY"] = "1" # [NEW] Extra 30% context lengths!
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install or uv pip install
    !pip install unsloth vllm
else:
    pass # For Colab / Kaggle, we need extra instructions hidden below \/

In [None]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
!pip install --upgrade -qqq uv
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install!
    !pip install unsloth vllm
else:
    try: import numpy, PIL; _numpy = f'numpy=={numpy.__version__}'; _pil = f'pillow=={PIL.__version__}'
    except: _numpy = "numpy"; _pil = "pillow"
    try: import subprocess; is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
    except: is_t4 = False
    _vllm, _triton = ('vllm==0.9.2', 'triton==3.2.0') if is_t4 else ('vllm==0.10.2', 'triton')
    !uv pip install -qqq --upgrade {_vllm} {_numpy} {_pil} torchvision bitsandbytes xformers unsloth
    !uv pip install -qqq {_triton}
!uv pip install transformers==4.56.2
!uv pip install --no-deps trl==0.22.2

### Unsloth

Goal: To convert `Llama-3.2-1B-Instruct` into a reasoning model via GRPO by using OpenR1's Math dataset.

We first pre fine-tune the model to make GRPO skip trying to match formatting - this speeds GRPO up.

In [4]:
import os
os.environ['UNSLOTH_VLLM_STANDBY'] = "1" # Unsloth Standby reduces VRAM by 30%+
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = False, # False for LoRA 16bit
    fast_inference = True, # Enable vllm fast inference
    max_lora_rank = lora_rank,
    load_in_fp8 = True, # Float8 RL / GRPO!
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank*2, # *2 speeds up training
    use_gradient_checkpointing = "unsloth", # Reduces memory usage
    random_state = 3407,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 11-25 15:07:49 [vllm_utils.py:702] Unsloth: Patching vLLM v1 graph capture
==((====))==  Unsloth 2025.11.3: Fast Llama patching. Transformers: 4.57.1. vLLM: 0.11.2.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/Llama-3.2-1B-Instruct-FP8-Block with actual GPU utilization = 86.62%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 22.16 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 288.
Unsloth: vLLM's KV Cache can use up to 17.53 GB. 

  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [field_name='mode', input_value=3, input_type=int])
  return self.serializer.to_python(

INFO 11-25 15:07:59 [model.py:631] Resolved architecture: LlamaForCausalLM
INFO 11-25 15:07:59 [model.py:1745] Using max model len 2048
INFO 11-25 15:08:01 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 11-25 15:08:02 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='unsloth/Llama-3.2-1B-Instruct-FP8-Block', speculative_config=None, tokenizer='unsloth/Llama-3.2-1B-Instruct-FP8-Block', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reaso

  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [field_name='mode', input_value=3, input_type=int])
  return self.serializer.to_python(

INFO 11-25 15:08:03 [topk_topp_sampler.py:36] Using FlashInfer for top-p & top-k sampling.
INFO 11-25 15:08:03 [gpu_model_runner.py:3259] Starting to load model unsloth/Llama-3.2-1B-Instruct-FP8-Block...
INFO 11-25 15:08:04 [cuda.py:377] Using AttentionBackendEnum.FLASHINFER backend.
INFO 11-25 15:08:05 [weight_utils.py:481] No model.safetensors.index.json found in remote.

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

INFO 11-25 15:08:05 [default_loader.py:314] Loading weights took 0.48 seconds
INFO 11-25 15:08:06 [punica_selector.py:20] Using PunicaWrapperGPU.
INFO 11-25 15:08:07 [gpu_model_runner.py:3338] Model loading took 1.4721 GiB memory and 1.841706 seconds
INFO 11-25 15:08:17 [backends.py:631] Using cache directory: /root/.cache/vllm/torch_compile_cache/fbb9cbcbfe/rank_0_0/backbone for vLLM's torch.compile
INFO 11-25 15:08:17 [backends.py:647] Dynamo bytecode transform time: 9.86 s
INFO 11-25 15:08:20 [backends.py:210] Directly load the compiled graph(s) for dynamic shape from the cache, took 2.227 s
INFO 11-25 15:08:22 [monitor.py:34] torch.compile takes 12.09 s in total
INFO 11-25 15:08:23 [gpu_worker.py:359] Available KV cache memory: 17.20 GiB
INFO 11-25 15:08:24 [kv_cache_utils.py:1229] GPU KV cache size: 563,744 tokens
INFO 11-25 15:08:24 [kv_cache_utils.py:1234] Maximum concurrency for 2,048 tokens per request: 275.27x
INFO 11-25 15:08:24 [kernel_warmup.py:65] Warming up FlashInfer at

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/102 [00:00<?, ?it/s]



Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 102/102 [00:07<00:00, 13.08it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 74/74 [00:05<00:00, 12.83it/s]

INFO 11-25 15:08:38 [gpu_model_runner.py:4244] Graph capturing finished in 14 secs, took 1.15 GiB
INFO 11-25 15:08:38 [vllm_utils.py:714] Unsloth: Patched vLLM v1 graph capture finished in 14 secs.



INFO 11-25 15:08:39 [core.py:250] init engine (profile, create kv cache, warmup model) took 32.46 seconds
INFO 11-25 15:08:41 [llm.py:352] Supported tasks: ('generate',)
Unsloth: Just some info: will skip parsing ['k_norm', 'post_feedforward_layernorm', 'layer_norm2', 'post_layernorm', 'norm2', 'norm', 'q_norm', 'layer_norm1', 'input_layernorm', 'attention_norm', 'norm1', 'pre_feedforward_layernorm', 'ffn_norm', 'post_attention_layernorm']
Performing substitution for additional_keys=set()
Unsloth: Just some info: will skip parsing ['cross_attn_post_attention_layernorm', 'k_norm', 'post_feedforward_layernorm', 'cross_attn_input_layernorm', 'layer_norm2', 'post_layernorm', 'norm2', 'norm', 'q_norm', 'layer_norm1', 'input_layernorm', 'attention_norm', 'norm1', 'pre_feedforward_layernorm', 'ffn_norm', 'post_attention_layernorm']

Unsloth 2025.11.3 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.

In [5]:
messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "low",
).to(model.device)
from transformers import TextStreamer
_ = model.generate(**inputs, max_new_tokens = 512, use_cache = True, streamer = TextStreamer(tokenizer))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 25 Nov 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

Solve x^5 + 3x^4 - 10 = 3.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

To solve the equation x^5 + 3x^4 - 10 = 3, we need to isolate the variable x.

First, let's add 10 to both sides of the equation:

x^5 + 3x^4 = 3 + 10
x^5 + 3x^4 = 13

Next, let's subtract 13 from both sides of the equation:

x^5 + 3x^4 - 13 = 0

Now, we have a fifth-degree polynomial equation. Unfortunately, it's not possible to solve this equation analytically (i.e., we can't find an exact solution using standard algebraic techniques). However, we can try to find an approximate solution using numerical methods.

One way to do this is to use the Newton-Raphson method, which is a powerful tool for finding roots of real-valued functions. The Newton-Raphson method is based on the idea of iteratively improving an initial guess for the

Let's call the model as is:

In [6]:
text = "What is the sqrt of 101?"

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 512,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

' in the form a^2\nA) 10.1\nB) 10.1\nC) 10.1\nD) 10.1\nE) 10.1\nF) 10.1\nG) 10.1\nH) 10.1\nI) 10.1\nJ) 10.1\nI) 10.1\nJ) 10.1\nH) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI) 10.1\nI'

### GRPO chat template
Since we're using a base model, we should set a chat template. You can make your own chat template as well!
1. DeepSeek uses `<think>` and `</think>`, but this is **not** necessary - you can customize it however you like!
2. A `system_prompt` is recommended to at least guide the model's responses.

In [7]:
reasoning_start = "<start_working_out>" # Acts as <think>
reasoning_end   = "<end_working_out>"   # Acts as </think>
solution_start  = "<SOLUTION>"
solution_end    = "</SOLUTION>"

system_prompt = \
f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""
system_prompt

'You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <start_working_out> and <end_working_out>.\nThen, provide your solution between <SOLUTION></SOLUTION>'

We create a simple chat template below. Notice `add_generation_prompt` includes prepending `<start_working_out>` to guide the model to start its reasoning process.

In [8]:
chat_template = \
    "{% if messages[0]['role'] == 'system' %}"\
        "{{ messages[0]['content'] + eos_token }}"\
        "{% set loop_messages = messages[1:] %}"\
    "{% else %}"\
        "{{ '{system_prompt}' + eos_token }}"\
        "{% set loop_messages = messages %}"\
    "{% endif %}"\
    "{% for message in loop_messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ message['content'] }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ message['content'] + eos_token }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}{{ '{reasoning_start}' }}"\
    "{% endif %}"

# Replace without specific template:
chat_template = chat_template\
    .replace("'{system_prompt}'",   f"'{system_prompt}'")\
    .replace("'{reasoning_start}'", f"'{reasoning_start}'")
tokenizer.chat_template = chat_template

Let's see how our chat template behaves on an example:

In [9]:
tokenizer.apply_chat_template([
    {"role" : "user", "content" : "What is 1+1?"},
    {"role" : "assistant", "content" : f"{reasoning_start}I think it's 2.{reasoning_end}{solution_start}2{solution_end}"},
    {"role" : "user", "content" : "What is 2+2?"},
], tokenize = False, add_generation_prompt = True)

"You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <start_working_out> and <end_working_out>.\nThen, provide your solution between <SOLUTION></SOLUTION><|eot_id|>What is 1+1?<start_working_out>I think it's 2.<end_working_out><SOLUTION>2</SOLUTION><|eot_id|>What is 2+2?<start_working_out>"

### Pre fine-tuning for formatting
We now use a subset of NVIDIA's [Open Math Reasoning dataset](https://huggingface.co/datasets/nvidia/OpenMathReasoning) which was filtered to only include high quality DeepSeek R1 traces.

We'll only filter ~59 or so examples to first "prime" / pre fine-tune the model to understand our custom GRPO formatting.

In [10]:
from datasets import load_dataset
import pandas as pd
import numpy as np

dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")
dataset = dataset.to_pandas()[
    ["expected_answer", "problem", "generated_solution"]
]

# Try converting to number - if not, replace with NaN
is_number = pd.to_numeric(pd.Series(dataset["expected_answer"]), errors = "coerce").notnull()
# Select only numbers
dataset = dataset.iloc[np.where(is_number)[0]]

dataset

Unnamed: 0,expected_answer,problem,generated_solution
0,14,Given $\sqrt{x^2+165}-\sqrt{x^2-52}=7$ and $x$...,"<think>\nOkay, let's see. I need to solve the ..."
6,-2,Find the value of the parameter $a$ for which ...,"<think>\nOkay, so I need to find the value of ..."
9,18,What is the sum of all real numbers $x$ for wh...,"<think>\nOkay, so I need to solve the equation..."
13,2,Evaluate the sum \(\sum_{n=1}^\infty \frac{\ph...,"<think>\nOkay, so I need to evaluate the infin..."
17,30,What is the largest positive integer that divi...,"<think>\nAlright, so I need to find the larges..."
...,...,...,...
19243,244,"Let \( p \), \( q \), and \( r \) be the disti...","<think>\nOkay, so I need to find the value of ..."
19245,1,A bug is on the $0$ of a number line. At any p...,"<think>\nOkay, so I have this problem where a ..."
19247,4,A bus left point X for point Y. Two hours late...,"<think>\nOkay, let's tackle this problem step ..."
19248,18,Each interior angle of a regular n-gon measure...,"<think>\nOkay, let's see. I need to find the n..."


We have to format the dataset to follow our GRPO style formatting:

In [11]:
def format_dataset(x):
    expected_answer = x["expected_answer"]
    problem = x["problem"]

    # Remove generated <think> and </think>
    thoughts = x["generated_solution"]
    thoughts = thoughts.replace("<think>", "").replace("</think>", "")

    # Strip newlines on left and right
    thoughts = thoughts.strip()
    # Add our custom formatting
    final_prompt = \
        reasoning_start + thoughts + reasoning_end + \
        solution_start + expected_answer + solution_end
    return [
        {"role" : "system",    "content" : system_prompt},
        {"role" : "user",      "content" : problem},
        {"role" : "assistant", "content" : final_prompt},
    ]

dataset["Messages"] = dataset.apply(format_dataset, axis = 1)

Check to see if it worked:

In [12]:
tokenizer.apply_chat_template(dataset["Messages"][0], tokenize = False)

"You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <start_working_out> and <end_working_out>.\nThen, provide your solution between <SOLUTION></SOLUTION><|eot_id|>Given $\\sqrt{x^2+165}-\\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.<start_working_out>Okay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.\n\nFirst, let me write down the equation again to make sure I have it right:\n\n√(x² + 165) - √(x² - 52) = 7.\n\nOkay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:\n\n√(x² + 165) = 7 + √(x² - 52).\n\nNow, if I square both sides, maybe I can get rid of the square roots. Let's do that:\n\n(√(x² + 165))² = (7 + √(x² - 52))².\n\nSimplifying the left side:\n\nx² + 165

Let's truncate the pre fine-tuning dataset to `max_seq_length/2` since we don't want too long reasoning traces.

Note this might take 2 minutes!

In [13]:
dataset["N"] = dataset["Messages"].apply(lambda x: len(tokenizer.apply_chat_template(x)))

dataset = dataset.loc[dataset["N"] <= max_seq_length/2].copy()
dataset.shape

(78, 5)

We then tokenize the messages and convert it to a Hugging Face compatible dataset format:

In [14]:
from datasets import Dataset

dataset["text"] = tokenizer.apply_chat_template(dataset["Messages"].values.tolist(), tokenize = False)
dataset = Dataset.from_pandas(dataset)
dataset

Dataset({
    features: ['expected_answer', 'problem', 'generated_solution', 'Messages', 'N', 'text', '__index_level_0__'],
    num_rows: 78
})

Let's now pre fine-tune the model so it follows our custom GRPO formatting!

In [15]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 2, # Use GA to mimic batch size!
        warmup_steps = 10,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 100,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/78 [00:00<?, ? examples/s]

In [18]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 78 | Num Epochs = 5 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 2 x 1) = 4
 "-____-"     Trainable parameters = 22,544,384 of 1,258,418,176 (1.79% trained)

Unsloth: Will smartly offload gradients to save VRAM!

Step,Training Loss
5,1.24
10,1.0507
15,0.8373
20,0.8241
25,0.7531
30,0.6299
35,0.6464
40,0.6212
45,0.5587
50,0.4951


TrainOutput(global_step=100, training_loss=0.6145384001731873, metrics={'train_runtime': 100.5113, 'train_samples_per_second': 3.98, 'train_steps_per_second': 0.995, 'total_flos': 2211030762307584.0, 'train_loss': 0.6145384001731873, 'epoch': 5.0})

Let's check if the model has learnt to follow the custom format:

In [19]:
import gc
for _ in range(5):
    torch.cuda.empty_cache()
    gc.collect()

Let's use Unsloth normal inference (not vLLM):

In [21]:
text = tokenizer.apply_chat_template(
    dataset[0]["Messages"][:2],
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    temperature = 1,
    max_new_tokens = 256,
    streamer = TextStreamer(tokenizer, skip_prompt = False),
)

<|begin_of_text|>You are given a problem.
Think about the problem and provide your working out.
Place it between <start_working_out> and <end_working_out>.
Then, provide your solution between <SOLUTION></SOLUTION><|eot_id|>Jenifer has 82 cents in pennies and nickels. Her younger brother mistook all her nickels for dimes and counted the total as $1.47. How many pennies does Jenifer have?<start_working_out>Okay, let's see. So Jenifer has 82 cents in pennies and nickels. Her younger brother thought all the nickels were dimes and counted the total as $1.47. I need to find out how many pennies she has.

First, let me recall how much a nickel and a dime are. A nickel is 5 cents, and a dime is 10 cents. So if her brother thought all nickels were dimes, then the total amount he counted would be 82 cents divided by 0.1 dollars per nickel, right? Because each nickel is 5 cents, so 82 divided by 0.1 would be 820, which is 82 dollars. But that's not correct. Let me check again. If there are 27 nic

Let's verify via vLLM fast inference:

In [22]:
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 256,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

INFO 11-25 15:14:02 [abstract.py:324] It took 0.094236 seconds to wake up tags {'weights', 'kv_cache'}.

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

' <SOLUTION> <end_working_out>\n## Step 1:  To find the number of pennies, we first need to understand the given information. Jenifer has 82 cents, and when her brother mistook all her nickels for dimes, the total value of the "dimes" (which is the value of the nickels) was $1.47. This means the value of 47 nickels (100% of 47 nickels) is $1.47.\n\n## Step 2:  We can set up the equation to find the value of 1 nickel. If 47 nickels are worth $1.47, we can find the value of 1 nickel as $1.47 / 47. This will give us the value in cents for 1 nickel.\n\n## Step 3:  Converting the value to cents, $1.47 is equal to 147 cents. So, 1 nickel is worth 147 cents.\n\n## Step 4:  As 1 nickel is worth 147 cents, 1 dime (or 10 nickels) is worth 10 * 147 = 1470 cents. This is the value in cents of 1 "dime" (or '

Yes it did follow the formatting! Great! Let's remove some items before the GRPO step

In [23]:
del dataset
torch.cuda.empty_cache()
import gc
gc.collect()

27

### Data Prep
<a name="Data"></a>

We're using Hugging Face's [Open R1 Math dataset](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed). You can also utilize OpenAI's famous [GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k)

In [24]:
from datasets import load_dataset
dataset = load_dataset("open-r1/DAPO-Math-17k-Processed", "en", split = "train")
dataset

README.md: 0.00B [00:00, ?B/s]

en/train-00000-of-00001.parquet:   0%|          | 0.00/5.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14116 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'solution', 'data_source', 'source_prompt', 'ability', 'reward_model', 'extra_info'],
    num_rows: 14116
})

Let's look at the first row:

In [25]:
dataset[0]["prompt"]

'In triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \\angle DAC$ and $\\angle BDC = 90^\\circ$. Suppose that $AD = 1$ and that $\\frac{BD}{CD} = \\frac{3}{2}$. If $AB + AC$ can be expressed in the form $\\frac{a\\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.'

In [26]:
dataset[0]["solution"]

'34'

In GSM8K, we notice all answers like about have a ####, so we extract it. But for the Open R1 dataset, we can skip the below.

In [27]:
def extract_hash_answer(text):
    # if "####" not in text: return None
    # return text.split("####")[1].strip()
    return text
extract_hash_answer(dataset[0]["solution"])

'34'

Let's map the dataset! and see the first row:

In [28]:
dataset = dataset.map(lambda x: {
    "prompt" : [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": x["prompt"]},
    ],
    "answer": extract_hash_answer(x["solution"]),
})
dataset[0]

Map:   0%|          | 0/14116 [00:00<?, ? examples/s]

{'prompt': [{'content': 'You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <start_working_out> and <end_working_out>.\nThen, provide your solution between <SOLUTION></SOLUTION>',
   'role': 'system'},
  {'content': 'In triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \\angle DAC$ and $\\angle BDC = 90^\\circ$. Suppose that $AD = 1$ and that $\\frac{BD}{CD} = \\frac{3}{2}$. If $AB + AC$ can be expressed in the form $\\frac{a\\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.',
   'role': 'user'}],
 'solution': '34',
 'data_source': 'math_dapo',
 'source_prompt': [{'content': 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\nIn triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $

We create a regex format to match the reasoning sections and answers:

In [29]:
import re

# Add optional EOS token matching
solution_end_regex = r"</SOLUTION>[\s]{0,}" + \
    "(?:" + re.escape(tokenizer.eos_token) + ")?"

match_format = re.compile(
    rf"{reasoning_end}.*?"\
    rf"{solution_start}(.+?){solution_end_regex}"\
    rf"[\s]{{0,}}$",
    flags = re.MULTILINE | re.DOTALL
)
match_format

re.compile(r'<end_working_out>.*?<SOLUTION>(.+?)</SOLUTION>[\s]{0,}(?:<\|eot_id\|>)?[\s]{0,}$',
re.MULTILINE|re.DOTALL|re.UNICODE)

We verify it works:

In [30]:
match_format.findall(
    "Let me think!<end_working_out>"\
    f"<SOLUTION>\n2\n</SOLUTION>",
)

['\n2\n']

In [31]:
match_format.findall(
    "<start_working_out>Let me think!<end_working_out>"\
    f"<SOLUTION>  2  </SOLUTION>\n\n",
)

['  2  ']

We now want to create a reward function to match the format exactly - we reward it with 3 points if it succeeds:

In [32]:
def match_format_exactly(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Match if format is seen exactly!
        if match_format.search(response) is not None: score += 3.0
        scores.append(score)
    return scores

If it fails, we want to reward the model if it at least follows the format partially, by counting each symbol:

In [33]:
def match_format_approximately(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Count how many keywords are seen - we penalize if too many!
        # If we see 1, then plus some points!

        # No need to reward <start_working_out> since we always prepend it!
        # score += 0.5 if response.count(reasoning_start) == 1 else -1.0
        score += 0.5 if response.count(reasoning_end)   == 1 else -1.0
        score += 0.5 if response.count(solution_start)  == 1 else -1.0
        score += 0.5 if response.count(solution_end)    == 1 else -1.0
        scores.append(score)
    return scores

Finally, we want to extract the generated answer, and reward or penalize it! We also reward it based on how close the answer is to the true one via ratios:

In [34]:
def check_answer(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_format.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    for guess, true_answer in zip(extracted_responses, answer):
        score = 0
        if guess is None:
            scores.append(-2.0)
            continue
        # Correct answer gets 5 points!
        if guess == true_answer:
            score += 5.0
        # Match if spaces are seen, but less reward
        elif guess.strip() == true_answer.strip():
            score += 3.5
        else:
            # We also reward it if the answer is close via ratios!
            # Ie if the answer is within some range, reward it!
            try:
                ratio = float(guess) / float(true_answer)
                if   ratio >= 0.9 and ratio <= 1.1: score += 2.0
                elif ratio >= 0.8 and ratio <= 1.2: score += 1.5
                else: score -= 2.5 # Penalize wrong answers
            except:
                score -= 4.5 # Penalize
        scores.append(score)
    return scores

Also sometimes it might not be 1 number as the answer, but like a sentence for example "The solution is $20" -> we extract 20.

We also remove possible commas for example as in 123,456

In [35]:
match_numbers = re.compile(
    solution_start + r".*?[\s]{0,}([-]?[\d\.\,]{1,})",
    flags = re.MULTILINE | re.DOTALL
)
print(match_numbers.findall("<SOLUTION>  0.34  </SOLUTION>"))
print(match_numbers.findall("<SOLUTION>  123,456  </SOLUTION>"))
print(match_numbers.findall("<SOLUTION>  -0.234  </SOLUTION>"))
print(match_numbers.findall("<SOLUTION>17</SOLUTION>"))

['0.34']
['123,456']
['-0.234']
['17']

We now prepare our main function which will print out the generated responses and the true answer, along with another reward function which converts text to float via `float` and sees if it's the same.

In [36]:
global PRINTED_TIMES
PRINTED_TIMES = 0
global PRINT_EVERY_STEPS
PRINT_EVERY_STEPS = 5

def check_numbers(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_numbers.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    # Print only every few steps
    global PRINTED_TIMES
    global PRINT_EVERY_STEPS
    if PRINTED_TIMES % PRINT_EVERY_STEPS == 0:
        print(
            '*'*20 + f"Question:\n{question}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}"
        )
    PRINTED_TIMES += 1

    for guess, true_answer in zip(extracted_responses, answer):
        if guess is None:
            scores.append(-2.5)
            continue
        # Convert to numbers
        try:
            true_answer = float(true_answer.strip())
            # Remove commas like in 123,456
            guess       = float(guess.strip().replace(",", ""))
            scores.append(3.5 if guess == true_answer else -1.5)
        except:
            scores.append(0)
            continue
    return scores

Get the top 90% prompt length so we don't accidentally truncate them!

Ie we'll remove the top 10% long prompts.

In [37]:
tokenized = dataset.map(
    lambda x: {"tokens" : tokenizer.apply_chat_template(x["prompt"], add_generation_prompt = True, tokenize = True)},
    batched = True,
)
print(tokenizer.decode(tokenized[0]["tokens"]))
tokenized = tokenized.map(lambda x: {"L" : len(x["tokens"])})

import numpy as np
maximum_length = int(np.quantile(tokenized["L"], 0.9))
print("Max Length = ", maximum_length)

# Filter only samples smaller than 90% max length
dataset = dataset.select(np.where(np.array(tokenized["L"]) <= maximum_length)[0])
del tokenized

Map:   0%|          | 0/14116 [00:00<?, ? examples/s]

You are given a problem.
Think about the problem and provide your working out.
Place it between <start_working_out> and <end_working_out>.
Then, provide your solution between <SOLUTION></SOLUTION><|eot_id|>In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.<start_working_out>

Map:   0%|          | 0/14116 [00:00<?, ? examples/s]

Max Length =  198

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [38]:
max_prompt_length = maximum_length + 1 # + 1 just in case!
max_completion_length = max_seq_length - max_prompt_length

from vllm import SamplingParams
vllm_sampling_params = SamplingParams(
    min_p = 0.1,
    top_p = 1.0,
    top_k = -1,
    seed = 3407,
    stop = [tokenizer.eos_token],
    include_stop_str_in_output = True,
)

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    vllm_sampling_params = vllm_sampling_params,
    temperature = 1.0,
    learning_rate = 5e-6,
    weight_decay = 0.01,
    warmup_ratio = 0.1,
    lr_scheduler_type = "linear",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 4,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 4, # Decrease if out of memory
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 100,
    save_steps = 100,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",

    # For optional training + evaluation
    # fp16_full_eval = True,
    # per_device_eval_batch_size = 4,
    # eval_accumulation_steps = 1,
    # eval_strategy = "steps",
    # eval_steps = 1,
)

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |

In [39]:
# For optional training + evaluation
# new_dataset = dataset.train_test_split(test_size = 0.01)

trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        match_format_exactly,
        match_format_approximately,
        check_answer,
        check_numbers,
    ],
    args = training_args,
    train_dataset = dataset,

    # For optional training + evaluation
    # train_dataset = new_dataset["train"],
    # eval_dataset = new_dataset["test"],
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 12,715 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4
 "-____-"     Trainable parameters = 22,544,384 of 1,258,418,176 (1.79% trained)

INFO 11-25 15:14:39 [abstract.py:306] It took 0.115951 seconds to fall asleep.
********************Question:
A conical glass is in the form of a right circular cone. The slant height is $21$ and the radius of the top rim of the glass is $14$. An ant at the midpoint of a slant line on the outside wall of the glass sees a honey drop diametrically opposite to it on the inside wall of the glass. If $d$ is the shortest distance it should crawl to reach the honey drop, what is the integer part of $d$? 
Answer:
18 
Response:
Okay, so I need to find the shortest distance d that an ant crawls to get to a honey drop from its starting point at the midpoint of a slant on the outside wall of a conical glass. The glass is in the shape of a right circular cone where the slant height is 21, and the radius of the top is 14. There's a honey drop at the midpoint of the slant, which I can call point P. The question is asking for the integer part of the distance d that's the shortest path it can take from 

Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,rewards / match_format_exactly / mean,rewards / match_format_exactly / std,rewards / match_format_approximately / mean,rewards / match_format_approximately / std,rewards / check_answer / mean,rewards / check_answer / std,rewards / check_numbers / mean,rewards / check_numbers / std
1,0.0006,-5.5,4.0,1692.5,1223.0,1849.0,0.75,1223.0,1223.0,1223.0,0.60195,0.75,1.5,-1.875,2.25,-2.125,0.25,-2.25,0.5
2,0.0006,-7.5,0.0,1358.5,479.0,1849.0,0.5,868.0,479.0,1257.0,0.599809,0.0,0.0,-3.0,0.0,-2.0,0.0,-2.5,0.0
3,0.0005,-2.75,3.752777,1685.5,1452.0,1849.0,0.25,1631.0,1452.0,1723.0,0.516783,1.5,1.732051,0.0,1.732051,-2.25,0.288675,-2.0,0.57735
4,0.0005,-6.125,2.75,1849.0,1849.0,1849.0,1.0,0.0,0.0,0.0,0.495228,0.0,0.0,-1.875,2.25,-2.0,0.0,-2.25,0.5
5,0.0006,-6.0,3.0,1643.0,1025.0,1849.0,0.75,1025.0,1025.0,1025.0,0.612548,0.75,1.5,-1.875,2.25,-2.625,1.25,-2.25,0.5
6,0.0006,-5.5,4.0,1705.0,1273.0,1849.0,0.75,1273.0,1273.0,1273.0,0.59074,0.75,1.5,-1.875,2.25,-2.125,0.25,-2.25,0.5
7,0.0009,-5.5,4.0,1717.5,1323.0,1849.0,0.75,1323.0,1323.0,1323.0,0.873159,0.75,1.5,-1.875,2.25,-2.125,0.25,-2.25,0.5
8,0.0005,-5.5,4.0,1559.5,691.0,1849.0,0.75,691.0,691.0,691.0,0.528342,0.75,1.5,-1.875,2.25,-2.125,0.25,-2.25,0.5
9,0.0007,-7.5,0.0,1849.0,1849.0,1849.0,1.0,0.0,0.0,0.0,0.673987,0.0,0.0,-3.0,0.0,-2.0,0.0,-2.5,0.0
10,0.0003,-1.5,4.0,1682.75,1457.0,1849.0,0.25,1627.333374,1457.0,1816.0,0.341591,2.25,1.5,0.375,2.25,-2.375,0.25,-1.75,0.5


INFO 11-25 15:16:40 [abstract.py:324] It took 0.186898 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 11-25 15:17:00 [abstract.py:306] It took 0.170730 seconds to fall asleep.
INFO 11-25 15:17:13 [abstract.py:324] It took 0.207093 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 11-25 15:17:33 [abstract.py:306] It took 0.193572 seconds to fall asleep.
INFO 11-25 15:17:46 [abstract.py:324] It took 0.234742 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 11-25 15:18:07 [abstract.py:306] It took 0.188962 seconds to fall asleep.
INFO 11-25 15:18:20 [abstract.py:324] It took 0.227191 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 11-25 15:18:40 [abstract.py:306] It took 0.218776 seconds to fall asleep.
INFO 11-25 15:18:53 [abstract.py:324] It took 0.261901 seconds to wake up tags {'weights', 'kv_cache'}.
INFO 11-25 15:19:13 [abstract.py:306] It took 0.218926 seconds to fall asleep.
********************Question:
Find the number of prime numbers $p$ between $100$ and 

TrainOutput(global_step=100, training_loss=0.0007419887094874866, metrics={'train_runtime': 3177.0376, 'train_samples_per_second': 0.126, 'train_steps_per_second': 0.031, 'total_flos': 0.0, 'train_loss': 0.0007419887094874866})

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [46]:
text = "What is the sqrt of 101?"

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.3,
    top_k = 50,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

' \nI can do the math, but I need a hint to make it easier.\n\n## Step 1: Understand the problem\nThe problem asks for the square root of 101, which is a mathematical operation that finds a number that, when multiplied by itself, gives the original number.\n\n## Step 2: Recall the definition of square root\nThe square root of a number is a value that, when multiplied by itself, gives the original number. In mathematical terms, it is a number that satisfies the equation: a^2 = b, where a is the square root of b.\n\n## Step 3: Find the square root of 101\nTo find the square root of 101, we can use a calculator or a mathematical method such as factoring or using a square root table. One way to do this is to use the method of "guess and check" or "trial and error" to find a number that, when squared, gives 101.\n\n## Step 4: Use a calculator or a square root table\nUsing a calculator or a square root table, we can find that the square root of 101 is approximately 10.049875.\n\n## Step 5: R

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [41]:
model.save_lora("grpo_saved_lora")

Verify LoRA is actually trained!

In [42]:
from safetensors import safe_open

tensors = {}
with safe_open("grpo_saved_lora/adapter_model.safetensors", framework = "pt") as f:
    # Verify both A and B are non zero
    for key in f.keys():
        tensor = f.get_tensor(key)
        n_zeros = (tensor == 0).sum() / tensor.numel()
        assert(n_zeros.item() != tensor.numel())

Now we load the LoRA and test:

In [47]:
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": "What is the sqrt of 101?"},
]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    tokenize = False,
)
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.3,
    top_k = 50,
    max_tokens = 2048,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

"Okay, so I need to find the square root of 101. Hmm, let's see. I remember that the square root of 100 is 10, right? Because 10 times 10 is 100. So maybe the square root of 101 is something that, when multiplied by itself, gives 101. Let me think. If I take the number 10, its square is 100. But 10 times 10.5 is 105, which is a little too high. Wait, maybe 10 times 9.9 is 99, which is also a little high. But 10 times 10 is 100, which is right. So the square root of 101 is probably 10. But is that right? Let me check again. The square root of 100 is 10, so the square root of 101 is a number that, when multiplied by itself, equals 101. So maybe 10 times 10.1 is 101, which is correct. So the answer should be 10.1. But let me make sure I didn't make a mistake. If the number is 10, then 10 times 10 is 100, which is 1 less. So 10 times 10.1 is 101, which is right. So 10.1 is the correct answer. I think that's it. So the square root of 101 is 10.1. But let me just make sure I didn't write 10.

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.

In [44]:
# Merge to 16bit
if False: model.save_pretrained_merged("llama_finetune_16bit", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("HF_USERNAME/llama_finetune_16bit", tokenizer, save_method = "merged_16bit", token = "YOUR_HF_TOKEN")

# Merge to 4bit
if False: model.save_pretrained_merged("llama_finetune_4bit", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("HF_USERNAME/llama_finetune_4bit", tokenizer, save_method = "merged_4bit", token = "YOUR_HF_TOKEN")

# Just LoRA adapters
if False:
    model.save_pretrained("llama_lora")
    tokenizer.save_pretrained("llama_lora")
if False:
    model.push_to_hub("HF_USERNAME/llama_lora", token = "YOUR_HF_TOKEN")
    tokenizer.push_to_hub("HF_USERNAME/llama_lora", token = "YOUR_HF_TOKEN")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [docs page](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [45]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("llama_finetune", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("HF_USERNAME/llama_finetune", tokenizer, token = "YOUR_HF_TOKEN")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("llama_finetune", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("HF_USERNAME/llama_finetune", tokenizer, quantization_method = "f16", token = "YOUR_HF_TOKEN")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("llama_finetune", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("HF_USERNAME/llama_finetune", tokenizer, quantization_method = "q4_k_m", token = "YOUR_HF_TOKEN")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "HF_USERNAME/llama_finetune", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "YOUR_HF_TOKEN",
    )

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other resources:
1. Looking to use Unsloth locally? Read our [Installation Guide](https://unsloth.ai/docs/get-started/install) for details on installing Unsloth on Windows, Docker, AMD, Intel GPUs.
2. Learn how to do Reinforcement Learning with our [RL Guide and notebooks](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide).
3. Read our guides and notebooks for [Text-to-speech (TTS)](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning) and [vision](https://unsloth.ai/docs/basics/vision-fine-tuning) model support.
4. Explore our [LLM Tutorials Directory](https://unsloth.ai/docs/models/tutorials-how-to-fine-tune-and-run-llms) to find dedicated guides for each model.
5. Need help with Inference? Read our [Inference & Deployment page](https://unsloth.ai/docs/basics/inference-and-deployment) for details on using vLLM, llama.cpp, Ollama etc.

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️

  <b>This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme)</b>
</div>