To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**Read our [Gemma 3 blog](https://unsloth.ai/blog/gemma3) for what's new in Unsloth and our [Reasoning blog](https://unsloth.ai/blog/r1-reasoning) on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm

In [6]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

### Unsloth

Load up `Llama 3.1 8B Instruct`, and set parameters

In [7]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 03-20 15:41:06 [__init__.py:256] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.17: Fast Llama patching. Transformers: 4.48.3. vLLM: 0.8.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/deepseek-r1-distill-llama-8b-unsloth-bnb-4bit with actual GPU utilization = 59.43%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 160.
Unsloth: vLLM's KV Cache can use up to 2

tokenizer_config.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

INFO 03-20 15:41:40 [cuda.py:234] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 03-20 15:41:40 [cuda.py:282] Using XFormers backend.
INFO 03-20 15:41:41 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-20 15:41:41 [model_runner.py:1110] Starting to load model unsloth/deepseek-r1-distill-llama-8b-unsloth-bnb-4bit...
INFO 03-20 15:41:41 [loader.py:1137] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 03-20 15:41:43 [weight_utils.py:257] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

INFO 03-20 15:42:16 [weight_utils.py:273] Time spent downloading weights for unsloth/deepseek-r1-distill-llama-8b-unsloth-bnb-4bit: 33.480166 seconds
INFO 03-20 15:42:17 [weight_utils.py:307] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 03-20 15:43:00 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-20 15:43:01 [model_runner.py:1146] Model loading took 5.7736 GB and 79.337496 seconds
INFO 03-20 15:43:12 [worker.py:267] Memory profiling takes 10.36 seconds
INFO 03-20 15:43:12 [worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.59) = 8.76GiB
INFO 03-20 15:43:12 [worker.py:267] model weights take 5.77GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.74GiB; the rest of the memory reserved for KV Cache is 2.22GiB.
INFO 03-20 15:43:12 [executor_base.py:111] # cuda blocks: 1134, # CPU blocks: 1024
INFO 03-20 15:43:12 [executor_base.py:116] Maximum concurrency for 1024 tokens per request: 17.72x
INFO 03-20 15:43:14 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If 

Capturing CUDA graph shapes: 100%|██████████| 23/23 [00:56<00:00,  2.48s/it]

INFO 03-20 15:44:11 [model_runner.py:1570] Graph capturing finished in 57 secs, took 0.53 GiB
INFO 03-20 15:44:11 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 70.12 seconds





tokenizer_config.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Unsloth 2025.3.17 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Data Prep
<a name="Data"></a>

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [9]:
import re
from datasets import load_dataset, Dataset
import pandas as pd

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def get_emoji_math_dataset() -> Dataset:
    # Read the CSV file
    df = pd.read_csv('data3.csv')

    # Convert to the format expected by the model
    dataset_dict = {
        'prompt': [
            [
                {'role': 'system', 'content': SYSTEM_PROMPT},
                {'role': 'user', 'content': problem}
            ] for problem in df['Problem']
        ],
        'answer': df['Solution'].tolist()
    }

    return Dataset.from_dict(dataset_dict)

dataset = get_emoji_math_dataset()

# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [12]:
max_prompt_length = 256

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 6, # Decrease if out of memory
    max_prompt_length = max_prompt_length,
    max_completion_length = max_seq_length - max_prompt_length,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 50,
    save_steps = 50,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 6


And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [13]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 30 | Num Epochs = 2 | Total steps = 50
O^O/ \_/ \    Batch size per device = 6 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (6 x 1 x 1) = 6
 "-____-"     Trainable parameters = 83,886,080/8,000,000,000 (1.05% trained)


-------------------- Question:
🎃 + 🌽 = 13, and 🎃 - 🌽 = 3. Find the values of 🎃 and 🌽. 
Answer:
🎃 = 8, 🌽 = 5. 
Response:
I need to find the values of 🌞 and 🌛 given the equations 🌞 + 🌛 = 18 and 🌞 - 🌛 = 6.

First, I'll label the variables for clarity: let 🌞 = x and 🌛 = y. This gives me the system of equations:
1. x + y = 18
2. x - y = 6

Next, I'll add the two equations together to eliminate y:
( x + y ) + ( x - y ) = 18 + 6
This simplifies to 2x = 24, so x = 12.

Now that I have x = 12, I'll substitute this back into one of the original equations to find y:
12 + y = 18
Subtracting 12 from both sides gives y = 6.

Therefore, the values are 🌞 = 12 and 🌛 = 6.

</answer>
The values are: 🌞 = 12 and 🌛 = 6.
</think>

Let's solve the system of equations step by step.

We are given:
\[
\begin{cases}
\+ \, \text{🌞} + \text{🌛} = 18 \quad \text{(1)} \\
- \, \text{🌞} - \text{🌛} = 6 \quad \text{(2)}
\end{cases}
\]

**Step 1: Assign Variables**

Let’s define:
\[
x = \text{🌞}, \quad y = \text{🌛}
\]

**S

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / int_reward_func,rewards / correctness_reward_func
1,0.0,-0.103333,0.388673,539.833374,4e-06,-0.103333,0.0,0.0,0.0,0.0
2,0.0,-0.162167,0.475377,268.833344,6e-06,-0.2455,0.0,0.0,0.083333,0.0
3,0.0,-0.1755,0.1998,272.333344,4e-06,-0.1755,0.0,0.0,0.0,0.0
4,0.0,-0.1325,0.156946,269.0,7e-06,-0.1325,0.0,0.0,0.0,0.0
5,0.0,0.020833,0.051031,472.0,3e-06,0.020833,0.0,0.0,0.0,0.0
6,0.0,-0.201167,0.344057,375.666687,4e-06,-0.201167,0.0,0.0,0.0,0.0
7,0.0,-0.1625,0.252531,270.833344,1.5e-05,-0.1625,0.0,0.0,0.0,0.0
8,0.0,-0.136833,0.221125,274.833344,1.8e-05,-0.136833,0.0,0.0,0.0,0.0
9,0.0,-0.100167,0.166981,305.5,2.7e-05,-0.100167,0.0,0.0,0.0,0.0
10,0.0,-0.292167,0.379318,293.833344,8.7e-05,-0.292167,0.0,0.0,0.0,0.0


-------------------- Question:
🐻 × 🐻 = 25. Find the value of 🐻. 
Answer:
🐻 = 5. 
Response:
To determine the value of 🐻 when 🐻 × 🐻 equals 25, we can set up the equation \( x \times x = 25 \).

By solving the equation, we find that \( x \) equals 5.

Therefore, the value of 🐻 is 5.
</think>

Let's solve the equation step by step.

**Given:**
\[
\text{🐻} \times \text{🐻} = 25
\]

**Let** \( x = \text{🐻} \).

**Step 1:** Write the equation:
\[
x \times x = 25
\]

**Step 2:** Recognize that \( x \times x \) is the same as \( x^2 \):
\[
x^2 = 25
\]

**Step 3:** Solve for \( x \) by taking the square root of both sides:
\[
x = \sqrt{25}
\]

**Step 4:** Simplify the square root:
\[
x = 5
\]

**Answer:** The value of 🐻 is \(\boxed{5}\). 
Extracted:
To determine the value of 🐻 when 🐻 × 🐻 equals 25, we can set up the equation \( x \times x = 25 \).

By solving the equation, we find that \( x \) equals 5.

Therefore, the value of 🐻 is 5.
</think>

Let's solve the equation step by step.

**Given:**


TrainOutput(global_step=50, training_loss=2.115623431421909e-05, metrics={'train_runtime': 5246.5083, 'train_samples_per_second': 0.057, 'train_steps_per_second': 0.01, 'total_flos': 0.0, 'train_loss': 2.115623431421909e-05})

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [18]:
text = tokenizer.apply_chat_template([
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "🎨 + 🎨 + 🎨 = 27. Find the value of 🎨."},
], tokenize=False, add_generation_prompt=True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=1024,
)
output = model.fast_generate(
    [text],
    sampling_params=sampling_params,
    lora_request=None,
)[0].outputs[0].text
print(output)


Processed prompts: 100%|██████████| 1/1 [00:19<00:00, 19.84s/it, est. speed input: 2.62 toks/s, output: 13.31 toks/s]

To determine the value of 🎨, I'll start by analyzing the equation: 🎨 + 🎨 + 🎨 = 27.

First, adding three of the same numbers together means multiplying one number by 3. So, I can represent this as 3 × 🎨 = 27.

Next, to find the value of 🎨, I'll divide both sides of the equation by 3. This gives me 🎨 = 27 ÷ 3.

Performing the division, I calculate that 🎨 equals 9.

Therefore, the value of 🎨 is 9.
</think>

<reasoning>
We have the equation:
\[ \text{🎨} + \text{🎨} + \text{🎨} = 27 \]

Let's denote \( x = \text{🎨} \). So the equation becomes:
\[ x + x + x = 27 \]

Combining like terms:
\[ 3x = 27 \]

Solving for \( x \):
\[ x = \frac{27}{3} \]
\[ x = 9 \]
</reasoning>

<answer>
The value of 🎨 is \boxed{9}.
</answer>





And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [16]:
model.save_lora("grpo_saved_lora")

Now we load the LoRA and test:

In [17]:
# Test with GRPO training
test_problems = [
    "🎨 + 🎨 + 🎨 = 27. Find the value of 🎨.",
    "🐉 + 🐉 + 🐉 + 🐉 = 28. Find the value of 🐉.",
    "🎁 + 🎀 = 14, and 🎁 - 🎀 = 2. Find the values of 🎁 and 🎀."
]

print("\nWith GRPO training:")
for problem in test_problems:
    text = tokenizer.apply_chat_template([
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": problem},
    ], tokenize=False, add_generation_prompt=True)

    output = model.fast_generate(
        text,
        sampling_params=sampling_params,
        lora_request=model.load_lora("grpo_saved_lora"),
    )[0].outputs[0].text
    print(f"\nProblem: {problem}")
    print(f"Response:\n{output}")
    print("-" * 50)


With GRPO training:


Processed prompts: 100%|██████████| 1/1 [00:15<00:00, 15.91s/it, est. speed input: 3.27 toks/s, output: 16.60 toks/s]



Problem: 🎨 + 🎨 + 🎨 = 27. Find the value of 🎨.
Response:
I need to determine the value of one black circle emoji (🎨) when three of them add up to 27.

First, I'll let the value of one black circle emoji be represented by x.

So, the equation is 3x = 27.

To find the value of x, I'll divide both sides of the equation by 3.

x = 27 divided by 3, which equals 9.

Therefore, one black circle emoji is worth 9.
</think>

Let's solve the equation step by step.

**Given:**
\[
\boxed{🎨} + \boxed{🎨} + \boxed{🎨} = 27
\]

Let \( x \) represent the value of one black circle emoji (\(\boxed{🎨}\)).

**Step 1:** Express the equation with the variable.
\[
3x = 27
\]

**Step 2:** Solve for \( x \) by dividing both sides by 3.
\[
x = \frac{27}{3}
\]
\[
x = 9
\]

**Conclusion:**
One black circle emoji (\(\boxed{🎨}\)) is worth **9**.

\[
\boxed{9}
\]
--------------------------------------------------


Processed prompts: 100%|██████████| 1/1 [00:16<00:00, 16.99s/it, est. speed input: 3.30 toks/s, output: 15.02 toks/s]



Problem: 🐉 + 🐉 + 🐉 + 🐉 = 28. Find the value of 🐉.
Response:
I need to find the value of the unknown dragon 🐉 when three known dragons add up to 28.

I'll start by assigning a variable to the unknown dragon. Let’s call it x.

So, the equation becomes:
x + 3 = 28.

To solve for x, I'll subtract 3 from both sides of the equation:
x = 28 - 3.

This simplifies to:
x = 25.

Therefore, the value of the unknown dragon is 25.
</think>

Let's solve the equation step by step.

**Given:**
\[
🐉 + 🐉 + 🐉 + 🐉 = 28
\]

**Let's denote the unknown dragon as \( x \):**
\[
x + \cancel{🐉} + \cancel{🐉} + \cancel{🐉} = 28
\]
\[
x + 3\, \text{ dragons} = 28
\]

**Subtract 3 from both sides to isolate \( x \):**
\[
x = 28 - 3
\]
\[
x = 25
\]

**Answer:**
\[
\boxed{25}
\]
--------------------------------------------------


Processed prompts: 100%|██████████| 1/1 [00:25<00:00, 25.97s/it, est. speed input: 2.46 toks/s, output: 16.18 toks/s]


Problem: 🎁 + 🎀 = 14, and 🎁 - 🎀 = 2. Find the values of 🎁 and 🎀.
Response:
First, I'll define the values of the variables. Let me denote 🎁 as \( x \) and 🎀 as \( y \).

Given the equations:
1. \( x + y = 14 \)
2. \( x - y = 2 \)

To find the values of \( x \) and \( y \), I'll add the two equations together. Adding the left sides and the right sides:
\( (x + y) + (x - y) = 14 + 2 \)

This simplifies to:
\( 2x = 16 \)

Solving for \( x \), I divide both sides by 2:
\( x = 8 \)

Now that I know \( x \) is 8, I'll substitute this value back into the first equation to find \( y \):
\( 8 + y = 14 \)

Subtracting 8 from both sides gives:
\( y = 6 \)

Therefore, the values are 🎁 = 8 and 🎀 = 6.
</think>

Let's solve the system of equations step by step.

We are given:
\[
\begin{cases}
x + y = 14 \quad \text{(1)} \\
x - y = 2 \quad \text{(2)}
\end{cases}
\]

**Step 1:** Add equations (1) and (2) to eliminate \( y \).
\[
(x + y) + (x - y) = 14 + 2 \\
2x = 16
\]
\[
x = \frac{16}{2} = 8
\]

**Step




In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
model.push_to_hub("saadsohail/DeepSeekR1-GRPO-CreativeMathProblemSolver", token = "your_token") # Online saving
tokenizer.push_to_hub("saadsohail/DeepSeekR1-GRPO-CreativeMathProblemSolver", token = "your_token") # Online saving

README.md:   0%|          | 0.00/629 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/336M [00:00<?, ?B/s]

Saved model to https://huggingface.co/saadsohail/DeepSeekR1-GRPO-CreativeMathProblemSolver


  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# # Merge to 16bit
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
# if False: model.save_pretrained_merged("saadsohail/DeepSeek R1-GRPO-Creative Math Problem Solver", tokenizer, save_method = "merged_4bit",)
# if False: model.push_to_hub_merged("saadsohail/DeepSeek R1-GRPO-Creative Math Problem Solver", tokenizer, save_method = "merged_4bit", token = "")

# # Just LoRA adapters
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

RuntimeError: Unsloth: Merging into 4bit will cause your model to lose accuracy if you plan
to merge to GGUF or others later on. I suggest you to do this as a final step
if you're planning to do multiple saves.
If you are certain, change `save_method` to `merged_4bit_forced`.

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
