<a href="https://colab.research.google.com/github/rogerwzeng/BigDataSystems/blob/main/GRPO_with_s1_TPO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### RL with GRPO using the Stanford s1 dataset


Visit [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
# Skip restarting message in Colab
import sys; modules = list(sys.modules.keys())
for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None

!pip install wandb onnx protobuf --upgrade
!pip install unsloth vllm
!pip install --upgrade pillow
!pip install git+https://github.com/huggingface/trl.git@e95f9fb74a3c3647b86f251b7e230ec51c64b72b

### Unsloth

Use `PatchFastRL` before all functions to patch GRPO and other RL algorithms!

In [None]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

Unsloth: Patching Xformers to fix some performance issues.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 02-11 03:50:09 __init__.py:190] Automatically detected platform cuda.


In [None]:
import wandb
from google.colab import userdata

wb_api_key = userdata.get('WANDB_API_KEY')
wandb.login(key = wb_api_key)

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mrogerwzeng[0m ([33mrogerwzeng-harvard-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

Load up base model, and set parameters

In [None]:
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    #model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
    model_name = "Qwen/Qwen2.5-7B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.75, # Reduce if out of memory
    chat_template = "\n<|im_start|>system\n{system_prompt}\n<|im_end|>\n<|im_start|>user\n{prompt}\n<|im_end|>\n<|im_start|>assistant\n",
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

==((====))==  Unsloth 2025.2.5: Fast Qwen2 patching. Transformers: 4.48.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit with actual GPU utilization = 74.22%
Unsloth: Your GPU has CUDA compute capability 8.0 with VRAM = 39.56 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 288.
Unsloth: vLLM's KV Cache can use up to 23.38 GB. Also swap space = 6 GB.
INFO 02-11 03:52:23 config.py:542] This model supports multiple tasks: {'reward', 'embed', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwa

tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

INFO 02-11 03:52:27 cuda.py:230] Using Flash Attention backend.
INFO 02-11 03:52:28 model_runner.py:1110] Starting to load model unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit...
INFO 02-11 03:52:28 loader.py:1102] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 02-11 03:52:28 weight_utils.py:252] Using model weights format ['*.safetensors']


model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.16G [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 02-11 03:52:51 model_runner.py:1115] Loading model weights took 6.7252 GB
INFO 02-11 03:52:51 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 02-11 03:52:59 worker.py:267] Memory profiling takes 7.37 seconds
INFO 02-11 03:52:59 worker.py:267] the current vLLM instance can use total_gpu_memory (39.56GiB) x gpu_memory_utilization (0.74) = 29.36GiB
INFO 02-11 03:52:59 worker.py:267] model weights take 6.73GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 1.57GiB; the rest of the memory reserved for KV Cache is 20.97GiB.
INFO 02-11 03:53:00 executor_base.py:110] # CUDA blocks: 24539, # CPU blocks: 7021
INFO 02-11 03:53:00 executor_base.py:115] Maximum concurrency for 1024 tokens per request: 383.42x
INFO 02-11 03:53:04 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error

Capturing CUDA graph shapes: 100%|██████████| 39/39 [00:50<00:00,  1.29s/it]

INFO 02-11 03:53:54 model_runner.py:1562] Graph capturing finished in 50 secs, took 0.80 GiB
INFO 02-11 03:53:54 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 62.57 seconds





tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth 2025.2.5 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


### Data Prep
<a name="Data"></a>

Use Stanford s1 data set.

@misc{muennighoff2025s1simpletesttimescaling,
      title={s1: Simple test-time scaling},
      author={Niklas Muennighoff and Zitong Yang and Weijia Shi and Xiang Lisa Li and Li Fei-Fei and Hannaneh Hajishirzi and Luke Zettlemoyer and Percy Liang and Emmanuel Candès and Tatsunori Hashimoto},
      year={2025},
      eprint={2501.19393},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.19393},
}

In [None]:
import re
import numpy as np
from datasets import load_dataset, Dataset
from sentence_transformers import SentenceTransformer, util

# Load and prep dataset
SYSTEM_PROMPT = """
You are a helpful and thoughtful assistant that generates answers in a specific XML format.
All your responses must follow this format:

<answer
... final answer ...
</answer>

<reasoning>
... your reasoning ...
</reasoning>

Think for up to 1024 tokens.
Make sure to include both the <reasoning> </reasoning> and <answer> </answer> sections.
"""

XML_COT_FORMAT = """\
<answer>
{answer}
</answer>
<reasoning>
{reasoning}
</reasoning>
"""

def extract_xml_answer(text: str) -> str:
  answer = text.split("<answer>")[-1]
  answer = answer.split("</answer>")[0]
  return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split]
    data = data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_xml_answer(x['answer'])
    })
    return data

#dataset = get_gsm8k_questions()

# use the Stanford s1 dataset instead
def get_s1_questions(split = "train") -> Dataset:
    data = load_dataset("simplescaling/s1K")[split]
    data = data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}],
        'answer': extract_xml_answer(x['solution'])})
    return data

dataset = get_s1_questions()

# initialize Sentence-BERT model
sbert_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# calculate semantic answer similarity
def is_numeric(string) -> bool:
    try:
        float(string)
        return True
    except ValueError:
        return False

def encode_long_answer(answer, model, max_length=512):
    # break long answers into 512 byte chunks
    answer_chunks = [answer[i:i+max_length] for i in range(0, len(answer), max_length)]

    # guard against empty answer
    if not answer_chunks:
      return torch.zeros(model.get_sentence_embedding_dimension())

    answer_embeddings = [model.encode(chunk, convert_to_tensor=True) for chunk in answer_chunks]

    return torch.mean(torch.stack(answer_embeddings), dim=0)

def semantic_similarity(answer, correct_answer):
    answer_embedding = encode_long_answer(answer, sbert_model)
    correct_answer_embedding = encode_long_answer(correct_answer, sbert_model)
    return util.pytorch_cos_sim(answer_embedding, correct_answer_embedding).item()

# Reward functions
def correctness_reward_func(prompts, completions, solution, **kwargs) -> list[float]:
    rewards = []
    correctness_threshold = 0.7 # tunable hyperparameter
    # extract solutions
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    #print('-'*20, f"Question:\n{q}", f"\nsolution:\n{solution[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    #print('\n', '-'*30, '\n', q, '\n', solution[0], '\n', responses[0], '\n', extracted_responses[0], '\n', '-'*30, '\n')

    print("\n -------------- CONTENT BEGIN -------------- \n")
    print(f"\n CORRECT ANSWER: {solution}")
    print(f"\n ANSWERS:\n {extracted_responses}\n")

    # see how close are we to the correct solution
    for r, a in zip(extracted_responses, solution):
      if is_numeric(r) and is_numeric(a):
        sem_sim = 1.0 if r == a else 0.0
      else:
        sem_sim = semantic_similarity(r, a)

      print(f"\n SEMANTIC SIMILARITY = {sem_sim}")

      if sem_sim > correctness_threshold:
        print("\n CORRECTLY ANSWERED! \n")
        rewards.append(2.0)
      else:
        print("\n INCORRECTLY ANSWERED! \n")
        rewards.append(-0.5)

    print("\n -------------- CONTENT END ---------------- \n")

    return rewards

    #return [2.0 if semantic_similarity(r, a) > correctness_threshold else 0 for r, a in zip(extracted_responses, solution)]

#def int_reward_func(completions, **kwargs) -> list[float]:
#    responses = [completion[0]['content'] for completion in completions]
#    extracted_responses = [extract_xml_answer(r) for r in responses]
#    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else -0.1 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0

    # marker counts
    r_open = text.count("<reasoning>")
    r_close = text.count("</reasoning>")
    a_open = text.count("<answer>")
    a_close = text.count("</answer>")

    # reward presence of markers, penalize no markers
    count += r_open * 0.125 + (r_open - 1)*0.01
    count += r_close * 0.125 + (r_close - 1)*0.01
    count += a_open * 0.125 + (a_open - 1)*0.01
    count += a_close * 0.125 + (a_close - 1)*0.01

    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [None]:
from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 5, # num of samples generated per training step. Decrease if out of memory
    max_prompt_length = 512,
    max_completion_length = 1024,
    num_train_epochs = 3, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1, # gradient clipping
    report_to = "wandb", # Can use Weights & Biases
    output_dir = "outputs",
    run_name = "GRPO-s1-TPO-Qwen2.5-7B",
)


torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch


And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!


In [None]:
with wandb.init(project="GRPO_s1", config=training_args.to_dict()):
  trainer = GRPOTrainer(
      model = model,
      processing_class = tokenizer,
      reward_funcs = [
          xmlcount_reward_func,
          soft_format_reward_func,
          strict_format_reward_func,
          #int_reward_func,
          correctness_reward_func,
      ],
      args = training_args,
      train_dataset = dataset,
  )
  trainer.train()


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 1
\        /    Total batch size = 1 | Total steps = 250
 "-____-"     Number of trainable parameters = 80,740,352



 -------------- CONTENT BEGIN -------------- 


 CORRECT ANSWER: ['1. Let \\( R \\) be a ring with at least one non-zero zero divisor and a finite number of zero divisors. Denote the number of zero divisors in \\( R \\) by \\( m \\).\n\n2. Since \\( R \\) contains at least one non-zero zero divisor, there exist non-zero elements \\( u, v \\in R \\) such that \\( uv = 0 \\).\n\n3. Consider an arbitrary element \\( x \\in R \\). The element \\( xu \\) is either zero or a zero divisor because:\n   \\[\n   (xu)v = x(uv) = x \\cdot 0 = 0\n   \\]\n   Hence, \\( xu \\) is a zero divisor.\n\n4. Suppose \\( xu = yu \\) for some distinct elements \\( x, y \\in R \\). Then:\n   \\[\n   (x - y)u = xu - yu = 0\n   \\]\n   This implies that \\( x - y \\) is a zero divisor.\n\n5. Since \\( R \\) has \\( m \\) zero divisors, each zero divisor can be written in the form \\( xu \\) for at most \\( m \\) different elements \\( x \\in R \\). This is because if \\( xu = yu \\) for distinct \\( x \\) and \

Step,Training Loss,reward,reward_std,completion_length,kl
1,0.0,0.873,1.395043,661.400024,0.0009
2,0.0,-0.181,0.073943,703.0,0.000982
3,0.0,2.049,0.120748,833.0,0.000339
4,0.0,-0.262,0.147885,610.400024,0.000366
5,0.0001,-0.1,0.0,506.0,0.00169
6,0.0,0.711,1.298876,827.0,0.000589
7,0.0,1.711,1.027457,794.600037,0.00035
8,0.0001,0.238,0.997726,804.200012,0.001463
9,0.0,-0.235,0.165341,859.600037,0.000697
10,0.0,-0.073,0.147885,718.600037,0.001043


[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
 INCORRECTLY ANSWERED! 


 -------------- CONTENT END ---------------- 


 -------------- CONTENT BEGIN -------------- 


 CORRECT ANSWER: ['550', '550', '550', '550', '550']

 ANSWERS:
 ['132', '220', '132', '44', '24']


 SEMANTIC SIMILARITY = 0.0

 INCORRECTLY ANSWERED! 


 SEMANTIC SIMILARITY = 0.0

 INCORRECTLY ANSWERED! 


 SEMANTIC SIMILARITY = 0.0

 INCORRECTLY ANSWERED! 


 SEMANTIC SIMILARITY = 0.0

 INCORRECTLY ANSWERED! 


 SEMANTIC SIMILARITY = 0.0

 INCORRECTLY ANSWERED! 


 -------------- CONTENT END ---------------- 


 -------------- CONTENT BEGIN -------------- 


 CORRECT ANSWER: ['BC', 'BC', 'BC', 'BC', 'BC']

 ANSWERS:
 ['(A) True\n(B) True\n(C) True\n(D) True', '(A) True\n(B) True\n(C) True\n(D) True', '(A) The resistance of the Voltmeter will be \\(100 \\text{k}\\Omega\\)\n(B) The resistance of the Ammeter will be \\(0.02 \\Omega\\) (round off to 2nd decimal place)\n(C) The measured value of \\(R\\) will be \\(978 \\Omega 

Step,Training Loss,reward,reward_std,completion_length,kl
1,0.0,0.873,1.395043,661.400024,0.0009
2,0.0,-0.181,0.073943,703.0,0.000982
3,0.0,2.049,0.120748,833.0,0.000339
4,0.0,-0.262,0.147885,610.400024,0.000366
5,0.0001,-0.1,0.0,506.0,0.00169
6,0.0,0.711,1.298876,827.0,0.000589
7,0.0,1.711,1.027457,794.600037,0.00035
8,0.0001,0.238,0.997726,804.200012,0.001463
9,0.0,-0.235,0.165341,859.600037,0.000697
10,0.0,-0.073,0.147885,718.600037,0.001043


0,1
train/completion_length,▇▂█▅▆▆▂▁▆▃▇▇▃▂▅▇▇▇▅▄██▅▇▃▃▇▆▂▄▃▆▅▂▁▆▅▅▇▅
train/epoch,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▇▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▇▇▇▇█
train/grad_norm,▆▁▅▁▆▅▁▄▁▅▁▆▇▅▆▅▅▅▅▄▁▁▅▄▂▆▇▁▆▅▅▁▁▁▅█▁▅█▅
train/kl,▁▁▁▁▂▁▁▂▁▃▁▁▂▂▂▁▁▂▂▃▄▁▂▄▁▄▁▂▁▂█▄▄▂▁▂▁▂▂▂
train/learning_rate,▂▄███████▇▇▇▇▇▇▇▆▆▆▆▆▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁
train/loss,▁▁▁▂▁▁▁▁▂█▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▁▂▁▁▃▂▁▁▁▂▁▂
train/reward,▂▆▂▂▂▂▂▂▂▂▄▂▂▇▅▂▂▂▂▂▂▂▂▂▄▆▃▂▂▂▁▇▁▂▄▂▂█▂█
train/reward_std,█▆█▁▁▆▁▁▁▁▂▇▁▁▆▁▁▁▁▁▁▁▁▁▁▁▇▇▁▁▁▁▇▁▁▁▂▇█▆
train/rewards/correctness_reward_func,▁█▁▁▁▂▁▁▄█▇▁▁▁▄▁▁▁▁▂▁▁▁▁▁▂▁▁▄▁▁▁▁▁▁▁▁▁█▁

0,1
total_flos,0.0
train/completion_length,677.0
train/epoch,0.25
train/global_step,250.0
train/grad_norm,0.07585
train/kl,0.00032
train/learning_rate,0.0
train/loss,0.0
train/reward,1.549
train/reward_std,0.99729


<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [None]:
#print(f"\n Total correct answers: {total_correct_answer} \n")

text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "计算斐波那契序列里第101个数字，从0开始计算，并给出最终结果。"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:04<00:00,  4.12s/it, est. speed input: 12.64 toks/s, output: 72.43 toks/s]


'计算斐波那契序列中的第101个数字，我们首先要了解斐波那契数列的定义。斐波那契数列是一个非常著名的数列，其规则是从第3项开始，每一项都是前两项的和。具体来说，数列的定义如下：\n\n- F(0) = 0\n- F(1) = 1\n- 对于 n ≥ 2，F(n) = F(n-1) + F(n-2)\n\n我们可以使用递归或迭代的方法来计算斐波那契数列的第101项。为了提高效率，这里采用迭代方法，避免递归可能导致的重复计算。\n\n迭代计算斐波那契数列第101项的Python代码如下：\n\n```python\ndef fibonacci(n):\n    if n == 0:\n        return 0\n    elif n == 1:\n        return 1\n    else:\n        a, b = 0, 1\n        for _ in range(2, n+1):\n            a, b = b, a + b\n        return b\n\n# 计算第101项\nresult = fibonacci(100)\nprint(result)\n```\n\n运行上述代码，得到斐波那契数列的第101个数字为：57314784401381708410。'

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [None]:
model.save_lora("grpo_saved_lora")

Now we load the LoRA and test:

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "What is the 101 number in the Fibonacci sequence with 0 based indexing."},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.89s/it, est. speed input: 28.52 toks/s, output: 66.02 toks/s]


"<answer>\n54\n</answer>\n\n<reasoning>\nThe Fibonacci sequence is a series of numbers where each number is the sum of the two preceding ones, usually starting with 0 and 1. Given the 0-based indexing, the sequence starts as follows: 0, 1, 1, 2, 3, 5, 8, 13, ...\n\nTo find the 101st number in this sequence, we can use a known property of the Fibonacci sequence. The nth Fibonacci number can be approximated using Binet's formula, but for direct calculation, we can use a simple iterative approach or a built-in function in most programming languages.\n\nUsing a Python-like pseudocode, we can generate the Fibonacci sequence up to the 101st number:\n\n```\nfib = [0, 1]\nfor i in range(2, 101):\n    fib.append(fib[i-1] + fib[i-2])\n```\n\nAfter running this code, the 101st number in the sequence (index 100) is 54.\n\nTherefore, the 101 number in the Fibonacci sequence with 0 based indexing is 54.\n</reasoning>"

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
from google.colab import userdata
hf_token = userdata.get('HF_GRPO')

In [None]:
# Merge to 16bit
model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
model.push_to_hub_merged("rogerzeng/model", tokenizer, save_method = "merged_16bit", token = hf_token)

# Merge to 4bit
model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit_forced",)
model.push_to_hub_merged("rogerzeng/model", tokenizer, save_method = "merged_4bit_forced", token = hf_token)

# Just LoRA adapters
model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
model.push_to_hub_merged("rogerzeng/model", tokenizer, save_method = "lora", token = hf_token)

Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 7.2G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 45.54 out of 83.48 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


  7%|▋         | 2/28 [00:00<00:01, 18.54it/s]
We will save to Disk and not RAM now.
100%|██████████| 28/28 [00:30<00:00,  1.11s/it]


Unsloth: Saving tokenizer... Done.
Done.


Unsloth: You are pushing to hub, but you passed your HF username = rogerzeng.
We shall truncate rogerzeng/model to model


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 45.31 out of 83.48 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:29<00:00,  1.05s/it]


Unsloth: Saving tokenizer...

  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

 Done.


README.md:   0%|          | 0.00/610 [00:00<?, ?B/s]

  0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.88G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.09G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.33G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/rogerzeng/model


RuntimeError: Unsloth: Merging into 4bit will cause your model to lose accuracy if you plan
to merge to GGUF or others later on. I suggest you to do this as a final step
if you're planning to do multiple saves.
If you are certain, change `save_method` to `merged_4bit_forced`.

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("rogerzeng/model", tokenizer, token = hf_token)

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("rogerzeng/model", tokenizer, quantization_method = "f16", token = hf_token)

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("rogerzeng/model", tokenizer, quantization_method = "q4_k_m", token = hf_token)

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "rogerzeng/model",
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = hf_token,
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
