# OpenR1 Qwen2-0.5B-gsm8k-cppo

https://github.com/lzhxmu/CPPO.git


- gpu: T4*2
- model: Qwen/Qwen2-0.5B
- data: stpete2/openai-gsm8k-part
- method: cppo
- output: Qwen2-0.5B-gsm8k-cppo
  

### unique setting for cppo in custom_config2.yaml
- metric: smallest
- pruning: 0.5 
- allocation: true

## Open-R1 
is an open initiative to replicate and extend the techniques behind DeepSeek-R1, a state-of-the-art reasoning model, in a fully transparent and collaborative way: 

https://github.com/huggingface/open-r1



By selecting the model, dataset, and method, and running the training command from the command line, we were able to successfully perform training using the OpenR1 environment.

Cconsidering the limitations of the notebook environment, I limited the model and data to a minimum. And the following techniques are used. 

* 1. Using LoRA (Low-Rank Adaptation)
* 2. Gradient checkpointing
* 3. Batching optimizations
* 4. BF16 mixed precision
* 5. Sequence length limit
* 6. Data packing

This setting is far from sufficient for effective training, but on the other hand, it allows us to check the operation of the method in a short time.

This minimal configuration allows for rapid validation of the training pipeline even with limited resources, and is a useful starting point before scaling up to larger experiments.

In [1]:
from kaggle_secrets import UserSecretsClient
import wandb
user_secrets = UserSecretsClient()
secret_value = user_secrets.get_secret("wandb_api_key")
wandb.login(key=secret_value)

# save metrics into wandb folder
import os
os.environ["WANDB_DIR"] = "./wandb"
wandb.init(project="250419cp", mode="online")

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mstpeteishii[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Tracking run with wandb version 0.19.1
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20250419_084821-shqlur13[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mmorning-violet-3[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/stpeteishii/250419cp[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/stpeteishii/250419cp/runs/shqlur13[0m


In [2]:
!git clone https://github.com/lzhxmu/CPPO.git
!pip install -e ./CPPO
!pip show CPPO

Cloning into 'CPPO'...
remote: Enumerating objects: 203, done.[K
remote: Counting objects: 100% (203/203), done.[K
remote: Compressing objects: 100% (151/151), done.[K
remote: Total 203 (delta 84), reused 153 (delta 44), pack-reused 0 (from 0)[K
Receiving objects: 100% (203/203), 3.60 MiB | 20.49 MiB/s, done.
Resolving deltas: 100% (84/84), done.
Obtaining file:///kaggle/working/CPPO
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting trl@ git+https://github.com/huggingface/trl.git@69ad852e5654a77f1695eb4c608906fe0c7e8624 (from open-r1==0.1.0.dev0)
  Cloning https://github.com/huggingface/trl.git (to revision 69ad852e5654a77f1695eb4c608906fe0c7e8624) to /tmp/pip-install-z35fxqbh/trl_d6a93a8dc0ec43669cf3b0dfbced4dba
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/trl.git /tmp/pip-install-z35fxqbh/trl_d6a93a8dc0ec43669cf3b0dfbced4dba
  Running command git rev-parse -q --verify 'sha^69ad852e5654a77f1695eb4c608906fe0c7e8624'

In [3]:
import os
os.chdir('./CPPO')

In [4]:
!ls

asset  LICENSE	Makefile  README.md  recipes  scripts  setup.cfg  setup.py  src


In [5]:
!pip install flash-attn --no-build-isolation
#!pip install vllm

Collecting flash-attn
  Downloading flash_attn-2.7.4.post1.tar.gz (6.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m53.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... [?25l[?25hdone
  Created wheel for flash-attn: filename=flash_attn-2.7.4.post1-cp310-cp310-linux_x86_64.whl size=187797312 sha256=b267f80a08e516292cdd748056a2178a45b8abedf7fca123292eb17c21c8c87c
  Stored in directory: /root/.cache/pip/wheels/59/ce/d5/08ea07bfc16ba218dc65a3a7ef9b6a270530bcbd2cea2ee1ca
Successfully built flash-attn
Installing collected packages: flash-attn
Successfully installed flash-attn-2.7.4.post1


In [6]:
from pathlib import Path


config_content = """
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_clipping: 1.0
  zero3_init_flag: true
  zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
"""

config_path = "custom_config.yaml"
Path(config_path).write_text(config_content)


#################################


config_content2 = """
# Model arguments
model_name_or_path: Qwen/Qwen2-0.5B
model_revision: main
torch_dtype: bfloat16
attn_implementation: eager

# Data training arguments
dataset_name: stpete2/openai-gsm8k-part
system_prompt: |
  You are a helpful AI Assistant that provides well-reasoned and detailed responses. You first think about the reasoning process as an internal monologue and then provide the user with the answer. Respond in the following format: <think>
  ...
  </think>
  <answer>
  ...
  </answer>
# GRPO trainer config
bf16: true
use_vllm: false
do_eval: false
gradient_accumulation_steps: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: Qwen2-0.5B-gsm8k-cppo
hub_strategy: every_save
learning_rate: 2.0e-05
log_completions: true
log_level: info
logging_first_step: true
logging_steps: 1
logging_strategy: steps
lr_scheduler_type: cosine
max_prompt_length: 256
max_completion_length: 512
max_steps: -1
num_generations: 4

metric: smallest
pruning: 0.5 
allocation: true

num_train_epochs: 1
output_dir: data/Qwen2-0.5B-gsm8k-cppo
overwrite_output_dir: true
per_device_eval_batch_size: 16
per_device_train_batch_size: 8
push_to_hub: false
report_to:
- wandb
reward_funcs:
- accuracy
- format
- tag_count
reward_weights:
- 1.0
- 1.0
- 1.0
save_strategy: "epoch"
save_total_limit: 1
seed: 42
warmup_ratio: 0.1
"""

config_path2 = "custom_config2.yaml"
Path(config_path2).write_text(config_content2)


##########################################################


!accelerate launch --config_file custom_config.yaml src/open_r1/grpo_gsm.py \
--config custom_config2.yaml \
--disable_tqdm=False



[2025-04-19 08:49:49,882] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /root/.triton/autotune: No such file or directory
2025-04-19 08:49:54.804698: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-19 08:49:55.081775: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-04-19 08:49:55.155674: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0419 08:50:05.535000 285 torch/distributed/run.py:793] 
W0419 08:50:05.535000 285 torch/distributed/run.py:793] *****************************************
W0419 08:50:05.535000 285 tor