# OpenR1 Qwen2-0.5B-math-drgrpo2

- gpu: T4*2
- model: Qwen/Qwen2-0.5B
- data: stpete2/openr1-math-part
- method: drgrpo
- output: Qwen2-0.5B-math-drgrpo
  

The previous approach is correct but complicated

https://www.kaggle.com/code/stpeteishii/openr1-qwen2-0-5b-math-drgrpo-0414

## Open-R1 
is an open initiative to replicate and extend the techniques behind DeepSeek-R1, a state-of-the-art reasoning model, in a fully transparent and collaborative way: 

https://github.com/huggingface/open-r1



By selecting the model, dataset, and method, and running the training command from the command line, we were able to successfully perform training using the OpenR1 environment.

Cconsidering the limitations of the notebook environment, I limited the model and data to a minimum. And the following techniques are used. 

* 1. Using LoRA (Low-Rank Adaptation)
* 2. Gradient checkpointing
* 3. Batching optimizations
* 4. BF16 mixed precision
* 5. Sequence length limit
* 6. Data packing

This setting is far from sufficient for effective training, but on the other hand, it allows us to check the operation of the method in a short time.

This minimal configuration allows for rapid validation of the training pipeline even with limited resources, and is a useful starting point before scaling up to larger experiments.

In [None]:
from kaggle_secrets import UserSecretsClient
import wandb
user_secrets = UserSecretsClient()
secret_value = user_secrets.get_secret("wandb_api_key")
wandb.login(key=secret_value)

# save metrics into wandb folder
import os
os.environ["WANDB_DIR"] = "./wandb"
wandb.init(project="250414dr2", mode="online")

In [None]:
!git clone https://github.com/huggingface/open-r1.git
!pip install -e ./open-r1
!pip show open-r1
!pip install trl --upgrade

In [None]:
import os
from pathlib import Path

os.chdir('./open-r1')

In [None]:
!ls

In [None]:
!pip install flash-attn --no-build-isolation
#!pip install vllm

In [None]:
config_content = """
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_clipping: 1.0
  zero3_init_flag: true
  zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
"""

config_path = "custom_config.yaml"
Path(config_path).write_text(config_content)


#################################


config_content2 = """
# Model arguments
model_name_or_path: Qwen/Qwen2-0.5B
model_revision: main
torch_dtype: bfloat16
attn_implementation: eager

# Data training arguments
dataset_name: stpete2/openr1-math-part
dataset_prompt_column: problem
system_prompt: |
  You are a helpful AI Assistant that provides well-reasoned and detailed responses. You first think about the reasoning process as an internal monologue and then provide the user with the answer. Respond in the following format: <think>
  ...
  </think>
  <answer>
  ...
  </answer>
# GRPO trainer config
bf16: true
use_vllm: false
do_eval: false
gradient_accumulation_steps: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: Qwen2-0.5B-math-drgrpo
hub_strategy: every_save
learning_rate: 2.0e-05
log_completions: true
log_level: info
logging_first_step: true
logging_steps: 1
logging_strategy: steps
lr_scheduler_type: cosine
max_prompt_length: 256
max_completion_length: 512
max_steps: -1
num_generations: 4

scale_rewards: false

num_train_epochs: 1
output_dir: data/Qwen2-0.5B-math-drgrpo
overwrite_output_dir: true
per_device_eval_batch_size: 16
per_device_train_batch_size: 8
push_to_hub: false
report_to:
- wandb
reward_funcs:
- accuracy
- format
- tag_count
reward_weights:
- 1.0
- 1.0
- 1.0
save_strategy: "epoch"
save_total_limit: 1
seed: 42
warmup_ratio: 0.1
"""

config_path2 = "custom_config2.yaml"
Path(config_path2).write_text(config_content2)


##########################################################
#!accelerate launch --config_file custom_config.yaml grpo2.py \


!accelerate launch --config_file custom_config.yaml src/open_r1/grpo.py \
--config custom_config2.yaml \
--disable_tqdm=False

