# verl Demo

The demo is verified on the image `hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.3-flashinfer0.2.2-cxx11abi0`

We modify from the setup in [SimpleRL Zoo](https://github.com/hkust-nlp/simpleRL-reason?tab=readme-ov-file#training). Kudos to their awesome work on verifying RL with LLMs of various scales!

In [1]:
import os

In [2]:
os.chdir("/root/verl")

## Install `verl`

In [3]:
! pip install -e ".[vllm]"

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Obtaining file:///root/verl
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: verl
  Building editable for verl (pyproject.toml) ... [?25ldone
[?25h  Created wheel for verl: filename=verl-0.2.0.dev0-0.editable-py3-none-any.whl size=16801 sha256=7d1ad94ef6d22a86858d0fb7d77a8698d42013b75515779eaa6b121cb97fce9f
  Stored in directory: /tmp/pip-ephem-wheel-cache-qxf0f01a/wheels/d4/f5/29/7c5bb62e9344bc78534719365f2fb772bb330dbd23de4b25d2
Successfully built verl
Installing collected packages: verl
  Attempting uninstall: verl
    Found existing installation: verl 0.2.0.dev0
    Uninstalling verl-0.2.0.dev0:
      Successfully uninstalled verl-0.2.0.dev0
Successfully installed verl-0.

## Prepare the Data

In [4]:
os.environ.update({
    "TRAIN_FILE": "/root/data/gsm8k/train.parquet",
    "TEST_FILE": "/root/data/gsm8k/test.parquet",
})

! python examples/data_preprocess/gsm8k.py

Creating parquet from Arrow format: 100%|█████████| 8/8 [00:00<00:00, 82.83ba/s]
Creating parquet from Arrow format: 100%|████████| 2/2 [00:00<00:00, 366.63ba/s]


## Download the Base Model

In [5]:
os.environ.update({
    "MODEL_ID": "Qwen/Qwen2.5-1.5B-Instruct",
})

! huggingface-cli download "${MODEL_ID}"

/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-1.5B-Instruct/snapshots/989aa7980e4cf806f80c7fef2b1adb7bc71aa306


## Train!

Please search for "val-core/..." in the output for core validation metrics.

In [6]:
os.environ.update({"VLLM_USE_V1": "1", 
                   "VERL_PPO_LOGGING_LEVEL": "INFO"})


In [None]:
os.environ.update({
    k: str(v) for k, v in {
        "train_max_token_num_per_gpu": int(1024 * 8),
        "infer_max_token_num_per_gpu": int(1024 * 32),
    }.items()
})
! python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=[${TRAIN_FILE}] \
    data.val_files=[${TEST_FILE}] \
    data.max_prompt_length=512 \
    data.max_response_length=1024 \
    data.train_batch_size=128 \
    algorithm.use_kl_in_reward=True \
    algorithm.kl_ctrl.kl_coef=0.0001 \
    actor_rollout_ref.model.path=${MODEL_ID} \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.entropy_coeff=0.001 \
    actor_rollout_ref.actor.optim.lr=5e-7 \
    actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
    actor_rollout_ref.actor.clip_ratio_low=0.2 \
    actor_rollout_ref.actor.clip_ratio_high=0.2 \
    actor_rollout_ref.actor.clip_ratio_c=10.0 \
    actor_rollout_ref.actor.ppo_mini_batch_size=128 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu="${train_max_token_num_per_gpu}" \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.rollout.temperature=1.0 \
    actor_rollout_ref.rollout.top_p=1.0 \
    actor_rollout_ref.rollout.val_kwargs.temperature=1.0 \
    actor_rollout_ref.rollout.val_kwargs.top_p=0.95 \
    actor_rollout_ref.rollout.val_kwargs.do_sample=True \
    actor_rollout_ref.rollout.val_kwargs.n=1 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.enable_chunked_prefill=True \
    actor_rollout_ref.rollout.max_num_batched_tokens=10240 \
    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_max_token_num_per_gpu} \
    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=True  \
    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_max_token_num_per_gpu} \
    trainer.total_epochs=20 \
    trainer.val_before_train=True \
    trainer.test_freq=5 \
    trainer.save_freq=-1 \
    trainer.resume_mode=disable \
    trainer.nnodes=1 \
    trainer.n_gpus_per_node=1 \
    trainer.logger=["console"] \
    trainer.project_name="verl-demo" \
    trainer.experiment_name="grpo-gsm8k-$(basename ${MODEL_ID,,})"

2025-04-16 11:23:41,891	INFO worker.py:1832 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
[36m(TaskRunner pid=1226029)[0m {'actor_rollout_ref': {'actor': {'checkpoint': {'contents': ['model',
[36m(TaskRunner pid=1226029)[0m                                                              'optimizer',
[36m(TaskRunner pid=1226029)[0m                                                              'extra']},
[36m(TaskRunner pid=1226029)[0m                                  'clip_ratio': 0.2,
[36m(TaskRunner pid=1226029)[0m                                  'clip_ratio_c': 10.0,
[36m(TaskRunner pid=1226029)[0m                                  'clip_ratio_high': 0.2,
[36m(TaskRunner pid=1226029)[0m                                  'clip_ratio_low': 0.2,
[36m(TaskRunner pid=1226029)[0m                                  'entropy_coeff': 0.001,
[36m(TaskRunner pid=1226029)[0m                                  'fsdp_config': {'fsdp_size': -1,
