<a href="https://colab.research.google.com/github/saymrwulf/timeseries/blob/main/verl_getting_started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Run Qwen PPO with [verl](https://github.com/volcengine/verl)

This tutorial provides a step-by-step guide to using veRL for executing your RLHF pipeline. You can find our [github repo](https://github.com/volcengine/verl/) and [documentation](https://verl.readthedocs.io/en/latest/index.html) for mode details.

This notebook is also published on the [Lightning Studio](https://lightning.ai/hlin-verl/studios/verl-getting-started) platform, which provides free GPU quota every month. Checkout the published notebook with pre-installed dependencies using a free L4 GPU [here](https://lightning.ai/hlin-verl/studios/verl-getting-started) (no credit card required).

### You will learn:

- How to install veRL from scratch.
- How to use existing scripts to run an RLHF pipeline with your own models and data.

# Dependency Installation

If you are running on Lightning Studio using the published notebook, the dependencies are **already installed** and you can proceed to step "**Load Pretrained Language Model**"

In [1]:
!pip3 install --upgrade pip setuptools wheel
!pip3 install torch==2.4.0 torchvision==0.19.0
!pip3 list | grep torch
!pip3 install flash-attn --no-build-isolation

Collecting torch==2.4.0
  Using cached torch-2.4.0-cp311-cp311-manylinux1_x86_64.whl.metadata (26 kB)
Collecting torchvision==0.19.0
  Using cached torchvision-0.19.0-cp311-cp311-manylinux1_x86_64.whl.metadata (6.0 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.4.0)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.4.0)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.4.0)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.4.0)
  Using cached nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.4.0)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (

In [1]:
!git clone https://github.com/Jiayi-Pan/TinyZero.git
%cd TinyZero

!pip install verl

fatal: destination path 'TinyZero' already exists and is not an empty directory.
/content/TinyZero


## Install and verify verl
Now we're ready to install verl!

In [None]:
# In case you run this notebook and have not cloned verl yet:
# !git clone https://github.com/volcengine/verl $HOME/verl_repo

!cd $HOME/verl_repo && pip3 install -e . -U

Obtaining file:///teamspace/studios/this_studio/verl_repo
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: verl
  Building editable for verl (pyproject.toml) ... [?25ldone
[?25h  Created wheel for verl: filename=verl-0.1-0.editable-py3-none-any.whl size=13000 sha256=8fd1f1241dfe89d7f8384fe884f50ec4e070d18029c37472e5584300f5a326de
  Stored in directory: /tmp/pip-ephem-wheel-cache-pz36kou4/wheels/f4/30/ea/7a2d2086bd780aba22048a0b415dc5e5a9e50b2c87e39e9717
Successfully built verl
Installing collected packages: verl
Successfully installed verl-0.1


## Restart the python kernel

In [None]:
import IPython

# Restart the kernel to pickup the latest python packages
IPython.get_ipython().kernel.do_shutdown(restart=True)

{'status': 'ok', 'restart': True}

In [2]:
!pip install ray

Collecting ray
  Downloading ray-2.42.0-cp311-cp311-manylinux2014_x86_64.whl.metadata (18 kB)
Downloading ray-2.42.0-cp311-cp311-manylinux2014_x86_64.whl (67.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.4/67.4 MB[0m [31m112.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ray
Successfully installed ray-2.42.0


In [3]:
import torch
try:
  assert torch.cuda.is_available() is True
  torch.ones(1, dtype=torch.bfloat16).cuda()
except AssertionError:
  print("Please switch to an env with GPUs supporting bfloat16 (L4 RTX 5000, A5000, A100, H100, A10, etc)")

try:
  import verl
except Exception as e:
  print("Please install verl via pip and restart the kernel")
  raise e

import flash_attn

# Load Pretrained Language Model

verl supports models available in Huggingface transformers (as well as custom Megatron models).

Let's download the model first.

In [4]:
!huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct --local-dir $HOME/models/Qwen2.5-0.5B-Instruct

# If huggingface-cli is not stable, use the method below
# import transformers
# transformers.pipeline('text-generation', model='Qwen/Qwen2.5-0.5B-Instruct')

Fetching 10 files:   0% 0/10 [00:00<?, ?it/s]Downloading 'merges.txt' to '/root/models/Qwen2.5-0.5B-Instruct/.cache/huggingface/download/PtHk0z_I45atnj23IIRhTExwT3w=.20024bfe7c83998e9aeaf98a0cd6a2ce6306c2f0.incomplete'
Downloading 'config.json' to '/root/models/Qwen2.5-0.5B-Instruct/.cache/huggingface/download/8_PA_wEVGiVa2goH2H4KQOQpvVY=.0dbb161213629a23f0fc00ef286e6b1e366d180f.incomplete'
Downloading 'generation_config.json' to '/root/models/Qwen2.5-0.5B-Instruct/.cache/huggingface/download/3EVKVggOldJcKSsGjSdoUCN1AyQ=.dfc11073787daf1b0f9c0f1499487ab5f4c93738.incomplete'
Downloading 'model.safetensors' to '/root/models/Qwen2.5-0.5B-Instruct/.cache/huggingface/download/xGOKKLRSlIhH692hSVvI1-gpoa8=.fdf756fa7fcbe7404d5c60e26bff1a0c8b8aa1f72ced49e7dd0210fe288fb7fe.incomplete'

config.json:   0% 0.00/659 [00:00<?, ?B/s][Aconfig.json: 100% 659/659 [00:00<00:00, 5.84MB/s]

merges.txt:   0% 0.00/1.67M [00:00<?, ?B/s][ADownload complete. Moving file to /root/models/Qwen2.5-0.5B-Instruct

# Dataset preparation

We train with the Grade School Math 8K (GSM8k) task in this demo. The dataset is downloaded from huggingface [gsm8k](https://huggingface.co/datasets/openai/gsm8k) and below are some samples:


**Prompt**

Katy makes coffee using teaspoons of sugar and cups of water in the ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups of water, calculate the number of teaspoonfuls of sugar she used.

**Solution**

The total ratio representing the ingredients she used to make the coffee is 7+13 = <<7+13=20>>20 Since the fraction representing the number of teaspoons she used is 7/20, she used 7/20120 = <<7/20120=42>>42 #### 42

In [1]:
!pip install datasets
!mkdir -p $HOME/data/gsm8k
!python3 /content/TinyZero/examples/data_preprocess/gsm8k.py --local_dir $HOME/data/gsm8k

README.md: 100% 7.94k/7.94k [00:00<00:00, 44.3MB/s]
train-00000-of-00001.parquet: 100% 2.31M/2.31M [00:00<00:00, 29.1MB/s]
test-00000-of-00001.parquet: 100% 419k/419k [00:00<00:00, 187MB/s]
Generating train split: 100% 7473/7473 [00:00<00:00, 173916.13 examples/s]
Generating test split: 100% 1319/1319 [00:00<00:00, 330088.72 examples/s]
Map: 100% 7473/7473 [00:00<00:00, 19404.02 examples/s]
Map: 100% 1319/1319 [00:00<00:00, 17962.32 examples/s]
Creating parquet from Arrow format: 100% 8/8 [00:00<00:00, 190.79ba/s]
Creating parquet from Arrow format: 100% 2/2 [00:00<00:00, 282.67ba/s]


# the reward

We use a rule-based reward model. We force the model to produce a final answer following 4 `#` as shown in the solution. We extract the final answer from both the solution and model's output using regular expression matching. We compare them and assign a reward of 1 to correct answer, 0.1 to incorrect answer and 0 to no answer.

In [2]:
import inspect
from verl.utils.reward_score.gsm8k import compute_score as gsm8k_reward
print(inspect.getsource(gsm8k_reward))

def compute_score(solution_str, ground_truth, method='strict', format_score=0., score=1.):
    answer = extract_solution(solution_str=solution_str, method=method)
    if answer is None:
        return 0
    else:
        if answer == ground_truth:
            return score
        else:
            return format_score



# Run the RL Pipeline
Let's start with the Proximal Policy Optimization (PPO) algorithm,  one of the most widely used methods for post-training large language models.

The main entry point of the PPO algorithm example is: `main_ppo.py`. A detailed guide to understanding the code architecture of `main_ppo.py` is available [here](https://verl.readthedocs.io/en/latest/examples/ppo_code_architecture.html).

In this tutorial, we will demonstrate how to run the PPO algorithm with **Qwen 2.5-0.5B** by setting:
- `trainer.n_gpus_per_node`: Number of GPUs per node.

- `actor_rollout_ref.rollout.tensor_model_parallel_size`: TP size for rollout. Only effective for vllm.

- `actor_rollout_ref/critic.model.path`: Huggingface model path. This can be either local path or HDFS path. For HDFS path, we provide utils to download it to DRAM and convert the HDFS path to local path.

- `data.train_batch_size`: Batch size sampled for one training iteration of different RL algorithms.

- `data.max_prompt_length`: Maximum prompt length. All prompts will be left-padded to this length. An error will be reported if the length is too long.

- `data.max_response_length`: Maximum response length. Rollout in RL algorithms (e.g. PPO) generates up to this length.

- `actor_rollout_ref.actor.ppo_mini_batch_size`: One sample is split into multiple sub-batches with batch_size=ppo_mini_batch_size for PPO updates.

- `actor_rollout_ref/critic.actor.ppo_micro_batch_size`: Similar to gradient accumulation, the micro_batch_size for one forward pass, trading speed for GPU memory.

The full configuration explanation is available [here](https://verl.readthedocs.io/en/latest/examples/config.html).

The training may take a few hours to finish but you can observe how the model performance increases. It will progressively output:

- generated sentences.

- step information with RL metrics, such as entropy loss, kl, and ``val/test_score/openai/gsm8k`` (validated every ``trainer.test_freq`` steps)

If you come across GPU out of memory issues, set smaller values for the micro batch size used for gradient accumulation:

- actor_rollout_ref.actor.ppo_micro_batch_size=1
- critic.ppo_micro_batch_size=1

In [3]:
!pip install vllm==0.6.3

Collecting vllm==0.6.3
  Downloading vllm-0.6.3-cp38-abi3-manylinux1_x86_64.whl.metadata (10 kB)
Collecting uvicorn[standard] (from vllm==0.6.3)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm==0.6.3)
  Downloading prometheus_fastapi_instrumentator-7.0.2-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken>=0.6.0 (from vllm==0.6.3)
  Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting lm-format-enforcer==0.10.6 (from vllm==0.6.3)
  Downloading lm_format_enforcer-0.10.6-py3-none-any.whl.metadata (16 kB)
Collecting outlines<0.1,>=0.0.43 (from vllm==0.6.3)
  Downloading outlines-0.0.46-py3-none-any.whl.metadata (15 kB)
Collecting partial-json-parser (from vllm==0.6.3)
  Downloading partial_json_parser-0.2.1.1.post5-py3-none-any.whl.metadata (6.1 kB)
Collecting msgspec (from vllm==0.6.3)
  Downloading msgspec-0.19.0-cp311-cp311-manylinux_2_17_x86_64

In [None]:
!PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
 data.train_files=$HOME/data/gsm8k/train.parquet \
 data.val_files=$HOME/data/gsm8k/test.parquet \
 data.train_batch_size=256 \
 data.val_batch_size=1312 \
 data.max_prompt_length=512 \
 data.max_response_length=256 \
 actor_rollout_ref.model.path=$HOME/models/Qwen2.5-0.5B-Instruct \
 actor_rollout_ref.actor.optim.lr=1e-6 \
 actor_rollout_ref.actor.ppo_mini_batch_size=64 \
 actor_rollout_ref.actor.ppo_micro_batch_size=1 \
 actor_rollout_ref.rollout.log_prob_micro_batch_size=1 \
 actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
 actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
 actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
 critic.optim.lr=1e-5 \
 critic.model.path=$HOME/models/Qwen2.5-0.5B-Instruct \
 critic.ppo_micro_batch_size=1 \
 algorithm.kl_ctrl.kl_coef=0.001 \
 +trainer.val_before_train=False \
 trainer.default_hdfs_dir='' \
 trainer.n_gpus_per_node=1 \
 trainer.nnodes=1 \
 trainer.save_freq=10 \
 trainer.test_freq=10 \
 trainer.total_epochs=15 \
 trainer.logger=\[console\]

2025-02-11 17:59:40,119	INFO worker.py:1841 -- Started a local Ray instance.
[36m(main_task pid=24124)[0m {'actor_rollout_ref': {'actor': {'clip_ratio': 0.2,
[36m(main_task pid=24124)[0m                                  'entropy_coeff': 0.001,
[36m(main_task pid=24124)[0m                                  'fsdp_config': {'grad_offload': False,
[36m(main_task pid=24124)[0m                                                  'optimizer_offload': False,
[36m(main_task pid=24124)[0m                                                  'param_offload': False,
[36m(main_task pid=24124)[0m                                                  'wrap_policy': {'min_num_params': 0}},
[36m(main_task pid=24124)[0m                                  'grad_clip': 1.0,
[36m(main_task pid=24124)[0m                                  'optim': {'lr': 1e-06,
[36m(main_task pid=24124)[0m                                            'lr_warmup_steps_ratio': 0.0,
[36m(main_task pid=24124)[0m               

# Stop and clean up resources

In [1]:
!ray stop

/bin/bash: line 1: ray: command not found
