# Run Qwen PPO with [verl](https://github.com/volcengine/verl)

This tutorial provides a step-by-step guide to using veRL for executing your RLHF pipeline. You can find our [github repo](https://github.com/volcengine/verl/) and [documentation](https://verl.readthedocs.io/en/latest/index.html) for mode details.

Please connect to a T4 GPU to run the notebook - It's **free**! However, be aware that the environment is not persisted and may be lost if the session is idle for some time.

### You will learn:

- How to install veRL from scratch.
- How to use existing scripts to run an RLHF pipeline with your own models and data.

# Dependency Installation

In [1]:
!pip3 uninstall torch torchaudio torchvision -y
!pip3 install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu121
!pip3 list | grep torch

Found existing installation: torch 2.5.1+cu121
Uninstalling torch-2.5.1+cu121:
  Successfully uninstalled torch-2.5.1+cu121
Found existing installation: torchaudio 2.5.1+cu121
Uninstalling torchaudio-2.5.1+cu121:
  Successfully uninstalled torchaudio-2.5.1+cu121
Found existing installation: torchvision 0.20.1+cu121
Uninstalling torchvision-0.20.1+cu121:
  Successfully uninstalled torchvision-0.20.1+cu121
Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch==2.4.0
  Downloading https://download.pytorch.org/whl/cu121/torch-2.4.0%2Bcu121-cp310-cp310-linux_x86_64.whl (799.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m799.1/799.1 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.19.0
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.19.0%2Bcu121-cp310-cp310-linux_x86_64.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m98.8 MB/s[0m eta [36m0:00:00[0m


## Flash attention

Install flash-attention with a pre-built wheel for T4 (because flash-attn 2 does not support T4, and flash-attn 1 compilation is slow)

In [2]:
!wget https://github.com/eric-haibin-lin/fa-wheels/raw/refs/heads/wheel/sm75-v109-th241-cu121/flash_attn-1.0.9-cp310-cp310-linux_x86_64.whl
!pip3 install flash_attn-1.0.9-cp310-cp310-linux_x86_64.whl

--2025-01-10 06:19:59--  https://github.com/eric-haibin-lin/fa-wheels/raw/refs/heads/wheel/sm75-v109-th241-cu121/flash_attn-1.0.9-cp310-cp310-linux_x86_64.whl
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/eric-haibin-lin/fa-wheels/refs/heads/wheel/sm75-v109-th241-cu121/flash_attn-1.0.9-cp310-cp310-linux_x86_64.whl [following]
--2025-01-10 06:19:59--  https://media.githubusercontent.com/media/eric-haibin-lin/fa-wheels/refs/heads/wheel/sm75-v109-th241-cu121/flash_attn-1.0.9-cp310-cp310-linux_x86_64.whl
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 72963892 (70M) [applicatio

## Install and verify verl
Now we're ready to install verl!

In [5]:
!git clone https://github.com/volcengine/verl verl_repo
!cd verl_repo && pip3 install -e .

fatal: destination path 'verl_repo' already exists and is not an empty directory.
Obtaining file:///content/verl_repo
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting codetiming (from verl==0.1)
  Downloading codetiming-1.4.0-py3-none-any.whl.metadata (7.7 kB)
Collecting datasets (from verl==0.1)
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill (from verl==0.1)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting hydra-core (from verl==0.1)
  Downloading hydra_core-1.3.2-py3-none-any.whl.metadata (5.5 kB)
Collecting pybind11 (from verl==0.1)
  Downloading pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Collecting ray (from verl==0.1)
  Downloading ray-2.40.0-cp310-cp310-manylinux2014_x86_64.whl.metadata (17 kB)
Collecting 

In [1]:
import torch
try:
  assert torch.cuda.is_available() is True
except AssertionError:
  print("Please connect to a T4 GPU first usingt the top right button")

import verl

# Load Pretrained Language Model

verl supports models available in Huggingface transformers (as well as custom Megatron models).

Let's download the model first.

In [2]:
import transformers
transformers.pipeline('text-generation', model='Qwen/Qwen2.5-0.5B-Instruct')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Device set to use cuda:0


<transformers.pipelines.text_generation.TextGenerationPipeline at 0x7be6f79959f0>

# Dataset preparation

We train with the Grade School Math 8K (GSM8k) task in this demo. The dataset is downloaded from huggingface [gsm8k](https://huggingface.co/datasets/openai/gsm8k) and below are some samples:


**Prompt**

Katy makes coffee using teaspoons of sugar and cups of water in the ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups of water, calculate the number of teaspoonfuls of sugar she used.

**Solution**

The total ratio representing the ingredients she used to make the coffee is 7+13 = <<7+13=20>>20 Since the fraction representing the number of teaspoons she used is 7/20, she used 7/20120 = <<7/20120=42>>42 #### 42

In [3]:
!python3 verl_repo/examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k

README.md: 100% 7.94k/7.94k [00:00<00:00, 30.1MB/s]
train-00000-of-00001.parquet: 100% 2.31M/2.31M [00:00<00:00, 31.2MB/s]
test-00000-of-00001.parquet: 100% 419k/419k [00:00<00:00, 82.5MB/s]
Generating train split: 100% 7473/7473 [00:00<00:00, 63243.85 examples/s]
Generating test split: 100% 1319/1319 [00:00<00:00, 281899.97 examples/s]
Map: 100% 7473/7473 [00:00<00:00, 13537.93 examples/s]
Map: 100% 1319/1319 [00:00<00:00, 17447.66 examples/s]
Creating parquet from Arrow format: 100% 8/8 [00:00<00:00, 213.58ba/s]
Creating parquet from Arrow format: 100% 2/2 [00:00<00:00, 287.46ba/s]


# the reward

We use a rule-based reward model. We force the model to produce a final answer following 4 `#` as shown in the solution. We extract the final answer from both the solution and model's output using regular expression matching. We compare them and assign a reward of 1 to correct answer, 0.1 to incorrect answer and 0 to no answer.

In [4]:
import inspect
from verl.utils.reward_score.gsm8k import compute_score as gsm8k_reward
print(inspect.getsource(gsm8k_reward))

def compute_score(solution_str, ground_truth, method='strict', format_score=0., score=1.):
    """The scoring function for GSM8k.

    Reference: Trung, Luong, et al. "Reft: Reasoning with reinforced fine-tuning." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.

    Args:
        solution_str: the solution text
        ground_truth: the ground truth
        method: the method to extract the solution, choices are 'strict' and 'flexible'
        format_score: the score for the format
        score: the score for the correct answer
    """
    answer = extract_solution(solution_str=solution_str, method=method)
    if answer is None:
        return 0
    else:
        if answer == ground_truth:
            return score
        else:
            return format_score



# Run the RL Pipeline
Let's start with the Proximal Policy Optimization (PPO) algorithm,  one of the most widely used methods for post-training large language models.

The main entry point of the PPO algorithm example is: `main_ppo.py`. A detailed guide to understanding the code architecture of `main_ppo.py` is available [here](https://verl.readthedocs.io/en/latest/examples/ppo_code_architecture.html).

In this tutorial, we will demonstrate how to run the PPO algorithm with **Qwen 2.5-0.5B** by setting:
- `trainer.n_gpus_per_node`: Number of GPUs per node.

- `actor_rollout_ref.rollout.tensor_model_parallel_size`: TP size for rollout. Only effective for vllm.

- `actor_rollout_ref/critic.model.path`: Huggingface model path. This can be either local path or HDFS path. For HDFS path, we provide utils to download it to DRAM and convert the HDFS path to local path.

- `data.train_batch_size`: Batch size sampled for one training iteration of different RL algorithms.

- `data.max_prompt_length`: Maximum prompt length. All prompts will be left-padded to this length. An error will be reported if the length is too long.

- `data.max_response_length`: Maximum response length. Rollout in RL algorithms (e.g. PPO) generates up to this length.

- `actor_rollout_ref.actor.ppo_mini_batch_size`: One sample is split into multiple sub-batches with batch_size=ppo_mini_batch_size for PPO updates.

- `actor_rollout_ref/critic.actor.ppo_micro_batch_size`: Similar to gradient accumulation, the micro_batch_size for one forward pass, trading speed for GPU memory.

The full configuration explanation is available [here](https://verl.readthedocs.io/en/latest/examples/config.html).

The training may take long time to finish. It will output:

- generated sentences.

- step information with RL metrics.

In [1]:
!python3 -m verl.trainer.main_ppo \
 data.train_files=$HOME/data/gsm8k/train.parquet \
 data.val_files=$HOME/data/gsm8k/test.parquet \
 data.train_batch_size=256 \
 data.val_batch_size=1312 \
 data.max_prompt_length=512 \
 data.max_response_length=256 \
 actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
 actor_rollout_ref.actor.optim.lr=1e-6 \
 actor_rollout_ref.actor.ppo_mini_batch_size=64 \
 actor_rollout_ref.actor.ppo_micro_batch_size=1 \
 actor_rollout_ref.rollout.log_prob_micro_batch_size=1 \
 actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
 actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
 actor_rollout_ref.ref.log_prob_micro_batch_size=1 \
 critic.optim.lr=1e-5 \
 critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
 critic.ppo_micro_batch_size=1 \
 algorithm.kl_ctrl.kl_coef=0.001 \
 trainer.logger=['console'] \
 +trainer.val_before_train=False \
 trainer.default_hdfs_dir=null \
 trainer.n_gpus_per_node=1 \
 trainer.nnodes=1 \
 trainer.save_freq=10 \
 trainer.test_freq=10 \
 trainer.total_epochs=15 \
 actor_rollout_ref.actor.ppo_micro_batch_size=1 \
 critic.ppo_micro_batch_size=1


2025-01-10 06:34:43,005	INFO worker.py:1821 -- Started a local Ray instance.
[36m(main_task pid=6881)[0m {'actor_rollout_ref': {'actor': {'clip_ratio': 0.2,
[36m(main_task pid=6881)[0m                                  'entropy_coeff': 0.001,
[36m(main_task pid=6881)[0m                                  'fsdp_config': {'grad_offload': False,
[36m(main_task pid=6881)[0m                                                  'optimizer_offload': False,
[36m(main_task pid=6881)[0m                                                  'param_offload': False,
[36m(main_task pid=6881)[0m                                                  'wrap_policy': {'min_num_params': 0}},
[36m(main_task pid=6881)[0m                                  'grad_clip': 1.0,
[36m(main_task pid=6881)[0m                                  'optim': {'lr': 1e-06,
[36m(main_task pid=6881)[0m                                            'lr_warmup_steps_ratio': 0.0,
[36m(main_task pid=6881)[0m                         

# Stop and clean up resources

In [8]:
!ray stop

2025-01-10 06:30:42,799 - INFO - NumExpr defaulting to 2 threads.
Did not find any active Ray processes.
[0m