In [1]:
from IPython.display import Image

- https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/readme.md
- https://www.bilibili.com/video/BV1KCqbYoE2i/

In [2]:
Image(url='./imgs/verl_ppo.png', width=600)

- actor 与 rollout 共享模型参数
    - 注意在训练阶段更新 actor 时，存在 actor 与 rollout 的参数同步（sync weights，NCLL）；
    - weight sync is expensive
        - A100 rdma 带宽：50GB/s
        - 405B 的 llama：60s;

### 核心源码

```python
trainer = RayPPOTrainer(config=config,
                        tokenizer=tokenizer,
                        role_worker_mapping=role_worker_mapping,
                        resource_pool_manager=resource_pool_manager,
                        ray_worker_group_cls=ray_worker_group_cls,
                        reward_fn=reward_fn,
                        val_reward_fn=val_reward_fn)
trainer.init_workers()
trainer.fit()
```

- **RayPPOTrainer**
    - **init_workers**：设置 worker group（`WorkerGroup`）
        - 多个 worker 共享资源，实际是跑在一个进程中的。
        - `actor_rollout_wg` （`ActorRolloutRefWorker`）
            - generate_sequences
            - compute_log_prob
        - `ref_policy_wg`（`ActorRolloutRefWorker`）
            - compute_ref_log_prob
        - `critic_wg`（`CriticWorker`）
        - `rm_wg`（`RewardModelWorker`）
    - **fit**

```python
def fit():
    # Training loop
    for epoch in range(total_epochs):
        for batch in dataloader:  
            # batch: input_ids, attention_mask, position_ids

            sequences = actor_rollout_wg.generate_sequences(batch)   # no_grad
            # batch: input_ids, attention_mask, position_ids, prompts, responses
            # (现在input_ids,attention_mask,position_id涵盖整个prompts+responses)
            batch = batch.repeat().union(sequence)  # 相当于每个response复制了n份

            log_probs = actor_rollout_wg.compute_log_prob(batch)   # no_grad
            # batch: +log_probs(per sequence)

            ref_policy_wg.compute_ref_log_prob(batch)  # no_grad
            # batch: +ref_log_prob(per sequence)

            values = critic_wg.compute_values(batch)  # no_grad
            # batch: +values(per token)

            rewards = rm_wg.compute_rm_score(batch)  # no_grad
            reward_tensor = self.reward_fn(batch)  # rule based
            # batch: +rm_scores(per token), token_level_scores(per token)

            apply_kl_penalty(...)  # no_grad
            # batch: +token_level_rewards

            advantages = compute_advantage(batch)  # 本地运行
            # batch: +advantage(per token), returns(per token)

            critic_wg.update_critic(batch)
            actor_rollout_wg.update_actor(batch)
```


## parameters

### actor_rollout_ref.actor

- ppo_mini_batch_size、ppo_micro_batch_size_per_gpu
    - 出现在 update policy 时；
    - ppo_mini_batch_size
    - ppo_mini_batch_size//ppo_micro_batch_size_per_gpu：相当于 gradient accumulation
- 关于 batch size 即为多少的样本，算fwd、loss、bwd，更新 model

### actor_rollout_ref.rollout

> vllm 相关的参数？

- name
- gpu_memory_utilization
- tensor_model_parallel_size
- n