# Reinforcement Learning from Human Feedback (RLHF)

This tutorial provides a comprehensive overview on RLHF-based techniques implemented in `phyagi-sdk`. The explanations and examples are designed to help you understand how to use these techniques effectively, and based on the following scripts:

- [ray_grpo_tune.py](https://github.com/microsoft/phyagi-sdk/blob/main/scripts/tune/ray_grpo_tune.py).
- [ray_isft_tune.py](https://github.com/microsoft/phyagi-sdk/blob/main/scripts/tune/ray_isft_tune.py).

**Note**: The code snippets provided in this guide highlight essential sections of the scripts for better understanding. For the complete implementation, refer to the linked scripts above.

## Resources

All examples use the **Phi‑3‑mini‑4k‑instruct** checkpoint and the **GSM8K** dataset. You can download them from either storage account:

| Resource | aifshared (mlfoundations) | aifrontiers (MSR-LIT) |
|:---------|:----------------|:--------------|
| **Phi-3-mini-4k (MixFormer)** | [Link](https://aifshared.blob.core.windows.net/data/piero_checkpoints/phi-3-mini-4k-instruct/mixformer-sequential) | [Link](https://aifrontiers.blob.core.windows.net/phickpts/phi-3-mini-4k/mixformer-sequential) |
| **GSM8k data files (train.parquet and test.parquet)** | [Link](https://aifshared.blob.core.windows.net/data/piero_data/gsm8k_formatted/gsm8k) | [Link](https://aifrontiers.blob.core.windows.net/phickpts/gsm8k) |

## Group Relative Policy Optimization (GRPO)

**GRPO** is an RLHF algorithm that trains a policy from *relative rankings* inside a group of candidate completions instead of scalar rewards.

### How it works

1. **Generate *k* completions** for the same prompt.  
2. A human (or heuristic) **ranks the completions** best to worst.  
3. Convert the ranking into a **pair‑wise preference matrix** and compute relative advantages.  
4. Update the policy with a **PPO‑style objective** that boosts the log‑probability of higher ranked completions while suppressing lower ranked ones.

Because each ranking yields *k × (k‑1)/2* preference pairs, GRPO extracts **far denser feedback** than scalar reward PPO for the same annotation effort.

| | PPO | **GRPO** |
| --- | --- | --- |
| Feedback | scalar reward | ranking of *k* completions |
| Signal density | 1 value / prompt | *k(k‑1)/2* pairwise preferences / prompt |
| Annotation cost | moderate | identical |

In [None]:
import os

import datasets
from transformers import AutoTokenizer

from phyagi.datasets.rl.chat.chat_dataset import ChatDataset
from phyagi.datasets.rl.rl_data_collator import RewardDataCollator
from phyagi.rl.tuners.grpo.grpo_config import RayGRPOConfig
from phyagi.rl.tuners.grpo.grpo_tuner import RayGRPOTuner
from phyagi.rl.models.actor_config import ActorConfig
from phyagi.rl.rollout.vllm_worker_config import VLLMWorkerConfig
from phyagi.rl.rewards.gsm8k import GSM8kReward

os.environ["WANDB_MODE"] = "disabled"

actor_config = ActorConfig(
    model={"pretrained_model_name_or_path": "/home/gderosa/models/Phi-3-mini-4k-instruct"},
    optimizer={
        "betas": [0.9, 0.999],
        "weight_decay": 0.01,
    },
    scheduler={
        "warmup_num_steps": 1,
        "warmup_max_lr": 5.0e-6,
    }
)

rollout_config = VLLMWorkerConfig(
    prompt_length=256,
    response_length=512,
    dtype="bfloat16",
    gpu_memory_utilization=0.5,
    enforce_eager=False,
    enable_prefix_caching=True,
    sampling_params={"temperature": 1.0},
)

tuning_args = RayGRPOConfig(
    output_dir="/tmp/grpo_gsm8k",
    n_nodes=1,
    n_gpus_per_node=4,
    max_steps=1,
    train_batch_size=16,
    group_size=8,
    train_max_micro_batch_size_per_gpu=1,
    actor=actor_config,
    rollout=rollout_config,
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

def _extract_answer(row):
    row["answer"] = row["reward_model"]["ground_truth"]
    return row

dataset = datasets.load_dataset(
    "parquet",
    data_files={
        "train": "/home/gderosa/datasets/gsm8k/train.parquet",
        "test": "/home/gderosa/datasets/gsm8k/test.parquet"
    }
)
dataset = dataset.map(_extract_answer)

rewards = {
    "math_verifier": GSM8kReward(format_score=0.0, correct_score=1.0)
}

train_dataset = ChatDataset(
    dataset["train"],
    tokenizer=tokenizer,
    messages_column_name="prompt",
    ground_truth_column_name="answer",
    max_length=tuning_args.rollout.prompt_length,
    filter_max_length=True,
)
eval_dataset = ChatDataset(
    dataset["test"],
    tokenizer=tokenizer,
    messages_column_name="prompt",
    ground_truth_column_name="answer",
    max_length=tuning_args.rollout.prompt_length,
)
data_collator = RewardDataCollator(reward_names=list(rewards.keys()))

tuner = RayGRPOTuner(
    args=tuning_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    rewards=rewards,
)
tuner.train()

2025-05-26 13:43:21,662	INFO worker.py:1888 -- Started a local Ray instance.


[phyagi] [2025-05-26 13:43:22,793] [INFO] [grpo_tuner.py:107:__init__] Tuning arguments: {'output_dir': '/tmp/grpo_gsm8k', 'n_nodes': 1, 'n_gpus_per_node': 4, 'do_final_eval': True, 'eval_before_training': False, 'epochs': None, 'max_steps': 1, 'log_n_eval_completions': 20, 'save_steps': -1, 'save_final_checkpoint': True, 'eval_steps': 0, 'seed': 1, 'group_size': 8, 'train_batch_size': 16, 'train_max_micro_batch_size_per_gpu': 1, 'adv_length_bias_correction': True, 'num_policy_updates_per_batch': 1, 'kl_coeff': 0.001, 'epsilon_low': 0.2, 'epsilon_high': 0.2, 'actor': {'model': {'pretrained_model_name_or_path': '/home/gderosa/models/Phi-3-mini-4k-instruct'}, 'use_meta_tensor': False, 'optimizer': {'betas': [0.9, 0.999], 'weight_decay': 0.01}, 'scheduler': {'warmup_num_steps': 1, 'warmup_max_lr': 5e-06}, 'gradient_clipping': 1.0, 'manual_offload': False, 'fsdp_offload': False, 'activation_checkpointing': False, 'dtype': 'bfloat16', 'adam_8bit': False}, 'rollout': {'prompt_length': 256, '

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:05<00:05,  5.51s/it]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s][32m [repeated 3x across cluster][0m
Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00,  4.23s/it]


[36m(RayGRPOWorker pid=2125360)[0m [phyagi] [2025-05-26 13:43:44,034] [INFO] [model.py:93:get_model] Loading pre-trained model: /home/gderosa/models/Phi-3-mini-4k-instruct
[36m(RayGRPOWorker pid=2125360)[0m [phyagi] [2025-05-26 13:43:44,034] [INFO] [model.py:94:get_model] Model configuration: {'torch_dtype': None, 'model_type': 'mixformer-sequential', 'trust_remote_code': True}
[36m(pid=2125363)[0m INFO 05-26 13:43:32 [importing.py:53] Triton module has been replaced with a placeholder.[32m [repeated 3x across cluster][0m
[36m(pid=2125363)[0m INFO 05-26 13:43:32 [__init__.py:239] Automatically detected platform cuda.[32m [repeated 3x across cluster][0m
[36m(RayGRPOWorker pid=2125363)[0m [phyagi] [2025-05-26 13:43:34,766] [INFO] [ray_worker.py:64:configure_models] Initializing actor, reference (optional) and rollout models... [GPU memory allocated: 0.26 GB (1.0% of device)][32m [repeated 3x across cluster][0m


Loading checkpoint shards:  50%|█████     | 1/2 [00:05<00:05,  5.46s/it][32m [repeated 4x across cluster][0m
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s][32m [repeated 4x across cluster][0m
Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00,  4.29s/it][32m [repeated 3x across cluster][0m


[36m(RayGRPOWorker pid=2125360)[0m [phyagi] [2025-05-26 13:44:01,327] [INFO] [parallel_mixformer_sequential.py:126:apply_fsdp_mixformer_sequential] Fully Sharded Data Parallelism (FSDP) has been applied to model blocks.
[36m(RayGRPOWorker pid=2125360)[0m [phyagi] [2025-05-26 13:44:01,839] [INFO] [parallel_mixformer_sequential.py:129:apply_fsdp_mixformer_sequential] Fully Sharded Data Parallelism (FSDP) has been applied to model.
[36m(RayGRPOWorker pid=2125361)[0m INFO 05-26 13:44:01 [config.py:2968] Downcasting torch.float32 to torch.bfloat16.
[36m(RayGRPOWorker pid=2125361)[0m INFO 05-26 13:44:10 [config.py:717] This model supports multiple tasks: {'generate', 'embed', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
[36m(RayGRPOWorker pid=2125361)[0m INFO 05-26 13:44:10 [config.py:1729] Disabling V1 multiprocessing for external launcher.
[36m(RayGRPOWorker pid=2125361)[0m INFO 05-26 13:44:10 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:05<00:05,  5.56s/it][32m [repeated 3x across cluster][0m
Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00,  4.28s/it][32m [repeated 4x across cluster][0m
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:06,  2.02s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:04<00:04,  2.10s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:04<00:01,  1.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.61s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.65s/it]
[36m(RayGRPOWorker pid=2125360)[0m 


[36m(RayGRPOWorker pid=2125360)[0m INFO 05-26 13:44:18 [loader.py:458] Loading weights took 6.77 seconds
[36m(RayGRPOWorker pid=2125362)[0m INFO 05-26 13:44:10 [config.py:1729] Disabling V1 multiprocessing for external launcher.[32m [repeated 3x across cluster][0m
[36m(RayGRPOWorker pid=2125362)[0m INFO 05-26 13:44:10 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=8192.[32m [repeated 3x across cluster][0m
[36m(RayGRPOWorker pid=2125362)[0m INFO 05-26 13:44:10 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='/tmp/grpo_gsm8k/initial_rollout', speculative_config=None, tokenizer='/tmp/grpo_gsm8k/initial_rollout', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None,

  0%|          | 0/1 [00:00<?, ?it/s]

[phyagi] [2025-05-26 13:45:12,103] [INFO] [grpo_tuner.py:498:train] [step=1] Evaluating model... [GPU memory allocated: 5.74 GB (13.0% of device)]
[phyagi] [2025-05-26 13:45:12,103] [INFO] [grpo_tuner.py:498:train] [step=1] Evaluating model... [GPU memory allocated: 5.74 GB (13.0% of device)] [GPU memory allocated: 5.74 GB (13.0% of device)]
[phyagi] [2025-05-26 13:45:12,103] [INFO] [grpo_tuner.py:498:train] [step=1] Evaluating model... [GPU memory allocated: 5.74 GB (13.0% of device)] [GPU memory allocated: 5.74 GB (13.0% of device)] [GPU memory allocated: 5.74 GB (13.0% of device)]
[phyagi] [2025-05-26 13:45:12,103] [INFO] [grpo_tuner.py:498:train] [step=1] Evaluating model... [GPU memory allocated: 5.74 GB (13.0% of device)] [GPU memory allocated: 5.74 GB (13.0% of device)] [GPU memory allocated: 5.74 GB (13.0% of device)] [GPU memory allocated: 5.74 GB (13.0% of device)]
[phyagi] [2025-05-26 13:45:12,103] [INFO] [grpo_tuner.py:498:train] [step=1] Evaluating model... [GPU memory all

Generating completions:   0%|          | 0/21 [00:00<?, ?it/s]


[36m(RayGRPOWorker pid=2125361)[0m [phyagi] [2025-05-26 13:45:13,060] [INFO] [ray_worker.py:186:generate_completions] Synchronization done. [GPU memory allocated: 26.20 GB (59.0% of device)]
[36m(RayGRPOWorker pid=2125361)[0m [phyagi] [2025-05-26 13:45:13,061] [INFO] [ray_worker.py:191:generate_completions] Generating completions using 21 batches of 16 prompts... [GPU memory allocated: 26.20 GB (59.0% of device)]
[36m(RayGRPOWorker pid=2125363)[0m INFO 05-26 13:45:09 [gpu_worker.py:95] Sleep mode freed 21.28 GiB memory, 0.89 GiB memory is still in use.[32m [repeated 3x across cluster][0m
[36m(RayGRPOWorker pid=2125363)[0m INFO 05-26 13:45:09 [executor_base.py:210] It took 8.681329 seconds to fall asleep.[32m [repeated 3x across cluster][0m


Generating completions:   5%|▍         | 1/21 [00:13<04:28, 13.45s/it]
Generating completions:   0%|          | 0/21 [00:00<?, ?it/s][32m [repeated 3x across cluster][0m
Generating completions:  10%|▉         | 2/21 [00:24<03:50, 12.15s/it][32m [repeated 4x across cluster][0m
Generating completions:  10%|▉         | 2/21 [00:30<04:44, 14.99s/it][32m [repeated 3x across cluster][0m
Generating completions:  14%|█▍        | 3/21 [00:36<03:38, 12.16s/it]
Generating completions:  14%|█▍        | 3/21 [00:40<04:02, 13.50s/it]
Generating completions:  19%|█▉        | 4/21 [00:48<03:22, 11.94s/it][32m [repeated 3x across cluster][0m
Generating completions:  19%|█▉        | 4/21 [00:55<03:52, 13.67s/it][32m [repeated 2x across cluster][0m
Generating completions:  24%|██▍       | 5/21 [01:02<03:23, 12.70s/it][32m [repeated 2x across cluster][0m
Generating completions:  24%|██▍       | 5/21 [01:08<03:30, 13.15s/it][32m [repeated 2x across cluster][0m
Generating completions:  29%|██▊

[36m(RayGRPOWorker pid=2125361)[0m [phyagi] [2025-05-26 13:49:40,109] [INFO] [ray_worker.py:204:generate_completions] Completions generated. [GPU memory allocated: 26.20 GB (59.0% of device)]
[36m(RayGRPOWorker pid=2125363)[0m [phyagi] [2025-05-26 13:45:12,010] [INFO] [ray_worker.py:78:configure_models] Actor, reference (optional) and rollout models initialized. [GPU memory allocated: 5.48 GB (12.0% of device)][32m [repeated 3x across cluster][0m
[36m(RayGRPOWorker pid=2125363)[0m INFO 05-26 13:45:12 [executor_base.py:226] It took 0.460750 seconds to wake up tags {'kv_cache', 'weights'}.[32m [repeated 3x across cluster][0m
[36m(RayGRPOWorker pid=2125363)[0m [phyagi] [2025-05-26 13:45:12,618] [INFO] [ray_worker.py:184:generate_completions] Synchronizing actor weights with rollout... [GPU memory allocated: 25.52 GB (56.99999999999999% of device)][32m [repeated 3x across cluster][0m
[36m(RayGRPOWorker pid=2125363)[0m [phyagi] [2025-05-26 13:45:13,056] [INFO] [ray_worker.py

Generating completions: 100%|██████████| 21/21 [04:27<00:00, 12.72s/it]
Generating completions:  90%|█████████ | 19/21 [04:18<00:27, 13.67s/it][32m [repeated 2x across cluster][0m


[36m(RayGRPOWorker pid=2125361)[0m INFO 05-26 13:49:40 [gpu_worker.py:95] Sleep mode freed 21.37 GiB memory, 4.83 GiB memory is still in use.
[36m(RayGRPOWorker pid=2125361)[0m INFO 05-26 13:49:40 [executor_base.py:210] It took 0.834118 seconds to fall asleep.
[36m(RayGRPOWorker pid=2125361)[0m [phyagi] [2025-05-26 13:49:40,946] [INFO] [ray_worker.py:206:generate_completions] vLLM is now asleep. [GPU memory allocated: 4.83 GB (11.0% of device)]


Generating completions:  95%|█████████▌| 20/21 [04:29<00:12, 12.92s/it]
Generating completions: 100%|██████████| 21/21 [04:39<00:00, 13.32s/it]
Generating completions:  95%|█████████▌| 20/21 [04:31<00:13, 13.61s/it][32m [repeated 2x across cluster][0m
Generating completions: 100%|██████████| 21/21 [04:39<00:00, 13.32s/it]


[36m(RayGRPOWorker pid=2125362)[0m [phyagi] [2025-05-26 13:49:52,739] [INFO] [ray_worker.py:204:generate_completions] Completions generated. [GPU memory allocated: 26.20 GB (59.0% of device)]
[36m(RayGRPOWorker pid=2125362)[0m INFO 05-26 13:49:52 [block_pool.py:264] Successfully reset prefix cache
[36m(RayGRPOWorker pid=2125363)[0m [phyagi] [2025-05-26 13:49:52,852] [INFO] [ray_worker.py:204:generate_completions] Completions generated. [GPU memory allocated: 26.20 GB (59.0% of device)]
[36m(RayGRPOWorker pid=2125363)[0m INFO 05-26 13:49:52 [block_pool.py:264] Successfully reset prefix cache
[36m(RayGRPOWorker pid=2125362)[0m INFO 05-26 13:49:53 [gpu_worker.py:95] Sleep mode freed 21.37 GiB memory, 4.83 GiB memory is still in use.
[36m(RayGRPOWorker pid=2125362)[0m INFO 05-26 13:49:53 [executor_base.py:210] It took 0.827318 seconds to fall asleep.
[36m(RayGRPOWorker pid=2125362)[0m [phyagi] [2025-05-26 13:49:53,569] [INFO] [ray_worker.py:206:generate_completions] vLLM is n

100%|██████████| 1/1 [06:07<00:00, 367.55s/it]

[phyagi] [2025-05-26 13:51:19,642] [INFO] [grpo_tuner.py:554:train] Training done. [GPU memory allocated: 16.93 GB (38.0% of device)]
[phyagi] [2025-05-26 13:51:19,642] [INFO] [grpo_tuner.py:554:train] Training done. [GPU memory allocated: 16.93 GB (38.0% of device)] [GPU memory allocated: 16.93 GB (38.0% of device)]
[phyagi] [2025-05-26 13:51:19,642] [INFO] [grpo_tuner.py:554:train] Training done. [GPU memory allocated: 16.93 GB (38.0% of device)] [GPU memory allocated: 16.93 GB (38.0% of device)] [GPU memory allocated: 16.93 GB (38.0% of device)]
[phyagi] [2025-05-26 13:51:19,642] [INFO] [grpo_tuner.py:554:train] Training done. [GPU memory allocated: 16.93 GB (38.0% of device)] [GPU memory allocated: 16.93 GB (38.0% of device)] [GPU memory allocated: 16.93 GB (38.0% of device)] [GPU memory allocated: 16.93 GB (38.0% of device)]
[phyagi] [2025-05-26 13:51:19,642] [INFO] [grpo_tuner.py:554:train] Training done. [GPU memory allocated: 16.93 GB (38.0% of device)] [GPU memory allocated: 1


Generating completions: 100%|██████████| 21/21 [04:43<00:00, 13.49s/it]


### Command-line script

Instead of manually writing a script, one can use the pre-defined training script with an input YAML configuration file, e.g., [ray_grpo.yaml](https://github.com/microsoft/phyagi-sdk/blob/rl/scripts/tune/configs/ray_grpo.yaml), to configure the GRPO trainer:

```bash
python scripts/tune/ray_grpo_tune.py <path_to_yaml_file> --tuning_args.max_steps 200 --output_dir /tmp/output
```

Extra arguments can be passed to the script to override the configurations in the YAML file. For example, you can specify the `tuning_args.max_steps` and `output_dir` parameters directly in the command line.

When the script starts it parses the YAML, spins up Ray, and begins training. **Weights & Biases** dashboards update in real time:

![WandB GRPO metrics](./img/grpo_metrics.png)

Model completions are also logged according to `log_n_eval_completions`:

![WandB GRPO completions](./img/grpo_completions.png)


## Interactive Supervised Fine‑Tuning (ISFT)

**Interactive Supervised Fine‑Tuning (ISFT)** bridges the gap between ordinary supervised fine‑tuning and full RLHF.

**Feedback loop**

1. **Generate new model completions** at a fixed interval.  
2. Collect **human labels** indicating *the single best completion* (or whether any is correct).  
3. **Fine‑tune the model** on the accepted completions with cross‑entropy loss.

Because updates remain purely supervised, ISFT is **stable and fast**, yet it continually adapts to *model dependent* errors.

### When does ISFT shine?

* Tasks with an unambiguous “right answer” (math, code, factual QA).  
* Teams that need quick iteration and cannot afford the complexity of RL.  
* Early‑stage projects where exploration is less critical than correctness.

In [None]:
import os

import datasets
from transformers import AutoTokenizer

from phyagi.datasets.rl.chat.chat_dataset import ChatDataset
from phyagi.datasets.rl.rl_data_collator import RewardDataCollator
from phyagi.rl.tuners.isft.isft_config import RayISFTConfig
from phyagi.rl.tuners.isft.isft_tuner import RayISFTTuner
from phyagi.rl.models.actor_config import ActorConfig
from phyagi.rl.rollout.vllm_worker_config import VLLMWorkerConfig
from phyagi.rl.rewards.gsm8k import GSM8kReward

os.environ["WANDB_MODE"] = "disabled"

actor_config = ActorConfig(
    model={"pretrained_model_name_or_path": "/home/gderosa/models/Phi-3-mini-4k-instruct"},
    optimizer={
        "betas": [0.9, 0.999],
        "weight_decay": 0.01,
    },
    scheduler={
        "warmup_num_steps": 1,
        "warmup_max_lr": 5.0e-6,
    }
)

rollout_config = VLLMWorkerConfig(
    prompt_length=256,
    response_length=512,
    dtype="bfloat16",
    gpu_memory_utilization=0.5,
    enforce_eager=False,
    enable_prefix_caching=True,
    sampling_params={"temperature": 1.0},
)

tuning_args = RayISFTConfig(
    output_dir="/tmp/isft_gsm8k",
    n_nodes=1,
    n_gpus_per_node=4,
    max_steps=1,
    train_batch_size=16,
    group_size=8,
    train_max_micro_batch_size_per_gpu=1,
    actor=actor_config,
    rollout=rollout_config,
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

def _extract_answer(row):
    row["answer"] = row["reward_model"]["ground_truth"]
    return row

dataset = datasets.load_dataset(
    "parquet",
    data_files={
        "train": "/home/gderosa/datasets/gsm8k/train.parquet",
        "test": "/home/gderosa/datasets/gsm8k/test.parquet"
    }
)
dataset = dataset.map(_extract_answer)

rewards = {
    "math_verifier": GSM8kReward(format_score=0.0, correct_score=1.0)
}

train_dataset = ChatDataset(
    dataset["train"],
    tokenizer=tokenizer,
    messages_column_name="prompt",
    ground_truth_column_name="answer",
    max_length=tuning_args.rollout.prompt_length,
    filter_max_length=True,
)
eval_dataset = ChatDataset(
    dataset["test"],
    tokenizer=tokenizer,
    messages_column_name="prompt",
    ground_truth_column_name="answer",
    max_length=tuning_args.rollout.prompt_length,
)
data_collator = RewardDataCollator(reward_names=list(rewards.keys()))

tuner = RayISFTTuner(
    args=tuning_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    rewards=rewards,
)
tuner.train()

2025-05-26 13:53:56,442	INFO worker.py:1888 -- Started a local Ray instance.


[phyagi] [2025-05-26 13:53:57,608] [INFO] [isft_tuner.py:107:__init__] Tuning arguments: {'output_dir': '/tmp/isft_gsm8k', 'n_nodes': 1, 'n_gpus_per_node': 4, 'do_final_eval': True, 'eval_before_training': False, 'epochs': None, 'max_steps': 1, 'log_n_eval_completions': 20, 'save_steps': -1, 'save_final_checkpoint': True, 'eval_steps': 0, 'seed': 1, 'group_size': 8, 'train_batch_size': 16, 'train_max_micro_batch_size_per_gpu': 1, 'adv_length_bias_correction': True, 'num_policy_updates_per_batch': 1, 'actor': {'model': {'pretrained_model_name_or_path': '/home/gderosa/models/Phi-3-mini-4k-instruct'}, 'use_meta_tensor': False, 'optimizer': {'betas': [0.9, 0.999], 'weight_decay': 0.01}, 'scheduler': {'warmup_num_steps': 1, 'warmup_max_lr': 5e-06}, 'gradient_clipping': 1.0, 'manual_offload': False, 'fsdp_offload': False, 'activation_checkpointing': False, 'dtype': 'bfloat16', 'adam_8bit': False}, 'rollout': {'prompt_length': 256, 'response_length': 512, 'tensor_parallel_size': 1, 'offload':

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:05<00:05,  5.48s/it]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s][32m [repeated 3x across cluster][0m
Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00,  4.22s/it]


[36m(RayISFTWorker pid=2135235)[0m INFO 05-26 13:55:29 [config.py:2968] Downcasting torch.float32 to torch.bfloat16.
[36m(pid=2135236)[0m [2025-05-26 13:54:05,275] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)[32m [repeated 3x across cluster][0m
[36m(pid=2135236)[0m INFO 05-26 13:54:07 [importing.py:53] Triton module has been replaced with a placeholder.[32m [repeated 3x across cluster][0m
[36m(pid=2135236)[0m INFO 05-26 13:54:07 [__init__.py:239] Automatically detected platform cuda.[32m [repeated 3x across cluster][0m
[36m(RayISFTWorker pid=2135236)[0m [phyagi] [2025-05-26 13:54:09,037] [INFO] [ray_worker.py:64:configure_models] Initializing actor, reference (optional) and rollout models... [GPU memory allocated: 0.26 GB (1.0% of device)][32m [repeated 3x across cluster][0m
[36m(RayISFTWorker pid=2135235)[0m INFO 05-26 13:55:37 [config.py:717] This model supports multiple tasks: {'embed', 'classify', 'score', 'genera

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:05<00:05,  5.51s/it][32m [repeated 3x across cluster][0m
Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00,  4.25s/it][32m [repeated 3x across cluster][0m
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:05,  1.97s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:04<00:04,  2.22s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:05<00:01,  1.66s/it]


[36m(RayISFTWorker pid=2135237)[0m INFO 05-26 13:55:46 [loader.py:458] Loading weights took 6.41 seconds
[36m(RayISFTWorker pid=2135234)[0m INFO 05-26 13:55:38 [config.py:1729] Disabling V1 multiprocessing for external launcher.[32m [repeated 3x across cluster][0m
[36m(RayISFTWorker pid=2135234)[0m INFO 05-26 13:55:38 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=8192.[32m [repeated 3x across cluster][0m
[36m(RayISFTWorker pid=2135234)[0m INFO 05-26 13:55:38 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='/tmp/isft_gsm8k/initial_rollout', speculative_config=None, tokenizer='/tmp/isft_gsm8k/initial_rollout', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None,

Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00,  1.78s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00,  1.83s/it]
[36m(RayISFTWorker pid=2135234)[0m 


[36m(RayISFTWorker pid=2135237)[0m INFO 05-26 13:55:52 [backends.py:420] Using cache directory: /home/gderosa/.cache/vllm/torch_compile_cache/49dcf6ec0c/rank_3_0 for vLLM's torch.compile
[36m(RayISFTWorker pid=2135237)[0m INFO 05-26 13:55:52 [backends.py:430] Dynamo bytecode transform time: 6.41 s
[36m(RayISFTWorker pid=2135236)[0m INFO 05-26 13:55:47 [loader.py:458] Loading weights took 7.66 seconds[32m [repeated 3x across cluster][0m
[36m(RayISFTWorker pid=2135236)[0m INFO 05-26 13:55:47 [gpu_model_runner.py:1347] Model loading took 7.1184 GiB and 7.834270 seconds[32m [repeated 3x across cluster][0m
[36m(RayISFTWorker pid=2135237)[0m INFO 05-26 13:55:55 [backends.py:136] Cache the graph of shape None for later use
[36m(RayISFTWorker pid=2135237)[0m INFO 05-26 13:56:19 [backends.py:148] Compiling a graph for general shape takes 25.85 s
[36m(RayISFTWorker pid=2135236)[0m INFO 05-26 13:55:54 [backends.py:420] Using cache directory: /home/gderosa/.cache/vllm/torch_compi

  0%|          | 0/1 [00:00<?, ?it/s]

[phyagi] [2025-05-26 13:57:14,270] [INFO] [isft_tuner.py:482:train] [step=1] Evaluating model... [GPU memory allocated: 5.73 GB (13.0% of device)]
[phyagi] [2025-05-26 13:57:14,275] [INFO] [isft_tuner.py:362:evaluate] Generating completions for validation set... [GPU memory allocated: 5.73 GB (13.0% of device)]
[36m(RayISFTWorker pid=2135234)[0m INFO 05-26 13:57:14 [executor_base.py:226] It took 0.459948 seconds to wake up tags {'kv_cache', 'weights'}.
[36m(RayISFTWorker pid=2135234)[0m [phyagi] [2025-05-26 13:57:14,762] [INFO] [ray_worker.py:184:generate_completions] Synchronizing actor weights with rollout... [GPU memory allocated: 25.59 GB (57.99999999999999% of device)]
[36m(RayISFTWorker pid=2135237)[0m INFO 05-26 13:57:14 [executor_base.py:226] It took 0.462365 seconds to wake up tags {'weights', 'kv_cache'}.


Generating completions:   0%|          | 0/11 [00:00<?, ?it/s]


[36m(RayISFTWorker pid=2135234)[0m [phyagi] [2025-05-26 13:57:15,208] [INFO] [ray_worker.py:186:generate_completions] Synchronization done. [GPU memory allocated: 26.26 GB (59.0% of device)]
[36m(RayISFTWorker pid=2135234)[0m [phyagi] [2025-05-26 13:57:15,208] [INFO] [ray_worker.py:191:generate_completions] Generating completions using 11 batches of 32 prompts... [GPU memory allocated: 26.26 GB (59.0% of device)]
[36m(RayISFTWorker pid=2135234)[0m INFO 05-26 13:57:11 [gpu_worker.py:95] Sleep mode freed 21.09 GiB memory, 1.14 GiB memory is still in use.[32m [repeated 3x across cluster][0m
[36m(RayISFTWorker pid=2135234)[0m INFO 05-26 13:57:11 [executor_base.py:210] It took 8.538833 seconds to fall asleep.[32m [repeated 3x across cluster][0m


Generating completions:   9%|▉         | 1/11 [00:19<03:15, 19.53s/it]
Generating completions:   0%|          | 0/11 [00:00<?, ?it/s][32m [repeated 3x across cluster][0m
Generating completions:   9%|▉         | 1/11 [00:26<04:23, 26.32s/it][32m [repeated 3x across cluster][0m
Generating completions:  18%|█▊        | 2/11 [00:37<02:48, 18.67s/it]
Generating completions:  18%|█▊        | 2/11 [00:40<03:01, 20.18s/it]
Generating completions:  18%|█▊        | 2/11 [00:47<03:28, 23.15s/it][32m [repeated 2x across cluster][0m
Generating completions:  27%|██▋       | 3/11 [01:00<02:41, 20.18s/it]
Generating completions:  27%|██▋       | 3/11 [01:00<02:45, 20.71s/it]
Generating completions:  27%|██▋       | 3/11 [01:08<02:57, 22.17s/it][32m [repeated 2x across cluster][0m
Generating completions:  36%|███▋      | 4/11 [01:21<02:24, 20.66s/it]
Generating completions:  36%|███▋      | 4/11 [01:22<02:26, 20.96s/it]
Generating completions:  36%|███▋      | 4/11 [01:28<02:29, 21.29s/it][32m

[36m(RayISFTWorker pid=2135235)[0m [phyagi] [2025-05-26 14:00:45,490] [INFO] [ray_worker.py:204:generate_completions] Completions generated. [GPU memory allocated: 26.08 GB (59.0% of device)]
[36m(RayISFTWorker pid=2135235)[0m INFO 05-26 14:00:45 [block_pool.py:264] Successfully reset prefix cache
[36m(RayISFTWorker pid=2135234)[0m [phyagi] [2025-05-26 13:57:14,210] [INFO] [ray_worker.py:78:configure_models] Actor, reference (optional) and rollout models initialized. [GPU memory allocated: 5.73 GB (13.0% of device)][32m [repeated 3x across cluster][0m
[36m(RayISFTWorker pid=2135236)[0m INFO 05-26 13:57:14 [executor_base.py:226] It took 0.457412 seconds to wake up tags {'kv_cache', 'weights'}.[32m [repeated 2x across cluster][0m
[36m(RayISFTWorker pid=2135236)[0m [phyagi] [2025-05-26 13:57:14,765] [INFO] [ray_worker.py:184:generate_completions] Synchronizing actor weights with rollout... [GPU memory allocated: 25.33 GB (56.99999999999999% of device)][32m [repeated 3x acro

Generating completions: 100%|██████████| 11/11 [03:30<00:00, 19.12s/it]


[36m(RayISFTWorker pid=2135235)[0m INFO 05-26 14:00:46 [gpu_worker.py:95] Sleep mode freed 21.26 GiB memory, 4.83 GiB memory is still in use.
[36m(RayISFTWorker pid=2135235)[0m INFO 05-26 14:00:46 [executor_base.py:210] It took 0.882834 seconds to fall asleep.
[36m(RayISFTWorker pid=2135235)[0m [phyagi] [2025-05-26 14:00:46,375] [INFO] [ray_worker.py:206:generate_completions] vLLM is now asleep. [GPU memory allocated: 4.83 GB (11.0% of device)]


Generating completions:  91%|█████████ | 10/11 [03:31<00:19, 19.91s/it][32m [repeated 2x across cluster][0m
Generating completions: 100%|██████████| 11/11 [03:38<00:00, 19.88s/it]


[36m(RayISFTWorker pid=2135236)[0m [phyagi] [2025-05-26 14:00:53,830] [INFO] [ray_worker.py:204:generate_completions] Completions generated. [GPU memory allocated: 26.08 GB (59.0% of device)]
[36m(RayISFTWorker pid=2135236)[0m INFO 05-26 14:00:53 [block_pool.py:264] Successfully reset prefix cache


Generating completions: 100%|██████████| 11/11 [03:38<00:00, 19.90s/it]


[36m(RayISFTWorker pid=2135234)[0m [phyagi] [2025-05-26 14:00:54,083] [INFO] [ray_worker.py:204:generate_completions] Completions generated. [GPU memory allocated: 26.34 GB (59.0% of device)]
[36m(RayISFTWorker pid=2135234)[0m INFO 05-26 14:00:54 [block_pool.py:264] Successfully reset prefix cache
[36m(RayISFTWorker pid=2135236)[0m INFO 05-26 14:00:54 [gpu_worker.py:95] Sleep mode freed 21.25 GiB memory, 4.83 GiB memory is still in use.
[36m(RayISFTWorker pid=2135236)[0m INFO 05-26 14:00:54 [executor_base.py:210] It took 0.867121 seconds to fall asleep.
[36m(RayISFTWorker pid=2135236)[0m [phyagi] [2025-05-26 14:00:54,700] [INFO] [ray_worker.py:206:generate_completions] vLLM is now asleep. [GPU memory allocated: 4.83 GB (11.0% of device)]
[phyagi] [2025-05-26 14:00:59,700] [INFO] [isft_tuner.py:371:evaluate] Completions generated. [GPU memory allocated: 5.09 GB (11.0% of device)]
[phyagi] [2025-05-26 14:00:59,702] [INFO] [isft_tuner.py:373:evaluate] Calculating rewards for val

100%|██████████| 1/1 [04:41<00:00, 281.60s/it]

[phyagi] [2025-05-26 14:01:55,865] [INFO] [isft_tuner.py:538:train] Training done. [GPU memory allocated: 16.11 GB (36.0% of device)]
[36m(RayISFTWorker pid=2135236)[0m [phyagi] [2025-05-26 14:01:34,827] [INFO] [isft_worker.py:100:update_actor_policy] Updating actor policy with 13 batches... [GPU memory allocated: 4.83 GB (11.0% of device)][32m [repeated 3x across cluster][0m
[36m(RayISFTWorker pid=2135236)[0m [phyagi] [2025-05-26 14:01:34,827] [INFO] [isft_worker.py:101:update_actor_policy] Shapes: [torch.Size([1, 768]), torch.Size([1, 760]), torch.Size([1, 767]), torch.Size([1, 763]), torch.Size([1, 736]), torch.Size([1, 668]), torch.Size([1, 628]), torch.Size([1, 768]), torch.Size([1, 768]), torch.Size([1, 732]), torch.Size([1, 690]), torch.Size([1, 620]), torch.Size([1, 515])] [GPU memory allocated: 4.83 GB (11.0% of device)][32m [repeated 3x across cluster][0m



Generating completions: 100%|██████████| 11/11 [03:42<00:00, 20.26s/it]


### Command-line script

As mentioned in GRPO, instead of manually writing a script, one can use the pre-defined tuning script with an input YAML configuration file, e.g., [ray_isft.yaml](https://github.com/microsoft/phyagi-sdk/blob/rl/scripts/tune/configs/ray_isft.yaml), to configure the ISFT trainer:

```bash
python scripts/tune/ray_isft_tune.py <path_to_yaml_file>
```

## Customization

In addition to the built-in datasets and reward functions, you can customize the RLHF process by using your own datasets and defining custom reward functions.

### Custom datasets

To use a custom dataset, you can create a `ChatDataset` instance by providing the dataset, tokenizer, and necessary column names. Here's an example using the `HuggingFaceH4/ultrachat_200k` dataset:

```python
import datasets
from transformers import AutoTokenizer

from phyagi.datasets.rl.chat.chat_dataset import ChatDataset

dataset = datasets.load_dataset("HuggingFaceH4/ultrachat_200k")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

train_dataset = ChatDataset(
    dataset["train"],
    tokenizer=tokenizer,
    messages_column_name="prompt",
    ground_truth_column_name="answer",
    max_length=256,
)
```

### Custom reward functions

Define a custom reward function by subclassing `phyagi.rewards.reward.Reward` and implementing the `__call__` method. This method should take the solution and ground truth as inputs and return a float representing the reward, as follows:

```python
from phyagi.rl.rewards.reward import Reward

class UnitTestReward(Reward):
    def __call__(self, solution: str, ground_truth: str) -> float:
        return float(run_pytests(solution, ground_truth))
```

You can use any combination of rewards since `RayGRPOTuner` and `RayISFTTuner` aggregates them under the hood.

## Monitoring & Evaluation

Track these during training:

* **KL divergence**: Should hover near your `kl_coeff`.  
* **Reward moving average**: Should rise then plateau.  
* **Exact match / functional correctness**: Evaluated on a *frozen* validation set.

All metrics stream to **Weights & Biases** out‑of‑the‑box. Open the W&B run in your browser to monitor progress and catch regressions early.