model selection of PPO in Table 2 #60

langhaobeijing · 2023-07-28T08:10:56Z

Hi, thank you for your great work here!

After running ppo script (examples/scripts/rlhf_ppo.sh) from your code, there are multiple checkpoints of finetuned PPO models from different training steps.

I wonder how the checkopint is selected for PPO results in Table 2.

based on the validation split (2k) or the evaluation data (805)?
based on scores of the trained reward model or simulated preferences from p_sim^eval?

Thank you!

lxuechen · 2023-08-01T07:12:29Z

Thanks for your interest!

Our final Table 2 models were primarily selected based on p_sim^eval with self-instruct eval data. For the runs on human preferences, we also performed human eval on some model checkpoints for PPO and different k's for rerank to ensure the final results weren't in the over-optimization (see Section 4 of our paper) regime.

* [ENH] remove inputs from example * [ENH] remove inputs from example

lxuechen closed this as completed Aug 1, 2023

lolipopshock pushed a commit to lolipopshock/alpaca_farm that referenced this issue Sep 24, 2023

[ENH] remove inputs from example (tatsu-lab#60)

0099f05

* [ENH] remove inputs from example * [ENH] remove inputs from example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model selection of PPO in Table 2 #60

model selection of PPO in Table 2 #60

langhaobeijing commented Jul 28, 2023

lxuechen commented Aug 1, 2023

model selection of PPO in Table 2 #60

model selection of PPO in Table 2 #60

Comments

langhaobeijing commented Jul 28, 2023

lxuechen commented Aug 1, 2023