Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model selection of PPO in Table 2 #60

Closed
langhaobeijing opened this issue Jul 28, 2023 · 1 comment
Closed

model selection of PPO in Table 2 #60

langhaobeijing opened this issue Jul 28, 2023 · 1 comment

Comments

@langhaobeijing
Copy link

Hi, thank you for your great work here!

After running ppo script (examples/scripts/rlhf_ppo.sh) from your code, there are multiple checkpoints of finetuned PPO models from different training steps.

I wonder how the checkopint is selected for PPO results in Table 2.

  1. based on the validation split (2k) or the evaluation data (805)?
  2. based on scores of the trained reward model or simulated preferences from p_sim^eval?

Thank you!

@lxuechen
Copy link
Collaborator

lxuechen commented Aug 1, 2023

Thanks for your interest!

Our final Table 2 models were primarily selected based on p_sim^eval with self-instruct eval data. For the runs on human preferences, we also performed human eval on some model checkpoints for PPO and different k's for rerank to ensure the final results weren't in the over-optimization (see Section 4 of our paper) regime.

@lxuechen lxuechen closed this as completed Aug 1, 2023
lolipopshock pushed a commit to lolipopshock/alpaca_farm that referenced this issue Sep 24, 2023
* [ENH] remove inputs from example

* [ENH] remove inputs from example
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants