Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems in reproducing the RL fine-tuned results #30

Open
abhik1505040 opened this issue Mar 8, 2023 · 8 comments
Open

Problems in reproducing the RL fine-tuned results #30

abhik1505040 opened this issue Mar 8, 2023 · 8 comments

Comments

@abhik1505040
Copy link

abhik1505040 commented Mar 8, 2023

Hi, thanks for open-sourcing your amazing work!

I have been trying to reproduce the RL fine-tuned results reported in the paper, but unfortunately, I am encountering some issues. Here is a brief overview of the steps I followed:

  • Fine-tuned the actor model with CE loss for 10 epochs with train_actor.sh and the CodeT5-NTP model. This fine-tuned model gives similar results to the paper (2.86 pass@5 compared to 2.90 in the paper)

  • With some modifications to generate.py, generated 20 candidate samples per problem (following the sample files given in the repo) and greedy baseline codes for the training set with the CE fine-tuned model. The result key required for the corresponding gen_solutions.json and baseline_solutions.json was generated with this snippet.

  • Generated the token level hidden states/critic scores with the released critic model through generate_critic_scores.sh.

  • RL-finetuning with the default hyperparameters present in train_actor_rl.sh, the RL-finetuned model gives very degraded results. (0.84 pass@5)

I would greatly appreciate any suggestions you may have on hyperparameter choices or other settings that could help me reproduce the RL-finetuned results accurately.

Many thanks!

@henryhungle
Copy link
Collaborator

@abhik1505040 Thanks for reporting the observations. The RL finetuning stage can be quite sensitive to hyperparameters. Based on my experience, you should experiment with a larger batch size e.g. 256 samples per training step, and experiment with lower learning rates.

Another trick is to have a new LM head for RL training iterations. We could initialize this head as a clone from the fine-tuned checkpoint of the original LM head following this. This strategy can help to stabilize the finetuning with RL for T5 models. But in some cases e.g. in GPT-J experiments, I found the benefit not too significant.

@doviettung96
Copy link

Yeah. I'm suffering from the same failure cases. I haven't got the number for the fine-tuned model with generated code yet but it should be similar to yours @abhik1505040 . Especially, in many files, it just generates some repetitive texts like MockRecorder over and over (instead of a proper function). For me, the result from the fine-tuned model using ground truth examples (train_actor.sh) is quite similar to yours.

@abhik1505040
Copy link
Author

@henryhungle Thank you very much for the pointers. I'll give them a try!

@parshinsh
Copy link

I'm also facing the same issue!

@sssszh
Copy link

sssszh commented Mar 23, 2023

@abhik1505040 Hi, I want to know the result for pass@1 of your fine-tuned model with CE loss for 10 epochs. My result pass@1 It's much lower than the one in the paper, but pass@5 is similar to yours and the results of the paper are similar.

@abhik1505040
Copy link
Author

Hi @sssszh, apologies for the late response; I observed similar poor performance for pass@1 as well. The exact score was 0.67

@doviettung96
Copy link

Hi, @abhik1505040
Did you get any better results? And what is your changes to the default hyperparams? I found that using the --clone_rl_head did improve the result a little bit (well strict accuracy >0 ^^).

@ZishunYu
Copy link

Hi, folks, I also get pass@1 approximately 1% but pass@5 at 2.4% with CE loss fine-tuned model. After trying a bunch of temperature, 0.2 seems get me the best pass@1 at 1.1%. I wonder does anyone have anyone have any updates on reproducing the CE fine-tuned model? Thanks a lot!! @doviettung96 @abhik1505040 @sssszh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants