Potential reward hacking?

Hi authors, thanks for your excellent work! When I want to reproduce your results on small models, I find that at a certain timestep, the training return increases rapidly but the eval results decrease significantly. Could you inform me what goes wrong here? I also experimented with Qwen models as well as SFT models, but the results are similar to the ones I observed on Llama-3.2-3B:

<img width="1608" height="348" alt="Image" src="https://github.com/user-attachments/assets/2e77d0ce-f52d-429f-809b-bd8fd9088ef5" />

<img width="509" height="275" alt="Image" src="https://github.com/user-attachments/assets/69488392-a120-46d3-a7c4-cb5581743f36" />

<img width="1593" height="278" alt="Image" src="https://github.com/user-attachments/assets/df9a500c-8876-43b3-9b91-02a078c5be40" />

Originally, I suspected that there is a training-eval gap, but I don't think it is what is truly happening here, since the training and eval data are roughly from the same distribution.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential reward hacking? #143

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential reward hacking? #143

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions