-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass@k or Pass@1? #1
Comments
@geekan, I also checked the code and I can join to your question. The code which is used to determine number of iterations is here: https://github.com/trotsky1997/MathBlackBox/blob/main/run_with_earlystopping.py#L845-L849 and there is no |
Yes looks like pass@k; k=max_iters (which is 16 by default) |
@trotsky1997, could you provide results for pass@1 (i.e. max_iters=1 in your code)? |
pass@1 is max_iters=1, mctsr will degenerate to zero-shot CoT manner, barely the original model. Since then, only one node has been created in the tree, without further expansions. Alert: This project is in a very early stage without peer review, and no promise or commitment was given. |
@trotsky1997, I see, your algorithm builds the decision tree based on |
Upon reviewing our project's performance metrics, it has become evident that our definition of the performance index may not be as robust as required. I apologize for any oversight in this regard. I have also noticed that several sns influencers have incorrectly linked our work with the so-called "Q*" Project. I want to clarify that such associations were never intended or even ever implied, and our project is still in its nascent phase. The original purpose of the MCTSr was to enhance efficiency in sampling methods for self-training applications. Through the gen_dpo_data scripts, trajectories can export MCTSr's tree structure as DPO pair data. However, the performance gains during the DPO stage on the Gemma-7B model have been modest, approximately 10 percentage points, which is admittedly disheartening. Contrastingly, MCTSr's efficacy in the sampling phase has exceeded expectations, prompting me to share it separately in a forthcoming technical report. Currently, the main limitation of this project is the design of the termination condition for open-domain tasks. The model’s stability in self-evaluation within open domains is insufficient, often leading to suboptimal but overly confident responses. Please temper your expectations regarding this project. It is, at this stage, a preliminary sharing of our technical progress rather than a definitive breakthrough. The framework is better suited to non-self-evaluated black-box optimization tasks that sample real rewards and is more mature than open-domain applications. To clarify, the assertion that this algorithm utilizes ground-truth to guide the model's exploration process is a misconception rooted in the interpretation of the |
Thanks for all the replies. First of all, I think the author's answer is very sincere. I also ask everyone to view this work positively and not have any bias. Secondly, I'd still like to know how many k is equivalent to 8 rollouts mentioned in the paper. If it is 8 or 16, then it is also significantly better than deepseek-math-rl 7B. In particular, I'm guessing it should be a value significantly less than 16, since the check function will actually stop early when it encounters the correct answer. But I still wonder whether 16 max_iter corresponds to 8 rollouts? |
8 rollouts meaning to max 8 iteration |
@trotsky1997 Your work is very interesting! I would like to clarify the model you used: is it Meta-Llama-3-8B-Instruct or Meta-Llama-3-8B? |
instruct version indeed, instruction following capability is important for self-refine process also self-evaluation |
I still do not understand how many complete solutions are generated in 1 iteration. In the MCTS implementation of this work, 1 node seems to contain 1 complete solution. According to the standard MCTS definition, a rollout should continuously generate nodes until a terminal node or the maximum depth is reached. Does this mean that 8 iterations generate approximately 8*depth complete solutions? Thanks! |
One expansion generates one complete refined answer, this method build a tree different from step-by-step one. |
To clarify, the assertion that this algorithm utilizes ground-truth to guide the model's exploration process is a misconception rooted in the interpretation of the |
After seeing this work, I read the paper and found that the effect is very good. When reading the code, I found that this line of code seems to cause the indicator to degenerate from pass@1 to pass@k. Is my point of view correct?
MathBlackBox/run_with_earlystopping.py
Line 769 in 390a894
I am not saying that pass@k is not a good indicator. The default evaluation indicator of gsm8k is usually equivalent to pass@1, but https://arxiv.org/pdf/2205.14318 also uses pass@k, and they are far from reaching this score. But if we can clearly mark the relationship between the value of k and the corresponding score, then we can better understand the paper.
Furthermore, it may be difficult to get the ground truth in reality, and pass@1 is actually more in line with reality. Do you know if there is a better way to evaluate pass@1?
If I understand it incorrectly, please kindly correct me.
The text was updated successfully, but these errors were encountered: