Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass@k or Pass@1? #1

Open
geekan opened this issue Jun 16, 2024 · 16 comments
Open

Pass@k or Pass@1? #1

geekan opened this issue Jun 16, 2024 · 16 comments

Comments

@geekan
Copy link

geekan commented Jun 16, 2024

After seeing this work, I read the paper and found that the effect is very good. When reading the code, I found that this line of code seems to cause the indicator to degenerate from pass@1 to pass@k. Is my point of view correct?

if check(ground_truth,answer) and 'testtime' in DATA_NAME:

I am not saying that pass@k is not a good indicator. The default evaluation indicator of gsm8k is usually equivalent to pass@1, but https://arxiv.org/pdf/2205.14318 also uses pass@k, and they are far from reaching this score. But if we can clearly mark the relationship between the value of k and the corresponding score, then we can better understand the paper.

Furthermore, it may be difficult to get the ground truth in reality, and pass@1 is actually more in line with reality. Do you know if there is a better way to evaluate pass@1?

If I understand it incorrectly, please kindly correct me.

@Oktai15
Copy link

Oktai15 commented Jun 16, 2024

@geekan, I also checked the code and I can join to your question. The code which is used to determine number of iterations is here: https://github.com/trotsky1997/MathBlackBox/blob/main/run_with_earlystopping.py#L845-L849 and there is no meta-math substring in https://github.com/trotsky1997/MathBlackBox/blob/main/run_olympics.py, i.e max_iters=16 is for every test dataset. Does it mean that result in paper should be considered as pass@16?

@cassanof
Copy link

Yes looks like pass@k; k=max_iters (which is 16 by default)

@liyongsea
Copy link

I did not find the metric description in the paper. For reference, the pass@16 result for the deepseek-math-rl 7B is
image

@Oktai15
Copy link

Oktai15 commented Jun 17, 2024

@trotsky1997, could you provide results for pass@1 (i.e. max_iters=1 in your code)?

@trotsky1997
Copy link
Owner

@trotsky1997, could you provide results for pass@1 (i.e. max_iters=1 in your code)?

pass@1 is max_iters=1, mctsr will degenerate to zero-shot CoT manner, barely the original model.

Since then, only one node has been created in the tree, without further expansions.

Alert: This project is in a very early stage without peer review, and no promise or commitment was given.

@Oktai15
Copy link

Oktai15 commented Jun 17, 2024

@trotsky1997, I see, your algorithm builds the decision tree based on max_iters in for-loop. But you can run the first check(ground_truth,answer) after the fop-loop with max_iters=4/8/16. It will be still pass@1.

@trotsky1997
Copy link
Owner

trotsky1997 commented Jun 17, 2024

I did not find the metric description in the paper. For reference, the pass@16 result for the deepseek-math-rl 7B is

Upon reviewing our project's performance metrics, it has become evident that our definition of the performance index may not be as robust as required. I apologize for any oversight in this regard. I have also noticed that several sns influencers have incorrectly linked our work with the so-called "Q*" Project. I want to clarify that such associations were never intended or even ever implied, and our project is still in its nascent phase.

The original purpose of the MCTSr was to enhance efficiency in sampling methods for self-training applications. Through the gen_dpo_data scripts, trajectories can export MCTSr's tree structure as DPO pair data. However, the performance gains during the DPO stage on the Gemma-7B model have been modest, approximately 10 percentage points, which is admittedly disheartening.

Contrastingly, MCTSr's efficacy in the sampling phase has exceeded expectations, prompting me to share it separately in a forthcoming technical report. Currently, the main limitation of this project is the design of the termination condition for open-domain tasks. The model’s stability in self-evaluation within open domains is insufficient, often leading to suboptimal but overly confident responses.

Please temper your expectations regarding this project. It is, at this stage, a preliminary sharing of our technical progress rather than a definitive breakthrough. The framework is better suited to non-self-evaluated black-box optimization tasks that sample real rewards and is more mature than open-domain applications.

To clarify, the assertion that this algorithm utilizes ground-truth to guide the model's exploration process is a misconception rooted in the interpretation of the check function within the code. In reality, ground truth is solely employed to devise the early stopping strategy in the present code repository. This strategy can be substituted by other alternative early stopping methods and optimal node selection strategies. It is important to note that during the self-evaluation and self-refine phases, the model does not access any information from ground-truth text. Instead, the enhancement of the model's responses is derived entirely from the accumulation of self-experience throughout the exploration process.

@geekan
Copy link
Author

geekan commented Jun 17, 2024

Thanks for all the replies.

First of all, I think the author's answer is very sincere. I also ask everyone to view this work positively and not have any bias.

Secondly, I'd still like to know how many k is equivalent to 8 rollouts mentioned in the paper. If it is 8 or 16, then it is also significantly better than deepseek-math-rl 7B. In particular, I'm guessing it should be a value significantly less than 16, since the check function will actually stop early when it encounters the correct answer. But I still wonder whether 16 max_iter corresponds to 8 rollouts?

@trotsky1997
Copy link
Owner

Thanks for all the replies.

First of all, I think the author's answer is very sincere. I also ask everyone to view this work positively and not have any bias.

Secondly, I'd still like to know how many k is equivalent to 8 rollouts mentioned in the paper. If it is 8 or 16, then it is also significantly better than deepseek-math-rl 7B. In particular, I'm guessing it should be a value significantly less than 16, since the check function will actually stop early when it encounters the correct answer. But I still wonder whether 16 max_iter corresponds to 8 rollouts?

8 rollouts meaning to max 8 iteration

@Zui-C
Copy link

Zui-C commented Jun 18, 2024

@trotsky1997, could you provide results for pass@1 (i.e. max_iters=1 in your code)?

pass@1 is max_iters=1, mctsr will degenerate to zero-shot CoT manner, barely the original model.

Since then, only one node has been created in the tree, without further expansions.

Alert: This project is in a very early stage without peer review, and no promise or commitment was given.

@trotsky1997 Your work is very interesting! I would like to clarify the model you used: is it Meta-Llama-3-8B-Instruct or Meta-Llama-3-8B?
My question arises because I tested the MATH dataset using zero-shot CoT on Meta-Llama-3-8B, and the accuracy did not reach 20%. An accuracy of 24.36% seems more like the performance of Meta-Llama-3-8B-Instruct. However, the article does not seem to mention Instruct.

@trotsky1997
Copy link
Owner

@trotsky1997, could you provide results for pass@1 (i.e. max_iters=1 in your code)?

pass@1 is max_iters=1, mctsr will degenerate to zero-shot CoT manner, barely the original model.
Since then, only one node has been created in the tree, without further expansions.
Alert: This project is in a very early stage without peer review, and no promise or commitment was given.

@trotsky1997 Your work is very interesting! I would like to clarify the model you used: is it Meta-Llama-3-8B-Instruct or Meta-Llama-3-8B? My question arises because I tested the MATH dataset using zero-shot CoT on Meta-Llama-3-8B, and the accuracy did not reach 20%. An accuracy of 24.36% seems more like the performance of Meta-Llama-3-8B-Instruct. However, the article does not seem to mention Instruct.

instruct version indeed, instruction following capability is important for self-refine process also self-evaluation

@mayiran1999
Copy link

Thanks for all the replies.感谢所有的答复。
First of all, I think the author's answer is very sincere. I also ask everyone to view this work positively and not have any bias.首先,我认为作者的回答是非常真诚的。我也请大家积极看待这部作品,不要有任何偏见。
Secondly, I'd still like to know how many k is equivalent to 8 rollouts mentioned in the paper. If it is 8 or 16, then it is also significantly better than deepseek-math-rl 7B. In particular, I'm guessing it should be a value significantly less than 16, since the check function will actually stop early when it encounters the correct answer. But I still wonder whether 16 max_iter corresponds to 8 rollouts?其次,我还想知道论文中提到的8个rollout相当于多少个k。如果是8或16,那么它也明显优于deepseek-math-rl 7B。特别是,我猜测它应该是一个明显小于 16 的值,因为检查函数在遇到正确答案时实际上会提前停止。但我还是想知道16个max_iter是否对应8个rollouts?

8 rollouts meaning to max 8 iteration8 次部署意味着最多 8 次迭代

I still do not understand how many complete solutions are generated in 1 iteration. In the MCTS implementation of this work, 1 node seems to contain 1 complete solution. According to the standard MCTS definition, a rollout should continuously generate nodes until a terminal node or the maximum depth is reached. Does this mean that 8 iterations generate approximately 8*depth complete solutions? Thanks!

@trotsky1997
Copy link
Owner

trotsky1997 commented Jun 20, 2024

Thanks for all the replies.感谢所有的答复。
First of all, I think the author's answer is very sincere. I also ask everyone to view this work positively and not have any bias.首先,我认为作者的回答是非常真诚的。我也请大家积极看待这部作品,不要有任何偏见。
Secondly, I'd still like to know how many k is equivalent to 8 rollouts mentioned in the paper. If it is 8 or 16, then it is also significantly better than deepseek-math-rl 7B. In particular, I'm guessing it should be a value significantly less than 16, since the check function will actually stop early when it encounters the correct answer. But I still wonder whether 16 max_iter corresponds to 8 rollouts?其次,我还想知道论文中提到的8个rollout相当于多少个k。如果是8或16,那么它也明显优于deepseek-math-rl 7B。特别是,我猜测它应该是一个明显小于 16 的值,因为检查函数在遇到正确答案时实际上会提前停止。但我还是想知道16个max_iter是否对应8个rollouts?

8 rollouts meaning to max 8 iteration8 次部署意味着最多 8 次迭代

I still do not understand how many complete solutions are generated in 1 iteration. In the MCTS implementation of this work, 1 node seems to contain 1 complete solution. According to the standard MCTS definition, a rollout should continuously generate nodes until a terminal node or the maximum depth is reached. Does this mean that 8 iterations generate approximately 8*depth complete solutions? Thanks!

One expansion generates one complete refined answer, this method build a tree different from step-by-step one.

@trotsky1997
Copy link
Owner

To clarify, the assertion that this algorithm utilizes ground-truth to guide the model's exploration process is a misconception rooted in the interpretation of the check function within the code. In reality, ground truth is solely employed to devise the early stopping strategy in the present code repository. This strategy can be substituted by other alternative early stopping methods and optimal node selection strategies. It is important to note that during the self-evaluation and self-refine phases, the model does not access any information from ground-truth text. Instead, the enhancement of the model's responses is derived entirely from the accumulation of self-experience throughout the exploration process.

@zhenniB
Copy link

zhenniB commented Jun 27, 2024

Does the configuration of "One-turn Self-refine" in the image refer to setting max_iters to 1?
image

@trotsky1997
Copy link
Owner

Does the configuration of "One-turn Self-refine" in the image refer to setting max_iters to 1? image

Basically correctly, if max_iter set to 1, mctsr will degenerated to self-refine method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants