Pass@k or Pass@1? #1

geekan · 2024-06-16T08:47:11Z

After seeing this work, I read the paper and found that the effect is very good. When reading the code, I found that this line of code seems to cause the indicator to degenerate from pass@1 to pass@k. Is my point of view correct?

MathBlackBox/run_with_earlystopping.py

Line 769 in 390a894

if check(ground_truth,answer) and 'testtime' in DATA_NAME:

I am not saying that pass@k is not a good indicator. The default evaluation indicator of gsm8k is usually equivalent to pass@1, but https://arxiv.org/pdf/2205.14318 also uses pass@k, and they are far from reaching this score. But if we can clearly mark the relationship between the value of k and the corresponding score, then we can better understand the paper.

Furthermore, it may be difficult to get the ground truth in reality, and pass@1 is actually more in line with reality. Do you know if there is a better way to evaluate pass@1?

If I understand it incorrectly, please kindly correct me.

Oktai15 · 2024-06-16T11:28:27Z

@geekan, I also checked the code and I can join to your question. The code which is used to determine number of iterations is here: https://github.com/trotsky1997/MathBlackBox/blob/main/run_with_earlystopping.py#L845-L849 and there is no meta-math substring in https://github.com/trotsky1997/MathBlackBox/blob/main/run_olympics.py, i.e max_iters=16 is for every test dataset. Does it mean that result in paper should be considered as pass@16?

cassanof · 2024-06-17T01:06:26Z

Yes looks like pass@k; k=max_iters (which is 16 by default)

liyongsea · 2024-06-17T09:39:12Z

I did not find the metric description in the paper. For reference, the pass@16 result for the deepseek-math-rl 7B is

Oktai15 · 2024-06-17T09:58:24Z

@trotsky1997, could you provide results for pass@1 (i.e. max_iters=1 in your code)?

trotsky1997 · 2024-06-17T10:08:29Z

@trotsky1997, could you provide results for pass@1 (i.e. max_iters=1 in your code)?

pass@1 is max_iters=1, mctsr will degenerate to zero-shot CoT manner, barely the original model.

Since then, only one node has been created in the tree, without further expansions.

Alert: This project is in a very early stage without peer review, and no promise or commitment was given.

Oktai15 · 2024-06-17T10:20:39Z

@trotsky1997, I see, your algorithm builds the decision tree based on max_iters in for-loop. But you can run the first check(ground_truth,answer) after the fop-loop with max_iters=4/8/16. It will be still pass@1.

trotsky1997 · 2024-06-17T10:21:32Z

I did not find the metric description in the paper. For reference, the pass@16 result for the deepseek-math-rl 7B is

Upon reviewing our project's performance metrics, it has become evident that our definition of the performance index may not be as robust as required. I apologize for any oversight in this regard. I have also noticed that several sns influencers have incorrectly linked our work with the so-called "Q*" Project. I want to clarify that such associations were never intended or even ever implied, and our project is still in its nascent phase.

The original purpose of the MCTSr was to enhance efficiency in sampling methods for self-training applications. Through the gen_dpo_data scripts, trajectories can export MCTSr's tree structure as DPO pair data. However, the performance gains during the DPO stage on the Gemma-7B model have been modest, approximately 10 percentage points, which is admittedly disheartening.

Contrastingly, MCTSr's efficacy in the sampling phase has exceeded expectations, prompting me to share it separately in a forthcoming technical report. Currently, the main limitation of this project is the design of the termination condition for open-domain tasks. The model’s stability in self-evaluation within open domains is insufficient, often leading to suboptimal but overly confident responses.

Please temper your expectations regarding this project. It is, at this stage, a preliminary sharing of our technical progress rather than a definitive breakthrough. The framework is better suited to non-self-evaluated black-box optimization tasks that sample real rewards and is more mature than open-domain applications.

To clarify, the assertion that this algorithm utilizes ground-truth to guide the model's exploration process is a misconception rooted in the interpretation of the check function within the code. In reality, ground truth is solely employed to devise the early stopping strategy in the present code repository. This strategy can be substituted by other alternative early stopping methods and optimal node selection strategies. It is important to note that during the self-evaluation and self-refine phases, the model does not access any information from ground-truth text. Instead, the enhancement of the model's responses is derived entirely from the accumulation of self-experience throughout the exploration process.

geekan · 2024-06-17T12:04:02Z

Thanks for all the replies.

First of all, I think the author's answer is very sincere. I also ask everyone to view this work positively and not have any bias.

Secondly, I'd still like to know how many k is equivalent to 8 rollouts mentioned in the paper. If it is 8 or 16, then it is also significantly better than deepseek-math-rl 7B. In particular, I'm guessing it should be a value significantly less than 16, since the check function will actually stop early when it encounters the correct answer. But I still wonder whether 16 max_iter corresponds to 8 rollouts?

trotsky1997 · 2024-06-17T12:07:02Z

Thanks for all the replies.

First of all, I think the author's answer is very sincere. I also ask everyone to view this work positively and not have any bias.

Secondly, I'd still like to know how many k is equivalent to 8 rollouts mentioned in the paper. If it is 8 or 16, then it is also significantly better than deepseek-math-rl 7B. In particular, I'm guessing it should be a value significantly less than 16, since the check function will actually stop early when it encounters the correct answer. But I still wonder whether 16 max_iter corresponds to 8 rollouts?

8 rollouts meaning to max 8 iteration

Zui-C · 2024-06-18T03:54:39Z

@trotsky1997, could you provide results for pass@1 (i.e. max_iters=1 in your code)?

pass@1 is max_iters=1, mctsr will degenerate to zero-shot CoT manner, barely the original model.

Since then, only one node has been created in the tree, without further expansions.

Alert: This project is in a very early stage without peer review, and no promise or commitment was given.

@trotsky1997 Your work is very interesting! I would like to clarify the model you used: is it Meta-Llama-3-8B-Instruct or Meta-Llama-3-8B?
My question arises because I tested the MATH dataset using zero-shot CoT on Meta-Llama-3-8B, and the accuracy did not reach 20%. An accuracy of 24.36% seems more like the performance of Meta-Llama-3-8B-Instruct. However, the article does not seem to mention Instruct.

trotsky1997 · 2024-06-18T05:46:14Z

@trotsky1997, could you provide results for pass@1 (i.e. max_iters=1 in your code)?

pass@1 is max_iters=1, mctsr will degenerate to zero-shot CoT manner, barely the original model.
Since then, only one node has been created in the tree, without further expansions.
Alert: This project is in a very early stage without peer review, and no promise or commitment was given.

@trotsky1997 Your work is very interesting! I would like to clarify the model you used: is it Meta-Llama-3-8B-Instruct or Meta-Llama-3-8B? My question arises because I tested the MATH dataset using zero-shot CoT on Meta-Llama-3-8B, and the accuracy did not reach 20%. An accuracy of 24.36% seems more like the performance of Meta-Llama-3-8B-Instruct. However, the article does not seem to mention Instruct.

instruct version indeed, instruction following capability is important for self-refine process also self-evaluation

mayiran1999 · 2024-06-18T12:05:15Z

Thanks for all the replies.感谢所有的答复。
First of all, I think the author's answer is very sincere. I also ask everyone to view this work positively and not have any bias.首先，我认为作者的回答是非常真诚的。我也请大家积极看待这部作品，不要有任何偏见。
Secondly, I'd still like to know how many k is equivalent to 8 rollouts mentioned in the paper. If it is 8 or 16, then it is also significantly better than deepseek-math-rl 7B. In particular, I'm guessing it should be a value significantly less than 16, since the check function will actually stop early when it encounters the correct answer. But I still wonder whether 16 max_iter corresponds to 8 rollouts?其次，我还想知道论文中提到的8个rollout相当于多少个k。如果是8或16，那么它也明显优于deepseek-math-rl 7B。特别是，我猜测它应该是一个明显小于 16 的值，因为检查函数在遇到正确答案时实际上会提前停止。但我还是想知道16个max_iter是否对应8个rollouts？

8 rollouts meaning to max 8 iteration8 次部署意味着最多 8 次迭代

I still do not understand how many complete solutions are generated in 1 iteration. In the MCTS implementation of this work, 1 node seems to contain 1 complete solution. According to the standard MCTS definition, a rollout should continuously generate nodes until a terminal node or the maximum depth is reached. Does this mean that 8 iterations generate approximately 8*depth complete solutions? Thanks!

trotsky1997 · 2024-06-20T18:52:50Z

Thanks for all the replies.感谢所有的答复。
First of all, I think the author's answer is very sincere. I also ask everyone to view this work positively and not have any bias.首先，我认为作者的回答是非常真诚的。我也请大家积极看待这部作品，不要有任何偏见。
Secondly, I'd still like to know how many k is equivalent to 8 rollouts mentioned in the paper. If it is 8 or 16, then it is also significantly better than deepseek-math-rl 7B. In particular, I'm guessing it should be a value significantly less than 16, since the check function will actually stop early when it encounters the correct answer. But I still wonder whether 16 max_iter corresponds to 8 rollouts?其次，我还想知道论文中提到的8个rollout相当于多少个k。如果是8或16，那么它也明显优于deepseek-math-rl 7B。特别是，我猜测它应该是一个明显小于 16 的值，因为检查函数在遇到正确答案时实际上会提前停止。但我还是想知道16个max_iter是否对应8个rollouts？

8 rollouts meaning to max 8 iteration8 次部署意味着最多 8 次迭代

I still do not understand how many complete solutions are generated in 1 iteration. In the MCTS implementation of this work, 1 node seems to contain 1 complete solution. According to the standard MCTS definition, a rollout should continuously generate nodes until a terminal node or the maximum depth is reached. Does this mean that 8 iterations generate approximately 8*depth complete solutions? Thanks!

One expansion generates one complete refined answer, this method build a tree different from step-by-step one.

trotsky1997 · 2024-06-20T19:10:58Z

To clarify, the assertion that this algorithm utilizes ground-truth to guide the model's exploration process is a misconception rooted in the interpretation of the check function within the code. In reality, ground truth is solely employed to devise the early stopping strategy in the present code repository. This strategy can be substituted by other alternative early stopping methods and optimal node selection strategies. It is important to note that during the self-evaluation and self-refine phases, the model does not access any information from ground-truth text. Instead, the enhancement of the model's responses is derived entirely from the accumulation of self-experience throughout the exploration process.

zhenniB · 2024-06-27T06:44:46Z

Does the configuration of "One-turn Self-refine" in the image refer to setting max_iters to 1?

trotsky1997 · 2024-06-28T04:21:41Z

Does the configuration of "One-turn Self-refine" in the image refer to setting max_iters to 1?

Basically correctly, if max_iter set to 1, mctsr will degenerated to self-refine method.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass@k or Pass@1? #1

Pass@k or Pass@1? #1

geekan commented Jun 16, 2024

Oktai15 commented Jun 16, 2024 •

edited

Loading

cassanof commented Jun 17, 2024

liyongsea commented Jun 17, 2024

Oktai15 commented Jun 17, 2024

trotsky1997 commented Jun 17, 2024

Oktai15 commented Jun 17, 2024 •

edited

Loading

trotsky1997 commented Jun 17, 2024 •

edited

Loading

geekan commented Jun 17, 2024 •

edited

Loading

trotsky1997 commented Jun 17, 2024

Zui-C commented Jun 18, 2024

trotsky1997 commented Jun 18, 2024

mayiran1999 commented Jun 18, 2024

trotsky1997 commented Jun 20, 2024 •

edited

Loading

trotsky1997 commented Jun 20, 2024

zhenniB commented Jun 27, 2024

trotsky1997 commented Jun 28, 2024

Pass@k or Pass@1? #1

Pass@k or Pass@1? #1

Comments

geekan commented Jun 16, 2024

Oktai15 commented Jun 16, 2024 • edited Loading

cassanof commented Jun 17, 2024

liyongsea commented Jun 17, 2024

Oktai15 commented Jun 17, 2024

trotsky1997 commented Jun 17, 2024

Oktai15 commented Jun 17, 2024 • edited Loading

trotsky1997 commented Jun 17, 2024 • edited Loading

geekan commented Jun 17, 2024 • edited Loading

trotsky1997 commented Jun 17, 2024

Zui-C commented Jun 18, 2024

trotsky1997 commented Jun 18, 2024

mayiran1999 commented Jun 18, 2024

trotsky1997 commented Jun 20, 2024 • edited Loading

trotsky1997 commented Jun 20, 2024

zhenniB commented Jun 27, 2024

trotsky1997 commented Jun 28, 2024

Oktai15 commented Jun 16, 2024 •

edited

Loading

Oktai15 commented Jun 17, 2024 •

edited

Loading

trotsky1997 commented Jun 17, 2024 •

edited

Loading

geekan commented Jun 17, 2024 •

edited

Loading

trotsky1997 commented Jun 20, 2024 •

edited

Loading