Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help wanted! I can't reproduce your results on Spider dev as shown in the paper #7

Closed
bdiva opened this issue Feb 6, 2024 · 7 comments

Comments

@bdiva
Copy link

bdiva commented Feb 6, 2024

I have followed the code in your script, but the results I get are not quite right. I wonder if there are other parameters? Is there any way to improve this? Can you help me solve this problem? Thank you.

=====================   EXECUTION ACCURACY     =====================
execution            0.911                0.814                0.759                0.651                0.802               

====================== EXACT MATCHING ACCURACY =====================
exact match          0.069                0.007                0.000                0.000                0.019               

---------------------PARTIAL MATCHING ACCURACY----------------------
select               1.000                1.000                0.000                0.000                0.952               
select(no AGG)       1.000                1.000                0.000                0.000                0.952               
where                0.000                1.000                0.000                1.000                1.000               
where(no OP)         0.000                1.000                0.000                1.000                1.000               
group(no Having)     0.000                0.000                0.000                0.000                0.000               
group                0.000                0.000                0.000                0.000                0.000               
order                0.000                0.000                0.000                0.000                0.000               
and/or               1.000                0.899                0.897                0.880                0.920               
IUEN                 0.000                0.000                0.000                0.000                0.000               
keywords             0.000                1.000                0.000                1.000                1.000               
---------------------- PARTIAL MATCHING RECALL ----------------------
select               0.069                0.007                0.000                0.000                0.019               
select(no AGG)       0.069                0.007                0.000                0.000                0.019               
where                0.000                0.016                0.000                0.011                0.008               
where(no OP)         0.000                0.016                0.000                0.011                0.008               
group(no Having)     0.000                0.000                0.000                0.000                0.000               
group                0.000                0.000                0.000                0.000                0.000               
order                0.000                0.000                0.000                0.000                0.000               
and/or               1.000                1.000                1.000                1.000                1.000               
IUEN                 0.000                0.000                0.000                0.000                0.000               
keywords             0.000                0.008                0.000                0.006                0.005               
---------------------- PARTIAL MATCHING F1 --------------------------
select               0.128                0.013                1.000                1.000                0.038               
select(no AGG)       0.128                0.013                1.000                1.000                0.038               
where                1.000                0.032                1.000                0.021                0.017               
where(no OP)         1.000                0.032                1.000                0.021                0.017               
group(no Having)     1.000                1.000                1.000                1.000                1.000               
group                1.000                1.000                1.000                1.000                1.000               
order                1.000                1.000                1.000                1.000                1.000               
and/or               1.000                0.947                0.945                0.936                0.958               
IUEN                 1.000                1.000                1.000                1.000                1.000               
keywords             1.000                0.016                1.000                0.012                0.009
@zhihui-shao
Copy link

@bdiva Does it take a long time to evaluate the Spider dataset? I've been running it for two days and still no results

@zhihui-shao
Copy link

zhihui-shao commented Feb 25, 2024

image
一直卡在这里,两天了。请问你是用什么硬件跑的

@wbbeyourself
Copy link
Owner

wbbeyourself commented Feb 25, 2024

我刚才修复了 evaluate_spider 里的问题,卡住是因为 exec_eval.py 中

_, preds = get_all_preds_for_execution(g_str, p_str)

这行代码会产生大量可能的pred变体,如果数量非常多,就会卡在这里,只要控制数量及时 break 出去即可。一般是错误的 SQL 导致 代码执行时间超时,很多条这样的,就会卡住很久。我加了 max_try = 50 (exec_eval.py line 213) ,表示最多尝试50种变体,你可以看具体情况改改。

@zhihui-shao
Copy link

我刚才修复了 evaluate_spider 里的问题,卡住是因为 exec_eval.py 中

_, preds = get_all_preds_for_execution(g_str, p_str)

这行代码会产生大量可能的pred变体,如果数量非常多,就会卡在这里,只要控制数量及时 break 出去即可。一般是错误的 SQL 导致 代码执行时间超时,很多条这样的,就会卡住很久。我加了 max_try = 50 (exec_eval.py line 213) ,表示最多尝试50种变体,你可以看具体情况改改。

非常感谢你的修改,我成功跑出结果了,但似乎距离文章中的数据有点差别,您能帮我看看原因吗?
image

@wbbeyourself
Copy link
Owner

请问您这次结果用的是其中哪个模型?
['gpt-4-1106-preview', 'gpt-4-32k', 'gpt-4', 'gpt-35-turbo-16k']
GPT-4 和 GPT-3.5-turbo 的结果会存在一定差距,另外请查看模型的具体日期版本,我实验中使用的是 0613 版本。

@zhihui-shao
Copy link

zhihui-shao commented Feb 26, 2024

请问您这次结果用的是其中哪个模型? ['gpt-4-1106-preview', 'gpt-4-32k', 'gpt-4', 'gpt-35-turbo-16k'] GPT-4 和 GPT-3.5-turbo 的结果会存在一定差距,另外请查看模型的具体日期版本,我实验中使用的是 0613 版本。

我用的是公司申请的模型,应该是gpt-4-turbo,应该表现比gpt-4更好吧。
请问文中的实验结果是基于哪个模型?

@wbbeyourself
Copy link
Owner

论文中标出的最优结果是基于 GPT-4-32k 跑出来的,后续因为 GPT-4-32k 成本太高,因此在代码中将默认模型改成了 gpt-4-1106-preview (即 GPT-4-turbo) ,GPT-4-32k 总体上会比 GPT-4-Turbo 的更优一些。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants