Help wanted! I can't reproduce your results on Spider dev as shown in the paper #7

bdiva · 2024-02-06T08:37:44Z

I have followed the code in your script, but the results I get are not quite right. I wonder if there are other parameters? Is there any way to improve this? Can you help me solve this problem? Thank you.

=====================   EXECUTION ACCURACY     =====================
execution            0.911                0.814                0.759                0.651                0.802               

====================== EXACT MATCHING ACCURACY =====================
exact match          0.069                0.007                0.000                0.000                0.019               

---------------------PARTIAL MATCHING ACCURACY----------------------
select               1.000                1.000                0.000                0.000                0.952               
select(no AGG)       1.000                1.000                0.000                0.000                0.952               
where                0.000                1.000                0.000                1.000                1.000               
where(no OP)         0.000                1.000                0.000                1.000                1.000               
group(no Having)     0.000                0.000                0.000                0.000                0.000               
group                0.000                0.000                0.000                0.000                0.000               
order                0.000                0.000                0.000                0.000                0.000               
and/or               1.000                0.899                0.897                0.880                0.920               
IUEN                 0.000                0.000                0.000                0.000                0.000               
keywords             0.000                1.000                0.000                1.000                1.000               
---------------------- PARTIAL MATCHING RECALL ----------------------
select               0.069                0.007                0.000                0.000                0.019               
select(no AGG)       0.069                0.007                0.000                0.000                0.019               
where                0.000                0.016                0.000                0.011                0.008               
where(no OP)         0.000                0.016                0.000                0.011                0.008               
group(no Having)     0.000                0.000                0.000                0.000                0.000               
group                0.000                0.000                0.000                0.000                0.000               
order                0.000                0.000                0.000                0.000                0.000               
and/or               1.000                1.000                1.000                1.000                1.000               
IUEN                 0.000                0.000                0.000                0.000                0.000               
keywords             0.000                0.008                0.000                0.006                0.005               
---------------------- PARTIAL MATCHING F1 --------------------------
select               0.128                0.013                1.000                1.000                0.038               
select(no AGG)       0.128                0.013                1.000                1.000                0.038               
where                1.000                0.032                1.000                0.021                0.017               
where(no OP)         1.000                0.032                1.000                0.021                0.017               
group(no Having)     1.000                1.000                1.000                1.000                1.000               
group                1.000                1.000                1.000                1.000                1.000               
order                1.000                1.000                1.000                1.000                1.000               
and/or               1.000                0.947                0.945                0.936                0.958               
IUEN                 1.000                1.000                1.000                1.000                1.000               
keywords             1.000                0.016                1.000                0.012                0.009

The text was updated successfully, but these errors were encountered:

zhihui-shao · 2024-02-25T08:10:36Z

@bdiva Does it take a long time to evaluate the Spider dataset? I've been running it for two days and still no results

zhihui-shao · 2024-02-25T12:47:00Z

一直卡在这里，两天了。请问你是用什么硬件跑的

wbbeyourself · 2024-02-25T13:54:29Z

我刚才修复了 evaluate_spider 里的问题，卡住是因为 exec_eval.py 中

_, preds = get_all_preds_for_execution(g_str, p_str)

这行代码会产生大量可能的pred变体，如果数量非常多，就会卡在这里，只要控制数量及时 break 出去即可。一般是错误的 SQL 导致代码执行时间超时，很多条这样的，就会卡住很久。我加了 max_try = 50 （exec_eval.py line 213），表示最多尝试50种变体，你可以看具体情况改改。

zhihui-shao · 2024-02-26T02:46:36Z

我刚才修复了 evaluate_spider 里的问题，卡住是因为 exec_eval.py 中
_, preds = get_all_preds_for_execution(g_str, p_str)
这行代码会产生大量可能的pred变体，如果数量非常多，就会卡在这里，只要控制数量及时 break 出去即可。一般是错误的 SQL 导致代码执行时间超时，很多条这样的，就会卡住很久。我加了 max_try = 50 （exec_eval.py line 213），表示最多尝试50种变体，你可以看具体情况改改。

非常感谢你的修改，我成功跑出结果了，但似乎距离文章中的数据有点差别，您能帮我看看原因吗？

wbbeyourself · 2024-02-26T02:53:49Z

请问您这次结果用的是其中哪个模型？
['gpt-4-1106-preview', 'gpt-4-32k', 'gpt-4', 'gpt-35-turbo-16k']
GPT-4 和 GPT-3.5-turbo 的结果会存在一定差距，另外请查看模型的具体日期版本，我实验中使用的是 0613 版本。

zhihui-shao · 2024-02-26T03:08:26Z

请问您这次结果用的是其中哪个模型？ ['gpt-4-1106-preview', 'gpt-4-32k', 'gpt-4', 'gpt-35-turbo-16k'] GPT-4 和 GPT-3.5-turbo 的结果会存在一定差距，另外请查看模型的具体日期版本，我实验中使用的是 0613 版本。

我用的是公司申请的模型，应该是gpt-4-turbo，应该表现比gpt-4更好吧。
请问文中的实验结果是基于哪个模型？

wbbeyourself · 2024-02-26T03:18:53Z

论文中标出的最优结果是基于 GPT-4-32k 跑出来的，后续因为 GPT-4-32k 成本太高，因此在代码中将默认模型改成了 gpt-4-1106-preview (即 GPT-4-turbo) ，GPT-4-32k 总体上会比 GPT-4-Turbo 的更优一些。

FlyingFeather mentioned this issue Mar 2, 2024

Regarding the issue of score mismatch in MAC-SQL FlyingFeather/DEA-SQL#1

Closed

kanseaveg mentioned this issue Apr 17, 2024

Help wanted! Question about Bird dataset results on the paper. #14

Closed

wbbeyourself closed this as completed Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help wanted! I can't reproduce your results on Spider dev as shown in the paper #7

Help wanted! I can't reproduce your results on Spider dev as shown in the paper #7

bdiva commented Feb 6, 2024

zhihui-shao commented Feb 25, 2024

zhihui-shao commented Feb 25, 2024 •

edited by wbbeyourself

Loading

wbbeyourself commented Feb 25, 2024 •

edited

Loading

zhihui-shao commented Feb 26, 2024

wbbeyourself commented Feb 26, 2024

zhihui-shao commented Feb 26, 2024 •

edited

Loading

wbbeyourself commented Feb 26, 2024

Help wanted! I can't reproduce your results on Spider dev as shown in the paper #7

Help wanted! I can't reproduce your results on Spider dev as shown in the paper #7

Comments

bdiva commented Feb 6, 2024

zhihui-shao commented Feb 25, 2024

zhihui-shao commented Feb 25, 2024 • edited by wbbeyourself Loading

wbbeyourself commented Feb 25, 2024 • edited Loading

zhihui-shao commented Feb 26, 2024

wbbeyourself commented Feb 26, 2024

zhihui-shao commented Feb 26, 2024 • edited Loading

wbbeyourself commented Feb 26, 2024

zhihui-shao commented Feb 25, 2024 •

edited by wbbeyourself

Loading

wbbeyourself commented Feb 25, 2024 •

edited

Loading

zhihui-shao commented Feb 26, 2024 •

edited

Loading