关于evaluate中的etype参数 #21

Everyth1ng-kyh · 2024-08-18T16:07:07Z

作者您好，请问为什么evaluate中的参数，选在all时EX指标会比选exec时的EX要高俩点左右？

wbbeyourself · 2024-08-18T16:14:41Z

因为如果参数为 all ，即使无法执行，SQL层面上match也算对；而exec必须保证SQL能执行且执行结果和Gold完全一致才行，条件比all苛刻，所以会低2个点。

wbbeyourself · 2024-08-18T16:14:47Z

Everyth1ng-kyh · 2024-08-18T16:19:20Z

感谢您的回复，这对我非常重要。请问MAC中的实验结果使用的是all还是exec呢？NL2SQL普遍使用的是哪种呢？
并且我调用的公司聊天GPT4O（带有上下文可能会对实验结果产生影响），复现了MAC-SQL发现在SPIDER上，EM分数较低（20左右），这是否和您当时得到的实验结果一致？如果一致的话，您知道其中的原因是什么吗？

wbbeyourself · 2024-08-18T16:24:23Z

MAC-SQL 用的exec，这点可以从代码脚本中看到。NL2SQL普遍采用 exec，因为随着大模型出现，EM指标显得很鸡肋，无法反映出模型的真实能力，因为SQL的写法有很多，只要答案对了即可，而EM指标要求写法完全按照Gold的来，很显然这不合理。MAC-SQL在EM较低是正常的，现在大模型方案在EM上都低，所以现在NL2SQL论文都已经摈弃EM指标了，都用的EX指标。

Everyth1ng-kyh · 2024-08-19T02:40:51Z

FlyingFeather/DEA-SQL#1 (comment)
你知道DEA-SQL这篇论文吗，这个问题中提到您之前提交的代码好像和现在不同。

wbbeyourself · 2024-08-19T02:44:45Z

Spider数据集上是有点调整，你可以用现在的版本。

Everyth1ng-kyh · 2024-08-28T01:53:16Z

请问作者，您有使用gpt4o复现您的论文吗？我在BIRD和SPIDER上都无法得到理想的结果
spider：
                     easy                 medium               hard                 extra                all
count                248                  446                  174                  166                  1034
=====================   EXECUTION ACCURACY     =====================
execution            0.935                0.841                0.730                0.584                0.804

====================== EXACT MATCHING ACCURACY =====================
exact match          0.302                0.123                0.138                0.072                0.161




BIRD：
"Evaluate BIRD EX begin!"
save json file to ./outputs/bird\eval_result_dev.json
start calculate
                     simple               moderate             challenging          total
count                925                  465                  144                  1534
======================================    ACCURACY    =====================================
accuracy             62.16                48.17                37.50                55.61
===========================================================================================
Finished evaluation
"Evaluate EX done!"
"Evaluate BIRD VES begin!"
0
500
1000
1500
start calculate
                     simple               moderate             challenging          total
count                925                  465                  144                  1534
=========================================    VES   ========================================
ves                  28.23                25.32                17.45                26.34
=======================================================================================
并且BIRD上VES表现差距较大

wbbeyourself · 2024-08-28T02:00:30Z

我没用 GPT4o跑过，Spider dev 上那个分数是正常的，会有波动。BIRD分数波动可能是BIRD dev 数据集更新了，我用的版本是之前的数据了。VES这个分数计算其实参考性不大，因为这个和当时电脑运行的CPU占用情况有关，有的服务器性能好，比较空闲，VES会很高，有的服务器跑的程序很多，就会导致VES偏低，我觉得这个VES指标看看就好。

Everyth1ng-kyh · 2024-09-03T06:42:44Z

作者您好，请问您data中的dev_gold_schema.json是怎么获得的呢？我想在我的论文中计算召回率，但是您的BIRD数据版本和我的不一致，所以我想知道这是怎么获得的。非常感谢！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于evaluate中的etype参数 #21

关于evaluate中的etype参数 #21

Everyth1ng-kyh commented Aug 18, 2024

wbbeyourself commented Aug 18, 2024

wbbeyourself commented Aug 18, 2024

Everyth1ng-kyh commented Aug 18, 2024

wbbeyourself commented Aug 18, 2024

Everyth1ng-kyh commented Aug 19, 2024

wbbeyourself commented Aug 19, 2024

Everyth1ng-kyh commented Aug 28, 2024 •

edited

Loading

wbbeyourself commented Aug 28, 2024

Everyth1ng-kyh commented Sep 3, 2024

关于evaluate中的etype参数 #21

关于evaluate中的etype参数 #21

Comments

Everyth1ng-kyh commented Aug 18, 2024

wbbeyourself commented Aug 18, 2024

wbbeyourself commented Aug 18, 2024

Everyth1ng-kyh commented Aug 18, 2024

wbbeyourself commented Aug 18, 2024

Everyth1ng-kyh commented Aug 19, 2024

wbbeyourself commented Aug 19, 2024

Everyth1ng-kyh commented Aug 28, 2024 • edited Loading

wbbeyourself commented Aug 28, 2024

Everyth1ng-kyh commented Sep 3, 2024

Everyth1ng-kyh commented Aug 28, 2024 •

edited

Loading