Get low accuracy with GPT-3.5. #21

Luoyang144 · 2023-11-09T08:36:49Z

Hi, I'm tring to run ReAct with GPT-3.5-Turbo on hotpot dataset with provided jupyter notebook. But only get 0.182 accuracy, is it a reasonable result? I think it is much lower than result showed in paper.

ysymyth · 2023-11-17T15:31:02Z

hi can you show your code and example trajectory?

Luoyang144 · 2023-11-20T01:48:10Z

I'm using this notebook, and using API from azure, so I change the llm function(call GPT-3.5).

the final result:

ysymyth · 2023-12-04T17:43:44Z

can you show some trajs

zhiyuanc2001 · 2023-12-07T12:09:31Z

Hi, I'm tring to run ReAct with GPT-3.5-Turbo on hotpot dataset with provided jupyter notebook. But only get 0.182 accuracy, is it a reasonable result? I think it is much lower than result showed in paper.

Hi, I got similar reults. I think it is the size of GPT-3.5-Turbo and alignment tax result in the low score. :-)

Luoyang144 · 2023-12-13T01:27:06Z

In fact, the results of ReAct are no longer as good as directly allowing GPT3.5 to reason. Why did this happen?

ysymyth · 2024-01-04T19:34:46Z

can you show some trajectories? also, try the original text-davinci-002 and see if scores also become lower?

Jiayi-Pan · 2024-01-29T04:44:43Z

It looks like we observed the same phenomenon on at least a subset of tasks on webshop benchmark.

We run react/act using the official code on webshop task 2000~2100 with gpt-3.5-turbo-instruct
The result is

ReAct: 0.5345 avg reward, 0.28 success rate
Act: 0.674 avg reward, 0.38 success rate

You can find the raw trajectories here

Jiayi-Pan · 2024-01-29T07:10:34Z

Same trend on 2k-3k

Method	2000-2100	2000-3000
ReAct	0.5345 / 0.28	0.5735 / 0.328
Act	0.674 / 0.38	0.67 / 0.352

Luoyang144 · 2024-01-30T03:50:31Z

Here is running log of gpt4 ReACT, still get lower result (GPT4 get 0.33).
https://github.com/Luoyang144/share/blob/main/gpt4_hotpot_react.log

ysymyth · 2024-01-30T03:59:48Z

Interesting. Is it only on HotpotQA or more tasks? Also, maybe check if text-davanci-002 result is reproducible?

https://github.com/Luoyang144/share/blob/main/gpt4_hotpot_react.log cannot be opened.

Luoyang144 · 2024-01-30T06:15:52Z

text-davinci-002 is not available now.
This link should be accessible now: https://github.com/Luoyang144/share/blob/main/gpt4_hotpot_react.log

ysymyth · 2024-04-04T23:44:43Z

My hypothesis is that later models after text-davinci-002 might be tuned on trajectories similar to Act, plus domains like QA have intuitive tools, and tasks like HotPotQA have intuitive reasoning patterns. On more out-of-distribution domains and tasks (e.g., WebShop, or AlfWorld), reasoning should still improve decision making generalization and transparency.

Close it for now but let me know if there's more findings or analysis into this.

ysymyth closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get low accuracy with GPT-3.5. #21

Get low accuracy with GPT-3.5. #21

Luoyang144 commented Nov 9, 2023 •

edited

ysymyth commented Nov 17, 2023

Luoyang144 commented Nov 20, 2023

ysymyth commented Dec 4, 2023

zhiyuanc2001 commented Dec 7, 2023

Luoyang144 commented Dec 13, 2023

ysymyth commented Jan 4, 2024

Jiayi-Pan commented Jan 29, 2024 •

edited

Jiayi-Pan commented Jan 29, 2024

Luoyang144 commented Jan 30, 2024 •

edited

ysymyth commented Jan 30, 2024

Luoyang144 commented Jan 30, 2024

ysymyth commented Apr 4, 2024

Get low accuracy with GPT-3.5. #21

Get low accuracy with GPT-3.5. #21

Comments

Luoyang144 commented Nov 9, 2023 • edited

ysymyth commented Nov 17, 2023

Luoyang144 commented Nov 20, 2023

ysymyth commented Dec 4, 2023

zhiyuanc2001 commented Dec 7, 2023

Luoyang144 commented Dec 13, 2023

ysymyth commented Jan 4, 2024

Jiayi-Pan commented Jan 29, 2024 • edited

Jiayi-Pan commented Jan 29, 2024

Luoyang144 commented Jan 30, 2024 • edited

ysymyth commented Jan 30, 2024

Luoyang144 commented Jan 30, 2024

ysymyth commented Apr 4, 2024

Luoyang144 commented Nov 9, 2023 •

edited

Jiayi-Pan commented Jan 29, 2024 •

edited

Luoyang144 commented Jan 30, 2024 •

edited