Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get low accuracy with GPT-3.5. #21

Closed
Luoyang144 opened this issue Nov 9, 2023 · 12 comments
Closed

Get low accuracy with GPT-3.5. #21

Luoyang144 opened this issue Nov 9, 2023 · 12 comments

Comments

@Luoyang144
Copy link

Luoyang144 commented Nov 9, 2023

Hi, I'm tring to run ReAct with GPT-3.5-Turbo on hotpot dataset with provided jupyter notebook. But only get 0.182 accuracy, is it a reasonable result? I think it is much lower than result showed in paper.

@ysymyth
Copy link
Owner

ysymyth commented Nov 17, 2023

hi can you show your code and example trajectory?

@Luoyang144
Copy link
Author

I'm using this notebook, and using API from azure, so I change the llm function(call GPT-3.5).
image
the final result:
image

@ysymyth
Copy link
Owner

ysymyth commented Dec 4, 2023

can you show some trajs

@zhiyuanc2001
Copy link

Hi, I'm tring to run ReAct with GPT-3.5-Turbo on hotpot dataset with provided jupyter notebook. But only get 0.182 accuracy, is it a reasonable result? I think it is much lower than result showed in paper.

Hi, I got similar reults. I think it is the size of GPT-3.5-Turbo and alignment tax result in the low score. :-)

@Luoyang144
Copy link
Author

In fact, the results of ReAct are no longer as good as directly allowing GPT3.5 to reason. Why did this happen?

@ysymyth
Copy link
Owner

ysymyth commented Jan 4, 2024

can you show some trajectories? also, try the original text-davinci-002 and see if scores also become lower?

@Jiayi-Pan
Copy link

Jiayi-Pan commented Jan 29, 2024

It looks like we observed the same phenomenon on at least a subset of tasks on webshop benchmark.

We run react/act using the official code on webshop task 2000~2100 with gpt-3.5-turbo-instruct
The result is

  • ReAct: 0.5345 avg reward, 0.28 success rate
  • Act: 0.674 avg reward, 0.38 success rate

You can find the raw trajectories here

@Jiayi-Pan
Copy link

Same trend on 2k-3k

Method 2000-2100 2000-3000
ReAct 0.5345 / 0.28 0.5735 / 0.328
Act 0.674 / 0.38 0.67 / 0.352

@Luoyang144
Copy link
Author

Luoyang144 commented Jan 30, 2024

Here is running log of gpt4 ReACT, still get lower result (GPT4 get 0.33).
https://github.com/Luoyang144/share/blob/main/gpt4_hotpot_react.log

@ysymyth
Copy link
Owner

ysymyth commented Jan 30, 2024

Interesting. Is it only on HotpotQA or more tasks? Also, maybe check if text-davanci-002 result is reproducible?

https://github.com/Luoyang144/share/blob/main/gpt4_hotpot_react.log cannot be opened.

@Luoyang144
Copy link
Author

text-davinci-002 is not available now.
This link should be accessible now: https://github.com/Luoyang144/share/blob/main/gpt4_hotpot_react.log

@ysymyth
Copy link
Owner

ysymyth commented Apr 4, 2024

My hypothesis is that later models after text-davinci-002 might be tuned on trajectories similar to Act, plus domains like QA have intuitive tools, and tasks like HotPotQA have intuitive reasoning patterns. On more out-of-distribution domains and tasks (e.g., WebShop, or AlfWorld), reasoning should still improve decision making generalization and transparency.

Close it for now but let me know if there's more findings or analysis into this.

@ysymyth ysymyth closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants