New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get low accuracy with GPT-3.5. #21
Comments
hi can you show your code and example trajectory? |
I'm using this notebook, and using API from azure, so I change the llm function(call GPT-3.5). |
can you show some trajs |
Hi, I got similar reults. I think it is the size of GPT-3.5-Turbo and alignment tax result in the low score. :-) |
In fact, the results of ReAct are no longer as good as directly allowing GPT3.5 to reason. Why did this happen? |
can you show some trajectories? also, try the original text-davinci-002 and see if scores also become lower? |
It looks like we observed the same phenomenon on at least a subset of tasks on webshop benchmark. We run react/act using the official code on webshop task 2000~2100 with
You can find the raw trajectories here |
Same trend on 2k-3k
|
Here is running log of gpt4 ReACT, still get lower result (GPT4 get 0.33). |
Interesting. Is it only on HotpotQA or more tasks? Also, maybe check if text-davanci-002 result is reproducible? https://github.com/Luoyang144/share/blob/main/gpt4_hotpot_react.log cannot be opened. |
text-davinci-002 is not available now. |
My hypothesis is that later models after text-davinci-002 might be tuned on trajectories similar to Act, plus domains like QA have intuitive tools, and tasks like HotPotQA have intuitive reasoning patterns. On more out-of-distribution domains and tasks (e.g., WebShop, or AlfWorld), reasoning should still improve decision making generalization and transparency. Close it for now but let me know if there's more findings or analysis into this. |
Hi, I'm tring to run ReAct with GPT-3.5-Turbo on hotpot dataset with provided jupyter notebook. But only get 0.182 accuracy, is it a reasonable result? I think it is much lower than result showed in paper.
The text was updated successfully, but these errors were encountered: