-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Reproducing Results] on Alfworld #28
Comments
@ysymyth - any update on the above? Thank you. |
Have you tried to run on more than 10 cases? 10 seems too noisy to tell anything. |
@ysymyth Thanks a lot for coming back to me on this one. Yes, we have run the results several times on the entire test set (i.e. 135 environments). We both used your code and our own implementation. Did you do modifications beyond the ones present in your code (in this repo?):
|
Yeah I can confirm that newer GPT-3.5 (both latest & instruct) do appear to perform really poorly, around 30%, when running a close clone of the notebook of alfworld. I would have just said that GPT-3.5 got much worse on this task or needs new prompts, but I see a recent paper that uses gpt-3.5-instruct with decent ReAct results (54%) here: https://arxiv.org/pdf/2405.17402 So @ai-nikolai and I might be missing something. |
i think reproducibility will get harder and harder ... close for now but feel free to reopen |
Dear Authors,
Thank you for the great work on introducing ReAct.
Since, the original model that you used
text-davinci-002
is deprecated on openai the closest two alternatives are:gpt-3.5-turbo
anddavinci-002
. The best performance we get on e.g. the first 10 is 0.3, while the reported results on the first 10 envs of Alfworld are 0.7.Could you share the traces or advice, what your latest scores on this environment is? Or how to reproduce your score of 0.7. @ysymyth @john-b-yang @descrip
Thanks.
The text was updated successfully, but these errors were encountered: