Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion in Model Evaluation Results Due to GPT Updates #282

Closed
yifan123 opened this issue Apr 17, 2024 · 2 comments · Fixed by #283
Closed

Confusion in Model Evaluation Results Due to GPT Updates #282

yifan123 opened this issue Apr 17, 2024 · 2 comments · Fixed by #283

Comments

@yifan123
Copy link
Contributor

The actual models pointed to by gpt-turbo and gpt-preview are frequently updated, leading to confusion in the results.

When using the latest gpt-turbo (defaulting to gpt-4-turbo-2024-04-09) and claude-3-opus-20240229, the result is 47.38, significantly higher than the 40.39 on the leaderboard.
image

When using the latest gpt-preview (defaulting to gpt-4-0125-preview) and claude-3-opus-20240229, the result is 41.09, which is similar to the 40.39 on the leaderboard.
image

Based on this, I speculate that alpaca_eval is using gpt-preview for evaluation.

However, there are two different settings displayed on the webpage.
image
image
The first image is after switching to alpaca-eval and then back to alpaca-eval 2.0, the displayed settings are different.

Given the above situation, I suggest:

Avoid using gpt-turbo (symbolic links), and instead use specific version numbers like gpt-4-turbo-2024-04-09.

Also, what are the current settings for the leaderboard?

@YannDubs
Copy link
Collaborator

Hi @yifan123, thanks a lot for flagging this.

What happened is that since gpt-4-turbo came out a few data ago, it means that preview would soon get discontinued. So I tested the annotations of turbo 3 different models to see how much the model is different (I thought it would mostly be the same model). I tested it on three models from Mistral, OpenAI, Contextual and saw little difference so I switched the annotator to gpt-4-turbo. But given your post I tested differences more thoroughly on 15 models and see that actually for a few models this makes a large difference.

#283 reverted those and reannotated the models that were added in the last few days with preview.

@yifan123
Copy link
Contributor Author

Thanks for your response. Alpaca_eval is a very nice leaderboard, hope to maintain and improve it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants