You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The actual models pointed to by gpt-turbo and gpt-preview are frequently updated, leading to confusion in the results.
When using the latest gpt-turbo (defaulting to gpt-4-turbo-2024-04-09) and claude-3-opus-20240229, the result is 47.38, significantly higher than the 40.39 on the leaderboard.
When using the latest gpt-preview (defaulting to gpt-4-0125-preview) and claude-3-opus-20240229, the result is 41.09, which is similar to the 40.39 on the leaderboard.
Based on this, I speculate that alpaca_eval is using gpt-preview for evaluation.
However, there are two different settings displayed on the webpage.
The first image is after switching to alpaca-eval and then back to alpaca-eval 2.0, the displayed settings are different.
Given the above situation, I suggest:
Avoid using gpt-turbo (symbolic links), and instead use specific version numbers like gpt-4-turbo-2024-04-09.
Also, what are the current settings for the leaderboard?
The text was updated successfully, but these errors were encountered:
What happened is that since gpt-4-turbo came out a few data ago, it means that preview would soon get discontinued. So I tested the annotations of turbo 3 different models to see how much the model is different (I thought it would mostly be the same model). I tested it on three models from Mistral, OpenAI, Contextual and saw little difference so I switched the annotator to gpt-4-turbo. But given your post I tested differences more thoroughly on 15 models and see that actually for a few models this makes a large difference.
#283 reverted those and reannotated the models that were added in the last few days with preview.
The actual models pointed to by
gpt-turbo
andgpt-preview
are frequently updated, leading to confusion in the results.When using the latest
gpt-turbo
(defaulting to gpt-4-turbo-2024-04-09) and claude-3-opus-20240229, the result is 47.38, significantly higher than the 40.39 on the leaderboard.When using the latest
gpt-preview
(defaulting to gpt-4-0125-preview) and claude-3-opus-20240229, the result is 41.09, which is similar to the 40.39 on the leaderboard.Based on this, I speculate that alpaca_eval is using
gpt-preview
for evaluation.However, there are two different settings displayed on the webpage.
The first image is after switching to alpaca-eval and then back to alpaca-eval 2.0, the displayed settings are different.
Given the above situation, I suggest:
Avoid using
gpt-turbo
(symbolic links), and instead use specific version numbers like gpt-4-turbo-2024-04-09.Also, what are the current settings for the leaderboard?
The text was updated successfully, but these errors were encountered: