Confusion in Model Evaluation Results Due to GPT Updates #282

yifan123 · 2024-04-17T17:22:59Z

The actual models pointed to by gpt-turbo and gpt-preview are frequently updated, leading to confusion in the results.

When using the latest gpt-turbo (defaulting to gpt-4-turbo-2024-04-09) and claude-3-opus-20240229, the result is 47.38, significantly higher than the 40.39 on the leaderboard.

When using the latest gpt-preview (defaulting to gpt-4-0125-preview) and claude-3-opus-20240229, the result is 41.09, which is similar to the 40.39 on the leaderboard.

Based on this, I speculate that alpaca_eval is using gpt-preview for evaluation.

However, there are two different settings displayed on the webpage.

The first image is after switching to alpaca-eval and then back to alpaca-eval 2.0, the displayed settings are different.

Given the above situation, I suggest:

Avoid using gpt-turbo (symbolic links), and instead use specific version numbers like gpt-4-turbo-2024-04-09.

Also, what are the current settings for the leaderboard?

The text was updated successfully, but these errors were encountered:

YannDubs · 2024-04-18T07:34:17Z

Hi @yifan123, thanks a lot for flagging this.

What happened is that since gpt-4-turbo came out a few data ago, it means that preview would soon get discontinued. So I tested the annotations of turbo 3 different models to see how much the model is different (I thought it would mostly be the same model). I tested it on three models from Mistral, OpenAI, Contextual and saw little difference so I switched the annotator to gpt-4-turbo. But given your post I tested differences more thoroughly on 15 models and see that actually for a few models this makes a large difference.

#283 reverted those and reannotated the models that were added in the last few days with preview.

yifan123 · 2024-04-18T07:48:40Z

Thanks for your response. Alpaca_eval is a very nice leaderboard, hope to maintain and improve it

YannDubs mentioned this issue Apr 18, 2024

[BUG] revert to GPT4 preview 1106 #283

Merged

YannDubs closed this as completed Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion in Model Evaluation Results Due to GPT Updates #282

Confusion in Model Evaluation Results Due to GPT Updates #282

yifan123 commented Apr 17, 2024

YannDubs commented Apr 18, 2024

yifan123 commented Apr 18, 2024

Confusion in Model Evaluation Results Due to GPT Updates #282

Confusion in Model Evaluation Results Due to GPT Updates #282

Comments

yifan123 commented Apr 17, 2024

YannDubs commented Apr 18, 2024

yifan123 commented Apr 18, 2024