diff --git a/docs/leaderboard.md b/docs/leaderboard.md index 63a543b..001cb26 100644 --- a/docs/leaderboard.md +++ b/docs/leaderboard.md @@ -2,30 +2,28 @@
-# SciCode Leaderboard - -| Models | Main Problem Resolve Rate | Subproblem | -|--------------------------|-------------------------------------|-------------------------------------| -| 🥇 OpenAI o3-mini-low |
**10.8**
|
33.3
| -| 🥈 OpenAI o3-mini-high |
**9.2**
|
34.4
| -| 🥉 OpenAI o3-mini-medium |
**9.2**
|
33.0
| -| OpenAI o1-preview |
**7.7**
|
28.5
| -| Deepseek-R1 |
**4.6**
|
28.5
| -| Claude3.5-Sonnet |
**4.6**
|
26.0
| -| Claude3.5-Sonnet (new) |
**4.6**
|
25.3
| -| Deepseek-v3 |
**3.1**
|
23.7
| -| Deepseek-Coder-v2 |
**3.1**
|
21.2
| -| GPT-4o |
**1.5**
|
25.0
| -| GPT-4-Turbo |
**1.5**
|
22.9
| -| OpenAI o1-mini |
**1.5**
|
22.2
| -| Gemini 1.5 Pro |
**1.5**
|
21.9
| -| Claude3-Opus |
**1.5**
|
21.5
| -| Llama-3.1-405B-Chat |
**1.5**
|
19.8
| -| Claude3-Sonnet |
**1.5**
|
17.0
| -| Qwen2-72B-Instruct |
**1.5**
|
17.0
| -| Llama-3.1-70B-Chat |
**0.0**
|
17.0
| -| Mixtral-8x22B-Instruct |
**0.0**
|
16.3
| -| Llama-3-70B-Chat |
**0.0**
|
14.6
| +| Models | Main Problem Resolve Rate | Subproblem | +|--------------------------|:-------------------------:|:--------------------------------------------:| +| 🥇 OpenAI o3-mini-low | **10.8** | 33.3 | +| 🥈 OpenAI o3-mini-high | **9.2** | 34.4 | +| 🥉 OpenAI o3-mini-medium | **9.2** | 33.0 | +| OpenAI o1-preview | **7.7** | 28.5 | +| Deepseek-R1 | **4.6** | 28.5 | +| Claude3.5-Sonnet | **4.6** | 26.0 | +| Claude3.5-Sonnet (new) | **4.6** | 25.3 | +| Deepseek-v3 | **3.1** | 23.7 | +| Deepseek-Coder-v2 | **3.1** | 21.2 | +| GPT-4o | **1.5** | 25.0 | +| GPT-4-Turbo | **1.5** | 22.9 | +| OpenAI o1-mini | **1.5** | 22.2 | +| Gemini 1.5 Pro | **1.5** | 21.9 | +| Claude3-Opus | **1.5** | 21.5 | +| Llama-3.1-405B-Chat | **1.5** | 19.8 | +| Claude3-Sonnet | **1.5** | 17.0 | +| Qwen2-72B-Instruct | **1.5** | 17.0 | +| Llama-3.1-70B-Chat | **0.0** | 17.0 | +| Mixtral-8x22B-Instruct | **0.0** | 16.3 | +| Llama-3-70B-Chat | **0.0** | 14.6 | **Note: If the models tie in the Main Problem resolve rate, we will then compare the Subproblems.**