Skip to content

SS47816/AGI-Elo

Repository files navigation

AGI-Elo

[Project Page] [HuggingFace] [Preprint] [Code] [Raw Data]

AGI-Elo: How Far Are We From Mastering A Task?

Shuo Sun1,3, Yimin Zhao1, Christina Dao Wen Lee1, Jiawei Sun1, Chengran Yuan1,
Zefan Huang1,3, Dongen Li1,3, Justin KW Yeoh1, Alok Prakash3,
Thomas W. Malone2,3, Marcelo H. Ang Jr.1,3

1National University of Singapore 2Massachusetts Institute of Technology
3Singapore MIT Alliance for Research and Technology


ImageNet

MMLU

Waymo

COCO

LiveCodeBench

NAVSIM

Main Results

Rating distribution of datasets

You can go to our Project Page for a more detailed rating distribution analysis.

Test case difficulties and visualized samples

You can view the visualized test cases samples and their associated ratings on our HuggingFace Collection for all six datasets.

Abstract

As the field progresses toward Artificial General Intelligence (AGI), there is a pressing need for more comprehensive and insightful evaluation frameworks that go beyond aggregate performance metrics. This paper introduces a unified rating system that jointly models the difficulty of individual test cases and the competency of AI models (or humans) across vision, language, and action domains. Unlike existing metrics that focus solely on models, our approach allows for fine-grained, difficulty-aware evaluations through competitive interactions between models and tasks, capturing both the long-tail distribution of real-world challenges and the competency gap between current models and full task mastery. We validate the generalizability and robustness of our system through extensive experiments on multiple established datasets and models across distinct AGI domains. The resulting rating distributions offer novel perspectives and interpretable insights into task difficulty, model progression, and the outstanding challenges that remain on the path to achieving full AGI task mastery.

Install

  1. Clone this repository
git clone https://github.com/SS47816/AGI-Elo.git
cd AGI-Elo
  1. Install all Dependencies
# Auto install conda env AGI_Elo
direnv allow
make install
conda activate AGI_Elo

# Auto install all pip dependencies from requirements.txt
make pip-install

Usage

1. Prepare model predictions

Each .pkl file should contain the prediction results of one model evaluated across all test cases.

You can download our precomputed prediction files from: Google Drive: Raw Data.

After downloading, organize the ./data folder with the following structure:

   ./data
   ├── imagenet_class_index.json
   │
   ├── classification/
   │   ├── ImageNet/
   │       ├── val/
   │           ├── predictions/
   │           ├── ...
   │
   ├── detection/
   │   ├── COCO/
   │       ├── val/
   │           ├── predictions/
   │           ├── ...
   │
   ├── question_answering/
   │   ├── MMLU/
   │       ├── test/
   │           ├── predictions/
   │           ├── ...
   │
   ├── coding/
   │   ├── LiveCodeBench/
   │       ├── test/
   │           ├── predictions/
   │           ├── ...
   │
   ├── motion_prediction/
   │   ├── Waymo/
   │       ├── val/
   │           ├── predictions/
   │           ├── ...
   │
   ├── motion_planning/
   │   ├── NAVSIM/
   │       ├── val/
   │           ├── predictions/
   │           ├── ...
   │

2. Run rating estimation

To run rating estimation across all tasks and datasets, use:

python3 AGI_Elo/scripts/run_all_experiments.py

Or optionally, you can run a specific task independently (e.g., classification):

python3 AGI_Elo/pipeline/classification.py

The results will be save to their respective ratings/ folders.

BibTeX

If you find our work interesting, please consider citing our paper:

@misc{sun2025agielofarmasteringtask,
  title={AGI-Elo: How Far Are We From Mastering A Task?}, 
  author={Shuo Sun and Yimin Zhao and Christina Dao Wen Lee and Jiawei Sun and Chengran Yuan and Zefan Huang and Dongen Li and Justin KW Yeoh and Alok Prakash and Thomas W. Malone and Marcelo H. Ang Jr},
  year={2025},
  eprint={2505.12844},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2505.12844}, 
}

License

This repository is licensed under the Apache License 2.0

Project based on Nesta's data science project template