AGI-Elo

[Project Page] [HuggingFace] [Preprint] [Code] [Raw Data]

AGI-Elo: How Far Are We From Mastering A Task?

Shuo Sun^1,3, Yimin Zhao¹, Christina Dao Wen Lee¹, Jiawei Sun¹, Chengran Yuan¹,
Zefan Huang^1,3, Dongen Li^1,3, Justin KW Yeoh¹, Alok Prakash³,
Thomas W. Malone^2,3, Marcelo H. Ang Jr.^1,3

¹National University of Singapore ²Massachusetts Institute of Technology
³Singapore MIT Alliance for Research and Technology

_ImageNet	_MMLU	_Waymo
_COCO	_{LiveCodeBench}	_NAVSIM

Main Results

Rating distribution of datasets

You can go to our Project Page for a more detailed rating distribution analysis.

Test case difficulties and visualized samples

You can view the visualized test cases samples and their associated ratings on our HuggingFace Collection for all six datasets.

Abstract

As the field progresses toward Artificial General Intelligence (AGI), there is a pressing need for more comprehensive and insightful evaluation frameworks that go beyond aggregate performance metrics. This paper introduces a unified rating system that jointly models the difficulty of individual test cases and the competency of AI models (or humans) across vision, language, and action domains. Unlike existing metrics that focus solely on models, our approach allows for fine-grained, difficulty-aware evaluations through competitive interactions between models and tasks, capturing both the long-tail distribution of real-world challenges and the competency gap between current models and full task mastery. We validate the generalizability and robustness of our system through extensive experiments on multiple established datasets and models across distinct AGI domains. The resulting rating distributions offer novel perspectives and interpretable insights into task difficulty, model progression, and the outstanding challenges that remain on the path to achieving full AGI task mastery.

Install

Clone this repository

git clone https://github.com/SS47816/AGI-Elo.git
cd AGI-Elo

Install all Dependencies

# Auto install conda env AGI_Elo
direnv allow
make install
conda activate AGI_Elo

# Auto install all pip dependencies from requirements.txt
make pip-install

Usage

1. Prepare model predictions

Each .pkl file should contain the prediction results of one model evaluated across all test cases.

You can download our precomputed prediction files from: Google Drive: Raw Data.

After downloading, organize the ./data folder with the following structure:

   ./data
   ├── imagenet_class_index.json
   │
   ├── classification/
   │   ├── ImageNet/
   │       ├── val/
   │           ├── predictions/
   │           ├── ...
   │
   ├── detection/
   │   ├── COCO/
   │       ├── val/
   │           ├── predictions/
   │           ├── ...
   │
   ├── question_answering/
   │   ├── MMLU/
   │       ├── test/
   │           ├── predictions/
   │           ├── ...
   │
   ├── coding/
   │   ├── LiveCodeBench/
   │       ├── test/
   │           ├── predictions/
   │           ├── ...
   │
   ├── motion_prediction/
   │   ├── Waymo/
   │       ├── val/
   │           ├── predictions/
   │           ├── ...
   │
   ├── motion_planning/
   │   ├── NAVSIM/
   │       ├── val/
   │           ├── predictions/
   │           ├── ...
   │

2. Run rating estimation

To run rating estimation across all tasks and datasets, use:

python3 AGI_Elo/scripts/run_all_experiments.py

Or optionally, you can run a specific task independently (e.g., classification):

python3 AGI_Elo/pipeline/classification.py

The results will be save to their respective ratings/ folders.

BibTeX

If you find our work interesting, please consider citing our paper:

@misc{sun2025agielofarmasteringtask,
  title={AGI-Elo: How Far Are We From Mastering A Task?}, 
  author={Shuo Sun and Yimin Zhao and Christina Dao Wen Lee and Jiawei Sun and Chengran Yuan and Zefan Huang and Dongen Li and Justin KW Yeoh and Alok Prakash and Thomas W. Malone and Marcelo H. Ang Jr},
  year={2025},
  eprint={2505.12844},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2505.12844}, 
}

License

This repository is licensed under the Apache License 2.0

Project based on Nesta's data science project template

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.cookiecutter		.cookiecutter
.github		.github
AGI_Elo		AGI_Elo
data		data
docs		docs
media		media
.envrc		.envrc
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
environment.yaml		environment.yaml
jupytext.toml		jupytext.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AGI-Elo

AGI-Elo: How Far Are We From Mastering A Task?

Main Results

Rating distribution of datasets

Test case difficulties and visualized samples

Abstract

Install

Usage

1. Prepare model predictions

2. Run rating estimation

BibTeX

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

SS47816/AGI-Elo

Folders and files

Latest commit

History

Repository files navigation

AGI-Elo

AGI-Elo: How Far Are We From Mastering A Task?

Main Results

Rating distribution of datasets

Test case difficulties and visualized samples

Abstract

Install

Usage

1. Prepare model predictions

2. Run rating estimation

BibTeX

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages