Code benchmarks

While we are starting with [SWE-bench](https://www.swebench.com/verified.html) [Verified](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified), we should explore other *python* benchmarks. In increasing order of difficulty:

* [HumanEval](https://github.com/openai/human-eval): ~160 hand-crafted python functions. Difficulty - **Easy**
* [ODEX](https://code-eval.github.io/): Open-domain execution-based python. Library API use. Difficulty - **Easy-Medium**
* [BigCodeBench](https://github.com/bigcode-project/bigcodebench): ~1700 tasks using composition of the standard library + third-party libraries. Difficulty - **Medium**
* [LiveCodeBench](https://livecodebench.github.io/): Competitive programming. Contamination resistant. Difficulty: **Medium-Very Hard**
* [SWE-bench Verified](https://www.swebench.com/verified.html): 500 human-validated GitHub issues from real python repos. Difficulty: **Hard**

**Warning**: See [paper](https://arxiv.org/html/2506.12286v3) for issues with memorization.

Agentic coverage:

* [SWT-Bench](https://github.com/logic-star-ai/swt-bench): Test generation for a given codebase. Includes categories like Test Generation, Test Repair, Coverage Improvement.
* [Terminal-Bench 2.0](https://www.tbench.ai/): Tasks in a terminal/shell environment.
* [FeatureBench](https://github.com/LiberCoders/FeatureBench): Feature request implementation and not just bug fixes.

Python DS code:
* [DS-1000](https://ds1000-code-gen.github.io/): ~1000 data science problems in numpy, pandas, matplotlib, sklearn, pytorch, scipy.
* [MLE-bench](https://github.com/openai/mle-bench): Kaggle competitions benchmark

Contamination resistance:
* [EvoEval](https://evo-eval.github.io/): Semantic perturbations of HumanEval problems.

Possible curriculum for RL:
1. **Easy** HumanEval -> EvoEval -> ODEX
2. **Medium** BigCodeBench -> DS-1000 -> LiveCodeBench (easy/medium)
3. **Hard** SWE-bench Verified -> SWT-Bench -> Terminal Bench 2.0
4. **Very Hard** MLE-Bench -> LiveCodeBench (hard) -> FeatureBench

**Important point about environment**

Most of these benchmarks have examples that fit in the context window. For practical use, once codebases get large enough, retrieval plays an important role. Might need custom harness to reproduce this e.g. cap the number of files that can be read for any instance. Don't worry about this now!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code benchmarks #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Code benchmarks #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions