While we are starting with SWE-bench Verified, we should explore other python benchmarks. In increasing order of difficulty:
- HumanEval: ~160 hand-crafted python functions. Difficulty - Easy
- ODEX: Open-domain execution-based python. Library API use. Difficulty - Easy-Medium
- BigCodeBench: ~1700 tasks using composition of the standard library + third-party libraries. Difficulty - Medium
- LiveCodeBench: Competitive programming. Contamination resistant. Difficulty: Medium-Very Hard
- SWE-bench Verified: 500 human-validated GitHub issues from real python repos. Difficulty: Hard
Warning: See paper for issues with memorization.
Agentic coverage:
- SWT-Bench: Test generation for a given codebase. Includes categories like Test Generation, Test Repair, Coverage Improvement.
- Terminal-Bench 2.0: Tasks in a terminal/shell environment.
- FeatureBench: Feature request implementation and not just bug fixes.
Python DS code:
- DS-1000: ~1000 data science problems in numpy, pandas, matplotlib, sklearn, pytorch, scipy.
- MLE-bench: Kaggle competitions benchmark
Contamination resistance:
- EvoEval: Semantic perturbations of HumanEval problems.
Possible curriculum for RL:
- Easy HumanEval -> EvoEval -> ODEX
- Medium BigCodeBench -> DS-1000 -> LiveCodeBench (easy/medium)
- Hard SWE-bench Verified -> SWT-Bench -> Terminal Bench 2.0
- Very Hard MLE-Bench -> LiveCodeBench (hard) -> FeatureBench
Important point about environment
Most of these benchmarks have examples that fit in the context window. For practical use, once codebases get large enough, retrieval plays an important role. Might need custom harness to reproduce this e.g. cap the number of files that can be read for any instance. Don't worry about this now!
While we are starting with SWE-bench Verified, we should explore other python benchmarks. In increasing order of difficulty:
Warning: See paper for issues with memorization.
Agentic coverage:
Python DS code:
Contamination resistance:
Possible curriculum for RL:
Important point about environment
Most of these benchmarks have examples that fit in the context window. For practical use, once codebases get large enough, retrieval plays an important role. Might need custom harness to reproduce this e.g. cap the number of files that can be read for any instance. Don't worry about this now!