Skip to content

Code benchmarks #5

@TreeinRandomForest

Description

@TreeinRandomForest

While we are starting with SWE-bench Verified, we should explore other python benchmarks. In increasing order of difficulty:

  • HumanEval: ~160 hand-crafted python functions. Difficulty - Easy
  • ODEX: Open-domain execution-based python. Library API use. Difficulty - Easy-Medium
  • BigCodeBench: ~1700 tasks using composition of the standard library + third-party libraries. Difficulty - Medium
  • LiveCodeBench: Competitive programming. Contamination resistant. Difficulty: Medium-Very Hard
  • SWE-bench Verified: 500 human-validated GitHub issues from real python repos. Difficulty: Hard

Warning: See paper for issues with memorization.

Agentic coverage:

  • SWT-Bench: Test generation for a given codebase. Includes categories like Test Generation, Test Repair, Coverage Improvement.
  • Terminal-Bench 2.0: Tasks in a terminal/shell environment.
  • FeatureBench: Feature request implementation and not just bug fixes.

Python DS code:

  • DS-1000: ~1000 data science problems in numpy, pandas, matplotlib, sklearn, pytorch, scipy.
  • MLE-bench: Kaggle competitions benchmark

Contamination resistance:

  • EvoEval: Semantic perturbations of HumanEval problems.

Possible curriculum for RL:

  1. Easy HumanEval -> EvoEval -> ODEX
  2. Medium BigCodeBench -> DS-1000 -> LiveCodeBench (easy/medium)
  3. Hard SWE-bench Verified -> SWT-Bench -> Terminal Bench 2.0
  4. Very Hard MLE-Bench -> LiveCodeBench (hard) -> FeatureBench

Important point about environment

Most of these benchmarks have examples that fit in the context window. For practical use, once codebases get large enough, retrieval plays an important role. Might need custom harness to reproduce this e.g. cap the number of files that can be read for any instance. Don't worry about this now!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions