SWE-bench Util

Scripts for working with SWE-bench, the AI coding agent benchmark.

If you are trying to beat Devin, see also The SWE-bench fork from OpenAgentsInc to run your agent.

Features

See usage info with --help and <subcommand> --help
- python -m swe_bench_util --help
get rows Download SWE-bench examples from HuggingFace to json file
get oracle Get "oracle" patch file lists parsed from diffs (context)
checkout Clone the repo for examples and checkout the base_commit
- Optionally run a command with --exec
index astra_assistants checkout example then upload to DataStack's Astra Assistants using phact's streaming-assistants library

Setup

Install poetry if you don't have it

python3 -m pip install poetry

If using a feature that requires a vendor API, copy .env.example to .env and fill in the values.

Install dependencies and initialize an editable command

poetry install

Run

swe_bench_util --help

This assumes the poetry install has gone onto your path, otherwise you can use python -m swe_bench_util.

Save the first example case. This will download the full dataset on first run, caching it with the datasets library.

swe_bench_util get rows --split 'dev[0:1]'

Output

File 'examples/sqlfluff__sqlfluff-4764.json' was saved
File 'examples/sqlfluff__sqlfluff-4764.md' was saved

Use jq to show a subset of the JSON.

jq '. | {repo, instance_id, base_commit, problem_statement}' examples/sqlfluff__sqlfluff-4764.json

Save the Oracle (patched file list) for the dev subset.

swe_bench_util get oracle

Output:

File 'examples/oracle.json' was saved

jq '.[] | .repo' examples/oracle.json  | jq -s 'unique'
jq '.[] | {repo, base_commit}' examples/oracle.json  | jq -s 'unique'

Git checkout the repo / base_commit of an example. swe-bench-util checkout --id pydicom__pydicom-793

index and run inference with astra-assistants:

Make sure you have your keys set up in .env

cp .env.backup .env

and set your keys. Then run the index command:

swe_bench_util index astra-assistants

Output:

...
Files used in retrieval: ["test_wcs.py", "wcs.py", "test_utils.py", "test_transform_coord_meta.py", "CHANGES.rst", "test_images.py", "test_misc.py"]
...

Data

By default, most commands will operate on the dev split, using the Huggingface datasets API. You can specify a split using --split, for instance:

--split dev the entire dev split
--split 'dev[0:10]' first 10 rows
--split 'dev[:10%]' 10% sample

You can also filter by repo or id. Filters are applied after split, so if you select a row range and a filter you may come up empty.

--repo pydicom/pydicom
--id pydicom__pydicom-1555

Here is the shape of the data.

    dev: Dataset({
        features: ['repo', 'instance_id', 'base_commit', 'patch', 'test_patch', 'problem_statement', 'hints_text', 'created_at', 'version', 'FAIL_TO_PASS', 'PASS_TO_PASS', 'environment_setup_commit'],
        num_rows: 225
    })
    test: Dataset({
        features: ['repo', 'instance_id', 'base_commit', 'patch', 'test_patch', 'problem_statement', 'hints_text', 'created_at', 'version', 'FAIL_TO_PASS', 'PASS_TO_PASS', 'environment_setup_commit'],
        num_rows: 2294
    })
    train: Dataset({
        features: ['repo', 'instance_id', 'base_commit', 'patch', 'test_patch', 'problem_statement', 'hints_text', 'created_at', 'version', 'FAIL_TO_PASS', 'PASS_TO_PASS', 'environment_setup_commit'],
        num_rows: 19008
    })

Checks

make check

That is equivelant to:

python -m pytest

python -m ruff check --fix

python -m ruff format

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
examples		examples
recall		recall
scripts		scripts
swe_bench_util		swe_bench_util
test		test
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SWE-bench Util

Features

Setup

Run

Data

Checks

About

Releases

Packages

Contributors 2

Languages

License

raymyers/swe-bench-util

Folders and files

Latest commit

History

Repository files navigation

SWE-bench Util

Features

Setup

Run

Data

Checks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages