Scripts for working with SWE-bench, the AI coding agent benchmark.
If you are trying to beat Devin, see also The SWE-bench fork from OpenAgentsInc to run your agent.
- See usage info with
--help
and<subcommand> --help
python -m swe_bench_util --help
get rows
Download SWE-bench examples from HuggingFace to json fileget oracle
Get "oracle" patch file lists parsed from diffs (context)checkout
Clone the repo for examples and checkout the base_commit- Optionally run a command with --exec
index astra_assistants
checkout example then upload to DataStack's Astra Assistants using phact's streaming-assistants library
Install poetry if you don't have it
python3 -m pip install poetry
If using a feature that requires a vendor API, copy .env.example
to .env
and fill in the values.
Install dependencies and initialize an editable command
poetry install
swe_bench_util --help
This assumes the poetry install has gone onto your path, otherwise you can use python -m swe_bench_util
.
Save the first example case. This will download the full dataset on first run, caching it with the datasets
library.
swe_bench_util get rows --split 'dev[0:1]'
Output
File 'examples/sqlfluff__sqlfluff-4764.json' was saved
File 'examples/sqlfluff__sqlfluff-4764.md' was saved
Use jq to show a subset of the JSON.
jq '. | {repo, instance_id, base_commit, problem_statement}' examples/sqlfluff__sqlfluff-4764.json
Save the Oracle (patched file list) for the dev subset.
swe_bench_util get oracle
Output:
File 'examples/oracle.json' was saved
jq '.[] | .repo' examples/oracle.json | jq -s 'unique'
jq '.[] | {repo, base_commit}' examples/oracle.json | jq -s 'unique'
Git checkout the repo / base_commit of an example.
swe-bench-util checkout --id pydicom__pydicom-793
index and run inference with astra-assistants:
Make sure you have your keys set up in .env
cp .env.backup .env
and set your keys. Then run the index command:
swe_bench_util index astra-assistants
Output:
...
Files used in retrieval: ["test_wcs.py", "wcs.py", "test_utils.py", "test_transform_coord_meta.py", "CHANGES.rst", "test_images.py", "test_misc.py"]
...
By default, most commands will operate on the dev
split, using the Huggingface datasets API. You can specify a split using --split
, for instance:
--split dev
the entire dev split--split 'dev[0:10]'
first 10 rows--split 'dev[:10%]'
10% sample
You can also filter by repo or id. Filters are applied after split, so if you select a row range and a filter you may come up empty.
--repo pydicom/pydicom
--id pydicom__pydicom-1555
Here is the shape of the data.
dev: Dataset({
features: ['repo', 'instance_id', 'base_commit', 'patch', 'test_patch', 'problem_statement', 'hints_text', 'created_at', 'version', 'FAIL_TO_PASS', 'PASS_TO_PASS', 'environment_setup_commit'],
num_rows: 225
})
test: Dataset({
features: ['repo', 'instance_id', 'base_commit', 'patch', 'test_patch', 'problem_statement', 'hints_text', 'created_at', 'version', 'FAIL_TO_PASS', 'PASS_TO_PASS', 'environment_setup_commit'],
num_rows: 2294
})
train: Dataset({
features: ['repo', 'instance_id', 'base_commit', 'patch', 'test_patch', 'problem_statement', 'hints_text', 'created_at', 'version', 'FAIL_TO_PASS', 'PASS_TO_PASS', 'environment_setup_commit'],
num_rows: 19008
})
make check
That is equivelant to:
python -m pytest
python -m ruff check --fix
python -m ruff format