# SWE-bench Verified: SDLC Agent Evaluation

This notebook runs the SDLC agent on SWE-bench Verified instances, writes a predictions JSONL file, and evaluates it with the SWE-bench harness.

Notes:
- You need Docker running to use the harness.
- Generating patches requires API access for the SDLC agent model (see `.env` / env vars).


## 1) Setup
Install SWE-bench, datasets, and this project in your environment.


In [1]:
!git clone https://github.com/princeton-nlp/SWE-bench.git
%pip install -e SWE-bench

Cloning into 'SWE-bench'...


Defaulting to user installation because normal site-packages is not writeable
Obtaining file:///D:/MegaScholaITMO/coding-agent/SWE-bench
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Checking if build backend supports build_editable: started
  Checking if build backend supports build_editable: finished with status 'done'
  Getting requirements to build editable: started
  Getting requirements to build editable: finished with status 'done'
  Preparing editable metadata (pyproject.toml): started
  Preparing editable metadata (pyproject.toml): finished with status 'done'
INFO: pip is looking at multiple versions of swebench to determine which version is compatible with other requirements. This could take a while.
Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement chardet (from swebench) (from versions: none)
ERROR: No matching distribution found for chardet


In [None]:
# Optional: install SWE-bench (recommended by SWE-bench docs)
# !git clone https://github.com/princeton-nlp/SWE-bench.git
# %pip install -e SWE-bench

# Optional: install dependencies for this repo
# %pip install -e .
# %pip install datasets


## 2) Config
Update paths and knobs as needed.


In [2]:
from pathlib import Path

DATASET_NAME = "princeton-nlp/SWE-bench_Verified"
SPLIT = "test"
RUN_ID = "sdlc-agent-verified"
MAX_WORKERS = 8
MAX_INSTANCES = 5  # set to None for full run

PREDICTIONS_PATH = Path("predictions.jsonl")
WORK_DIR = Path(".swebench_work")
WORK_DIR.mkdir(exist_ok=True)

AGENT_NAME = "sdlc-agent"

RETRY_MAX_ATTEMPTS = 8
RETRY_BASE_DELAY_SEC = 10.0
RETRY_MAX_DELAY_SEC = 60.0
SLEEP_BETWEEN_INSTANCES_SEC = 7.0

GIT_CLONE_TIMEOUT_SEC = 600.0
GIT_CHECKOUT_TIMEOUT_SEC = 120.0
DIFF_TIMEOUT_SEC = 60.0


## 3) Load the dataset


In [3]:
from datasets import load_dataset

dataset = load_dataset(DATASET_NAME, split=SPLIT)
if MAX_INSTANCES is not None:
    dataset = dataset.select(range(MAX_INSTANCES))
len(dataset)


  from .autonotebook import tqdm as notebook_tqdm
Generating test split: 100%|██████████| 500/500 [00:00<00:00, 10177.09 examples/s]


5

## 4) Agent runner (local patch generation)
This uses the SDLC agent prompt and context builder to generate a patch per instance.


In [4]:
import json
import subprocess
import tempfile
import os
import time
import random
from pathlib import Path
from typing import Any, Dict

from src.sdlc_agent.agent_code import PROMPT_CODE_PATCH, _apply_file_changes, _build_context
from src.sdlc_agent.llm import chat, extract_json


CODESTRAL_API_KEY = "vT41MNb68fDOfyvTcXNjEfx6B9JhZyA7"
if not CODESTRAL_API_KEY:
    raise ValueError("Missing CODESTRAL_API_KEY in environment")
CODESTRAL_API_BASE = os.getenv("CODESTRAL_API_BASE", "https://api.mistral.ai/v1")
CODESTRAL_MODEL = os.getenv("CODESTRAL_MODEL", "codestral-latest")


def _log(instance_id: str, message: str) -> None:
    print(f"[{instance_id}] {message}")


def run(cmd: str, cwd: Path, timeout_sec: float | None = None) -> subprocess.CompletedProcess:
    start = time.time()
    try:
        proc = subprocess.run(
            cmd,
            cwd=str(cwd),
            shell=True,
            text=True,
            capture_output=True,
            timeout=timeout_sec,
        )
    except subprocess.TimeoutExpired as exc:
        raise RuntimeError(f"Command timed out after {timeout_sec}s: {cmd}") from exc
    duration = time.time() - start
    if proc.returncode != 0:
        raise RuntimeError(
            f"Command failed ({proc.returncode}) in {duration:.1f}s: {cmd}\n"
            f"STDOUT:\n{proc.stdout}\nSTDERR:\n{proc.stderr}"
        )
    return proc


def get_repo_full_name(instance: Dict[str, Any]) -> str:
    if "repo" in instance and instance["repo"]:
        return instance["repo"]
    if "repo_name" in instance and instance["repo_name"]:
        return instance["repo_name"]
    raise KeyError("Missing repo field in instance")


def build_issue_text(instance: Dict[str, Any]) -> str:
    parts = []
    if instance.get("title"):
        parts.append(instance["title"])
    if instance.get("problem_statement"):
        parts.append(instance["problem_statement"])
    hints = instance.get("hints_text") or ""
    if hints.strip():
        parts.append("HINTS:\n" + hints)
    return "\n\n".join(parts).strip()


def _backoff_sleep(attempt: int) -> None:
    base = float(globals().get("RETRY_BASE_DELAY_SEC", 2.0))
    max_delay = float(globals().get("RETRY_MAX_DELAY_SEC", 60.0))
    jitter = random.uniform(0.0, 1.0)
    delay = min(max_delay, base * (2 ** attempt)) + jitter
    time.sleep(delay)


def run_agent_on_instance(instance: Dict[str, Any], work_dir: Path) -> str:
    instance_id = str(instance.get("instance_id", "unknown"))
    repo_full = get_repo_full_name(instance)
    base_commit = instance["base_commit"]
    issue_text = build_issue_text(instance)

    clone_timeout = float(globals().get("GIT_CLONE_TIMEOUT_SEC", 600.0))
    checkout_timeout = float(globals().get("GIT_CHECKOUT_TIMEOUT_SEC", 120.0))
    diff_timeout = float(globals().get("DIFF_TIMEOUT_SEC", 60.0))

    _log(instance_id, f"start repo={repo_full} base_commit={base_commit}")
    with tempfile.TemporaryDirectory(prefix="swebench-", dir=work_dir) as tmp:
        tmp_path = Path(tmp)
        repo_dir = tmp_path / repo_full.replace("/", "__")
        clone_url = f"https://github.com/{repo_full}.git"

        _log(instance_id, "cloning repo")
        run(f"git clone {clone_url} {repo_dir.name}", cwd=tmp_path, timeout_sec=clone_timeout)

        _log(instance_id, "checking out base commit")
        run(f"git checkout {base_commit}", cwd=repo_dir, timeout_sec=checkout_timeout)

        _log(instance_id, "building context (may take a while)")
        t0 = time.time()
        context = _build_context(repo_dir, issue_text)
        _log(instance_id, f"context built in {time.time() - t0:.1f}s")

        messages = [
            {"role": "system", "content": PROMPT_CODE_PATCH},
            {"role": "user", "content": context},
        ]

        max_attempts = int(globals().get("RETRY_MAX_ATTEMPTS", 5))
        last_exc: Exception | None = None
        response = None
        for attempt in range(max_attempts):
            try:
                _log(instance_id, f"LLM attempt {attempt + 1}/{max_attempts}")
                response = chat(
                    CODESTRAL_API_BASE,
                    CODESTRAL_API_KEY,
                    CODESTRAL_MODEL,
                    messages,
                    temperature=0.2,
                    max_tokens=2048,
                    timeout_sec=120,
                )
                last_exc = None
                break
            except Exception as exc:
                last_exc = exc
                _log(instance_id, f"LLM error: {exc}")
                if attempt < max_attempts - 1:
                    _backoff_sleep(attempt)
        if response is None:
            raise last_exc or RuntimeError("LLM call failed")

        data = extract_json(response.content)
        _apply_file_changes(repo_dir, data)
        diff = run("git diff --no-color", cwd=repo_dir, timeout_sec=diff_timeout)
        _log(instance_id, f"diff size={len(diff.stdout)} chars")
        return diff.stdout


## 4b) Smoke test (single instance)
Run this to verify the agent completes one instance end-to-end.


In [5]:
instance = dataset[0]
patch = run_agent_on_instance(instance, WORK_DIR)
print(f"Patch chars: {len(patch)}")
print(patch[:2000])


[astropy__astropy-12907] start repo=astropy/astropy base_commit=d16bfe05a744909de4b27f5875fe0d4ed41ce607
[astropy__astropy-12907] cloning repo
[astropy__astropy-12907] checking out base commit
[astropy__astropy-12907] building context (may take a while)
[astropy__astropy-12907] context built in 12.6s
[astropy__astropy-12907] LLM attempt 1/8
[astropy__astropy-12907] LLM error: LLM HTTP 429: {"object":"error","message":"Rate limit exceeded","type":"rate_limited","param":null,"code":"1300"}
[astropy__astropy-12907] LLM attempt 2/8
[astropy__astropy-12907] LLM error: LLM HTTP 429: {"object":"error","message":"Rate limit exceeded","type":"rate_limited","param":null,"code":"1300"}
[astropy__astropy-12907] LLM attempt 3/8
[astropy__astropy-12907] LLM error: LLM HTTP 429: {"object":"error","message":"Rate limit exceeded","type":"rate_limited","param":null,"code":"1300"}
[astropy__astropy-12907] LLM attempt 4/8
[astropy__astropy-12907] LLM error: LLM HTTP 429: {"object":"error","message":"Rate 

LlmError: LLM HTTP 429: {"object":"error","message":"Rate limit exceeded","type":"rate_limited","param":null,"code":"1300"}

## 5) Generate predictions


In [None]:
import json
import time

rows = []
sleep_between = float(globals().get("SLEEP_BETWEEN_INSTANCES_SEC", 1.0))
for instance in dataset:
    instance_id = instance["instance_id"]
    print(f"[INFO] {instance_id}: start")
    try:
        patch = run_agent_on_instance(instance, WORK_DIR)
    except Exception as exc:
        patch = ""
        print(f"[WARN] {instance_id}: {exc}")
    rows.append({
        "instance_id": instance_id,
        "model_name_or_path": AGENT_NAME,
        "model_patch": patch,
    })
    print(f"[INFO] {instance_id}: done patch_chars={len(patch)}")
    if sleep_between > 0:
        time.sleep(sleep_between)

with PREDICTIONS_PATH.open("w", encoding="utf-8") as f:
    for row in rows:
        f.write(json.dumps(row) + "\n")

PREDICTIONS_PATH


[INFO] Starting instance astropy__astropy-12907


## 6) Evaluate with the SWE-bench harness
If your SWE-bench version uses a different worker flag, run `python -m swebench.harness.run_evaluation --help` and adjust.


In [None]:
!python -m swebench.harness.run_evaluation \
    --dataset_name {DATASET_NAME} \
    --predictions_path {PREDICTIONS_PATH} \
    --max_workers {MAX_WORKERS} \
    --run_id {RUN_ID}
