Priority: Critical — the single biggest maturity gap
Currently PawBench scores quality via keyword matching (expect.output_mentions: ["flask", "api"]). The model could generate syntactically broken code and score 100%.
Proposal
For each PawStyle scenario, spin up a Docker container post-generation:
- Write the generated files (from tool call arguments) to the container
- For backend: start the server, hit each endpoint, verify HTTP status + JSON schema
- For frontend: run through a headless browser or just validate HTML syntax
- Score: endpoints_working / endpoints_expected
Implementation
class SandboxEvaluator:
async def evaluate(self, agent_result: AgentResult, scenario: dict) -> float:
# Extract files from tool call arguments
# Write to temp dir
# docker run -v tempdir:/app python:3.12 python /app/server.py &
# curl http://localhost:8080/api/products → check 200 + valid JSON
# Return pass_rate 0.0-1.0
Acceptance Criteria
This is the #1 gap between PawBench and SWE-bench level maturity.
Priority: Critical — the single biggest maturity gap
Currently PawBench scores quality via keyword matching (
expect.output_mentions: ["flask", "api"]). The model could generate syntactically broken code and score 100%.Proposal
For each PawStyle scenario, spin up a Docker container post-generation:
Implementation
Acceptance Criteria
--no-sandboxflag to skip (for quick runs)This is the #1 gap between PawBench and SWE-bench level maturity.