Skip to content

feat: execution-based correctness scoring (sandbox eval) #1

@zenprocess

Description

@zenprocess

Priority: Critical — the single biggest maturity gap

Currently PawBench scores quality via keyword matching (expect.output_mentions: ["flask", "api"]). The model could generate syntactically broken code and score 100%.

Proposal

For each PawStyle scenario, spin up a Docker container post-generation:

  1. Write the generated files (from tool call arguments) to the container
  2. For backend: start the server, hit each endpoint, verify HTTP status + JSON schema
  3. For frontend: run through a headless browser or just validate HTML syntax
  4. Score: endpoints_working / endpoints_expected

Implementation

class SandboxEvaluator:
    async def evaluate(self, agent_result: AgentResult, scenario: dict) -> float:
        # Extract files from tool call arguments
        # Write to temp dir
        # docker run -v tempdir:/app python:3.12 python /app/server.py &
        # curl http://localhost:8080/api/products → check 200 + valid JSON
        # Return pass_rate 0.0-1.0

Acceptance Criteria

  • Backend endpoints return correct HTTP status codes
  • JSON responses match expected schema (fields present, correct types)
  • Frontend HTML is syntactically valid
  • Scoring is binary per endpoint (pass/fail), aggregated as percentage
  • Docker cleanup on completion (no leaked containers)
  • --no-sandbox flag to skip (for quick runs)

This is the #1 gap between PawBench and SWE-bench level maturity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions