feat: execution-based correctness scoring (sandbox eval)

## Priority: Critical — the single biggest maturity gap

Currently PawBench scores quality via keyword matching (`expect.output_mentions: ["flask", "api"]`). The model could generate syntactically broken code and score 100%.

### Proposal

For each PawStyle scenario, spin up a Docker container post-generation:
1. Write the generated files (from tool call arguments) to the container
2. For backend: start the server, hit each endpoint, verify HTTP status + JSON schema
3. For frontend: run through a headless browser or just validate HTML syntax
4. Score: endpoints_working / endpoints_expected

### Implementation

```python
class SandboxEvaluator:
    async def evaluate(self, agent_result: AgentResult, scenario: dict) -> float:
        # Extract files from tool call arguments
        # Write to temp dir
        # docker run -v tempdir:/app python:3.12 python /app/server.py &
        # curl http://localhost:8080/api/products → check 200 + valid JSON
        # Return pass_rate 0.0-1.0
```

### Acceptance Criteria
- [ ] Backend endpoints return correct HTTP status codes
- [ ] JSON responses match expected schema (fields present, correct types)
- [ ] Frontend HTML is syntactically valid
- [ ] Scoring is binary per endpoint (pass/fail), aggregated as percentage
- [ ] Docker cleanup on completion (no leaked containers)
- [ ] `--no-sandbox` flag to skip (for quick runs)

This is the #1 gap between PawBench and SWE-bench level maturity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: execution-based correctness scoring (sandbox eval) #1

Priority: Critical — the single biggest maturity gap

Proposal

Implementation

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: execution-based correctness scoring (sandbox eval) #1

Description

Priority: Critical — the single biggest maturity gap

Proposal

Implementation

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions