Skip to content

Implement Verifier trait and standard implementations (SelfVerifying loop strategy) #44

@sbeardsley

Description

@sbeardsley

Context

Verifier is referenced in LoopStrategy::SelfVerifying in the Harness (#3):

SelfVerifying {
  verifier: Box<dyn Verifier>,
  evaluator_harness: Arc<dyn Harness>,
}

The Verifier is distinct from CompletionCheck (#43). CompletionCheck answers "is the task done?" Verifier answers "is what was produced correct?" — it is the oracle that the SelfVerifying loop uses to decide whether the evaluator's verdict should halt the build loop or continue it.

Currently a stub in the harness module tagged // SPEC: full trait lives in this issue. The SelfVerifying strategy returns HaltReason::StrategyNotYetImplemented until both this trait and #43 are implemented.

The SelfVerifying Loop Pattern

SelfVerifying loop:

  // Build phase — standard ReAct loop until agent claims done
  run_standard_loop(context) → build_result

  // Evaluate phase — separate evaluator harness
  // Read-only sandbox, fresh session, explicit evaluator role chunk
  // Default-FAIL contract: evaluator cannot be biased by watching the build
  eval_result = evaluator_harness.run(eval_task)

  // Verifier decides what to do with the evaluator's verdict
  match verifier.verify(build_result, eval_result):
    Passed          → HaltSuccess
    Failed { why }  → inject why into build context, continue build loop

The Verifier sits between the evaluator harness output and the build loop decision. It translates the evaluator's RunResult into an actionable verdict.

Trait Definition

VerifierVerdict {
  Passed,
  Failed { reason: String },   // injected into build context next turn
}

// Input to the verifier — what the build produced and what the evaluator said
VerifierInput {
  build_result: RunResult,
  eval_result: RunResult,
  workspace: PathBuf,
  iteration: u32,              // which build-evaluate cycle this is
}

trait Verifier {
  async fn verify(input: &VerifierInput) -> VerifierVerdict

  // Maximum number of build-evaluate cycles before giving up.
  // Prevents infinite build loops when the evaluator always finds problems.
  fn max_iterations() -> u32   // default: 3
}

Standard Implementations

EvaluatorResponseVerifier

Parses the evaluator harness's RunResult::Success { output } for pass/fail signals. The simplest verifier — trusts the evaluator's final text response.

EvaluatorResponseVerifier {
  pass_pattern: String,    // regex: if output matches this, Passed
  fail_pattern: String,    // regex: if output matches this, extract reason
  max_iterations: u32,
}

TestSuiteVerifier

Runs the test suite after the evaluator completes and uses the result as the verdict. Ignores the evaluator's text output — ground truth is the tests.

TestSuiteVerifier {
  command: String,
  working_dir: PathBuf,
  timeout: Duration,
  sandbox: Arc<dyn SandboxProvider>,
  max_iterations: u32,
}

CompositeVerifier

Passes only when all child verifiers pass.

CompositeVerifier {
  verifiers: Vec<Box<dyn Verifier>>,
  max_iterations: u32,
}

Evaluator Harness Constraints

The evaluator_harness in SelfVerifying must be constructed with:

  • Read-only sandboxSandboxProvider::read_only(workspace). No write or edit tools.
  • Fresh session — always a new SessionId, never shares with the build harness.
  • Evaluator role chunk"role-evaluator" from PromptChunkRegistry. This chunk must be registered in the standard chunk library before SelfVerifying is usable.
  • Mode::AlwaysAsk — evaluator never acts, only reports.

SubagentTool::new() already enforces no nested subagents. The evaluator harness is a peer harness, not a subagent — it is constructed directly by the caller and injected, not spawned by the build harness.

Checklist

  • Rust: Verifier trait + VerifierVerdict + VerifierInput + all standard implementations
  • TypeScript: same
  • Python: same
  • Go: same
  • Unit tests: each implementation returns correct verdict for pass and fail cases
  • max_iterations enforcement tested — loop halts after N cycles even without Passed verdict
  • Harness (Implement Harness runtime loop #3) SelfVerifying stub replaced with real execution using Verifier
  • "role-evaluator" chunk registered in standard chunk library (Implement PromptChunkRegistry and Mode system #24)
  • Fixture: fixtures/verifier/evaluator_pass.jsonl, fixtures/verifier/evaluator_fail.jsonl

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions