Skip to content

Implement CompletionCheck trait and standard implementations #43

@sbeardsley

Description

@sbeardsley

Context

CompletionCheck is referenced in two loop strategies in the Harness (#3):

  • LoopStrategy::Ralph { completion_check: Box<dyn CompletionCheck> } — the external check that determines when a Ralph continuation loop is actually done
  • LoopStrategy::SelfVerifying { verifier, evaluator_harness } — uses the termination policy's CompletionCheck to determine when the build phase should stop and hand off to the evaluator

Currently a stub in the harness module tagged // SPEC: full trait lives in this issue. The Ralph and SelfVerifying strategies both return HaltReason::StrategyNotYetImplemented until this trait is implemented.

Trait Definition

// Returns None if the task is complete, Some(reason) if not done yet.
// reason is injected into the next turn's context — tells the agent
// what it still needs to do. This is what prevents premature victory.
trait CompletionCheck {
  async fn check(state: &SessionStateSnapshot) -> Option<String>

  // Human-readable description of what this check evaluates.
  // Injected into agent context at session start so it understands
  // what "done" means for this task.
  fn description() -> String
}

Standard Implementations

FeatureListCheck

Reads feature_list.json from the workspace. Returns Some with the list of incomplete features if any have passes: false. Returns None when all features pass.

FeatureListCheck {
  path: PathBuf,   // default: "feature_list.json"
}

TestSuiteCheck

Runs the test suite and returns Some(failure_summary) if any tests fail. Returns None when the full suite passes.

TestSuiteCheck {
  command: String,          // e.g. "npm test", "cargo test", "pytest"
  working_dir: PathBuf,
  timeout: Duration,
  sandbox: Arc<dyn SandboxProvider>,
}

QuestionAnsweredCheck

LLM-as-judge: evaluates whether the agent's final response actually answered the user's question. Used for RAG and conversational agents.

QuestionAnsweredCheck {
  judge_model: ModelConfig,
  original_question: String,
  rubric: Option<String>,    // custom evaluation criteria
}

SqlResultCheck

Validates that the SQL result set is non-empty and structurally correct (column names match expectation). Used for NL-to-SQL agents.

SqlResultCheck {
  expected_columns: Option<Vec<String>>,
  min_rows: Option<usize>,
}

AlwaysComplete

Returns None immediately — task is always considered done when the model claims it is. Used for simple single-turn tasks where the model's self-assessment is sufficient.

AlwaysComplete

Relationship to TerminationPolicy

CompletionCheck is injected into TerminationPolicy and called only when agent_claims_done: true. The TerminationPolicy evaluates budget limits first (unconditionally), then sensor results, then calls CompletionCheck. This is the mechanism that prevents premature victory — the agent claims done, the check says "not yet, here's what's missing", and the harness injects that reason into the next turn.

Checklist

  • Rust: CompletionCheck trait + all standard implementations
  • TypeScript: CompletionCheck trait + all standard implementations
  • Python: CompletionCheck trait + all standard implementations
  • Go: CompletionCheck trait + all standard implementations
  • Unit tests: each implementation returns correct None / Some(reason) for its domain
  • Harness (Implement Harness runtime loop #3) stubs replaced with real CompletionCheck in Ralph and TerminationPolicy
  • TerminationPolicy (Implement TerminationPolicy #13) updated to call CompletionCheck correctly
  • Fixture: fixtures/completion_checks/feature_list_complete.jsonl

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions