Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions automation/judge/kg/judge-verdicts.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# verdict

Judge verdict classification for task output quality.
Judge verdict classification for task output quality (v2 5-level ladder).

synonyms:: accept, improve, reject, escalate, accepted, rejected, pass, fail, needs improvement, quality gate, quality check
synonyms:: BLOCK, FIX_FIRST, SAFE_TO_COMMIT, SAFE_TO_DEPLOY_AFTER_RUNTIME_CHECK, INSUFFICIENT_EVIDENCE, accept, improve, reject, escalate, accepted, rejected, pass, fail, needs improvement, quality gate, quality check, blocked, fix first, safe to commit, safe to deploy, insufficient evidence
2 changes: 1 addition & 1 deletion automation/judge/pre-push-judge.sh
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ case $EXIT_CODE in
echo "[pre-push-judge] PASSED -- push allowed"
;;
1)
echo "[pre-push-judge] FAILED -- push blocked"
echo "[pre-push-judge] BLOCKED or FIX FIRST -- push blocked"
echo "Fix the issues and try again, or override with: git push --no-verify"
;;
2)
Expand Down
84 changes: 55 additions & 29 deletions automation/judge/run-judge.sh
Original file line number Diff line number Diff line change
Expand Up @@ -133,17 +133,19 @@ Score on three dimensions (1-5 each):
- pragmatic: Does it enable intended actions? Actionable, useful, addresses the task goal.
- syntactic: Is it internally consistent and well-structured? Format compliance, completeness, no broken references.

Verdict rules:
- "accept" if all scores >= 3 AND average >= 3.5
- "improve" if any score < 3 but all >= 2
- "reject" if any score < 2
Verdict rules (5-level ladder):
- "SAFE_TO_COMMIT" if all scores >= 3 AND average >= 3.5 AND no runtime verification needed
- "SAFE_TO_DEPLOY_AFTER_RUNTIME_CHECK" if all scores >= 3 but deployment needs restart, smoke test, or migration check
- "FIX_FIRST" if any score < 3 but all >= 2
- "BLOCK" if any score < 2 or serious correctness/data/security/deploy risk
- "INSUFFICIENT_EVIDENCE" if you lack task context, diff, tests, or runtime evidence to make a reliable call

Respond with ONLY this JSON (no other text):
{
"task_id": "${TASK_ID}",
"model": "${model}",
"mode": "quick",
"verdict": "<accept|improve|reject>",
"verdict": "<BLOCK|FIX_FIRST|SAFE_TO_COMMIT|SAFE_TO_DEPLOY_AFTER_RUNTIME_CHECK|INSUFFICIENT_EVIDENCE>",
"scores": {
"semantic": "<1-5>",
"pragmatic": "<1-5>",
Expand Down Expand Up @@ -177,10 +179,12 @@ Score each dimension 1-5:
2. pragmatic: Does it enable intended actions?
3. syntactic: Is it internally consistent and well-structured?

Verdict rules:
- "accept" if all scores >= 3 AND average >= 3.5
- "improve" if any score < 3 but all >= 2
- "reject" if any score < 2
Verdict rules (5-level ladder):
- "SAFE_TO_COMMIT" if all scores >= 3 AND average >= 3.5 AND no runtime verification needed
- "SAFE_TO_DEPLOY_AFTER_RUNTIME_CHECK" if all scores >= 3 but deployment needs restart, smoke test, or migration check
- "FIX_FIRST" if any score < 3 but all >= 2
- "BLOCK" if any score < 2 or serious correctness/data/security/deploy risk
- "INSUFFICIENT_EVIDENCE" if you lack task context, diff, tests, or runtime evidence to make a reliable call

For each improvement, specify: what to fix, where it is, and why it matters.

Expand All @@ -189,7 +193,7 @@ Respond with ONLY this JSON (no other text):
"task_id": "${TASK_ID}",
"model": "${model}",
"mode": "${mode}",
"verdict": "<accept|improve|reject>",
"verdict": "<BLOCK|FIX_FIRST|SAFE_TO_COMMIT|SAFE_TO_DEPLOY_AFTER_RUNTIME_CHECK|INSUFFICIENT_EVIDENCE>",
"scores": {
"semantic": "<1-5>",
"pragmatic": "<1-5>",
Expand Down Expand Up @@ -320,7 +324,7 @@ try:
if 'verdict' not in v:
print('Missing: verdict', file=sys.stderr)
sys.exit(1)
if v['verdict'] not in ('accept', 'improve', 'reject', 'escalate'):
if v['verdict'] not in ('BLOCK', 'FIX_FIRST', 'SAFE_TO_COMMIT', 'SAFE_TO_DEPLOY_AFTER_RUNTIME_CHECK', 'INSUFFICIENT_EVIDENCE'):
print(f'Invalid verdict: {v[\"verdict\"]}', file=sys.stderr)
sys.exit(1)
# Normalize scores: handle flat format
Expand Down Expand Up @@ -490,19 +494,28 @@ QUICK_AVG="$ROUND_RESULT_AVG"

log_verdict "$QUICK_VERDICT_JSON" "$ROUND" "quick" "$PREVIOUS_ROUNDS" "null"

if [[ "$QUICK_VERDICT" == "accept" ]]; then
echo "RESULT: ACCEPTED (quick judge, round ${ROUND})"
if [[ "$QUICK_VERDICT" == "SAFE_TO_COMMIT" ]]; then
echo "RESULT: SAFE TO COMMIT (quick judge, round ${ROUND})"
exit 0
fi

if [[ "$QUICK_VERDICT" == "reject" ]]; then
echo "RESULT: REJECTED (quick judge, round ${ROUND})"
if [[ "$QUICK_VERDICT" == "BLOCK" ]]; then
echo "RESULT: BLOCKED (quick judge, round ${ROUND})"
"${SCRIPT_DIR}/handle-disagreement.sh" -t "$TASK_ID" -T "$TASK_DESCRIPTION" -f "${FILES[*]}" -r "persistent-reject" || true
exit 1
fi

if [[ "$QUICK_VERDICT" == "INSUFFICIENT_EVIDENCE" ]]; then
echo "RESULT: INSUFFICIENT EVIDENCE (quick judge, round ${ROUND}) -- proceeding to deep judge"
fi

if [[ "$QUICK_VERDICT" == "SAFE_TO_DEPLOY_AFTER_RUNTIME_CHECK" ]]; then
echo "RESULT: SAFE TO COMMIT but deploy needs runtime check (quick judge, round ${ROUND})"
exit 0
fi

# ============================
# Round 2: Deep judge (quick returned "improve")
# Round 2: Deep judge (quick returned FIX_FIRST or INSUFFICIENT_EVIDENCE)
# ============================
PREVIOUS_ROUNDS=$(python3 -c "
import json
Expand All @@ -526,22 +539,29 @@ log_verdict "$DEEP_VERDICT_JSON" "$ROUND" "deep" "$PREVIOUS_ROUNDS" "null"

# Check for disagreement requiring tiebreaker
NEEDS_TIEBREAKER=false
if [[ "$QUICK_VERDICT" == "accept" && "$DEEP_VERDICT" == "reject" ]] || \
[[ "$QUICK_VERDICT" == "reject" && "$DEEP_VERDICT" == "accept" ]]; then
if [[ "$QUICK_VERDICT" == "SAFE_TO_COMMIT" && "$DEEP_VERDICT" == "BLOCK" ]] || \
[[ "$QUICK_VERDICT" == "BLOCK" && "$DEEP_VERDICT" == "SAFE_TO_COMMIT" ]]; then
NEEDS_TIEBREAKER=true
fi

if [[ "$NEEDS_TIEBREAKER" == "false" ]]; then
if [[ "$DEEP_VERDICT" == "accept" ]]; then
echo "RESULT: ACCEPTED (deep judge, round ${ROUND})"
if [[ "$DEEP_VERDICT" == "SAFE_TO_COMMIT" ]]; then
echo "RESULT: SAFE TO COMMIT (deep judge, round ${ROUND})"
exit 0
elif [[ "$DEEP_VERDICT" == "reject" ]]; then
echo "RESULT: REJECTED (deep judge, round ${ROUND})"
elif [[ "$DEEP_VERDICT" == "SAFE_TO_DEPLOY_AFTER_RUNTIME_CHECK" ]]; then
echo "RESULT: SAFE TO COMMIT but deploy needs runtime check (deep judge, round ${ROUND})"
exit 0
elif [[ "$DEEP_VERDICT" == "BLOCK" ]]; then
echo "RESULT: BLOCKED (deep judge, round ${ROUND})"
"${SCRIPT_DIR}/handle-disagreement.sh" -t "$TASK_ID" -T "$TASK_DESCRIPTION" -f "${FILES[*]}" -r "persistent-reject" || true
exit 1
elif [[ "$DEEP_VERDICT" == "INSUFFICIENT_EVIDENCE" ]]; then
echo "RESULT: INSUFFICIENT EVIDENCE -- human fallback needed"
"${SCRIPT_DIR}/handle-disagreement.sh" -t "$TASK_ID" -T "$TASK_DESCRIPTION" -f "${FILES[*]}" -r "disagreement" || true
exit 2
else
# Both returned "improve" -- human fallback
echo "RESULT: Human fallback needed (both judges returned 'improve')"
# FIX_FIRST -- human fallback
echo "RESULT: FIX FIRST -- human fallback needed (both judges returned improvement needed)"
"${SCRIPT_DIR}/handle-disagreement.sh" -t "$TASK_ID" -T "$TASK_DESCRIPTION" -f "${FILES[*]}" -r "disagreement" || true
exit 2
fi
Expand Down Expand Up @@ -582,15 +602,21 @@ fi

log_verdict "$ROUND_RESULT_JSON" "$ROUND" "tiebreaker" "$PREVIOUS_ROUNDS" "$CONSENSUS"

if [[ "$TIEBREAKER_VERDICT" == "accept" ]]; then
echo "RESULT: ACCEPTED (tiebreaker, round ${ROUND}, consensus: ${CONSENSUS})"
if [[ "$TIEBREAKER_VERDICT" == "SAFE_TO_COMMIT" ]]; then
echo "RESULT: SAFE TO COMMIT (tiebreaker, round ${ROUND}, consensus: ${CONSENSUS})"
exit 0
elif [[ "$TIEBREAKER_VERDICT" == "reject" ]]; then
echo "RESULT: REJECTED (tiebreaker, round ${ROUND}, consensus: ${CONSENSUS})"
elif [[ "$TIEBREAKER_VERDICT" == "SAFE_TO_DEPLOY_AFTER_RUNTIME_CHECK" ]]; then
echo "RESULT: SAFE TO COMMIT but deploy needs runtime check (tiebreaker, round ${ROUND}, consensus: ${CONSENSUS})"
exit 0
elif [[ "$TIEBREAKER_VERDICT" == "BLOCK" ]]; then
echo "RESULT: BLOCKED (tiebreaker, round ${ROUND}, consensus: ${CONSENSUS})"
"${SCRIPT_DIR}/handle-disagreement.sh" -t "$TASK_ID" -T "$TASK_DESCRIPTION" -f "${FILES[*]}" -r "persistent-reject" || true
exit 1
elif [[ "$TIEBREAKER_VERDICT" == "INSUFFICIENT_EVIDENCE" ]]; then
echo "RESULT: INSUFFICIENT EVIDENCE (tiebreaker, round ${ROUND}, consensus: ${CONSENSUS})"
exit 0
else
echo "RESULT: Human fallback needed (tiebreaker returned '${TIEBREAKER_VERDICT}')"
echo "RESULT: FIX FIRST -- human fallback needed (tiebreaker returned '${TIEBREAKER_VERDICT}')"
"${SCRIPT_DIR}/handle-disagreement.sh" -t "$TASK_ID" -T "$TASK_DESCRIPTION" -f "${FILES[*]}" -r "disagreement" || true
exit 2
fi
10 changes: 5 additions & 5 deletions automation/judge/verdict-schema.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://terraphim.ai/schemas/verdict-v1.json",
"title": "Judge Verdict",
"description": "Schema for judge verdict records produced by the multi-model judge skill",
"$id": "https://terraphim.ai/schemas/verdict-v2.json",
"title": "Judge Verdict (v2)",
"description": "Schema for judge verdict records with 5-level verdict ladder (BLOCK, FIX FIRST, SAFE TO COMMIT, SAFE TO DEPLOY AFTER RUNTIME CHECK, INSUFFICIENT EVIDENCE)",
"type": "object",
"required": [
"task_id",
Expand Down Expand Up @@ -36,8 +36,8 @@
},
"verdict": {
"type": "string",
"description": "Final verdict for the task output",
"enum": ["accept", "improve", "reject", "escalate"]
"description": "Final verdict for the task output (5-level ladder)",
"enum": ["BLOCK", "FIX_FIRST", "SAFE_TO_COMMIT", "SAFE_TO_DEPLOY_AFTER_RUNTIME_CHECK", "INSUFFICIENT_EVIDENCE"]
},
"scores": {
"type": "object",
Expand Down
11 changes: 11 additions & 0 deletions skills/disciplined-design/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,17 @@ You are a design specialist executing Phase 2 of disciplined development. Your r
4. **Human Approval**: No implementation without sign-off
5. **Eliminate Before Adding**: Design is primarily about removal

## Skip This Skill When...

Do NOT invoke this skill when the change is trivial or well-understood:
- Fewer than 10 lines changed, with no architectural impact
- Adding a single field to a struct or API response
- Replicating an existing pattern (same structure, different data)
- Config flag toggle with an obvious test
- A bug fix where the design is already clear from the research phase

In these cases, go straight to implementation with a minimal plan. Full design phase for trivial changes produces 3,000+ lines of artefacts with no quality gain (see Eberhardt/Scott Logic experiment, Nov 2025).

## Essentialism: ELIMINATE Phase

This phase embodies McKeown's ELIMINATE principle. Design is about choosing what NOT to do.
Expand Down
11 changes: 11 additions & 0 deletions skills/disciplined-implementation/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,17 @@ license: Apache-2.0

You are an implementation specialist executing Phase 3 of disciplined development. Your role is to implement approved plans step by step, with tests at each stage.

## Skip This Skill When...

Do NOT invoke this skill when the change is trivial or well-understood:
- Fewer than 10 lines changed, following an existing pattern exactly
- Typo, rename, import-fix, or formatting-only change
- Dependency version bump with no API changes
- Config flag toggle with an obvious test
- A one-line fix where the test is equally obvious

In these cases, implement directly with a single commit. The step-by-step plan, per-step tests, and review protocol for trivial changes add overhead with no quality gain.

## Core Principles

1. **Follow the Plan**: Execute the approved design exactly
Expand Down
10 changes: 10 additions & 0 deletions skills/disciplined-quality-evaluation/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,16 @@ You evaluate Research Documents (Phase 1) and Implementation Plans (Phase 2) usi
3. **Actionable feedback**: Every low score includes specific fix
4. **Essentialism check**: Vital few focus enforced

## Skip This Skill When...

Do NOT invoke this skill when the document under review is trivial:
- Research doc for a change under 10 lines with no architectural impact
- Design doc that replicates an existing pattern
- Spec doc with no ambiguous requirements or edge cases
- Implementation plan with a single obvious step

In these cases, a quick read-through is sufficient. Full KLS scoring for trivial documents adds overhead without improving outcomes.

## When to Use This Skill

- After Phase 1 (Research) before Phase 2 (Design)
Expand Down
11 changes: 11 additions & 0 deletions skills/disciplined-research/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,17 @@ You are a research specialist executing Phase 1 of disciplined development. Your
4. **Document Everything**: Enable informed decisions
5. **Focus on Vital Few**: Identify what's essential, eliminate the rest

## Skip This Skill When...

Do NOT invoke this skill when the change is trivial or well-understood:
- Fewer than 10 lines changed, with no behavioural impact
- Typo, rename, import-fix, or formatting-only change
- Dependency version bump with no API changes
- Config flag toggle with an obvious test
- A bug fix where the root cause is already identified and the fix is obvious

In these cases, go straight to implementation. Research phase bloat for trivial changes wastes time with no quality gain (see Eberhardt/Scott Logic experiment, Nov 2025).

## Essentialism: EXPLORE Phase

This phase embodies McKeown's EXPLORE principle. Before diving into research, validate that this work is essential.
Expand Down
11 changes: 11 additions & 0 deletions skills/disciplined-specification/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,17 @@ You are a specification interviewer executing Phase 2.5 of disciplined developme
3. **Think Adversarially**: What could go wrong? What could be exploited?
4. **Consider Evolution**: How will this need to change?

## Skip This Skill When...

Do NOT invoke this skill when the change is trivial or well-understood:
- Fewer than 10 lines changed, with no edge case implications
- Typo, rename, import-fix, or formatting-only change
- Adding a field that mirrors an existing pattern
- Config flag toggle with no security or migration concerns
- A bug fix with no ambiguous requirements

In these cases, proceed directly to implementation. Specification interviews for trivial changes surface no new information and add cycle time with no quality gain.

## Prerequisites

Phase 2.5 requires:
Expand Down
11 changes: 11 additions & 0 deletions skills/disciplined-validation/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,17 @@ You are a validation specialist executing Phase 5 of disciplined development. Yo
4. **Defects Loop Back**: Failures return to research or design phase
5. **Leverage Specialists**: Use specialist skills for focused validation tasks

## Skip This Skill When...

Do NOT invoke this skill when the change is trivial or well-understood:
- Fewer than 10 lines changed, with no user-visible impact
- Typo, rename, import-fix, or formatting-only change
- Internal refactor with no API or behaviour change
- Config flag toggle with no security or migration concerns
- A bug fix already validated by the verification phase tests

In these cases, consider the verification phase tests sufficient. Full UAT, stakeholder interviews, and system testing for trivial changes add cycle time with no quality gain.

## Integration with Specialist Skills

This skill orchestrates validation by leveraging specialist skills:
Expand Down
11 changes: 11 additions & 0 deletions skills/disciplined-verification/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,17 @@ license: Apache-2.0

You are a verification specialist executing Phase 4 of disciplined development. Your role is to verify that the implementation matches the design through systematic unit and integration testing with full traceability.

## Skip This Skill When...

Do NOT invoke this skill when the change is trivial or well-understood:
- Fewer than 10 lines changed, with no behavioural impact
- Typo, rename, import-fix, or formatting-only change
- Adding a test that mirrors an existing test pattern
- Config flag toggle with an obvious smoke test
- A bug fix where the test is a straightforward regression check

In these cases, write the regression test directly. Full traceability matrix, coverage analysis, and UBS scan for trivial changes add cycle time with no quality gain.

## Core Principles

1. **Trace to Design**: Every test maps to a design element or spec finding
Expand Down
27 changes: 20 additions & 7 deletions skills/judge/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,26 @@ framework defined in `disciplined-quality-evaluation`:
| 4 | Good -- clear, useful, few issues |
| 5 | Excellent -- exemplary, no issues |

### Verdict Thresholds

| Condition | Verdict |
|-----------|---------|
| All dimensions >= 3 AND average >= 3.5 | **accept** |
| Any dimension < 3 OR average < 3.5, but all >= 2 | **improve** |
| Any dimension < 2 | **reject** |
### Verdict Thresholds (v2 -- 5-Level Ladder)

| Condition | Verdict | Pre-push exit code |
|-----------|---------|--------------------|
| Any dimension < 2, OR security/data/deploy risk | **BLOCK** | 1 (push blocked) |
| Any dimension < 3 but all >= 2 | **FIX_FIRST** | 1 (push blocked) |
| All dimensions >= 3 AND average >= 3.5 | **SAFE_TO_COMMIT** | 0 (push allowed) |
| All dimensions >= 3 but deploy needs runtime verification | **SAFE_TO_DEPLOY_AFTER_RUNTIME_CHECK** | 0 (push allowed, log advisory) |
| Reviewer lacked evidence to make reliable call | **INSUFFICIENT_EVIDENCE** | 0 (push allowed, log loudly) |

#### Migration from v1 (binary GO/NO-GO)

| v1 Verdict | v2 Mapping |
|------------|------------|
| `reject` | `BLOCK` |
| `improve` | `FIX_FIRST` |
| `accept` | `SAFE_TO_COMMIT` |
| `escalate` | `INSUFFICIENT_EVIDENCE` |

Historical calibration data tagged `binary-verdict-v1` remains valid for longitudinal analysis.
| Models disagree on accept/reject | **escalate** |

## Verdict Format
Expand Down