feat: Add 4 eval improvements — pre-gates, graduated scoring, failure feedback, Pareto by WellDunDun · Pull Request #18 · selftune-dev/selftune

WellDunDun · 2026-03-03T13:02:12Z

Summary

Implement four high-value eval improvements: deterministic pre-gates (<20ms), graduated 0-1 scoring, rich failure feedback with structured failure context, and Pareto frontier multi-candidate selection.

What Changed

Core Features (4):

Pre-gates: 4 deterministic checks (SKILL.md read, tools called, error count, session completed) skip LLM when all resolve
Graduated Scoring: 0-1 float scores on expectations, mean_score and score_std_dev in summaries
Failure Feedback: Structured context flows grader → patterns → proposal generation
Pareto Evolution: Multi-candidate proposals with frontier selection across invocation types

Files: 22 changed, 1678 insertions(+), 124 deletions(-). New: pre-gates.ts, pareto.ts, 5 test files. Updated: types.ts, all evolution modules, grader, docs.

Tests: 239 new tests added. Zero breaking changes (all new fields optional).

🤖 Generated with Claude Code

coderabbitai · 2026-03-03T13:02:27Z

Warning

Rate limit exceeded

@WellDunDun has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 11 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 5af9c2a and c3fc877.

📒 Files selected for processing (1)

tests/grading/grade-session.test.ts

📝 Walkthrough

Walkthrough

Adds deterministic pre‑gates to short‑circuit LLM grading, graduated per‑expectation numeric scores and failure_feedback, structured failure analysis in proposal prompts, parallel multi‑proposal generation with biased prompts, Pareto frontier selection/merge logic, CLI flags for pareto/candidates, new types, tests, and docs updates.

Changes

Cohort / File(s)	Summary
Docs & Guides `ARCHITECTURE.md`, `PRD.md`, `README.md`, `docs/design-docs/evolution-pipeline.md`, `docs/design-docs/index.md`, `docs/escalation-policy.md`, `docs/exec-plans/tech-debt-tracker.md`, `docs/golden-principles.md`	Documented pre‑gates, graduated scoring, failure_feedback, Pareto evolution; updated CLI commands/flags and verification dates.
Public Types `cli/selftune/types.ts`	Added grading metadata (`score`, `source`), summary stats (`mean_score`, `score_std_dev`), `FailureFeedback`, and Pareto/evolution types (InvocationTypeScores, ParetoCandidate, ValidationResultBase, ParetoSelectionResult).
Grading — Pre‑gates & Scoring `cli/selftune/grading/pre-gates.ts`, `cli/selftune/grading/grade-session.ts`	New deterministic pre‑gates API and DEFAULT_GATES; grade-session integrates pre‑gates, merges LLM and pre‑gate results, normalizes scores/sources, computes mean/std_dev, and surfaces failure_feedback.
Evolution — Pattern Extraction & Proposal `cli/selftune/evolution/extract-patterns.ts`, `cli/selftune/evolution/propose-description.ts`	extractFailurePatterns accepts optional gradingResults to attach FailureFeedback to FailurePatterns. propose-description adds Structured Failure Analysis and supports generateMultipleProposals + prompt variations for parallel, biased proposals.
Evolution — Validation, Pareto & Orchestration `cli/selftune/evolution/validate-proposal.ts`, `cli/selftune/evolution/pareto.ts`, `cli/selftune/evolution/evolve.ts`	validate-proposal returns per-entry results and by-invocation aggregates. New pareto.ts implements invocation scoring, dominance checks, frontier computation, merge-prompt builder, and selection. evolve.ts adds an optional Pareto path, parallel candidate generation, new EvolveOptions fields, CLI flags (`--pareto`, `--candidates`), and updated extractFailurePatterns signature.
Tests — Grading & Pre‑gates `tests/grading/grade-session.test.ts`, `tests/grading/pre-gates.test.ts`	Added tests for buildGraduatedSummary, grader schema changes, failure_feedback propagation, individual gate behaviors, runPreGates orchestration, and DI overrides.
Tests — Evolution & Pareto `tests/evolution/extract-patterns.test.ts`, `tests/evolution/pareto.test.ts`, `tests/evolution/propose-description.test.ts`	New coverage for feedback attachment in patterns, Pareto utilities (scores, dominance, frontier, merge prompt, selection), and structured failure analysis in proposal prompts.
Tests — Types & Sandbox Helpers `tests/types/new-types.test.ts`, `tests/sandbox/docker/run-with-llm.ts`, `tests/sandbox/run-sandbox.ts`	Type-definition tests for new optional fields and Pareto types; two sandbox test helper renames (`claudePrompt` → `_claudePrompt`, `fileHasNewContent` → `_fileHasNewContent`).

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as "CLI"
    participant PreGates as "PreGates"
    participant Grader as "Grader"
    participant Extractor as "Extractor"
    participant Proposer as "Proposer"
    participant Validator as "Validator"
    participant Pareto as "Pareto"
    participant Deployer as "Deployer"

    User->>CLI: grade --skill X
    CLI->>PreGates: runPreGates(expectations, ctx)
    PreGates-->>CLI: {resolved, remaining}
    alt remaining non-empty
        CLI->>Grader: grade(remaining)
        Grader-->>CLI: GradingResult (may include failure_feedback)
    end
    CLI-->>User: grading summary (expectations with scores, mean_score, score_std_dev)

    User->>CLI: evolve --skill X --pareto --candidates 3
    CLI->>Extractor: extractFailurePatterns(evalEntries, usage, skillName, gradingResults)
    Extractor-->>CLI: FailurePattern[] (with optional feedback)
    CLI->>Proposer: generateMultipleProposals(currentDesc, patterns) x3
    Proposer->>Grader: parallel prompts (biased variations)
    Grader-->>Proposer: 3× EvolutionProposal
    Proposer-->>CLI: proposals
    CLI->>Validator: validate each proposal
    Validator-->>CLI: ValidationResult[] (per-entry + by_invocation_type)
    CLI->>Pareto: computeParetoFrontier(candidates)
    Pareto-->>CLI: frontier + dominates_on
    CLI->>Pareto: selectFromFrontier(frontier)
    Pareto-->>CLI: {best, shouldMerge, mergePrompt}
    alt shouldMerge
        CLI->>Grader: merge using mergePrompt
        Grader-->>CLI: merged proposal
    end
    CLI->>Deployer: deploy(selected_proposal)
    Deployer-->>User: deployment summary

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Release: CI/CD, FOSS playbook, CLI status commands #6 — Overlaps evolve/evolution and grading code; likely to touch evolve.ts signatures and LLM call surfaces.
Migrate selftune CLI from Python to Bun/TypeScript #1 — Prior CLI/evolution scaffolding and types work; may conflict with the new public types and API changes.
Add PAI Infrastructure patterns to selftune #15 — Modifies evolve orchestration/memory and could intersect with the new Pareto path and CLI wiring.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title follows conventional commits format (feat:) and clearly summarizes the four main changes: pre-gates, graduated scoring, failure feedback, and Pareto evolution.
Description check	✅ Passed	Description comprehensively covers the changeset: explains all four features, lists affected files, notes test coverage and backward compatibility, directly aligning with the implementation across 22 files.
Docstring Coverage	✅ Passed	Docstring coverage is 92.59% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch custom/prefix/router-1772541985073

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ffc5e97eb9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-03T13:06:23Z

cli/selftune/grading/grade-session.ts

+    claims: [],
+    eval_feedback: { suggestions: [], overall: "" },
+  };


Preserve LLM feedback in graded session output

When pre-gates only resolve part of the rubric, cliMain still calls the LLM but then hardcodes claims/eval_feedback to empty values and omits failure_feedback from the final GradingResult. This drops the grader’s structured diagnostics for normal runs, so users lose eval guidance and the new failure-feedback pipeline (consumed by extractFailurePatterns) cannot work even when the model returned it.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-03T13:06:23Z

cli/selftune/grading/pre-gates.ts

+    pattern: /error[s]?\s*(count|encountered)/i,
+    check: (ctx) => (ctx.telemetry.errors_encountered ?? 0) <= 2,


Enforce one-error threshold in error pre-gate

The error_count pre-gate marks expectations as passed when errors_encountered <= 2, but this gate pattern matches expectations phrased like “No more than 1 error encountered” (the documented default rubric). In sessions with exactly 2 errors, the pre-gate will incorrectly record a PASS and can even skip LLM grading entirely if all expectations are pre-gated, producing a wrong grade.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 17

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cli/selftune/evolution/evolve.ts`:
- Around line 454-455: The CLI option "candidates" is read as a string and not
validated, which can produce NaN or out-of-range values and thus silently
disable Pareto mode or cause excessive LLM fan-out; update the parsing logic
where the "candidates" option is read (the "candidates" CLI option in evolve.ts
and the code that controls LLM fan-out / Pareto mode) to parse the value as an
integer (e.g., parseInt), fall back to the default (3) on NaN, and clamp the
result to a safe range (1..5) before using it to control parallel calls; ensure
any comparisons or mode toggles that check for Pareto/parallel behavior use this
validated numeric variable.
- Line 26: Remove the unused import symbol InvocationTypeScores from the import
statement in evolve.ts: locate the import line that currently reads "import type
{ InvocationTypeScores, ParetoCandidate } from \"../types.js\";" and delete only
InvocationTypeScores so the line imports just ParetoCandidate, fixing the
noUnusedImports lint error.
- Line 233: The "created" audit entries currently only log a generic message;
update the two recordAudit calls (the one using
recordAudit(proposal.proposal_id, "created", `Pareto candidate for
${skillName}`) and the other call around lines 295-297) to include the original
SKILL.md content in the details payload prefixed with "original_description:".
Retrieve the existing SKILL.md/original description variable (e.g.,
originalDescription or skillMd where the skill content is read) and pass a
details string like `original_description:${originalDescription}` concatenated
with the current message so the audit entry contains the original_description
payload for rollback.
- Around line 200-201: The pareto flag default is incorrectly set to false;
change all assignments where paretoEnabled is derived from options (e.g., const
paretoEnabled = options.paretoEnabled ?? false;) to use true as the fallback
(const paretoEnabled = options.paretoEnabled ?? true;) so Pareto evolution is
enabled by default; update every occurrence of that pattern in this file
(including the other paretoEnabled assignments noted) to use ?? true and leave
candidateCount logic unchanged.

In `@cli/selftune/evolution/extract-patterns.ts`:
- Around line 142-163: The current enrichment uses a Map<string,
FailureFeedback> that overwrites multiple feedbacks for the same query and may
attach cross-skill feedback; change feedbackMap to Map<string,
FailureFeedback[]> (or similar multimap) when building from gradingResults (push
each fb into an array keyed by fb.query), and when iterating patterns
(allPatterns) collect and flatten all arrays for each pattern.missed_queries
into matchingFeedback; additionally, if patterns carry a skill field (e.g.,
pattern.skill) filter feedback entries by fb.skill === pattern.skill before
attaching to pattern.feedback to avoid cross-skill contamination.

In `@cli/selftune/evolution/propose-description.ts`:
- Line 10: Remove the unused type import FailureFeedback from the import
statement that currently reads import type { EvolutionProposal, FailureFeedback,
FailurePattern } from "../types.js"; — leave only the types that are actually
used (e.g., EvolutionProposal and FailurePattern) so the import becomes import
type { EvolutionProposal, FailurePattern } from "../types.js"; to satisfy the
linter's noUnusedImports rule.
- Around line 148-170: The current Promise.all over variations causes one failed
callLlm to reject the entire batch; change to a resilient pattern (use
Promise.allSettled or wrap the async mapper in try/catch) so failures for
individual prompts are captured but do not abort all candidates—catch errors
around callLlm/parseProposalResponse inside the variations map (or inspect
allSettled results) and return either a null/error marker for failed entries,
then filter out failed/null proposals before proceeding; reference the proposals
variable, the variations.map async callback, callLlm and parseProposalResponse
to locate where to add the try/catch or switch to Promise.allSettled and handle
only the fulfilled results.

In `@cli/selftune/evolution/validate-proposal.ts`:
- Line 9: The dynamic scores record in validate-proposal.ts is missing required
InvocationType keys and uses an unsafe double-cast; replace the current creation
with an explicitly initialized Record<InvocationType, {passed:number;
total:number}> containing all four keys
("explicit","implicit","contextual","negative"), then map each key to a
fully-typed InvocationTypeScores (an array/object with all four dimensions)
instead of using "as unknown as" to silence types; update the logic in the
function(s) that build scores (referencing computeInvocationScores /
InvocationTypeScores / InvocationType / validate-proposal.ts) to mirror the
proven pattern from cli/selftune/monitoring/watch.ts (and apply the same fix in
pareto.ts's computeInvocationScores where the double-cast appears).

In `@cli/selftune/grading/grade-session.ts`:
- Around line 16-23: Remove the unused FailureFeedback import from the import
list in the top of the file: update the import that currently lists
ExecutionMetrics, FailureFeedback, GraderOutput, GradingExpectation,
GradingResult, SessionTelemetryRecord so that FailureFeedback is omitted (leave
the other types unchanged) to satisfy the noUnusedImports lint rule; verify no
references to FailureFeedback remain in functions like any grading helpers in
this module.
- Around line 193-202: The code currently uses raw model-provided values when
building the scores array (const scores = expectations.map(...)) which can be
out-of-range or non-numeric; sanitize each value before statistics: for each
expectation in expectations (the same place building scores) coerce score to a
finite number, fallback to (e.passed ? 1.0 : 0.0) if missing/NaN, then clamp to
the 0.0–1.0 range; after sanitizing ensure you handle empty expectations (avoid
divide-by-zero) and then compute mean_score and score_std_dev from the sanitized
scores array so returned metrics stay within documented bounds.
- Around line 466-507: The grading result currently overwrites LLM-provided
feedback by hardcoding empty claims and eval_feedback; update the result
construction to copy these fields from graderOutput (e.g., use result.claims =
graderOutput.claims ?? [] and result.eval_feedback = graderOutput.eval_feedback
?? { suggestions: [], overall: "" }) and also include any
graderOutput.failure_feedback (or similar field provided by GraderOutput) into
the result (e.g., result.failure_feedback = graderOutput.failure_feedback ?? []
or merge it into eval_feedback if that’s the expected shape) so LLM-supplied
claims/eval feedback are preserved instead of dropped.

In `@docs/exec-plans/tech-debt-tracker.md`:
- Line 20: Update the TD-012 entry in the tech-debt tracker so the description
reads "fails to send initial data event intermittently" instead of the current
awkward "sends initial data event fails intermittently"; locate the TD-012 table
row (unique id "TD-012") in the document and replace the phrase in the
description cell accordingly to improve grammar and clarity.

In `@tests/evolution/extract-patterns.test.ts`:
- Around line 242-243: Replace the non-null assertions on the test subject with
optional chaining: change occurrences of patterns[0].feedback! to
patterns[0].feedback?., specifically in the assertions that check length and
failure_reason (the two lines using patterns[0].feedback!.length and
patterns[0].feedback![0].failure_reason) so they use
patterns[0].feedback?.length and patterns[0].feedback?[0].failure_reason; the
earlier assertion already confirms feedback is defined so this refactor is safe
and conforms to the tests/**/*.ts guideline.

In `@tests/evolution/pareto.test.ts`:
- Around line 153-154: Replace the non-null assertions on memberA and memberB by
checking the find results are defined and using optional chaining when accessing
nested fields: call frontier.find(...) and assign to memberA/memberB, then add
explicit guards like if (!memberA) throw new Error("memberA not found for
proposal_id 'a'") (and similarly for memberB) or use
expect(memberA).toBeDefined() in the test before referencing
memberA.proposal.proposal_id; when reading proposal_id use
memberA?.proposal?.proposal_id to satisfy the tests/**/*.ts rule. Ensure you
reference the same symbols (frontier, memberA, memberB, proposal.proposal_id)
when adding the checks.

In `@tests/grading/grade-session.test.ts`:
- Around line 683-684: The test is using non-null assertions on
result.failure_feedback (expect(result.failure_feedback!.length) and
expect(result.failure_feedback![0].query)); replace those non-null assertions
with optional chaining so the expectations read against
result.failure_feedback?.length and result.failure_feedback?.[0]?.query (or
equivalent optional-chained checks) to satisfy the typing guideline; update both
occurrences referencing the result.failure_feedback symbol and adjust assertions
to handle undefined safely.

In `@tests/grading/pre-gates.test.ts`:
- Around line 41-132: Replace all non-null assertions used when looking up gates
in DEFAULT_GATES (e.g., DEFAULT_GATES.find((g) => g.name === "skill_md_read")!)
with a safe lookup that uses optional result checks; call
DEFAULT_GATES.find(...) without "!" and add expect(gate).toBeDefined() before
accessing gate.pattern or gate.check, or create a small helper like
getGate(name) that asserts the result is defined and then returns it for use in
tests; update every occurrence in the tests between the pattern checks and
pre-gate logic blocks to use this pattern so no test uses "!".

In `@tests/types/new-types.test.ts`:
- Around line 2-13: Remove the two unused type imports GradingResult and
ValidationResultBase from the import statement in tests/types/new-types.test.ts
to satisfy lint/correctness/noUnusedImports; locate the import block that
currently lists FailureFeedback, FailurePattern, GraderOutput,
GradingExpectation, GradingResult, GradingSummary, InvocationTypeScores,
ParetoCandidate, ParetoSelectionResult, ValidationResultBase and delete the
GradingResult and ValidationResultBase entries so only used types remain.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between 1cf63de and ffc5e97.

📒 Files selected for processing (22)

ARCHITECTURE.md
PRD.md
README.md
cli/selftune/evolution/evolve.ts
cli/selftune/evolution/extract-patterns.ts
cli/selftune/evolution/pareto.ts
cli/selftune/evolution/propose-description.ts
cli/selftune/evolution/validate-proposal.ts
cli/selftune/grading/grade-session.ts
cli/selftune/grading/pre-gates.ts
cli/selftune/types.ts
docs/design-docs/evolution-pipeline.md
docs/design-docs/index.md
docs/escalation-policy.md
docs/exec-plans/tech-debt-tracker.md
docs/golden-principles.md
tests/evolution/extract-patterns.test.ts
tests/evolution/pareto.test.ts
tests/evolution/propose-description.test.ts
tests/grading/grade-session.test.ts
tests/grading/pre-gates.test.ts
tests/types/new-types.test.ts

cli/selftune/evolution/evolve.ts

coderabbitai · 2026-03-03T13:08:57Z

cli/selftune/evolution/evolve.ts

+    const paretoEnabled = options.paretoEnabled ?? false;
+    const candidateCount = options.candidateCount ?? 3;


⚠️ Potential issue | 🟠 Major

--pareto default is wired to false, not the intended default behavior.

This currently disables Pareto flow unless explicitly set, which conflicts with the documented objective of default-enabled Pareto evolution.

Proposed fix

- const paretoEnabled = options.paretoEnabled ?? false; + const paretoEnabled = options.paretoEnabled ?? true; @@ - pareto: { type: "boolean", default: false }, + pareto: { type: "boolean", default: true }, @@ - paretoEnabled: values.pareto ?? false, + paretoEnabled: values.pareto ?? true,

Also applies to: 453-454, 519-520

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cli/selftune/evolution/evolve.ts` around lines 200 - 201, The pareto flag default is incorrectly set to false; change all assignments where paretoEnabled is derived from options (e.g., const paretoEnabled = options.paretoEnabled ?? false;) to use true as the fallback (const paretoEnabled = options.paretoEnabled ?? true;) so Pareto evolution is enabled by default; update every occurrence of that pattern in this file (including the other paretoEnabled assignments noted) to use ?? true and leave candidateCount logic unchanged.

coderabbitai · 2026-03-03T13:08:57Z

cli/selftune/evolution/evolve.ts

+      // Validate each candidate
+      const paretoCandidates: ParetoCandidate[] = [];
+      for (const proposal of viableCandidates) {
+        recordAudit(proposal.proposal_id, "created", `Pareto candidate for ${skillName}`);


⚠️ Potential issue | 🟠 Major

Created audit entries still omit rollback-required original_description: details.

Rollback fallback restoration depends on this payload format; current entries only store a generic message.

Proposed fix

- recordAudit(proposal.proposal_id, "created", `Pareto candidate for ${skillName}`); + recordAudit( + proposal.proposal_id, + "created", + `original_description:${currentDescription}\ncontext:Pareto candidate for ${skillName}`, + ); @@ - recordAudit( - proposal.proposal_id, - "created", - `Proposal created for ${skillName} (iteration ${iteration + 1})`, - ); + recordAudit( + proposal.proposal_id, + "created", + `original_description:${currentDescription}\ncontext:Proposal created for ${skillName} (iteration ${iteration + 1})`, + );

Based on learnings: In cli/selftune/evolution/evolve.ts, the "created" audit entry should store the original SKILL.md content with the prefix "original_description:" in the details field to support rollback fallback restoration.

Also applies to: 295-297

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cli/selftune/evolution/evolve.ts` at line 233, The "created" audit entries currently only log a generic message; update the two recordAudit calls (the one using recordAudit(proposal.proposal_id, "created", `Pareto candidate for ${skillName}`) and the other call around lines 295-297) to include the original SKILL.md content in the details payload prefixed with "original_description:". Retrieve the existing SKILL.md/original description variable (e.g., originalDescription or skillMd where the skill content is read) and pass a details string like `original_description:${originalDescription}` concatenated with the current message so the audit entry contains the original_description payload for rollback.

coderabbitai · 2026-03-03T13:08:57Z

cli/selftune/evolution/evolve.ts

+      candidates: { type: "string", default: "3" },
      help: { type: "boolean", default: false },


⚠️ Potential issue | 🟠 Major

--candidates is not validated before controlling LLM fan-out.

Invalid or oversized values can silently disable Pareto mode (NaN) or trigger excessive parallel calls despite help text claiming max 5.

Proposed fix

+ const parsedCandidates = Number.parseInt(values.candidates ?? "3", 10); + const candidateCount = Number.isFinite(parsedCandidates) + ? Math.min(5, Math.max(1, parsedCandidates)) + : 3; + const result = await evolve({ @@ - candidateCount: Number.parseInt(values.candidates ?? "3", 10), + candidateCount, });

Also applies to: 520-520

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cli/selftune/evolution/evolve.ts` around lines 454 - 455, The CLI option "candidates" is read as a string and not validated, which can produce NaN or out-of-range values and thus silently disable Pareto mode or cause excessive LLM fan-out; update the parsing logic where the "candidates" option is read (the "candidates" CLI option in evolve.ts and the code that controls LLM fan-out / Pareto mode) to parse the value as an integer (e.g., parseInt), fall back to the default (3) on NaN, and clamp the result to a safe range (1..5) before using it to control parallel calls; ensure any comparisons or mode toggles that check for Pareto/parallel behavior use this validated numeric variable.

coderabbitai · 2026-03-03T13:08:57Z

cli/selftune/evolution/extract-patterns.ts

+  // 3.5. Attach failure feedback from grading results if available
+  if (gradingResults && gradingResults.length > 0) {
+    const feedbackMap = new Map<string, FailureFeedback>();
+    for (const gr of gradingResults) {
+      if (gr.failure_feedback) {
+        for (const fb of gr.failure_feedback) {
+          feedbackMap.set(fb.query, fb);
+        }
+      }
+    }
+
+    for (const pattern of allPatterns) {
+      const matchingFeedback: FailureFeedback[] = [];
+      for (const query of pattern.missed_queries) {
+        const fb = feedbackMap.get(query);
+        if (fb) matchingFeedback.push(fb);
+      }
+      if (matchingFeedback.length > 0) {
+        pattern.feedback = matchingFeedback;
+      }
+    }
+  }


⚠️ Potential issue | 🟠 Major

Scope and preserve feedback entries when enriching patterns.

Current enrichment can attach cross-skill feedback and drops all but the last feedback per query due to Map<string, FailureFeedback> overwrite behavior.

Proposed fix

- if (gradingResults && gradingResults.length > 0) { - const feedbackMap = new Map<string, FailureFeedback>(); + if (gradingResults && gradingResults.length > 0) { + const feedbackMap = new Map<string, FailureFeedback[]>(); for (const gr of gradingResults) { + if (gr.skill_name !== skillName) continue; if (gr.failure_feedback) { for (const fb of gr.failure_feedback) { - feedbackMap.set(fb.query, fb); + const list = feedbackMap.get(fb.query) ?? []; + list.push(fb); + feedbackMap.set(fb.query, list); } } } for (const pattern of allPatterns) { const matchingFeedback: FailureFeedback[] = []; for (const query of pattern.missed_queries) { - const fb = feedbackMap.get(query); - if (fb) matchingFeedback.push(fb); + const fbList = feedbackMap.get(query); + if (fbList) matchingFeedback.push(...fbList); } if (matchingFeedback.length > 0) { pattern.feedback = matchingFeedback; } } }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cli/selftune/evolution/extract-patterns.ts` around lines 142 - 163, The current enrichment uses a Map<string, FailureFeedback> that overwrites multiple feedbacks for the same query and may attach cross-skill feedback; change feedbackMap to Map<string, FailureFeedback[]> (or similar multimap) when building from gradingResults (push each fb into an array keyed by fb.query), and when iterating patterns (allPatterns) collect and flatten all arrays for each pattern.missed_queries into matchingFeedback; additionally, if patterns carry a skill field (e.g., pattern.skill) filter feedback entries by fb.skill === pattern.skill before attaching to pattern.feedback to avoid cross-skill contamination.

tests/evolution/extract-patterns.test.ts

coderabbitai · 2026-03-03T13:08:58Z

tests/evolution/pareto.test.ts

+    const memberA = frontier.find(c => c.proposal.proposal_id === "a")!;
+    const memberB = frontier.find(c => c.proposal.proposal_id === "b")!;


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

fd -t f "pareto.test.ts" --exec wc -l {} \;

Repository: WellDunDun/selftune

Length of output: 99

🏁 Script executed:

cat -n tests/evolution/pareto.test.ts | sed -n '145,165p'

Repository: WellDunDun/selftune

Length of output: 1265

Replace non-null assertions with defined checks and optional chaining.

Lines 153-154 use non-null assertions (!), which violate the tests/**/*.ts guideline requiring optional chaining (result?.). Add explicit defined checks and switch to optional chaining:

Proposed fix

- const memberA = frontier.find(c => c.proposal.proposal_id === "a")!; - const memberB = frontier.find(c => c.proposal.proposal_id === "b")!; - expect(memberA.dominates_on).toContain("explicit"); - expect(memberB.dominates_on).toContain("implicit"); + const memberA = frontier.find((c) => c.proposal.proposal_id === "a"); + const memberB = frontier.find((c) => c.proposal.proposal_id === "b"); + expect(memberA).toBeDefined(); + expect(memberB).toBeDefined(); + expect(memberA?.dominates_on).toContain("explicit"); + expect(memberB?.dominates_on).toContain("implicit");

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const memberA = frontier.find(c => c.proposal.proposal_id === "a")!;

const memberB = frontier.find(c => c.proposal.proposal_id === "b")!;

const memberA = frontier.find((c) => c.proposal.proposal_id === "a");

const memberB = frontier.find((c) => c.proposal.proposal_id === "b");

expect(memberA).toBeDefined();

expect(memberB).toBeDefined();

expect(memberA?.dominates_on).toContain("explicit");

expect(memberB?.dominates_on).toContain("implicit");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/evolution/pareto.test.ts` around lines 153 - 154, Replace the non-null assertions on memberA and memberB by checking the find results are defined and using optional chaining when accessing nested fields: call frontier.find(...) and assign to memberA/memberB, then add explicit guards like if (!memberA) throw new Error("memberA not found for proposal_id 'a'") (and similarly for memberB) or use expect(memberA).toBeDefined() in the test before referencing memberA.proposal.proposal_id; when reading proposal_id use memberA?.proposal?.proposal_id to satisfy the tests/**/*.ts rule. Ensure you reference the same symbols (frontier, memberA, memberB, proposal.proposal_id) when adding the checks.

tests/grading/grade-session.test.ts

coderabbitai · 2026-03-03T13:08:58Z

tests/grading/pre-gates.test.ts

+    const gate = DEFAULT_GATES.find((g) => g.name === "skill_md_read")!;
+    expect(gate.pattern.test("SKILL.md was read")).toBe(true);
+    expect(gate.pattern.test("The skill.md file was read")).toBe(true);
+  });
+
+  test("skill_md_read gate does not match unrelated text", () => {
+    const gate = DEFAULT_GATES.find((g) => g.name === "skill_md_read")!;
+    expect(gate.pattern.test("Output is a .pptx")).toBe(false);
+  });
+
+  test("expected_tools_called gate matches tool-related text", () => {
+    const gate = DEFAULT_GATES.find((g) => g.name === "expected_tools_called")!;
+    expect(gate.pattern.test("tools were called")).toBe(true);
+    expect(gate.pattern.test("tool called")).toBe(true);
+  });
+
+  test("error_count gate matches error-related text", () => {
+    const gate = DEFAULT_GATES.find((g) => g.name === "error_count")!;
+    expect(gate.pattern.test("errors encountered")).toBe(true);
+    expect(gate.pattern.test("error count")).toBe(true);
+  });
+
+  test("session_completed gate matches session text", () => {
+    const gate = DEFAULT_GATES.find((g) => g.name === "session_completed")!;
+    expect(gate.pattern.test("session completed")).toBe(true);
+    expect(gate.pattern.test("session finished")).toBe(true);
+  });
+});
+
+// ---------------------------------------------------------------------------
+// Individual gate logic
+// ---------------------------------------------------------------------------
+
+describe("pre-gate check logic", () => {
+  test("skill_md_read passes when skill is in skills_triggered", () => {
+    const gate = DEFAULT_GATES.find((g) => g.name === "skill_md_read")!;
+    const ctx = makeCtx();
+    expect(gate.check(ctx)).toBe(true);
+  });
+
+  test("skill_md_read fails when skill not triggered and no transcript match", () => {
+    const gate = DEFAULT_GATES.find((g) => g.name === "skill_md_read")!;
+    const ctx = makeCtx({
+      telemetry: { ...makeCtx().telemetry, skills_triggered: [] },
+      skillName: "myskill",
+    });
+    expect(gate.check(ctx)).toBe(false);
+  });
+
+  test("skill_md_read passes via transcript match", () => {
+    const gate = DEFAULT_GATES.find((g) => g.name === "skill_md_read")!;
+    const ctx = makeCtx({
+      telemetry: { ...makeCtx().telemetry, skills_triggered: [] },
+      transcriptExcerpt: "[TOOL] Read file SKILL.md",
+    });
+    expect(gate.check(ctx)).toBe(true);
+  });
+
+  test("expected_tools_called passes when tool_calls > 0", () => {
+    const gate = DEFAULT_GATES.find((g) => g.name === "expected_tools_called")!;
+    expect(gate.check(makeCtx())).toBe(true);
+  });
+
+  test("expected_tools_called fails when total_tool_calls is 0", () => {
+    const gate = DEFAULT_GATES.find((g) => g.name === "expected_tools_called")!;
+    const ctx = makeCtx({
+      telemetry: { ...makeCtx().telemetry, total_tool_calls: 0 },
+    });
+    expect(gate.check(ctx)).toBe(false);
+  });
+
+  test("error_count passes when errors <= 2", () => {
+    const gate = DEFAULT_GATES.find((g) => g.name === "error_count")!;
+    expect(gate.check(makeCtx())).toBe(true); // errors_encountered: 1
+  });
+
+  test("error_count fails when errors > 2", () => {
+    const gate = DEFAULT_GATES.find((g) => g.name === "error_count")!;
+    const ctx = makeCtx({
+      telemetry: { ...makeCtx().telemetry, errors_encountered: 5 },
+    });
+    expect(gate.check(ctx)).toBe(false);
+  });
+
+  test("session_completed passes when assistant_turns > 0", () => {
+    const gate = DEFAULT_GATES.find((g) => g.name === "session_completed")!;
+    expect(gate.check(makeCtx())).toBe(true);
+  });
+
+  test("session_completed fails when assistant_turns is 0", () => {
+    const gate = DEFAULT_GATES.find((g) => g.name === "session_completed")!;
+    const ctx = makeCtx({


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, verify the file exists and check its content at the specified lines fd pre-gates.test.ts

Repository: WellDunDun/selftune

Length of output: 94

🏁 Script executed:

# Read the specific section mentioned in the review sed -n '41,132p' tests/grading/pre-gates.test.ts | cat -n

Repository: WellDunDun/selftune

Length of output: 4412

🏁 Script executed:

# Count occurrences of the pattern in the file rg '\.find$' tests/grading/pre-gates.test.ts -c rg '\.find\(.*$!' tests/grading/pre-gates.test.ts

Repository: WellDunDun/selftune

Length of output: 1117

🏁 Script executed:

# Check if there are any existing helper functions or patterns in the test file for gate lookups cat tests/grading/pre-gates.test.ts | head -50

Repository: WellDunDun/selftune

Length of output: 1760

🏁 Script executed:

# Check coding guidelines usage across other test files to understand preferred pattern rg "find\(" tests/ -t ts --max-count 5 -B 2 -A 2

Repository: WellDunDun/selftune

Length of output: 7790

Replace non-null assertions with proper null checks in gate lookups.

Lines 41-132 contain 14 instances of .find(...)! on DEFAULT_GATES. This violates the test guideline requiring optional chaining (result?.) instead of non-null assertions (result!.).

Use optional chaining with explicit expect().toBeDefined() checks before accessing properties:

const gate = DEFAULT_GATES.find((g) => g.name === "skill_md_read"); expect(gate).toBeDefined(); expect(gate?.pattern.test("SKILL.md was read")).toBe(true);

Alternatively, extract a test helper to reduce repetition:

function getGate(name: string): PreGate { const gate = DEFAULT_GATES.find((g) => g.name === name); expect(gate).toBeDefined(); return gate!; // safe after toBeDefined check }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/grading/pre-gates.test.ts` around lines 41 - 132, Replace all non-null assertions used when looking up gates in DEFAULT_GATES (e.g., DEFAULT_GATES.find((g) => g.name === "skill_md_read")!) with a safe lookup that uses optional result checks; call DEFAULT_GATES.find(...) without "!" and add expect(gate).toBeDefined() before accessing gate.pattern or gate.check, or create a small helper like getGate(name) that asserts the result is defined and then returns it for use in tests; update every occurrence in the tests between the pattern checks and pre-gate logic blocks to use this pattern so no test uses "!".

tests/types/new-types.test.ts

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (9)

cli/selftune/grading/grade-session.ts (1)

499-516: ⚠️ Potential issue | 🔴 Critical

Preserve LLM feedback fields in CLI grading output.

Line 514 and Line 515 hardcode empty claims/eval_feedback, and failure_feedback is omitted entirely. In the partial pre-gate path, this drops model output that downstream evolution steps need.

🔧 Proposed fix

-  let allExpectations: GradingExpectation[];
+  let allExpectations: GradingExpectation[];
+  let llmOutput: GraderOutput | null = null;

   if (preGateResult.remaining.length === 0) {
@@
     let graderOutput: GraderOutput;
     try {
       graderOutput = await gradeViaAgent(prompt, agent);
+      llmOutput = graderOutput;
     } catch (e) {
       console.error(`[ERROR] Grading failed: ${e}`);
       process.exit(1);
     }
@@
   const result: GradingResult = {
@@
-    claims: [],
-    eval_feedback: { suggestions: [], overall: "" },
+    claims: llmOutput?.claims ?? [],
+    eval_feedback: llmOutput?.eval_feedback ?? { suggestions: [], overall: "" },
+    failure_feedback: llmOutput?.failure_feedback,
   };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/grading/grade-session.ts` around lines 499 - 516, The result
object construction currently hardcodes empty claims and eval_feedback and omits
failure_feedback, losing LLM feedback; update the GradingResult creation to copy
the LLM-produced fields instead of hardcoding empties by assigning the actual
variables used earlier in the grading flow (e.g., the claims variable,
eval_feedback/ evalFeedback variable, and failure_feedback/failureFeedback
variable) into result.claims, result.eval_feedback, and result.failure_feedback
so downstream evolution steps receive the model outputs; ensure you preserve the
same shapes/types as expected by buildExecutionMetrics/consumers.

tests/grading/pre-gates.test.ts (1)

40-131: ⚠️ Potential issue | 🟠 Major

Remove repeated non-null assertions in gate lookups.

Lines 40-131 use DEFAULT_GATES.find(... )! repeatedly. This violates the test guideline and hides missing-gate failures behind assertion-based typing.

🔧 Proposed fix

+function getGate(name: string): PreGate {
+  const gate = DEFAULT_GATES.find((g) => g.name === name);
+  if (!gate) throw new Error(`Missing test gate: ${name}`);
+  return gate;
+}
+
 describe("pre-gate pattern matching", () => {
   test("skill_md_read gate matches 'SKILL.md was read'", () => {
-    const gate = DEFAULT_GATES.find((g) => g.name === "skill_md_read")!;
+    const gate = getGate("skill_md_read");
     expect(gate.pattern.test("SKILL.md was read")).toBe(true);
     expect(gate.pattern.test("The skill.md file was read")).toBe(true);
   });

As per coding guidelines tests/**/*.ts: Optional chaining (result?.) not non-null assertions (result!).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/grading/pre-gates.test.ts` around lines 40 - 131, Replace the repeated
non-null assertions when looking up gates with safe optional access: stop using
DEFAULT_GATES.find(... )! and instead assign without the '!' and call
methods/properties with optional chaining (e.g., gate?.pattern.test(...) and
gate?.check(...)). This keeps the test from hiding a missing gate (the
expectation will fail if gate is undefined) and follows the tests/**/*.ts
guideline to use optional chaining rather than non-null assertions on
DEFAULT_GATES lookups.

cli/selftune/evolution/validate-proposal.ts (1)

133-159: ⚠️ Potential issue | 🔴 Critical

Return a fully-shaped InvocationTypeScores object; remove unsafe cast.

Lines 134-159 build sparse dynamic keys and then force-cast to InvocationTypeScores. Missing dimensions can propagate undefined into Pareto scoring code that expects all four invocation dimensions.

🔧 Proposed fix

-import type { EvalEntry, EvolutionProposal, InvocationTypeScores } from "../types.js";
+import type {
+  EvalEntry,
+  EvolutionProposal,
+  InvocationType,
+  InvocationTypeScores,
+} from "../types.js";
@@
-  const byInvocationType: Record<string, { passed: number; total: number }> = {};
+  const byInvocationType: Record<InvocationType, { passed: number; total: number }> = {
+    explicit: { passed: 0, total: 0 },
+    implicit: { passed: 0, total: 0 },
+    contextual: { passed: 0, total: 0 },
+    negative: { passed: 0, total: 0 },
+  };
   for (const r of perEntryResults) {
-    const type = r.entry.invocation_type ?? "implicit";
-    if (!byInvocationType[type]) byInvocationType[type] = { passed: 0, total: 0 };
+    const type: InvocationType = r.entry.invocation_type ?? "implicit";
     byInvocationType[type].total++;
     if (r.after_pass) byInvocationType[type].passed++;
   }

-  const invocationScores: Record<string, { passed: number; total: number; pass_rate: number }> = {};
-  for (const [type, counts] of Object.entries(byInvocationType)) {
-    invocationScores[type] = {
-      ...counts,
-      pass_rate: counts.total > 0 ? counts.passed / counts.total : 0,
-    };
-  }
+  const toScore = (counts: { passed: number; total: number }) => ({
+    ...counts,
+    pass_rate: counts.total > 0 ? counts.passed / counts.total : 0,
+  });
+  const invocationScores: InvocationTypeScores = {
+    explicit: toScore(byInvocationType.explicit),
+    implicit: toScore(byInvocationType.implicit),
+    contextual: toScore(byInvocationType.contextual),
+    negative: toScore(byInvocationType.negative),
+  };
@@
-    by_invocation_type: invocationScores as unknown as InvocationTypeScores,
+    by_invocation_type: invocationScores,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/validate-proposal.ts` around lines 133 - 159, The code
builds dynamic invocationScores from byInvocationType and then unsafely casts it
to InvocationTypeScores; instead, explicitly populate all required invocation
dimensions (e.g., the four expected keys) with default counts/pass_rate if
missing and construct a properly typed InvocationTypeScores object without using
"as unknown as InvocationTypeScores". Locate the aggregation logic around
byInvocationType and invocationScores, create a list/array of the expected
invocation dimension names, iterate that list to ensure each key exists (with
passed:0,total:0,pass_rate:0 when absent), compute pass_rate safely, and return
that fully-shaped object directly as InvocationTypeScores rather than casting.

cli/selftune/evolution/extract-patterns.ts (1)

149-167: ⚠️ Potential issue | 🟠 Major

Preserve all feedback entries per query and scope enrichment to the target skill.

Line 151 stores feedback in Map<string, FailureFeedback>, so repeated query feedback gets overwritten. Lines 152-158 also ingest all grading results without filtering gr.skill_name, which can attach another skill’s feedback when queries match.

🔧 Proposed fix

-    const feedbackMap = new Map<string, FailureFeedback>();
+    const feedbackMap = new Map<string, FailureFeedback[]>();
     for (const gr of gradingResults) {
+      if (gr.skill_name !== skillName) continue;
       if (gr.failure_feedback) {
         for (const fb of gr.failure_feedback) {
-          feedbackMap.set(fb.query, fb);
+          const list = feedbackMap.get(fb.query) ?? [];
+          list.push(fb);
+          feedbackMap.set(fb.query, list);
         }
       }
     }

     for (const pattern of allPatterns) {
       const matchingFeedback: FailureFeedback[] = [];
       for (const query of pattern.missed_queries) {
-        const fb = feedbackMap.get(query);
-        if (fb) matchingFeedback.push(fb);
+        const fbList = feedbackMap.get(query);
+        if (fbList) matchingFeedback.push(...fbList);
       }
       if (matchingFeedback.length > 0) {
         pattern.feedback = matchingFeedback;
       }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/extract-patterns.ts` around lines 149 - 167, The code
currently uses Map<string, FailureFeedback> which overwrites repeated feedback
per query and does not scope feedback by skill; change feedbackMap to
Map<string, FailureFeedback[]> and key entries by a composite key including
skill and query (e.g. `${gr.skill_name}::${fb.query}`) when populating in the
gradingResults loop (push into the array instead of set), then when iterating
allPatterns look up feedback using the composite key
`${pattern.skill_name}::${query}` so all feedback entries for that query+skill
are preserved and feedback is only attached to the matching skill; update
references to feedbackMap, FailureFeedback entries, and pattern.feedback
accordingly.

tests/evolution/pareto.test.ts (1)

205-208: ⚠️ Potential issue | 🟡 Minor

Remove non-null assertions in test lookups.

Replace ! with defined checks and optional chaining to align with test guidelines.

Proposed fix

-    const memberA = frontier.find((c) => c.proposal.proposal_id === "a")!;
-    const memberB = frontier.find((c) => c.proposal.proposal_id === "b")!;
-    expect(memberA.dominates_on).toContain("explicit");
-    expect(memberB.dominates_on).toContain("implicit");
+    const memberA = frontier.find((c) => c.proposal.proposal_id === "a");
+    const memberB = frontier.find((c) => c.proposal.proposal_id === "b");
+    expect(memberA).toBeDefined();
+    expect(memberB).toBeDefined();
+    expect(memberA?.dominates_on).toContain("explicit");
+    expect(memberB?.dominates_on).toContain("implicit");

As per coding guidelines: tests/**/*.ts requires optional chaining (result?.) and disallows non-null assertions (result!).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/evolution/pareto.test.ts` around lines 205 - 208, Replace the non-null
assertions used when finding test members: avoid `!` on the results of
`frontier.find` and instead assert definedness and use optional chaining;
specifically, when locating the entries found via `frontier.find((c) =>
c.proposal.proposal_id === "a")` and the one for `"b"`, check that the find
result is defined (e.g., with an expect or guard) before accessing properties
like `proposal.proposal_id` and `dominates_on`, and use optional chaining (e.g.,
`memberA?.dominates_on`) in the subsequent assertions
(`expect(...).toContain(...)`) so the test follows the tests/**/*.ts guideline
disallowing `!`.

cli/selftune/evolution/propose-description.ts (1)

153-175: ⚠️ Potential issue | 🟠 Major

Make multi-candidate generation failure-tolerant.

A single callLlm/parse failure currently rejects the entire batch via Promise.all, which can discard valid candidates and fail Pareto evolution unnecessarily.

Proposed fix

-  const proposals = await Promise.all(
+  const settled = await Promise.allSettled(
     variations.map(async (prompt, i) => {
       const rawResponse = await callLlm(PROPOSER_SYSTEM, prompt, agent);
       const { proposed_description, rationale, confidence } = parseProposalResponse(rawResponse);
@@
-  );
-
-  return proposals;
+  );
+
+  const proposals = settled
+    .filter((r): r is PromiseFulfilledResult<EvolutionProposal> => r.status === "fulfilled")
+    .map((r) => r.value);
+
+  if (proposals.length === 0) {
+    throw new Error("Failed to generate any proposals");
+  }
+
+  return proposals;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/propose-description.ts` around lines 153 - 175, The
batch generation uses Promise.all over variations so a single failure in callLlm
or parseProposalResponse aborts the whole batch; make generation
failure-tolerant by changing the logic in the proposals creation (the
variations.map + Promise.all block) to handle per-candidate errors—wrap each
async map body in try/catch (or use Promise.allSettled) so failed calls return
null or an error marker, then filter out those null/failed entries before
returning proposals; apply this to the code referencing callLlm,
parseProposalResponse, proposal_id generation, and ensure status/confidence
handling remains correct for the surviving candidates.

cli/selftune/evolution/evolve.ts (3)

205-206: ⚠️ Potential issue | 🟠 Major

Pareto is still default-disabled in code paths that should default to enabled.

Current fallbacks/defaults disable Pareto unless explicitly set, which conflicts with the intended default behavior.

Proposed fix

-    const paretoEnabled = options.paretoEnabled ?? false;
+    const paretoEnabled = options.paretoEnabled ?? true;
@@
-      pareto: { type: "boolean", default: false },
+      pareto: { type: "boolean", default: true },
@@
-    paretoEnabled: values.pareto ?? false,
+    paretoEnabled: values.pareto ?? true,

Also applies to: 456-457, 522-523

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/evolve.ts` around lines 205 - 206, The code currently
uses nullish coalescing to default options.paretoEnabled to false (const
paretoEnabled = options.paretoEnabled ?? false), which should default to true;
change that expression to default to true (const paretoEnabled =
options.paretoEnabled ?? true) and similarly update the other occurrences that
use options.paretoEnabled ?? false (e.g., the other paretoEnabled assignments
near the candidateCount usage) so all code paths default to enabled; leave
candidateCount fallback as-is unless intended to change, and ensure the
parameter name options.paretoEnabled is consistently used across functions
(e.g., wherever paretoEnabled is derived) so the behavior is uniformly enabled
by default.

206-207: ⚠️ Potential issue | 🟠 Major

Validate and clamp --candidates before using it to control fan-out.

Number.parseInt is used directly; invalid or oversized values can produce NaN/unbounded behavior despite help text claiming max 5.

Proposed fix

-    const candidateCount = options.candidateCount ?? 3;
+    const rawCandidateCount = options.candidateCount ?? 3;
+    const candidateCount = Number.isFinite(rawCandidateCount)
+      ? Math.min(5, Math.max(1, rawCandidateCount))
+      : 3;
@@
-  const result = await evolve({
+  const parsedCandidates = Number.parseInt(values.candidates ?? "3", 10);
+  const candidateCount = Number.isFinite(parsedCandidates)
+    ? Math.min(5, Math.max(1, parsedCandidates))
+    : 3;
+
+  const result = await evolve({
@@
-    candidateCount: Number.parseInt(values.candidates ?? "3", 10),
+    candidateCount,
   });

Also applies to: 457-457, 478-478, 523-523

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/evolve.ts` around lines 206 - 207, The code reads
options.candidateCount (and other CLI numeric flags) using Number.parseInt
without validation, which can yield NaN or out-of-range values; update the
parsing and use of candidateCount so you parse with Number.parseInt, check for
NaN, and then clamp to an allowed range (e.g., min 1, max 5), falling back to
the default 3 when invalid; apply the same pattern for the other occurrences
referenced (the parse sites at the symbols/options that read candidateCount
around lines 457, 478, 523) so fan-out is always bounded and invalid input is
handled gracefully.

236-236: ⚠️ Potential issue | 🟠 Major

Created audit entries still omit rollback-required original_description: payload.

Rollback fallback restoration depends on this details format, but both created entries currently write generic strings.

Proposed fix

-        recordAudit(proposal.proposal_id, "created", `Pareto candidate for ${skillName}`);
+        recordAudit(
+          proposal.proposal_id,
+          "created",
+          `original_description:${currentDescription}\ncontext:Pareto candidate for ${skillName}`,
+        );
@@
         recordAudit(
           proposal.proposal_id,
           "created",
-          `Proposal created for ${skillName} (iteration ${iteration + 1})`,
+          `original_description:${currentDescription}\ncontext:Proposal created for ${skillName} (iteration ${iteration + 1})`,
         );

Based on learnings: In cli/selftune/evolution/evolve.ts, the "created" audit entry should store original SKILL.md content prefixed with "original_description:" to support rollback fallback restoration.

Also applies to: 296-300

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/evolve.ts` at line 236, The "created" audit entries
currently log a generic string; modify calls to recordAudit (e.g., where
recordAudit(proposal.proposal_id, "created", `Pareto candidate for
${skillName}`) is used) to include the original SKILL.md content prefixed with
"original_description:" so rollback can restore it (e.g., build the payload as
`original_description:${originalSkillContent}` rather than the generic message).
Locate the other similar "created" call(s) (the block around the other
recordAudit use mentioned in the review) and apply the same change so both
created entries store the original description payload.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cli/selftune/evolution/pareto.ts`:
- Around line 27-30: The loop over perEntryResults indexes counts with
r.entry.invocation_type which can be an unexpected value; normalize and guard
before using counts. Add an explicit allow-list (e.g., an array or Set like
ALLOWED_INVOCATION_TYPES) and resolve const type =
ALLOWED_INVOCATION_TYPES.has(r.entry.invocation_type) ? r.entry.invocation_type
: "implicit" (or "unknown") before doing counts[type].total++ and
counts[type].passed++; ensure the allow-list and fallback match the keys present
on counts so indexing cannot throw.

In `@tests/evolution/extract-patterns.test.ts`:
- Around line 222-333: Add two regression tests for extractFailurePatterns: (1)
verify that when gradingResults contains multiple failure_feedback entries with
the same query string, extractFailurePatterns (used with EvalEntry makeEval)
preserves all feedback entries for that pattern (i.e., patterns[].feedback
length equals number of matching failure_feedback entries); (2) verify
cross-skill isolation by including a gradingResults entry whose skill_name
differs from the target skill passed to extractFailurePatterns and assert that
its failure_feedback is ignored (no attached feedback). Reference
extractFailurePatterns, failure_feedback, skill_name, and makeEval to locate
where to add these tests in the existing
tests/evolution/extract-patterns.test.ts file.

In `@tests/grading/grade-session.test.ts`:
- Around line 666-709: Add a new unit test that exercises the CLI merge path in
cliMain (not just assembleResult) to ensure pre-gate and LLM outputs are merged
and that claims, eval_feedback, and failure_feedback propagate; in the test call
cliMain (or the exported function that runs the CLI flow) with mocked pre-gate
and LLM outputs (or stub the functions it calls) producing distinct claims,
eval_feedback, and failure_feedback, then assert the final merged output
contains all three fields correctly (e.g., claims include both sources,
eval_feedback reflects merged content, and failure_feedback is preserved);
reference the cliMain entrypoint and the assembleResult behavior to wire the
mocks so the pre-gate+LLM merge branch is executed.

---

Duplicate comments:
In `@cli/selftune/evolution/evolve.ts`:
- Around line 205-206: The code currently uses nullish coalescing to default
options.paretoEnabled to false (const paretoEnabled = options.paretoEnabled ??
false), which should default to true; change that expression to default to true
(const paretoEnabled = options.paretoEnabled ?? true) and similarly update the
other occurrences that use options.paretoEnabled ?? false (e.g., the other
paretoEnabled assignments near the candidateCount usage) so all code paths
default to enabled; leave candidateCount fallback as-is unless intended to
change, and ensure the parameter name options.paretoEnabled is consistently used
across functions (e.g., wherever paretoEnabled is derived) so the behavior is
uniformly enabled by default.
- Around line 206-207: The code reads options.candidateCount (and other CLI
numeric flags) using Number.parseInt without validation, which can yield NaN or
out-of-range values; update the parsing and use of candidateCount so you parse
with Number.parseInt, check for NaN, and then clamp to an allowed range (e.g.,
min 1, max 5), falling back to the default 3 when invalid; apply the same
pattern for the other occurrences referenced (the parse sites at the
symbols/options that read candidateCount around lines 457, 478, 523) so fan-out
is always bounded and invalid input is handled gracefully.
- Line 236: The "created" audit entries currently log a generic string; modify
calls to recordAudit (e.g., where recordAudit(proposal.proposal_id, "created",
`Pareto candidate for ${skillName}`) is used) to include the original SKILL.md
content prefixed with "original_description:" so rollback can restore it (e.g.,
build the payload as `original_description:${originalSkillContent}` rather than
the generic message). Locate the other similar "created" call(s) (the block
around the other recordAudit use mentioned in the review) and apply the same
change so both created entries store the original description payload.

In `@cli/selftune/evolution/extract-patterns.ts`:
- Around line 149-167: The code currently uses Map<string, FailureFeedback>
which overwrites repeated feedback per query and does not scope feedback by
skill; change feedbackMap to Map<string, FailureFeedback[]> and key entries by a
composite key including skill and query (e.g. `${gr.skill_name}::${fb.query}`)
when populating in the gradingResults loop (push into the array instead of set),
then when iterating allPatterns look up feedback using the composite key
`${pattern.skill_name}::${query}` so all feedback entries for that query+skill
are preserved and feedback is only attached to the matching skill; update
references to feedbackMap, FailureFeedback entries, and pattern.feedback
accordingly.

In `@cli/selftune/evolution/propose-description.ts`:
- Around line 153-175: The batch generation uses Promise.all over variations so
a single failure in callLlm or parseProposalResponse aborts the whole batch;
make generation failure-tolerant by changing the logic in the proposals creation
(the variations.map + Promise.all block) to handle per-candidate errors—wrap
each async map body in try/catch (or use Promise.allSettled) so failed calls
return null or an error marker, then filter out those null/failed entries before
returning proposals; apply this to the code referencing callLlm,
parseProposalResponse, proposal_id generation, and ensure status/confidence
handling remains correct for the surviving candidates.

In `@cli/selftune/evolution/validate-proposal.ts`:
- Around line 133-159: The code builds dynamic invocationScores from
byInvocationType and then unsafely casts it to InvocationTypeScores; instead,
explicitly populate all required invocation dimensions (e.g., the four expected
keys) with default counts/pass_rate if missing and construct a properly typed
InvocationTypeScores object without using "as unknown as InvocationTypeScores".
Locate the aggregation logic around byInvocationType and invocationScores,
create a list/array of the expected invocation dimension names, iterate that
list to ensure each key exists (with passed:0,total:0,pass_rate:0 when absent),
compute pass_rate safely, and return that fully-shaped object directly as
InvocationTypeScores rather than casting.

In `@cli/selftune/grading/grade-session.ts`:
- Around line 499-516: The result object construction currently hardcodes empty
claims and eval_feedback and omits failure_feedback, losing LLM feedback; update
the GradingResult creation to copy the LLM-produced fields instead of hardcoding
empties by assigning the actual variables used earlier in the grading flow
(e.g., the claims variable, eval_feedback/ evalFeedback variable, and
failure_feedback/failureFeedback variable) into result.claims,
result.eval_feedback, and result.failure_feedback so downstream evolution steps
receive the model outputs; ensure you preserve the same shapes/types as expected
by buildExecutionMetrics/consumers.

In `@tests/evolution/pareto.test.ts`:
- Around line 205-208: Replace the non-null assertions used when finding test
members: avoid `!` on the results of `frontier.find` and instead assert
definedness and use optional chaining; specifically, when locating the entries
found via `frontier.find((c) => c.proposal.proposal_id === "a")` and the one for
`"b"`, check that the find result is defined (e.g., with an expect or guard)
before accessing properties like `proposal.proposal_id` and `dominates_on`, and
use optional chaining (e.g., `memberA?.dominates_on`) in the subsequent
assertions (`expect(...).toContain(...)`) so the test follows the tests/**/*.ts
guideline disallowing `!`.

In `@tests/grading/pre-gates.test.ts`:
- Around line 40-131: Replace the repeated non-null assertions when looking up
gates with safe optional access: stop using DEFAULT_GATES.find(... )! and
instead assign without the '!' and call methods/properties with optional
chaining (e.g., gate?.pattern.test(...) and gate?.check(...)). This keeps the
test from hiding a missing gate (the expectation will fail if gate is undefined)
and follows the tests/**/*.ts guideline to use optional chaining rather than
non-null assertions on DEFAULT_GATES lookups.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between ffc5e97 and c39cecb.

📒 Files selected for processing (15)

cli/selftune/evolution/evolve.ts
cli/selftune/evolution/extract-patterns.ts
cli/selftune/evolution/pareto.ts
cli/selftune/evolution/propose-description.ts
cli/selftune/evolution/validate-proposal.ts
cli/selftune/grading/grade-session.ts
cli/selftune/types.ts
tests/evolution/extract-patterns.test.ts
tests/evolution/pareto.test.ts
tests/evolution/propose-description.test.ts
tests/grading/grade-session.test.ts
tests/grading/pre-gates.test.ts
tests/sandbox/docker/run-with-llm.ts
tests/sandbox/run-sandbox.ts
tests/types/new-types.test.ts

coderabbitai · 2026-03-03T13:24:26Z

cli/selftune/evolution/pareto.ts

+  for (const r of perEntryResults) {
+    const type = r.entry.invocation_type ?? "implicit";
+    counts[type].total++;
+    if (r.after_pass) counts[type].passed++;


⚠️ Potential issue | 🟠 Major

Guard unknown invocation_type before indexing score buckets.

counts[type] can throw at runtime if invocation_type is an unexpected value; normalize with an explicit allow-list before incrementing.

Proposed fix

- for (const r of perEntryResults) { - const type = r.entry.invocation_type ?? "implicit"; + for (const r of perEntryResults) { + const rawType = r.entry.invocation_type; + const type: InvocationType = + rawType === "explicit" || + rawType === "implicit" || + rawType === "contextual" || + rawType === "negative" + ? rawType + : "implicit"; counts[type].total++; if (r.after_pass) counts[type].passed++; }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cli/selftune/evolution/pareto.ts` around lines 27 - 30, The loop over perEntryResults indexes counts with r.entry.invocation_type which can be an unexpected value; normalize and guard before using counts. Add an explicit allow-list (e.g., an array or Set like ALLOWED_INVOCATION_TYPES) and resolve const type = ALLOWED_INVOCATION_TYPES.has(r.entry.invocation_type) ? r.entry.invocation_type : "implicit" (or "unknown") before doing counts[type].total++ and counts[type].passed++; ensure the allow-list and fallback match the keys present on counts so indexing cannot throw.

coderabbitai · 2026-03-03T13:24:26Z

tests/evolution/extract-patterns.test.ts

+  test("attaches feedback from grading results to patterns", () => {
+    const evals: EvalEntry[] = [makeEval("make slides", true, "implicit")];
+    const usage: SkillUsageRecord[] = [];
+    const gradingResults = [
+      {
+        session_id: "s1",
+        skill_name: "presenter",
+        transcript_path: "",
+        graded_at: "",
+        expectations: [],
+        summary: { passed: 0, failed: 0, total: 0, pass_rate: 0 },
+        execution_metrics: {
+          tool_calls: {},
+          total_tool_calls: 0,
+          total_steps: 0,
+          bash_commands_run: 0,
+          errors_encountered: 0,
+          skills_triggered: [],
+          transcript_chars: 0,
+        },
+        claims: [],
+        eval_feedback: { suggestions: [], overall: "" },
+        failure_feedback: [
+          {
+            query: "make slides",
+            failure_reason: "Description lacks slide keywords",
+            improvement_hint: "Add presentation triggers",
+          },
+        ],
+      },
+    ];
+
+    const patterns = extractFailurePatterns(evals, usage, "presenter", gradingResults);
+    expect(patterns.length).toBe(1);
+    expect(patterns[0].feedback).toBeDefined();
+    expect(patterns[0].feedback?.length).toBe(1);
+    expect(patterns[0].feedback?.[0].failure_reason).toBe("Description lacks slide keywords");
+  });
+
+  test("no feedback when grading results have no failure_feedback", () => {
+    const evals: EvalEntry[] = [makeEval("make slides", true, "implicit")];
+    const gradingResults = [
+      {
+        session_id: "s1",
+        skill_name: "presenter",
+        transcript_path: "",
+        graded_at: "",
+        expectations: [],
+        summary: { passed: 0, failed: 0, total: 0, pass_rate: 0 },
+        execution_metrics: {
+          tool_calls: {},
+          total_tool_calls: 0,
+          total_steps: 0,
+          bash_commands_run: 0,
+          errors_encountered: 0,
+          skills_triggered: [],
+          transcript_chars: 0,
+        },
+        claims: [],
+        eval_feedback: { suggestions: [], overall: "" },
+      },
+    ];
+
+    const patterns = extractFailurePatterns(evals, [], "presenter", gradingResults);
+    expect(patterns[0].feedback).toBeUndefined();
+  });
+
+  test("feedback not attached when no gradingResults provided", () => {
+    const evals: EvalEntry[] = [makeEval("make slides", true, "implicit")];
+    const patterns = extractFailurePatterns(evals, [], "presenter");
+    expect(patterns[0].feedback).toBeUndefined();
+  });
+
+  test("feedback matches by query string", () => {
+    const evals: EvalEntry[] = [
+      makeEval("make slides", true, "implicit"),
+      makeEval("debug python", true, "implicit"),
+    ];
+    const gradingResults = [
+      {
+        session_id: "s1",
+        skill_name: "presenter",
+        transcript_path: "",
+        graded_at: "",
+        expectations: [],
+        summary: { passed: 0, failed: 0, total: 0, pass_rate: 0 },
+        execution_metrics: {
+          tool_calls: {},
+          total_tool_calls: 0,
+          total_steps: 0,
+          bash_commands_run: 0,
+          errors_encountered: 0,
+          skills_triggered: [],
+          transcript_chars: 0,
+        },
+        claims: [],
+        eval_feedback: { suggestions: [], overall: "" },
+        failure_feedback: [
+          {
+            query: "make slides",
+            failure_reason: "Missing slide keywords",
+            improvement_hint: "Add slide triggers",
+          },
+        ],
+      },
+    ];
+
+    const patterns = extractFailurePatterns(evals, [], "presenter", gradingResults);
+    // Only the pattern containing "make slides" should have feedback
+    const withFeedback = patterns.filter((p) => p.feedback && p.feedback.length > 0);
+    expect(withFeedback.length).toBe(1);
+  });


🧹 Nitpick | 🔵 Trivial

Add regression tests for duplicate-query feedback and cross-skill isolation.

The new tests cover single-entry attachment, but not: (1) preserving multiple feedback entries for one query and (2) ignoring feedback from gradingResults where skill_name !== target skill. Add both to lock enrichment behavior.

As per coding guidelines tests/**/*.ts: Coverage of error paths, not just happy paths.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/evolution/extract-patterns.test.ts` around lines 222 - 333, Add two regression tests for extractFailurePatterns: (1) verify that when gradingResults contains multiple failure_feedback entries with the same query string, extractFailurePatterns (used with EvalEntry makeEval) preserves all feedback entries for that pattern (i.e., patterns[].feedback length equals number of matching failure_feedback entries); (2) verify cross-skill isolation by including a gradingResults entry whose skill_name differs from the target skill passed to extractFailurePatterns and assert that its failure_feedback is ignored (no attached feedback). Reference extractFailurePatterns, failure_feedback, skill_name, and makeEval to locate where to add these tests in the existing tests/evolution/extract-patterns.test.ts file.

coderabbitai · 2026-03-03T13:24:26Z

tests/grading/grade-session.test.ts

+describe("assembleResult with failure_feedback", () => {
+  it("passes through failure_feedback from grader output", () => {
+    const graderOutput = {
+      expectations: [{ text: "test", passed: false, evidence: "not found" }],
+      summary: { passed: 0, failed: 1, total: 1, pass_rate: 0 },
+      claims: [],
+      eval_feedback: { suggestions: [], overall: "" },
+      failure_feedback: [
+        {
+          query: "make slides",
+          failure_reason: "Skill not triggered",
+          improvement_hint: "Add slide keywords",
+        },
+      ],
+    };
+    const result = assembleResult(
+      graderOutput,
+      makeTelemetryRecord(),
+      "sess-1",
+      "pptx",
+      "/tmp/t.jsonl",
+    );
+    expect(result.failure_feedback).toBeDefined();
+    expect(result.failure_feedback?.length).toBe(1);
+    expect(result.failure_feedback?.[0].query).toBe("make slides");
+  });
+
+  it("failure_feedback is undefined when not in grader output", () => {
+    const graderOutput = {
+      expectations: [],
+      summary: { passed: 0, failed: 0, total: 0, pass_rate: 0 },
+      claims: [],
+      eval_feedback: { suggestions: [], overall: "" },
+    };
+    const result = assembleResult(
+      graderOutput,
+      makeTelemetryRecord(),
+      "sess-1",
+      "pptx",
+      "/tmp/t.jsonl",
+    );
+    expect(result.failure_feedback).toBeUndefined();
+  });
+});


🧹 Nitpick | 🔵 Trivial

Add a regression test for the pre-gate + LLM merge path in cliMain.

These tests verify assembleResult, but they do not exercise the CLI path that merges pre-gate and LLM outputs. Add one focused test for that flow to lock claims, eval_feedback, and failure_feedback propagation.

As per coding guidelines tests/**/*.ts: Coverage of error paths, not just happy paths.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/grading/grade-session.test.ts` around lines 666 - 709, Add a new unit test that exercises the CLI merge path in cliMain (not just assembleResult) to ensure pre-gate and LLM outputs are merged and that claims, eval_feedback, and failure_feedback propagate; in the test call cliMain (or the exported function that runs the CLI flow) with mocked pre-gate and LLM outputs (or stub the functions it calls) producing distinct claims, eval_feedback, and failure_feedback, then assert the final merged output contains all three fields correctly (e.g., claims include both sources, eval_feedback reflects merged content, and failure_feedback is preserved); reference the cliMain entrypoint and the assembleResult behavior to wire the mocks so the pre-gate+LLM merge branch is executed.

… feedback, Pareto evolution Implement four high-value eval improvements in parallel: 1. Deterministic Pre-Gates (grading/pre-gates.ts): 4 fast checks that resolve grading expectations without LLM calls (<20ms). Skips LLM entirely when all expectations resolve via pre-gates. 2. Graduated Scoring: 0-1 float scores on all expectations replacing binary pass/fail. GradingSummary includes mean_score and score_std_dev. 3. Rich Failure Feedback: Structured FailureFeedback flows from grader through extract-patterns to propose-description, giving the evolution LLM specific context about what failed and why. 4. Pareto Evolution (evolution/pareto.ts): Multi-candidate proposals with Pareto frontier selection across invocation type dimensions. Complementary candidates can be merged. CLI: --pareto (default true), --candidates N. All new type fields are optional — zero breaking changes. 239 new tests added. Docs updated: evolution-pipeline.md, ARCHITECTURE.md, PRD.md, README.md, golden-principles.md, escalation-policy.md, tech-debt-tracker.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove unused imports, fix import sort order, apply formatting rules, and prefix unused variables in sandbox tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (12)

cli/selftune/evolution/pareto.ts (1)

27-30: ⚠️ Potential issue | 🟠 Major

Guard invocation_type before indexing score buckets.

counts[type] can be undefined at runtime if an unexpected value slips through; normalize to the supported dimensions before incrementing.

Proposed fix

   for (const r of perEntryResults) {
-    const type = r.entry.invocation_type ?? "implicit";
+    const rawType = r.entry.invocation_type;
+    const type: InvocationType =
+      rawType === "explicit" ||
+      rawType === "implicit" ||
+      rawType === "contextual" ||
+      rawType === "negative"
+        ? rawType
+        : "implicit";
     counts[type].total++;
     if (r.after_pass) counts[type].passed++;
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/pareto.ts` around lines 27 - 30, The loop over
perEntryResults uses r.entry.invocation_type to index into counts (type =
r.entry.invocation_type ?? "implicit") but counts[type] may be undefined; before
incrementing, normalize and/or validate the key and ensure counts has the
supported buckets (e.g., "implicit", "explicit", or others expected) or
initialize counts[type] to a zeroed bucket; update the loop around
perEntryResults so it maps unknown invocation_type values to a fallback key (or
creates counts[type] = { total: 0, passed: 0 } when missing) before doing
counts[type].total++ and counts[type].passed++ (referencing perEntryResults,
r.entry.invocation_type, and counts).

tests/evolution/extract-patterns.test.ts (1)

tests/evolution/pareto.test.ts (1)

205-208: ⚠️ Potential issue | 🟡 Minor

Replace non-null assertions in test lookups.

Use explicit defined checks and optional chaining instead of ! on find(...) results.

Proposed fix

-    const memberA = frontier.find((c) => c.proposal.proposal_id === "a")!;
-    const memberB = frontier.find((c) => c.proposal.proposal_id === "b")!;
-    expect(memberA.dominates_on).toContain("explicit");
-    expect(memberB.dominates_on).toContain("implicit");
+    const memberA = frontier.find((c) => c.proposal.proposal_id === "a");
+    const memberB = frontier.find((c) => c.proposal.proposal_id === "b");
+    expect(memberA).toBeDefined();
+    expect(memberB).toBeDefined();
+    expect(memberA?.dominates_on).toContain("explicit");
+    expect(memberB?.dominates_on).toContain("implicit");

As per coding guidelines tests/**/*.ts: Optional chaining (result?.) not non-null assertions (result!).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/evolution/pareto.test.ts` around lines 205 - 208, Replace the non-null
assertions on the find results in tests/evolution/pareto.test.ts: remove the
trailing "!" from the memberA/memberB assignments (from frontier.find(...)) and
instead add explicit existence checks and optional chaining when accessing
properties — e.g., save const memberA = frontier.find(c =>
c.proposal.proposal_id === "a"); do expect(memberA).toBeDefined(); then use
expect(memberA?.dominates_on).toContain("explicit") (and same pattern for
memberB) so the test uses optional chaining and explicit defined assertions
rather than non-null assertions.

docs/exec-plans/tech-debt-tracker.md (1)

20-20: ⚠️ Potential issue | 🟡 Minor

Fix TD-012 description grammar for clarity.

The phrase is still awkward; use “fails to send initial data event intermittently” for readability.

Proposed fix

-| TD-012 | Dashboard server test (`tests/dashboard/dashboard-server.test.ts`) is flaky — `GET /api/events` sends initial data event fails intermittently with `null` response | Testing | Medium | — | Open | 2026-03-03 | 2026-03-03 |
+| TD-012 | Dashboard server test (`tests/dashboard/dashboard-server.test.ts`) is flaky — `GET /api/events` fails to send initial data event intermittently with `null` response | Testing | Medium | — | Open | 2026-03-03 | 2026-03-03 |

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/exec-plans/tech-debt-tracker.md` at line 20, Update the TD-012 row in
the tech-debt tracker so the description reads “Dashboard server test
(tests/dashboard/dashboard-server.test.ts) is flaky — GET /api/events fails to
send initial data event intermittently” instead of the current awkward phrasing;
edit the string for TD-012 in docs/exec-plans/tech-debt-tracker.md and ensure
the test path `tests/dashboard/dashboard-server.test.ts` and endpoint `GET
/api/events` remain present for clarity.

cli/selftune/evolution/extract-patterns.ts (1)

149-167: ⚠️ Potential issue | 🟠 Major

Preserve all feedback entries per query and isolate by skill_name.

Current enrichment overwrites duplicate feedback for the same query and can attach feedback from unrelated skills.

Proposed fix

-  if (gradingResults && gradingResults.length > 0) {
-    const feedbackMap = new Map<string, FailureFeedback>();
+  if (gradingResults && gradingResults.length > 0) {
+    const feedbackMap = new Map<string, FailureFeedback[]>();
     for (const gr of gradingResults) {
+      if (gr.skill_name !== skillName) continue;
       if (gr.failure_feedback) {
         for (const fb of gr.failure_feedback) {
-          feedbackMap.set(fb.query, fb);
+          const existing = feedbackMap.get(fb.query) ?? [];
+          existing.push(fb);
+          feedbackMap.set(fb.query, existing);
         }
       }
     }

     for (const pattern of allPatterns) {
       const matchingFeedback: FailureFeedback[] = [];
       for (const query of pattern.missed_queries) {
-        const fb = feedbackMap.get(query);
-        if (fb) matchingFeedback.push(fb);
+        const fbList = feedbackMap.get(query);
+        if (fbList) matchingFeedback.push(...fbList);
       }
       if (matchingFeedback.length > 0) {
         pattern.feedback = matchingFeedback;
       }
     }
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/extract-patterns.ts` around lines 149 - 167, The code
currently uses feedbackMap: Map<string, FailureFeedback> which overwrites
duplicate feedback for the same query and can attach feedback from other skills;
change the approach to preserve all entries and isolate by skill: build
feedbackMap as Map<string, FailureFeedback[]> (or Map<string, Map<string,
FailureFeedback[]>>) and push each fb into an array keyed by fb.query, then when
enriching each pattern (use pattern.missed_queries and pattern.skill_name)
collect matchingFeedback by retrieving the array for the query and filtering
entries where fb.skill_name === pattern.skill_name (or use the nested-skill key)
and append all matching fb items to pattern.feedback (ensuring pattern.feedback
becomes an array of FailureFeedback).

cli/selftune/evolution/evolve.ts (3)

236-236: ⚠️ Potential issue | 🟠 Major

Include rollback payload in created audit details.

Line 236 and Line 299 log message-only created details. Rollback fallback expects original_description: in the details payload.

Proposed fix

-        recordAudit(proposal.proposal_id, "created", `Pareto candidate for ${skillName}`);
+        recordAudit(
+          proposal.proposal_id,
+          "created",
+          `original_description:${currentDescription}\ncontext:Pareto candidate for ${skillName}`,
+        );
@@
-        recordAudit(
-          proposal.proposal_id,
-          "created",
-          `Proposal created for ${skillName} (iteration ${iteration + 1})`,
-        );
+        recordAudit(
+          proposal.proposal_id,
+          "created",
+          `original_description:${currentDescription}\ncontext:Proposal created for ${skillName} (iteration ${iteration + 1})`,
+        );

Based on learnings: In cli/selftune/evolution/evolve.ts, the "created" audit entry should store the original SKILL.md content with the prefix "original_description:" in the details field to support rollback fallback restoration.

Also applies to: 296-300

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/evolve.ts` at line 236, The "created" audit entries
currently call recordAudit(proposal.proposal_id, "created", `Pareto candidate
for ${skillName}`) (and a second similar call later) with only a message, but
the rollback fallback expects the details payload to contain the original
SKILL.md content prefixed with "original_description:". Update both recordAudit
calls that use event "created" (the ones around
recordAudit(proposal.proposal_id, "created", ... ) and the later duplicate call)
to include the original SKILL.md content in the details string, e.g.
`"original_description:" + originalSkillMd + " | Pareto candidate for " +
skillName`, so the details field contains the required rollback payload while
preserving the human-readable message.

205-206: ⚠️ Potential issue | 🟠 Major

Enable Pareto by default in runtime and CLI wiring.

Line 205, Line 456, and Line 522 still default pareto to false, which disables the new Pareto path unless explicitly enabled.

Proposed fix

-  const paretoEnabled = options.paretoEnabled ?? false;
+  const paretoEnabled = options.paretoEnabled ?? true;
@@
-      pareto: { type: "boolean", default: false },
+      pareto: { type: "boolean", default: true },
@@
-    paretoEnabled: values.pareto ?? false,
+    paretoEnabled: values.pareto ?? true,

Also applies to: 456-457, 522-523

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/evolve.ts` around lines 205 - 206, The pareto flag is
being defaulted to false in multiple places, so update each occurrence to enable
Pareto by default: change the initialization expressions that set paretoEnabled
(e.g., const paretoEnabled = options.paretoEnabled ?? false;) and any analogous
variables at the other sites (the same pattern around the CLI/runtime wiring and
candidateCount locations) to default to true (use ?? true or a default
parameter) so Pareto is enabled unless explicitly set to false; ensure you
update all three occurrences referenced (the paretoEnabled initializers and any
matching option destructures) so behavior is consistent.

206-206: ⚠️ Potential issue | 🟠 Major

Sanitize candidateCount before controlling parallel proposal fan-out.

Line 523 parses raw input without NaN/range checks, and Line 206 trusts caller input. Invalid values can silently bypass Pareto mode or trigger excessive parallel LLM calls despite documented max=5.

Proposed fix

-  const candidateCount = options.candidateCount ?? 3;
+  const rawCandidateCount = options.candidateCount ?? 3;
+  const candidateCount = Number.isFinite(rawCandidateCount)
+    ? Math.min(5, Math.max(1, Math.trunc(rawCandidateCount)))
+    : 3;
@@
-  const result = await evolve({
+  const parsedCandidates = Number.parseInt(values.candidates ?? "3", 10);
+  const candidateCount = Number.isFinite(parsedCandidates)
+    ? Math.min(5, Math.max(1, parsedCandidates))
+    : 3;
+
+  const result = await evolve({
@@
-    candidateCount: Number.parseInt(values.candidates ?? "3", 10),
+    candidateCount,
   });

Also applies to: 457-457, 523-523

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/evolve.ts` at line 206, Sanitize and clamp
candidateCount before using it to control parallel proposal fan-out: validate
options.candidateCount (the candidateCount variable) by coercing to a safe
integer (e.g., Math.floor(Number(...)) or parseInt), default to 3 when
NaN/invalid, and clamp to the allowed range [1, 5] (use a MAX_CANDIDATES
constant if present) so invalid/too-large values cannot bypass Pareto mode or
spawn excessive LLM calls; apply the same sanitization wherever raw input is
parsed (the parse logic referenced around the parse step) and replace direct
trusts of options.candidateCount with the sanitized value.

cli/selftune/evolution/validate-proposal.ts (1)

9-9: ⚠️ Potential issue | 🟠 Major

Build a fully typed 4-dimension InvocationTypeScores object (remove double-cast).

Line 134 and Line 158 use dynamic string-keyed maps plus as unknown as InvocationTypeScores. This can omit required dimensions (explicit|implicit|contextual|negative) and mask type errors.

Proposed fix

-import type { EvalEntry, EvolutionProposal, InvocationTypeScores } from "../types.js";
+import type {
+  EvalEntry,
+  EvolutionProposal,
+  InvocationType,
+  InvocationTypeScores,
+} from "../types.js";
@@
-  const byInvocationType: Record<string, { passed: number; total: number }> = {};
+  const byInvocationType: Record<InvocationType, { passed: number; total: number }> = {
+    explicit: { passed: 0, total: 0 },
+    implicit: { passed: 0, total: 0 },
+    contextual: { passed: 0, total: 0 },
+    negative: { passed: 0, total: 0 },
+  };
   for (const r of perEntryResults) {
     const type = r.entry.invocation_type ?? "implicit";
-    if (!byInvocationType[type]) byInvocationType[type] = { passed: 0, total: 0 };
     byInvocationType[type].total++;
     if (r.after_pass) byInvocationType[type].passed++;
   }
 
-  const invocationScores: Record<string, { passed: number; total: number; pass_rate: number }> = {};
-  for (const [type, counts] of Object.entries(byInvocationType)) {
-    invocationScores[type] = {
-      ...counts,
-      pass_rate: counts.total > 0 ? counts.passed / counts.total : 0,
-    };
-  }
+  const invocationScores: InvocationTypeScores = {
+    explicit: {
+      ...byInvocationType.explicit,
+      pass_rate:
+        byInvocationType.explicit.total > 0
+          ? byInvocationType.explicit.passed / byInvocationType.explicit.total
+          : 0,
+    },
+    implicit: {
+      ...byInvocationType.implicit,
+      pass_rate:
+        byInvocationType.implicit.total > 0
+          ? byInvocationType.implicit.passed / byInvocationType.implicit.total
+          : 0,
+    },
+    contextual: {
+      ...byInvocationType.contextual,
+      pass_rate:
+        byInvocationType.contextual.total > 0
+          ? byInvocationType.contextual.passed / byInvocationType.contextual.total
+          : 0,
+    },
+    negative: {
+      ...byInvocationType.negative,
+      pass_rate:
+        byInvocationType.negative.total > 0
+          ? byInvocationType.negative.passed / byInvocationType.negative.total
+          : 0,
+    },
+  };
@@
-    by_invocation_type: invocationScores as unknown as InvocationTypeScores,
+    by_invocation_type: invocationScores,

#!/bin/bash
# Confirm unsafe cast/dynamic-key pattern still exists and where fixed-key consumers are used.
rg -nP --type=ts 'as unknown as InvocationTypeScores|Record<string,\s*\{ passed: number; total: number \}>' cli/selftune/evolution/validate-proposal.ts cli/selftune/evolution/pareto.ts
rg -nP --type=ts '\b(invocation_scores|by_invocation_type)\.(explicit|implicit|contextual|negative)\b' cli/selftune

Also applies to: 133-159

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/validate-proposal.ts` at line 9, The code is
constructing invocation score objects using dynamic string keys and unsafe casts
(as unknown as InvocationTypeScores) which can omit required dimensions; replace
the dynamic maps with a fully-typed 4-field object literal that explicitly
includes explicit, implicit, contextual, and negative keys each with shape {
passed: number; total: number } and remove the double-cast; update the places
building invocation_scores / by_invocation_type in validate-proposal.ts (and
similarly in pareto.ts if present) so they construct and return an
InvocationTypeScores object directly (e.g., createInvocationScores or inline
object) to ensure the compiler enforces all four dimensions and eliminate
Record<string,...> / as unknown as InvocationTypeScores casts.

tests/grading/pre-gates.test.ts (1)

40-40: ⚠️ Potential issue | 🟠 Major

Remove non-null assertions from gate lookups.

These .find(... )! usages violate the test guideline and make failures less diagnosable than an explicit defined-check path.

Proposed fix

+function getGate(name: PreGate["name"]): PreGate {
+  const gate = DEFAULT_GATES.find((g) => g.name === name);
+  expect(gate).toBeDefined();
+  if (!gate) throw new Error(`Missing default gate: ${name}`);
+  return gate;
+}
@@
-    const gate = DEFAULT_GATES.find((g) => g.name === "skill_md_read")!;
+    const gate = getGate("skill_md_read");

As per coding guidelines: tests/**/*.ts: "Optional chaining (result?.) not non-null assertions (result!.)."

#!/bin/bash
# Verify there are no remaining non-null assertions on DEFAULT_GATES lookups.
rg -nP --type=ts 'DEFAULT_GATES\.find\([^)]*\)!' tests/grading/pre-gates.test.ts

Also applies to: 46-46, 51-51, 57-57, 63-63, 75-75, 81-81, 90-90, 99-99, 104-104, 112-112, 117-117, 125-125, 130-130

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/grading/pre-gates.test.ts` at line 40, Replace the non-null assertion
usage on DEFAULT_GATES.find(...) (e.g., const gate = DEFAULT_GATES.find((g) =>
g.name === "skill_md_read")!) with an explicit defined-check: call
DEFAULT_GATES.find(...) into a const (e.g., gate), assert it is present using
the test framework (e.g., expect(gate).toBeDefined()) or throw a clear Error if
missing, then continue using the now-checked variable; reference the
DEFAULT_GATES lookup and the local variable name (gate) so the change is applied
for every occurrence of .find(...)! in this test file.

cli/selftune/evolution/propose-description.ts (1)

153-178: ⚠️ Potential issue | 🟠 Major

One failed LLM call aborts all candidate generation.

Promise.all rejects the entire batch on the first failure. In Pareto mode this drops otherwise valid candidates and can fail evolution unnecessarily.

Proposed fix using Promise.allSettled

-  const proposals = await Promise.all(
+  const settled = await Promise.allSettled(
     variations.map(async (prompt, i) => {
       const rawResponse = await callLlm(PROPOSER_SYSTEM, prompt, agent);
       const { proposed_description, rationale, confidence } = parseProposalResponse(rawResponse);
-
       return {
         proposal_id: `evo-${skillName}-${Date.now()}-${i}`,
         skill_name: skillName,
         skill_path: skillPath,
         original_description: currentDescription,
         proposed_description,
         rationale,
         failure_patterns: failurePatterns.map((p) => p.pattern_id),
         eval_results: {
           before: { total: 0, passed: 0, failed: 0, pass_rate: 0 },
           after: { total: 0, passed: 0, failed: 0, pass_rate: 0 },
         },
         confidence,
         created_at: new Date().toISOString(),
         status: "pending" as const,
       };
     }),
   );
 
-  return proposals;
+  const proposals = settled
+    .filter((r): r is PromiseFulfilledResult<EvolutionProposal> => r.status === "fulfilled")
+    .map((r) => r.value);
+
+  if (proposals.length === 0) {
+    throw new Error("All proposal generation attempts failed");
+  }
+
+  return proposals;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/propose-description.ts` around lines 153 - 178, The
current batch LLM calls use Promise.all which aborts all candidate generation if
any single call fails; replace the Promise.all wrapping the variations.map calls
with Promise.allSettled, then filter the settled results for status ===
"fulfilled" and extract the value for successful proposals while
logging/skipping rejected results (including error info from the rejection) so
that propose-description's generation loop (the variations.map async block that
calls callLlm and parseProposalResponse) returns only the successful proposal
objects; ensure the returned array preserves indices or assigns unique
proposal_id as before and does not include undefined entries from failures.

cli/selftune/grading/grade-session.ts (1)

499-516: ⚠️ Potential issue | 🟠 Major

LLM claims, eval_feedback, and failure_feedback are dropped in CLI output.

Lines 514-515 hardcode empty values for claims and eval_feedback, and failure_feedback is not propagated from graderOutput. This was flagged in a previous review.

Proposed fix to propagate LLM feedback fields

+  // Capture graderOutput for feedback fields when LLM path is taken
+  let llmOutput: GraderOutput | undefined;
+
   if (preGateResult.remaining.length === 0) {
     // ...existing code...
   } else {
     // ...existing code...
     let graderOutput: GraderOutput;
     try {
       graderOutput = await gradeViaAgent(prompt, agent);
+      llmOutput = graderOutput;
     } catch (e) {
       // ...existing code...
     }
     // ...existing code...
   }

   const result: GradingResult = {
     // ...existing fields...
-    claims: [],
-    eval_feedback: { suggestions: [], overall: "" },
+    claims: llmOutput?.claims ?? [],
+    eval_feedback: llmOutput?.eval_feedback ?? { suggestions: [], overall: "" },
+    failure_feedback: llmOutput?.failure_feedback,
   };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/grading/grade-session.ts` around lines 499 - 516, The
GradingResult construction currently hardcodes empty claims and eval_feedback
and omits failure_feedback; update the result object (the const result:
GradingResult block) to populate claims, eval_feedback, and failure_feedback
from the grader output (e.g., use graderOutput.claims ||
graderOutput?.llm_claims as available, graderOutput.eval_feedback ||
graderOutput.feedback, and graderOutput.failure_feedback or
graderOutput.failureFeedback), falling back to empty defaults only if those
fields are undefined; ensure the fields names match the GradingResult interface
so downstream CLI output receives the LLM-provided values.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cli/selftune/grading/pre-gates.ts`:
- Line 35: The gate for skill_md_read uses a single-order regex (pattern:
/skill\.md.*read/i) that misses phrases like "read SKILL.md"; update the pattern
used by the skill_md_read gate (the pattern property in pre-gates.ts) to be
order-agnostic by matching either "skill.md ... read" or "read ... skill.md"
(e.g., use an alternation-based regex such as one matching
(skill\.md.*read|read.*skill\.md) with case-insensitive flag) so both ordering
variants are accepted; update any related tests to cover both orders.

In `@docs/golden-principles.md`:
- Around line 82-90: The ordered lists in the "Pre-gates before LLM grading",
"Graduated scores over binary pass/fail", and "Pareto frontier for
multi-candidate selection" sections use absolute numbering (13,14,15) which
triggers MD029; change these list prefixes to local sequential numbering (1.,
2., 3.) or add a markdownlint exception for this file/section if absolute
numbering is required, making the edits in the corresponding list blocks so that
the blocks at lines referenced (the three headings and the later block at
93-103) no longer violate MD029.

In `@PRD.md`:
- Around line 139-140: The Markdown heading "### Skill Evolution" lacks the
required blank line below it (MD022); edit the PRD.md content at the "### Skill
Evolution" header and insert a single blank line immediately after that heading
so the paragraph that begins "Runs the description improvement loop..." is
separated by one empty line from the heading.

In `@tests/grading/grade-session.test.ts`:
- Around line 651-660: Duplicate assertion on
GRADER_SYSTEM.contains("failure_feedback") appears in the two tests inside the
"failure feedback in GRADER_SYSTEM" block; remove the redundant check from the
second test (the one that also asserts "improvement_hint") so that the first
test verifies presence of "failure_feedback" and the second only asserts
"improvement_hint", referencing the GRADER_SYSTEM symbol and the two it() blocks
in the describe.

---

Duplicate comments:
In `@cli/selftune/evolution/evolve.ts`:
- Line 236: The "created" audit entries currently call
recordAudit(proposal.proposal_id, "created", `Pareto candidate for
${skillName}`) (and a second similar call later) with only a message, but the
rollback fallback expects the details payload to contain the original SKILL.md
content prefixed with "original_description:". Update both recordAudit calls
that use event "created" (the ones around recordAudit(proposal.proposal_id,
"created", ... ) and the later duplicate call) to include the original SKILL.md
content in the details string, e.g. `"original_description:" + originalSkillMd +
" | Pareto candidate for " + skillName`, so the details field contains the
required rollback payload while preserving the human-readable message.
- Around line 205-206: The pareto flag is being defaulted to false in multiple
places, so update each occurrence to enable Pareto by default: change the
initialization expressions that set paretoEnabled (e.g., const paretoEnabled =
options.paretoEnabled ?? false;) and any analogous variables at the other sites
(the same pattern around the CLI/runtime wiring and candidateCount locations) to
default to true (use ?? true or a default parameter) so Pareto is enabled unless
explicitly set to false; ensure you update all three occurrences referenced (the
paretoEnabled initializers and any matching option destructures) so behavior is
consistent.
- Line 206: Sanitize and clamp candidateCount before using it to control
parallel proposal fan-out: validate options.candidateCount (the candidateCount
variable) by coercing to a safe integer (e.g., Math.floor(Number(...)) or
parseInt), default to 3 when NaN/invalid, and clamp to the allowed range [1, 5]
(use a MAX_CANDIDATES constant if present) so invalid/too-large values cannot
bypass Pareto mode or spawn excessive LLM calls; apply the same sanitization
wherever raw input is parsed (the parse logic referenced around the parse step)
and replace direct trusts of options.candidateCount with the sanitized value.

In `@cli/selftune/evolution/extract-patterns.ts`:
- Around line 149-167: The code currently uses feedbackMap: Map<string,
FailureFeedback> which overwrites duplicate feedback for the same query and can
attach feedback from other skills; change the approach to preserve all entries
and isolate by skill: build feedbackMap as Map<string, FailureFeedback[]> (or
Map<string, Map<string, FailureFeedback[]>>) and push each fb into an array
keyed by fb.query, then when enriching each pattern (use pattern.missed_queries
and pattern.skill_name) collect matchingFeedback by retrieving the array for the
query and filtering entries where fb.skill_name === pattern.skill_name (or use
the nested-skill key) and append all matching fb items to pattern.feedback
(ensuring pattern.feedback becomes an array of FailureFeedback).

In `@cli/selftune/evolution/pareto.ts`:
- Around line 27-30: The loop over perEntryResults uses r.entry.invocation_type
to index into counts (type = r.entry.invocation_type ?? "implicit") but
counts[type] may be undefined; before incrementing, normalize and/or validate
the key and ensure counts has the supported buckets (e.g., "implicit",
"explicit", or others expected) or initialize counts[type] to a zeroed bucket;
update the loop around perEntryResults so it maps unknown invocation_type values
to a fallback key (or creates counts[type] = { total: 0, passed: 0 } when
missing) before doing counts[type].total++ and counts[type].passed++
(referencing perEntryResults, r.entry.invocation_type, and counts).

In `@cli/selftune/evolution/propose-description.ts`:
- Around line 153-178: The current batch LLM calls use Promise.all which aborts
all candidate generation if any single call fails; replace the Promise.all
wrapping the variations.map calls with Promise.allSettled, then filter the
settled results for status === "fulfilled" and extract the value for successful
proposals while logging/skipping rejected results (including error info from the
rejection) so that propose-description's generation loop (the variations.map
async block that calls callLlm and parseProposalResponse) returns only the
successful proposal objects; ensure the returned array preserves indices or
assigns unique proposal_id as before and does not include undefined entries from
failures.

In `@cli/selftune/evolution/validate-proposal.ts`:
- Line 9: The code is constructing invocation score objects using dynamic string
keys and unsafe casts (as unknown as InvocationTypeScores) which can omit
required dimensions; replace the dynamic maps with a fully-typed 4-field object
literal that explicitly includes explicit, implicit, contextual, and negative
keys each with shape { passed: number; total: number } and remove the
double-cast; update the places building invocation_scores / by_invocation_type
in validate-proposal.ts (and similarly in pareto.ts if present) so they
construct and return an InvocationTypeScores object directly (e.g.,
createInvocationScores or inline object) to ensure the compiler enforces all
four dimensions and eliminate Record<string,...> / as unknown as
InvocationTypeScores casts.

In `@cli/selftune/grading/grade-session.ts`:
- Around line 499-516: The GradingResult construction currently hardcodes empty
claims and eval_feedback and omits failure_feedback; update the result object
(the const result: GradingResult block) to populate claims, eval_feedback, and
failure_feedback from the grader output (e.g., use graderOutput.claims ||
graderOutput?.llm_claims as available, graderOutput.eval_feedback ||
graderOutput.feedback, and graderOutput.failure_feedback or
graderOutput.failureFeedback), falling back to empty defaults only if those
fields are undefined; ensure the fields names match the GradingResult interface
so downstream CLI output receives the LLM-provided values.

In `@docs/exec-plans/tech-debt-tracker.md`:
- Line 20: Update the TD-012 row in the tech-debt tracker so the description
reads “Dashboard server test (tests/dashboard/dashboard-server.test.ts) is flaky
— GET /api/events fails to send initial data event intermittently” instead of
the current awkward phrasing; edit the string for TD-012 in
docs/exec-plans/tech-debt-tracker.md and ensure the test path
`tests/dashboard/dashboard-server.test.ts` and endpoint `GET /api/events` remain
present for clarity.

In `@tests/evolution/extract-patterns.test.ts`:
- Around line 222-333: Add two tests for extractFailurePatterns: (1) create
gradingResults where a single result for skill "presenter" has failure_feedback
containing the same query string twice with different
failure_reason/improvement_hint and assert extractFailurePatterns(evals, [],
"presenter", gradingResults) attaches both feedback entries to the matching
pattern (feedback.length equals 2 and contains both failure_reason values); (2)
create gradingResults where skill_name is not the target (e.g., "other_skill")
with failure_feedback for the same query and assert that calling
extractFailurePatterns(evals, [], "presenter", gradingResults) does not attach
that feedback (pattern.feedback is undefined or empty). Use the existing
makeEval helper and reference extractFailurePatterns to locate where to add
these tests.

In `@tests/evolution/pareto.test.ts`:
- Around line 205-208: Replace the non-null assertions on the find results in
tests/evolution/pareto.test.ts: remove the trailing "!" from the memberA/memberB
assignments (from frontier.find(...)) and instead add explicit existence checks
and optional chaining when accessing properties — e.g., save const memberA =
frontier.find(c => c.proposal.proposal_id === "a"); do
expect(memberA).toBeDefined(); then use
expect(memberA?.dominates_on).toContain("explicit") (and same pattern for
memberB) so the test uses optional chaining and explicit defined assertions
rather than non-null assertions.

In `@tests/grading/pre-gates.test.ts`:
- Line 40: Replace the non-null assertion usage on DEFAULT_GATES.find(...)
(e.g., const gate = DEFAULT_GATES.find((g) => g.name === "skill_md_read")!) with
an explicit defined-check: call DEFAULT_GATES.find(...) into a const (e.g.,
gate), assert it is present using the test framework (e.g.,
expect(gate).toBeDefined()) or throw a clear Error if missing, then continue
using the now-checked variable; reference the DEFAULT_GATES lookup and the local
variable name (gate) so the change is applied for every occurrence of
.find(...)! in this test file.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between c39cecb and 0271867.

📒 Files selected for processing (24)

ARCHITECTURE.md
PRD.md
README.md
cli/selftune/evolution/evolve.ts
cli/selftune/evolution/extract-patterns.ts
cli/selftune/evolution/pareto.ts
cli/selftune/evolution/propose-description.ts
cli/selftune/evolution/validate-proposal.ts
cli/selftune/grading/grade-session.ts
cli/selftune/grading/pre-gates.ts
cli/selftune/types.ts
docs/design-docs/evolution-pipeline.md
docs/design-docs/index.md
docs/escalation-policy.md
docs/exec-plans/tech-debt-tracker.md
docs/golden-principles.md
tests/evolution/extract-patterns.test.ts
tests/evolution/pareto.test.ts
tests/evolution/propose-description.test.ts
tests/grading/grade-session.test.ts
tests/grading/pre-gates.test.ts
tests/sandbox/docker/run-with-llm.ts
tests/sandbox/run-sandbox.ts
tests/types/new-types.test.ts

cli/selftune/grading/pre-gates.ts

docs/golden-principles.md

PRD.md

tests/grading/grade-session.test.ts

coderabbitai

♻️ Duplicate comments (12)

cli/selftune/evolution/validate-proposal.ts (1)

133-159: ⚠️ Potential issue | 🔴 Critical

Build a fully-typed 4-dimension score object; remove unsafe double-cast.

Line 158 forces InvocationTypeScores even when Line 142-148 only materializes observed keys. Missing dimensions (explicit|implicit|contextual|negative) can propagate undefined into downstream Pareto logic.

Proposed fix

-import type { EvalEntry, EvolutionProposal, InvocationTypeScores } from "../types.js";
+import type {
+  EvalEntry,
+  EvolutionProposal,
+  InvocationType,
+  InvocationTypeScores,
+} from "../types.js";

-  const byInvocationType: Record<string, { passed: number; total: number }> = {};
+  const byInvocationType: Record<InvocationType, { passed: number; total: number }> = {
+    explicit: { passed: 0, total: 0 },
+    implicit: { passed: 0, total: 0 },
+    contextual: { passed: 0, total: 0 },
+    negative: { passed: 0, total: 0 },
+  };
   for (const r of perEntryResults) {
-    const type = r.entry.invocation_type ?? "implicit";
-    if (!byInvocationType[type]) byInvocationType[type] = { passed: 0, total: 0 };
+    const rawType = r.entry.invocation_type;
+    const type: InvocationType =
+      rawType === "explicit" ||
+      rawType === "implicit" ||
+      rawType === "contextual" ||
+      rawType === "negative"
+        ? rawType
+        : "implicit";
     byInvocationType[type].total++;
     if (r.after_pass) byInvocationType[type].passed++;
   }

-  const invocationScores: Record<string, { passed: number; total: number; pass_rate: number }> = {};
-  for (const [type, counts] of Object.entries(byInvocationType)) {
-    invocationScores[type] = {
-      ...counts,
-      pass_rate: counts.total > 0 ? counts.passed / counts.total : 0,
-    };
-  }
+  const invocationScores: InvocationTypeScores = {
+    explicit: {
+      ...byInvocationType.explicit,
+      pass_rate:
+        byInvocationType.explicit.total > 0
+          ? byInvocationType.explicit.passed / byInvocationType.explicit.total
+          : 0,
+    },
+    implicit: {
+      ...byInvocationType.implicit,
+      pass_rate:
+        byInvocationType.implicit.total > 0
+          ? byInvocationType.implicit.passed / byInvocationType.implicit.total
+          : 0,
+    },
+    contextual: {
+      ...byInvocationType.contextual,
+      pass_rate:
+        byInvocationType.contextual.total > 0
+          ? byInvocationType.contextual.passed / byInvocationType.contextual.total
+          : 0,
+    },
+    negative: {
+      ...byInvocationType.negative,
+      pass_rate:
+        byInvocationType.negative.total > 0
+          ? byInvocationType.negative.passed / byInvocationType.negative.total
+          : 0,
+    },
+  };

-    by_invocation_type: invocationScores as unknown as InvocationTypeScores,
+    by_invocation_type: invocationScores,

As per coding guidelines: cli/**/*.ts requires “Strict typing (no `any`, prefer `unknown` with type guards)”.

#!/bin/bash
# Verify unsafe casting/stringly score maps are removed.
rg -nP --type=ts 'as unknown as InvocationTypeScores|Record<string,\s*\{\s*passed:\s*number;\s*total:\s*number' cli/selftune/evolution/validate-proposal.ts cli/selftune/evolution/pareto.ts

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/validate-proposal.ts` around lines 133 - 159, The
current logic builds invocationScores from the observed keys in byInvocationType
and then force-casts it to InvocationTypeScores, which can leave missing
dimensions (explicit, implicit, contextual, negative) undefined; update the code
that computes byInvocationType/invocationScores (references: byInvocationType,
invocationScores, perEntryResults, InvocationTypeScores) to first initialize a
fully-typed object containing all four keys with {passed:0,total:0}, then
iterate perEntryResults to increment the correct key (use
r.entry.invocation_type ?? "implicit"), compute pass_rate for each explicit key,
and return that fully-populated InvocationTypeScores object without any "as
unknown as" double-cast so downstream Pareto logic always receives all four
dimensions as numbers.

cli/selftune/evolution/pareto.ts (1)

21-40: ⚠️ Potential issue | 🟠 Major

Harden computeInvocationScores against invalid keys and remove cast-based typing.

Line 29 can throw if runtime invocation_type is unexpected, and Line 39 masks structural issues with a double-cast.

Proposed fix

 export function computeInvocationScores(
   perEntryResults: Array<{ entry: { invocation_type?: InvocationType }; after_pass: boolean }>,
 ): InvocationTypeScores {
   const dims: InvocationType[] = ["explicit", "implicit", "contextual", "negative"];
-  const counts: Record<string, { passed: number; total: number }> = {};
+  const counts: Record<InvocationType, { passed: number; total: number }> = {
+    explicit: { passed: 0, total: 0 },
+    implicit: { passed: 0, total: 0 },
+    contextual: { passed: 0, total: 0 },
+    negative: { passed: 0, total: 0 },
+  };

-  for (const dim of dims) {
-    counts[dim] = { passed: 0, total: 0 };
-  }
-
   for (const r of perEntryResults) {
-    const type = r.entry.invocation_type ?? "implicit";
+    const rawType = r.entry.invocation_type;
+    const type: InvocationType =
+      rawType === "explicit" ||
+      rawType === "implicit" ||
+      rawType === "contextual" ||
+      rawType === "negative"
+        ? rawType
+        : "implicit";
     counts[type].total++;
     if (r.after_pass) counts[type].passed++;
   }

-  const result: Record<string, { passed: number; total: number; pass_rate: number }> = {};
-  for (const dim of dims) {
-    const { passed, total } = counts[dim];
-    result[dim] = { passed, total, pass_rate: total > 0 ? passed / total : 0 };
-  }
-
-  return result as unknown as InvocationTypeScores;
+  const result: InvocationTypeScores = {
+    explicit: { ...counts.explicit, pass_rate: counts.explicit.total ? counts.explicit.passed / counts.explicit.total : 0 },
+    implicit: { ...counts.implicit, pass_rate: counts.implicit.total ? counts.implicit.passed / counts.implicit.total : 0 },
+    contextual: { ...counts.contextual, pass_rate: counts.contextual.total ? counts.contextual.passed / counts.contextual.total : 0 },
+    negative: { ...counts.negative, pass_rate: counts.negative.total ? counts.negative.passed / counts.negative.total : 0 },
+  };
+
+  return result;
 }

As per coding guidelines: cli/**/*.ts requires “Strict typing (no `any`, prefer `unknown` with type guards)”.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/pareto.ts` around lines 21 - 40, The loop in
computeInvocationScores assumes r.entry.invocation_type is always one of dims
and uses a double-cast to force the return type; to fix, validate the runtime
key before indexing counts (e.g., check typeof type === "string" and that
dims.includes(type) or use Object.prototype.hasOwnProperty.call(counts, type))
and skip or categorize unknown types, incrementing counts only for known keys
(dims), and construct the result as a properly typed InvocationTypeScores object
rather than using a cast; reference symbols: computeInvocationScores, dims,
counts, perEntryResults, invocation_type, and InvocationTypeScores.

tests/grading/pre-gates.test.ts (1)

40-131: ⚠️ Potential issue | 🟠 Major

Replace .find(...)! with asserted safe lookup helper.

Lines 40-131 repeatedly use non-null assertions on gate lookups. If a gate key changes, tests crash with TypeError instead of producing a clear assertion failure.

Proposed fix

+function getGate(name: string): PreGate {
+  const gate = DEFAULT_GATES.find((g) => g.name === name);
+  expect(gate).toBeDefined();
+  if (!gate) throw new Error(`Missing gate: ${name}`);
+  return gate;
+}
+
 describe("pre-gate pattern matching", () => {
   test("skill_md_read gate matches 'SKILL.md was read'", () => {
-    const gate = DEFAULT_GATES.find((g) => g.name === "skill_md_read")!;
+    const gate = getGate("skill_md_read");
     expect(gate.pattern.test("SKILL.md was read")).toBe(true);
     expect(gate.pattern.test("The skill.md file was read")).toBe(true);
   });

As per coding guidelines: tests/**/*.ts requires “Optional chaining (result?.) not non-null assertions (result!.)”.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/grading/pre-gates.test.ts` around lines 40 - 131, Replace unsafe
non-null assertions on gate lookups by adding and using an asserted lookup
helper (e.g., assertGate or getRequiredGate) that searches DEFAULT_GATES by name
and throws/causes a test failure with a clear message if not found; replace
occurrences like DEFAULT_GATES.find((g) => g.name === "skill_md_read")! with
assertGate("skill_md_read") (or call the helper before using the returned gate),
and update tests referencing DEFAULT_GATES.find(...)! for all gate names
(skill_md_read, expected_tools_called, error_count, session_completed, etc.) so
tests fail with a clear assertion rather than a TypeError. Ensure the helper
returns the Gate object (so existing gate.check(makeCtx()) and
gate.pattern.test(...) calls continue to work).

cli/selftune/evolution/propose-description.ts (1)

153-177: ⚠️ Potential issue | 🟠 Major

One failed LLM call currently aborts the full candidate batch.

Promise.all is fail-fast here; a single candidate error drops all other valid candidates. Use Promise.allSettled and keep fulfilled proposals.

Suggested fix

-  const proposals = await Promise.all(
+  const settled = await Promise.allSettled(
     variations.map(async (prompt, i) => {
       const rawResponse = await callLlm(PROPOSER_SYSTEM, prompt, agent);
       const { proposed_description, rationale, confidence } = parseProposalResponse(rawResponse);

       return {
         proposal_id: `evo-${skillName}-${Date.now()}-${i}`,
@@
         status: "pending" as const,
       };
     }),
   );
-
-  return proposals;
+  const proposals = settled
+    .filter((r): r is PromiseFulfilledResult<EvolutionProposal> => r.status === "fulfilled")
+    .map((r) => r.value);
+
+  if (proposals.length === 0) {
+    throw new Error("Failed to generate any proposals");
+  }
+
+  return proposals;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/propose-description.ts` around lines 153 - 177, The
batch creation currently uses Promise.all which aborts all candidates if any
call to callLlm fails; change the proposals creation to use Promise.allSettled
over variations.map(async ...) (where callLlm and parseProposalResponse are
used) then filter the settled results to keep only those with status "fulfilled"
and extract the proposal values, optionally logging or recording errors from
rejected entries; ensure the returned array contains only the fulfilled proposal
objects (with the same shape including proposal_id, skill_name,
proposed_description, etc.) so a single LLM failure no longer drops the whole
batch.

cli/selftune/grading/grade-session.ts (1)

459-516: ⚠️ Potential issue | 🔴 Critical

CLI result assembly drops LLM metadata (claims, eval_feedback, failure_feedback).

In the mixed pre-gate + LLM path, graderOutput is used only for expectations. Final result overwrites metadata with empty defaults, so structured failure feedback is lost.

Suggested fix

-  let allExpectations: GradingExpectation[];
+  let allExpectations: GradingExpectation[];
+  let llmOutput: GraderOutput | null = null;
@@
     let graderOutput: GraderOutput;
     try {
       graderOutput = await gradeViaAgent(prompt, agent);
+      llmOutput = graderOutput;
     } catch (e) {
       console.error(`[ERROR] Grading failed: ${e}`);
       process.exit(1);
     }
@@
   const result: GradingResult = {
@@
-    claims: [],
-    eval_feedback: { suggestions: [], overall: "" },
+    claims: llmOutput?.claims ?? [],
+    eval_feedback: llmOutput?.eval_feedback ?? { suggestions: [], overall: "" },
+    failure_feedback: llmOutput?.failure_feedback,
   };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/grading/grade-session.ts` around lines 459 - 516, The final
GradingResult currently always overwrites metadata with empty defaults, losing
LLM-provided metadata when graderOutput exists; update the code that builds
result to pull claims, eval_feedback, and failure_feedback from graderOutput
when present (e.g., const claims = graderOutput?.claims ?? []; const
eval_feedback = graderOutput?.eval_feedback ?? { suggestions: [], overall: "" };
const failure_feedback = graderOutput?.failure_feedback ?? null) and use those
variables in the result object so mixed pre-gate + LLM runs preserve LLM
metadata; ensure this uses the existing graderOutput variable and still falls
back to safe defaults when graderOutput is undefined (pre-gate-only path).

cli/selftune/evolution/extract-patterns.ts (1)

149-170: ⚠️ Potential issue | 🟠 Major

Feedback enrichment drops duplicate entries and can mix unrelated skills.

Using Map<string, FailureFeedback> overwrites prior feedback for the same query, and missing skill_name filtering allows cross-skill feedback to be attached.

Suggested fix

-  if (gradingResults && gradingResults.length > 0) {
-    const feedbackMap = new Map<string, FailureFeedback>();
+  if (gradingResults && gradingResults.length > 0) {
+    const feedbackMap = new Map<string, FailureFeedback[]>();
     for (const gr of gradingResults) {
+      if (gr.skill_name !== skillName) continue;
       if (gr.failure_feedback) {
         for (const fb of gr.failure_feedback) {
-          feedbackMap.set(fb.query, fb);
+          const list = feedbackMap.get(fb.query) ?? [];
+          list.push(fb);
+          feedbackMap.set(fb.query, list);
         }
       }
     }

     for (const pattern of allPatterns) {
       const matchingFeedback: FailureFeedback[] = [];
       for (const query of pattern.missed_queries) {
-        const fb = feedbackMap.get(query);
-        if (fb) matchingFeedback.push(fb);
+        const fbList = feedbackMap.get(query);
+        if (fbList) matchingFeedback.push(...fbList);
       }
       if (matchingFeedback.length > 0) {
         pattern.feedback = matchingFeedback;
       }
     }
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/extract-patterns.ts` around lines 149 - 170, The
current loop builds feedbackMap as Map<string, FailureFeedback> which overwrites
duplicates and doesn't respect skill boundaries; change to Map<string,
FailureFeedback[]> (or similar) when populating from gradingResults so you push
all FailureFeedback entries for a query instead of replacing, and when attaching
feedback to each pattern (use pattern.missed_queries and pattern.skill_name)
only include feedback entries whose skill_name matches pattern.skill_name
(handle absent/undefined skill_name appropriately); finally set pattern.feedback
to the aggregated array (possibly empty) rather than a single item.

tests/grading/grade-session.test.ts (1)

tests/evolution/extract-patterns.test.ts (1)

tests/evolution/pareto.test.ts (1)

205-208: ⚠️ Potential issue | 🟡 Minor

Replace non-null assertions with defined checks and optional chaining.

The ! assertions on frontier.find(...) should be removed to match test typing rules and avoid hidden null-path failures.

Suggested fix

-    const memberA = frontier.find((c) => c.proposal.proposal_id === "a")!;
-    const memberB = frontier.find((c) => c.proposal.proposal_id === "b")!;
-    expect(memberA.dominates_on).toContain("explicit");
-    expect(memberB.dominates_on).toContain("implicit");
+    const memberA = frontier.find((c) => c.proposal.proposal_id === "a");
+    const memberB = frontier.find((c) => c.proposal.proposal_id === "b");
+    expect(memberA).toBeDefined();
+    expect(memberB).toBeDefined();
+    expect(memberA?.dominates_on).toContain("explicit");
+    expect(memberB?.dominates_on).toContain("implicit");

As per coding guidelines tests/**/*.ts: Optional chaining (result?.) not non-null assertions (result!).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/evolution/pareto.test.ts` around lines 205 - 208, Replace the non-null
assertions on the results of frontier.find by removing the trailing "!" from the
declarations (the variables memberA and memberB) and add defined checks plus
optional chaining when accessing properties: call expect(memberA).toBeDefined()
and expect(memberB).toBeDefined() after their assignments, then use
memberA?.dominates_on and memberB?.dominates_on in the subsequent expects (e.g.,
expect(memberA?.dominates_on).toContain("explicit") and
expect(memberB?.dominates_on).toContain("implicit")) so the test uses defined
checks and optional chaining instead of non-null assertions.

cli/selftune/evolution/evolve.ts (3)

205-206: ⚠️ Potential issue | 🟠 Major

Pareto is still default-disabled in runtime and CLI wiring.

Current fallbacks/defaults use false, which conflicts with the intended default-enabled Pareto flow.

Suggested fix

-    const paretoEnabled = options.paretoEnabled ?? false;
+    const paretoEnabled = options.paretoEnabled ?? true;
@@
-      pareto: { type: "boolean", default: false },
+      pareto: { type: "boolean", default: true },
@@
-    paretoEnabled: values.pareto ?? false,
+    paretoEnabled: values.pareto ?? true,

Also applies to: 456-457, 522-523

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/evolve.ts` around lines 205 - 206, The paretoEnabled
flag is being defaulted to false causing Pareto flow to be disabled; update the
fallback for options.paretoEnabled to true (e.g., const paretoEnabled =
options.paretoEnabled ?? true) in evolve.ts and make the same change for the
other occurrences where paretoEnabled is computed so the runtime/CLI wiring
defaults to enabled; ensure any CLI flag parsing that sets options.paretoEnabled
treats undefined as enabling Pareto rather than disabling it.

236-236: ⚠️ Potential issue | 🟠 Major

Created audit entries still omit rollback-required original_description: details.

Both "created" audit calls use generic messages. Rollback fallback expects details to include the original description with the original_description: prefix.

Suggested fix

-        recordAudit(proposal.proposal_id, "created", `Pareto candidate for ${skillName}`);
+        recordAudit(
+          proposal.proposal_id,
+          "created",
+          `original_description:${currentDescription}\ncontext:Pareto candidate for ${skillName}`,
+        );
@@
         recordAudit(
           proposal.proposal_id,
           "created",
-          `Proposal created for ${skillName} (iteration ${iteration + 1})`,
+          `original_description:${currentDescription}\ncontext:Proposal created for ${skillName} (iteration ${iteration + 1})`,
         );

Based on learnings: In cli/selftune/evolution/evolve.ts, the "created" audit entry should store the original SKILL.md content with the prefix "original_description:" in the details field to support rollback fallback restoration.

Also applies to: 295-300

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/evolve.ts` at line 236, The "created" audit entries
recorded via recordAudit(proposal.proposal_id, "created", ...) currently use
generic messages and must include the original SKILL.md content prefixed with
"original_description:" so rollback can restore it; update both places where
recordAudit is called for "created" (the call at or near
recordAudit(proposal.proposal_id, "created", `Pareto candidate for
${skillName}`) and the similar call around lines ~295-300) to fetch the current
SKILL.md/original description and pass a details string like
"original_description:<originalContent>" instead of the generic message, leaving
the audit type and proposal id unchanged.

206-207: ⚠️ Potential issue | 🟠 Major

Validate and clamp --candidates before using it for fan-out.

Raw parse can yield NaN (silently disabling Pareto path) or oversized parallelism. Parse + fallback + clamp to [1, 5].

Suggested fix

+  const parsedCandidates = Number.parseInt(values.candidates ?? "3", 10);
+  const candidateCount = Number.isFinite(parsedCandidates)
+    ? Math.min(5, Math.max(1, parsedCandidates))
+    : 3;
+
   const result = await evolve({
@@
-    candidateCount: Number.parseInt(values.candidates ?? "3", 10),
+    candidateCount,
   });

Also applies to: 457-457, 523-523

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/evolve.ts` around lines 206 - 207, Validate and clamp
options.candidateCount before using it: parse it as an integer (e.g., from
options.candidateCount), treat non-numeric results as undefined, default to 3,
and then clamp the resulting candidateCount into the inclusive range [1,5] to
avoid NaN or excessive parallelism; apply this same parse+fallback+clamp logic
wherever candidateCount is read (references: candidateCount,
options.candidateCount, and the places around the other uses noted at the same
pattern lines ~457 and ~523).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cli/selftune/evolution/evolve.ts`:
- Around line 205-206: The paretoEnabled flag is being defaulted to false
causing Pareto flow to be disabled; update the fallback for
options.paretoEnabled to true (e.g., const paretoEnabled = options.paretoEnabled
?? true) in evolve.ts and make the same change for the other occurrences where
paretoEnabled is computed so the runtime/CLI wiring defaults to enabled; ensure
any CLI flag parsing that sets options.paretoEnabled treats undefined as
enabling Pareto rather than disabling it.
- Line 236: The "created" audit entries recorded via
recordAudit(proposal.proposal_id, "created", ...) currently use generic messages
and must include the original SKILL.md content prefixed with
"original_description:" so rollback can restore it; update both places where
recordAudit is called for "created" (the call at or near
recordAudit(proposal.proposal_id, "created", `Pareto candidate for
${skillName}`) and the similar call around lines ~295-300) to fetch the current
SKILL.md/original description and pass a details string like
"original_description:<originalContent>" instead of the generic message, leaving
the audit type and proposal id unchanged.
- Around line 206-207: Validate and clamp options.candidateCount before using
it: parse it as an integer (e.g., from options.candidateCount), treat
non-numeric results as undefined, default to 3, and then clamp the resulting
candidateCount into the inclusive range [1,5] to avoid NaN or excessive
parallelism; apply this same parse+fallback+clamp logic wherever candidateCount
is read (references: candidateCount, options.candidateCount, and the places
around the other uses noted at the same pattern lines ~457 and ~523).

In `@cli/selftune/evolution/extract-patterns.ts`:
- Around line 149-170: The current loop builds feedbackMap as Map<string,
FailureFeedback> which overwrites duplicates and doesn't respect skill
boundaries; change to Map<string, FailureFeedback[]> (or similar) when
populating from gradingResults so you push all FailureFeedback entries for a
query instead of replacing, and when attaching feedback to each pattern (use
pattern.missed_queries and pattern.skill_name) only include feedback entries
whose skill_name matches pattern.skill_name (handle absent/undefined skill_name
appropriately); finally set pattern.feedback to the aggregated array (possibly
empty) rather than a single item.

In `@cli/selftune/evolution/pareto.ts`:
- Around line 21-40: The loop in computeInvocationScores assumes
r.entry.invocation_type is always one of dims and uses a double-cast to force
the return type; to fix, validate the runtime key before indexing counts (e.g.,
check typeof type === "string" and that dims.includes(type) or use
Object.prototype.hasOwnProperty.call(counts, type)) and skip or categorize
unknown types, incrementing counts only for known keys (dims), and construct the
result as a properly typed InvocationTypeScores object rather than using a cast;
reference symbols: computeInvocationScores, dims, counts, perEntryResults,
invocation_type, and InvocationTypeScores.

In `@cli/selftune/evolution/propose-description.ts`:
- Around line 153-177: The batch creation currently uses Promise.all which
aborts all candidates if any call to callLlm fails; change the proposals
creation to use Promise.allSettled over variations.map(async ...) (where callLlm
and parseProposalResponse are used) then filter the settled results to keep only
those with status "fulfilled" and extract the proposal values, optionally
logging or recording errors from rejected entries; ensure the returned array
contains only the fulfilled proposal objects (with the same shape including
proposal_id, skill_name, proposed_description, etc.) so a single LLM failure no
longer drops the whole batch.

In `@cli/selftune/evolution/validate-proposal.ts`:
- Around line 133-159: The current logic builds invocationScores from the
observed keys in byInvocationType and then force-casts it to
InvocationTypeScores, which can leave missing dimensions (explicit, implicit,
contextual, negative) undefined; update the code that computes
byInvocationType/invocationScores (references: byInvocationType,
invocationScores, perEntryResults, InvocationTypeScores) to first initialize a
fully-typed object containing all four keys with {passed:0,total:0}, then
iterate perEntryResults to increment the correct key (use
r.entry.invocation_type ?? "implicit"), compute pass_rate for each explicit key,
and return that fully-populated InvocationTypeScores object without any "as
unknown as" double-cast so downstream Pareto logic always receives all four
dimensions as numbers.

In `@cli/selftune/grading/grade-session.ts`:
- Around line 459-516: The final GradingResult currently always overwrites
metadata with empty defaults, losing LLM-provided metadata when graderOutput
exists; update the code that builds result to pull claims, eval_feedback, and
failure_feedback from graderOutput when present (e.g., const claims =
graderOutput?.claims ?? []; const eval_feedback = graderOutput?.eval_feedback ??
{ suggestions: [], overall: "" }; const failure_feedback =
graderOutput?.failure_feedback ?? null) and use those variables in the result
object so mixed pre-gate + LLM runs preserve LLM metadata; ensure this uses the
existing graderOutput variable and still falls back to safe defaults when
graderOutput is undefined (pre-gate-only path).

In `@tests/evolution/extract-patterns.test.ts`:
- Around line 222-333: Add two regression tests for extractFailurePatterns: one
that supplies a gradingResults entry whose failure_feedback contains multiple
objects with the same query and asserts that the resulting pattern's feedback
array preserves all entries (check length and contents), and another that
supplies gradingResults with failure_feedback for a different skill_name than
the target skill and asserts that no feedback is attached to patterns (i.e.,
patterns[].feedback is undefined or empty). Ensure tests construct EvalEntry
items (e.g., via makeEval) matching the queries and pass gradingResults
including failure_feedback arrays and explicit skill_name values to exercise
duplicate-query preservation and cross-skill filtering logic in
extractFailurePatterns.

In `@tests/evolution/pareto.test.ts`:
- Around line 205-208: Replace the non-null assertions on the results of
frontier.find by removing the trailing "!" from the declarations (the variables
memberA and memberB) and add defined checks plus optional chaining when
accessing properties: call expect(memberA).toBeDefined() and
expect(memberB).toBeDefined() after their assignments, then use
memberA?.dominates_on and memberB?.dominates_on in the subsequent expects (e.g.,
expect(memberA?.dominates_on).toContain("explicit") and
expect(memberB?.dominates_on).toContain("implicit")) so the test uses defined
checks and optional chaining instead of non-null assertions.

In `@tests/grading/grade-session.test.ts`:
- Around line 666-709: Add a regression test that exercises the CLI pre-gate +
LLM merge path by invoking cliMain (or the function that runs the merge branch)
with mocked pre-gate results and mocked LLM output so the merge logic is
executed; ensure the test uses the same shapes used by assembleResult and
asserts that claims, eval_feedback, and failure_feedback from both pre-gate and
LLM outputs are preserved/merged (e.g., check presence and contents of
result.claims, result.eval_feedback, and result.failure_feedback) to catch
regressions where those fields were dropped.

In `@tests/grading/pre-gates.test.ts`:
- Around line 40-131: Replace unsafe non-null assertions on gate lookups by
adding and using an asserted lookup helper (e.g., assertGate or getRequiredGate)
that searches DEFAULT_GATES by name and throws/causes a test failure with a
clear message if not found; replace occurrences like DEFAULT_GATES.find((g) =>
g.name === "skill_md_read")! with assertGate("skill_md_read") (or call the
helper before using the returned gate), and update tests referencing
DEFAULT_GATES.find(...)! for all gate names (skill_md_read,
expected_tools_called, error_count, session_completed, etc.) so tests fail with
a clear assertion rather than a TypeError. Ensure the helper returns the Gate
object (so existing gate.check(makeCtx()) and gate.pattern.test(...) calls
continue to work).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between 0271867 and 97cbc4f.

📒 Files selected for processing (16)

README.md
cli/selftune/evolution/evolve.ts
cli/selftune/evolution/extract-patterns.ts
cli/selftune/evolution/pareto.ts
cli/selftune/evolution/propose-description.ts
cli/selftune/evolution/validate-proposal.ts
cli/selftune/grading/grade-session.ts
cli/selftune/types.ts
tests/evolution/extract-patterns.test.ts
tests/evolution/pareto.test.ts
tests/evolution/propose-description.test.ts
tests/grading/grade-session.test.ts
tests/grading/pre-gates.test.ts
tests/sandbox/docker/run-with-llm.ts
tests/sandbox/run-sandbox.ts
tests/types/new-types.test.ts

- Make skill_md_read gate pattern order-agnostic (read...skill.md | skill.md...read) - Fix MD029 ordered list numbering in golden-principles.md (local 1..n) - Add blank line after ### Skill Evolution heading in PRD.md (MD022) - Remove redundant failure_feedback assertion in grade-session.test.ts - Add test coverage for both pattern orderings in pre-gates.test.ts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

tests/grading/pre-gates.test.ts (1)

40-133: ⚠️ Potential issue | 🟠 Major

Remove find(...)! non-null assertions from gate lookups.

Lines 40-133 repeatedly use DEFAULT_GATES.find(...)!. If a gate name changes, tests fail with runtime property errors instead of clear assertion failures.

Refactor with a guarded helper

+function getGate(name: string): PreGate {
+  const gate = DEFAULT_GATES.find((g) => g.name === name);
+  expect(gate).toBeDefined();
+  return gate as PreGate;
+}
+
 describe("pre-gate pattern matching", () => {
   test("skill_md_read gate matches both orderings", () => {
-    const gate = DEFAULT_GATES.find((g) => g.name === "skill_md_read")!;
+    const gate = getGate("skill_md_read");
     expect(gate.pattern.test("SKILL.md was read")).toBe(true);
     expect(gate.pattern.test("The skill.md file was read")).toBe(true);

As per coding guidelines tests/**/*.ts: Optional chaining (result?.) not non-null assertions (result!.).

#!/bin/bash
# Verify remaining non-null assertions on gate lookups in this file.
rg -nP 'DEFAULT_GATES\.find\([^\n]*\)!' tests/grading/pre-gates.test.ts

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/grading/pre-gates.test.ts` around lines 40 - 133, Tests use non-null
assertions like DEFAULT_GATES.find(...)! — remove those and replace with a small
guarded accessor to fail fast and avoid `!`. Add a helper function getGate(name:
string) that does const g = DEFAULT_GATES.find(g => g.name === name); if (!g)
throw new Error(`missing gate: ${name}`); return g; then replace all occurrences
of DEFAULT_GATES.find((g) => g.name === "…")! with getGate("…") so callers
(e.g., tests referencing gate.pattern or gate.check) no longer use non-null
assertions and will get a clear error if the gate is missing.

docs/golden-principles.md (1)

82-90: ⚠️ Potential issue | 🟡 Minor

Fix MD029 ordered-list numbering in Evolution Rules.

Lines 82, 85, and 88 restart numbering (1/2/3) inside the existing ordered list and fail markdownlint MD029 in this patch.

Minimal lint-safe renumbering

-1. **Pre-gates before LLM grading**
+5. **Pre-gates before LLM grading**
@@
-2. **Graduated scores over binary pass/fail**
+6. **Graduated scores over binary pass/fail**
@@
-3. **Pareto frontier for multi-candidate selection**
+7. **Pareto frontier for multi-candidate selection**

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/golden-principles.md` around lines 82 - 90, The ordered list entries
under the "Evolution Rules" section ("Pre-gates before LLM grading", "Graduated
scores over binary pass/fail", "Pareto frontier for multi-candidate selection")
restart numbering (1/2/3) which triggers markdownlint MD029; update those list
item numbers to continue the surrounding ordered-list sequence (use the correct
incremental numbers instead of restarting at 1) so the numbering is consecutive
throughout the list and re-run the linter to confirm MD029 is resolved.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cli/selftune/grading/pre-gates.ts`:
- Around line 82-83: The runPreGates flow uses gate.pattern.test(text) which can
be stateful if the injected RegExp has g or y flags; before invoking
gate.pattern.test(text) reset the regex state by setting gate.pattern.lastIndex
= 0 (or use a fresh RegExp instance) so repeated calls don't carry over
lastIndex and cause spurious failures; apply this fix immediately before the
test call in the runPreGates logic and keep the subsequent call to
gate.check(ctx) unchanged.

In `@tests/grading/grade-session.test.ts`:
- Around line 584-625: Add tests covering the clamping and non-finite fallback
branches of buildGraduatedSummary: create cases where an expectation score is
below 0 (e.g., -0.1) and above 1 (e.g., 1.2) to assert mean_score and individual
scores are clamped to [0,1], and create cases with score: NaN and score:
Infinity (and -Infinity) to assert they are treated as fallback values (e.g.,
defaulting to 1.0 for passed and 0.0 for failed) and that mean_score and
score_std_dev behave accordingly; add these test cases alongside the existing
buildGraduatedSummary tests so regressions on clamping/non-finite sanitization
are covered.

---

Duplicate comments:
In `@docs/golden-principles.md`:
- Around line 82-90: The ordered list entries under the "Evolution Rules"
section ("Pre-gates before LLM grading", "Graduated scores over binary
pass/fail", "Pareto frontier for multi-candidate selection") restart numbering
(1/2/3) which triggers markdownlint MD029; update those list item numbers to
continue the surrounding ordered-list sequence (use the correct incremental
numbers instead of restarting at 1) so the numbering is consecutive throughout
the list and re-run the linter to confirm MD029 is resolved.

In `@tests/grading/pre-gates.test.ts`:
- Around line 40-133: Tests use non-null assertions like
DEFAULT_GATES.find(...)! — remove those and replace with a small guarded
accessor to fail fast and avoid `!`. Add a helper function getGate(name: string)
that does const g = DEFAULT_GATES.find(g => g.name === name); if (!g) throw new
Error(`missing gate: ${name}`); return g; then replace all occurrences of
DEFAULT_GATES.find((g) => g.name === "…")! with getGate("…") so callers (e.g.,
tests referencing gate.pattern or gate.check) no longer use non-null assertions
and will get a clear error if the gate is missing.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between 97cbc4f and 403d2c1.

📒 Files selected for processing (5)

PRD.md
cli/selftune/grading/pre-gates.ts
docs/golden-principles.md
tests/grading/grade-session.test.ts
tests/grading/pre-gates.test.ts

cli/selftune/grading/pre-gates.ts

tests/grading/grade-session.test.ts

Reset lastIndex before gate.pattern.test() for global/sticky regexes to prevent stale state across iterations. Add edge case tests for buildGraduatedSummary: clamping out-of-range scores, NaN/Infinity fallback to passed-based defaults. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests/grading/pre-gates.test.ts (1)

40-133: ⚠️ Potential issue | 🟠 Major

Replace repeated .find(...)! with safe gate lookup helper.

Lines 40–133 use non-null assertions for gate lookups. This makes failures less actionable and violates the test guideline for optional-safe access patterns.

As per coding guidelines, tests/**/*.ts requires optional chaining (result?.) instead of non-null assertions (result!.).

Suggested refactor

+function getGate(name: string): PreGate {
+  const gate = DEFAULT_GATES.find((g) => g.name === name);
+  expect(gate).toBeDefined();
+  if (!gate) throw new Error(`Missing gate: ${name}`);
+  return gate;
+}
-
-const gate = DEFAULT_GATES.find((g) => g.name === "skill_md_read")!;
+const gate = getGate("skill_md_read");

#!/bin/bash
# Verify all non-null assertions on DEFAULT_GATES lookups in this test file
rg -nP 'DEFAULT_GATES\.find\([^\n]*\)!' tests/grading/pre-gates.test.ts

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/grading/pre-gates.test.ts` around lines 40 - 133, Replace unsafe
non-null assertions on DEFAULT_GATES lookups by adding a small helper (e.g.,
getGate(name): Gate | undefined) that returns DEFAULT_GATES.find(g => g.name ===
name) and update every test to use optional chaining instead of `!`; e.g.,
change `const gate = DEFAULT_GATES.find(... )!` to `const gate =
getGate("skill_md_read")` and change uses like `gate.pattern.test(...)` or
`gate.check(ctx)` to `gate?.pattern.test(...)` and `gate?.check(ctx)` (or assert
the gate exists via expect(getGate(...)).toBeDefined() before calling) so no
non-null assertions remain while preserving the same assertions for
DEFAULT_GATES, DEFAULT_GATES.find, gate.check, and gate.pattern.test.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/grading/grade-session.test.ts`:
- Around line 585-593: The assertion in the test for buildGraduatedSummary uses
2-decimal precision but the function rounds mean_score to 3 decimals; update the
expectation to assert 3-decimal precision (e.g., change
expect(result.mean_score).toBeCloseTo(0.533, 3)) or compare against the exact
rounded value returned by buildGraduatedSummary to ensure the test matches the
3-decimal contract.

---

Duplicate comments:
In `@tests/grading/pre-gates.test.ts`:
- Around line 40-133: Replace unsafe non-null assertions on DEFAULT_GATES
lookups by adding a small helper (e.g., getGate(name): Gate | undefined) that
returns DEFAULT_GATES.find(g => g.name === name) and update every test to use
optional chaining instead of `!`; e.g., change `const gate =
DEFAULT_GATES.find(... )!` to `const gate = getGate("skill_md_read")` and change
uses like `gate.pattern.test(...)` or `gate.check(ctx)` to
`gate?.pattern.test(...)` and `gate?.check(ctx)` (or assert the gate exists via
expect(getGate(...)).toBeDefined() before calling) so no non-null assertions
remain while preserving the same assertions for DEFAULT_GATES,
DEFAULT_GATES.find, gate.check, and gate.pattern.test.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between 403d2c1 and 116f334.

📒 Files selected for processing (3)

cli/selftune/grading/pre-gates.ts
tests/grading/grade-session.test.ts
tests/grading/pre-gates.test.ts

tests/grading/grade-session.test.ts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/grading/grade-session.test.ts`:
- Around line 704-709: Update the test fixture/assertions in
tests/grading/grade-session.test.ts to include the invocation_type field for
each failure_feedback entry so the assertion locks the full structured payload;
locate the failure_feedback array in the test (the object with query "make
slides" and improvement_hint "Add slide keywords") and add the appropriate
invocation_type value used by your invocation-type logic, and do the same for
the other occurrence around lines 719-722 to ensure both fixtures/assertions
include invocation_type.

---

Duplicate comments:
In `@tests/grading/grade-session.test.ts`:
- Around line 719-721: The test accesses an element property without full
optional chaining (result.failure_feedback?.[0].query) which can throw if index
0 is undefined; update the assertion to use full optional chaining on the
element (result.failure_feedback?.[0]?.query) so the access is safe, and run the
test to confirm it still validates the expected value when the entry exists.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between 116f334 and 90bdf96.

📒 Files selected for processing (1)

tests/grading/grade-session.test.ts

tests/grading/grade-session.test.ts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/grading/grade-session.test.ts`:
- Around line 720-723: Change the nested property accesses on the test variable
result to use full optional chaining on the array index so missing elements
won't throw; specifically replace result.failure_feedback?.[0].query with
result.failure_feedback?.[0]?.query and ensure any similar access (e.g.,
result.failure_feedback?.[0].invocation_type) uses
result.failure_feedback?.[0]?.invocation_type or
result.failure_feedback?.[0]?.invocation_type if it already has a safe
operator—update the two assertions referencing failure_feedback[0] in the test
to use ?. on the [0] index.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

📥 Commits

Reviewing files that changed from the base of the PR and between 90bdf96 and 5af9c2a.

📒 Files selected for processing (1)

tests/grading/grade-session.test.ts

tests/grading/grade-session.test.ts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Bug fixes: - Fix routing retries using wrong refiner in evolve-body.ts (#15) - Fix baseline gate evaluating wrong skill body in evolve.ts (#19) - Fix Gate 1 accepting routing tables with zero data rows (#20) - Fix section heading prefix match allowing false positives (#13) Security: - Add ReDoS protection for regex assertions (length cap, try-catch, u flag) (#10) - Harden YES/NO parsing to first-token extraction with punctuation stripping (#23) Type safety & validation: - Add type params to readJsonl calls in baseline.ts (#3) - Validate LLM JSON shape before treating as SkillUnitTest[] (#4) - Add JSON.parse validation in unit-test-cli.ts (#9) Code quality: - Replace console.warn with structured JSON logging (#5, #11) - Sort readdirSync for deterministic traversal (#6) - Tighten fuzzy match to prevent over-selection (#7) - Rename evalFailures to positiveEvals for clarity (#8) - Enforce timeout_ms via Promise.race in runUnitTest (#12) - Enrich deploy audit entry with gates snapshot (#16) - Add Pareto 5th dimension comment (#18) - Extract composability CLI into composability-cli.ts (#21) - Add missing eval files to lint-architecture.ts (#28) - Reformat AssertionType union (#22), template literal (#14), imports (#17) - Reformat baseline conditional (#2) Documentation: - Renumber scope expansion list (#24) - Add text language to fenced code blocks (#25, #27, #30) - Add rollback audit payload note (#26) - Clarify --dir expects corpus root (#29) - Make --tests optional in syntax (#31) Tests: - Remove unused afterEach import (#32) - Replace ! assertion with ?. optional chaining (#33) - Fix temp dir cleanup leak with afterAll (#34) - Add Gate 2 status assertion (#40) - Strengthen routing validation assertions (#41) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector bot reviewed Mar 3, 2026

View reviewed changes

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

WellDunDun force-pushed the custom/prefix/router-1772541985073 branch from c39cecb to 0271867 Compare March 3, 2026 13:45

fix: resolve biome lint and format errors for CI

97cbc4f

Remove unused imports, fix import sort order, apply formatting rules, and prefix unused variables in sandbox tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

WellDunDun force-pushed the custom/prefix/router-1772541985073 branch from 0271867 to 97cbc4f Compare March 3, 2026 13:50

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

cli/selftune/grading/pre-gates.ts Outdated Show resolved Hide resolved

docs/golden-principles.md Outdated Show resolved Hide resolved

PRD.md Show resolved Hide resolved

tests/grading/grade-session.test.ts Show resolved Hide resolved

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

cli/selftune/grading/pre-gates.ts Show resolved Hide resolved

tests/grading/grade-session.test.ts Show resolved Hide resolved

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

tests/grading/grade-session.test.ts Show resolved Hide resolved

fix: Tighten mean_score assertion to exact 3-decimal contract

90bdf96

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

tests/grading/grade-session.test.ts Show resolved Hide resolved

fix: Include invocation_type in failure_feedback test fixture

5af9c2a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

tests/grading/grade-session.test.ts Show resolved Hide resolved

fix: Use full optional chaining on failure_feedback array access

c3fc877

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

WellDunDun merged commit 04fe24a into dev Mar 3, 2026
5 checks passed

coderabbitai bot mentioned this pull request Mar 5, 2026

feat: Expand selftune from description-only to full skill evolution #20

Merged

WellDunDun deleted the custom/prefix/router-1772541985073 branch March 8, 2026 12:11

		pattern: /error[s]?\s*(count\|encountered)/i,
		check: (ctx) => (ctx.telemetry.errors_encountered ?? 0) <= 2,

		const paretoEnabled = options.paretoEnabled ?? false;
		const candidateCount = options.candidateCount ?? 3;

		candidates: { type: "string", default: "3" },
		help: { type: "boolean", default: false },

		const memberA = frontier.find(c => c.proposal.proposal_id === "a")!;
		const memberB = frontier.find(c => c.proposal.proposal_id === "b")!;

Conversation

WellDunDun commented Mar 3, 2026

Summary

What Changed

Uh oh!

coderabbitai bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot commented Mar 3, 2026 •

edited

Loading