Skip to content

Migrate selftune CLI from Python to Bun/TypeScript#1

Merged
WellDunDun merged 10 commits intomasterfrom
WellDunDun/init-reins
Feb 28, 2026
Merged

Migrate selftune CLI from Python to Bun/TypeScript#1
WellDunDun merged 10 commits intomasterfrom
WellDunDun/init-reins

Conversation

@WellDunDun
Copy link
Collaborator

@WellDunDun WellDunDun commented Feb 28, 2026

Migrate selftune CLI from Python to Bun/TypeScript

Complete rewrite of the selftune CLI from its original Python implementation to Bun/TypeScript. This PR introduces the full CLI surface, supporting infrastructure, documentation, and comprehensive test coverage.

What's included

Core CLI (cli/selftune/)

  • index.ts — Top-level command router dispatching to subcommand modules
  • init.ts — Agent identity bootstrap: detects coding agent (Claude Code, Codex, OpenCode), resolves CLI paths, determines LLM mode, checks hook installation, writes ~/.selftune/config.json
  • constants.ts — Shared paths, log file locations, config directory constants
  • types.ts — Central type definitions (telemetry records, eval entries, proposals, audit entries, config schema)
  • observability.ts — Health check system (selftune doctor) validating hooks, logs, config, and evolution audit state

Grading (cli/selftune/grading/)

  • grade-session.ts — Grade skill sessions via agent mode (Claude Code subprocess) or API mode (direct Anthropic SDK). Includes structured grading rubric, expectation matching, and JSONL output

Evolution pipeline (cli/selftune/evolution/)

  • evolve.ts — Full evolution orchestrator: extract failure patterns → propose description changes → validate proposals → deploy with PR creation → audit trail
  • extract-patterns.ts — Cross-reference eval entries with usage records to identify missed queries; Jaccard similarity clustering with transitive single-linkage merging
  • propose-description.ts — LLM-driven description proposal generation from failure patterns
  • validate-proposal.ts — Validate proposed descriptions via LLM using eval entries as test cases
  • deploy-proposal.ts — Deploy validated proposals: replace description in SKILL.md, create timestamped backups, git branch + commit + PR via gh; includes subprocess timeout protection
  • rollback.ts — Two-strategy rollback: restore from timestamped .bak backups (latest deploy) or from audit trail original_description entries (specific proposals); description-only restoration preserves SKILL.md structure
  • audit.ts — Append-only JSONL audit trail for evolution actions
  • stopping-criteria.ts — Determine when evolution should stop based on score thresholds and iteration limits

Ingestors (cli/selftune/ingestors/)

  • codex-rollout.ts — Parse Codex rollout directories (year/month/day/session structure), extract telemetry, detect skill triggers via word-boundary regex matching
  • codex-wrapper.ts — Wrap codex exec with telemetry capture; stream subprocess output while recording tool calls, errors, and skill usage
  • opencode-ingest.ts — Ingest OpenCode SQLite databases; safe identifier validation prevents SQL injection; parameterized queries for temporal filtering

Monitoring (cli/selftune/monitoring/)

  • watch.ts — Continuous monitoring: compute score snapshots over configurable session windows, detect regressions against thresholds, auto-rollback on sustained regression with configurable cooldown

Hooks (cli/selftune/hooks/)

  • session-stop.ts — Session end hook: extract telemetry from transcript JSONL with runtime type guards
  • skill-eval.ts — Post-tool-use hook: detect SKILL.md reads and log usage records with runtime file_path validation
  • prompt-log.ts — Prompt submission logging

Eval generation (cli/selftune/eval/)

  • hooks-to-evals.ts — Generate eval sets from hook logs: sample queries, create should/should-not trigger pairs, seeded randomization for reproducibility

Utilities (cli/selftune/utils/)

  • jsonl.ts — JSONL read/write with optional schema validation via logType
  • llm-call.ts — Unified LLM interface supporting both agent mode (Claude subprocess) and API mode (Anthropic SDK)
  • transcript.ts — Parse Claude Code transcript JSONL to extract metrics (tool calls, errors, skills, tokens)
  • schema-validator.ts — Runtime schema validation for telemetry records
  • logging.ts — Structured logging with module context
  • seeded-random.ts — Deterministic PRNG for reproducible eval sampling

Skill definition (skill/)

  • SKILL.md — Selftune skill definition for Claude Code integration
  • 8 workflow files: Doctor, Evals, Evolve, Grade, Ingest, Initialize, Rollback, Watch
  • Reference docs: grading methodology, invocation taxonomy, log formats
  • settings_snippet.json — Claude Code hook configuration template

Documentation (docs/)

  • PRD.md — Product requirements document
  • Design docs: core beliefs, evolution pipeline, monitoring pipeline
  • Execution plans, escalation policy, golden principles, tech debt tracker

Infrastructure

  • biome.json — Linter/formatter configuration
  • tsconfig.json — TypeScript configuration targeting Bun
  • Makefile — Build commands
  • .github/workflows/ci.yml — CI pipeline
  • lint-architecture.ts — Architectural lint rules
  • ARCHITECTURE.md, AGENTS.md, README.md

Security hardening (CodeRabbit reviews)

Three rounds of CodeRabbit review addressed across commits:

  • SQL injection preventionassertSafeIdentifier() validates all table names before SQL interpolation in opencode-ingest.ts; sinceTs uses parameterized queries (WHERE created > ?)
  • Subprocess safetyrunCommand() has configurable timeout with process kill on expiry; stdout/stderr drained concurrently via Promise.all to prevent pipe deadlocks
  • Input validation — CLI numeric inputs validated with strict regex before parsing (/^\d+$/, /^\d+(\.\d+)?$/); --eval-id, --since, --llm-mode all validated before use
  • Type safety — Runtime type guards for hook payloads; as string casts replaced with typeof checks; loadExpectationsFromEvalsJson validates parsed JSON structure at runtime
  • Error handlingJSON.parse calls wrapped with descriptive error context; statSync guarded against broken symlinks; dynamic imports selectively catch MODULE_NOT_FOUND only; debug-gated error logging in hooks
  • Robustness — Timestamped backups prevent overwrites; sanitizeForGitRef guards against empty strings; replaceDescription preserves pre-heading preamble; findOriginalFromAudit tolerates both prefixed and plain description formats
  • Observability — Hook path checks use .git/hooks/ (not bundled dir); config validation checks agent_type/llm_mode against allowed values; shared validateJsonlFile helper eliminates duplication; execSync timeout prevents git hangs
  • Test quality — Deterministic test assertions (no conditional skips); 8 new validation tests for loadExpectationsFromEvalsJson

Test coverage

407 tests across 27 files, 1027 assertions, all passing.

Module Tests File
Evolution (unit) 129 8 files
Evolution (integration) 9 integration.test.ts
Grading 43 grade-session.test.ts
Hooks 17 3 files
Ingestors 41 3 files
Init 19 init.test.ts
Monitoring 72 2 files
Observability 10 observability.test.ts
Utilities 48 4 files
Architecture lint 19 lint-architecture.test.ts

Stats

  • 100 files changed, 18,051 insertions
  • 10 commits (initial migration + 3 review rounds)
  • 0 test failures, 0 lint errors

WellDunDun and others added 4 commits February 28, 2026 17:25
Sets up the full project structure including:
- CLI modules for eval, grading, hooks, and ingestors
- Reins agent skills and workflow definitions
- CI pipeline, linting config, and architecture docs
- PRD, golden principles, and risk policy

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Preserves empty directory structure with .gitkeep so skills
can be symlinked into these paths after clone.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both symlink to .agents/skills/reins so the skill is
discoverable by all agent frameworks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrites the entire cli/selftune/ toolchain from Python to TypeScript
targeting the Bun runtime, eliminating Python ecosystem friction for
JS/TS developers. Zero runtime dependencies — all functionality uses
Bun built-ins (Bun.spawn, bun:sqlite, fetch, parseArgs).

- Add 16 TypeScript source files replacing 10 Python modules
- Add 11 test files (121 tests, 342 assertions, all passing)
- Delete all Python files, pyproject.toml, ruff.toml, __init__.py
- Update configs: package.json, tsconfig.json, biome.json, Makefile, CI
- Update docs: AGENTS.md, ARCHITECTURE.md, PRD.md, README.md
- Fix all biome lint errors (import sorting, template literals, non-null assertions)
- Architecture lint passes with zero violations

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a Reins/selftune harness: CLI, telemetry hooks, ingestors/wrappers, eval/grading/evolution pipelines (pattern→propose→validate→deploy/rollback), monitoring/doctor checks, schema/validation utilities, CI/config, documentation, and extensive tests.

Changes

Cohort / File(s) Summary
Reins Harness docs & skill metadata
/.agents/skills/reins/HarnessMethodology.md, /.agents/skills/reins/SKILL.md, /.agents/skills/reins/Workflows/*, .claude/skills/reins, skill/SKILL.md, skill/Workflows/*, skill/references/*, skill/settings_snippet.json
Adds comprehensive Reins/selftune documentation, workflow guides (init/audit/doctor/evolve/scaffold/ingest/watch/grade/rollback), skill metadata, and settings snippet for hook installation.
CLI entry, commands & scaffolding
cli/selftune/index.ts, cli/selftune/init.ts, cli/selftune/constants.ts, cli/selftune/types.ts
New top-level CLI, init flow, exported constants, and expanded public types for config, logs, evals, grading, evolution, health, and monitoring.
Telemetry hooks & JSONL utilities
cli/selftune/hooks/*, cli/selftune/utils/jsonl.ts, cli/selftune/utils/transcript.ts, cli/selftune/utils/logging.ts, cli/selftune/utils/schema-validator.ts
Claude hooks (prompt-log, skill-eval, session-stop) plus robust JSONL IO, transcript parsing, structured logging, and record schema validation.
Eval, grading & sampling
cli/selftune/eval/hooks-to-evals.ts, cli/selftune/grading/grade-session.ts, cli/selftune/utils/seeded-random.ts
Eval-set generation from logs, seeded deterministic sampling/shuffle, and a rubric-based grader supporting agent/API modes and transcript/excerpt handling.
LLM interface & utilities
cli/selftune/utils/llm-call.ts, cli/selftune/utils/llm-call.ts
Unified LLM caller (agent or API), agent detection, markdown-fence stripping, and unified callLlm dispatcher.
Ingestors & wrappers
cli/selftune/ingestors/codex-rollout.ts, cli/selftune/ingestors/codex-wrapper.ts, cli/selftune/ingestors/opencode-ingest.ts
Codex and OpenCode ingest pipelines and a Codex wrapper to normalize historical and live session telemetry into shared JSONL logs.
Evolution pipeline & orchestration
cli/selftune/evolution/...
evolve.ts, extract-patterns.ts, propose-description.ts, validate-proposal.ts, deploy-proposal.ts, rollback.ts, stopping-criteria.ts, audit.ts
Full evolution flow: failure-pattern extraction, LLM proposal generation, per-query trigger validation, deploy (backup/branch/PR), rollback strategies, audit trail, and stopping criteria.
Monitoring & observability
cli/selftune/monitoring/watch.ts, cli/selftune/observability.ts, cli/selftune/evolution/audit.ts
Health checks, doctor orchestration, monitoring snapshot computation, regression detection, alerting, and optional auto-rollback wiring.
Project config, tooling & CI
package.json, tsconfig.json, biome.json, Makefile, .github/workflows/ci.yml, .gitignore, lint-architecture.ts
Project metadata, TypeScript config, linter config, Makefile targets, CI workflow, .gitignore, and an architecture linter enforcing import constraints.
Risk / governance / design docs
AGENTS.md, ARCHITECTURE.md, PRD.md, README.md, docs/*, risk-policy.json, skills-lock.json
Adds product, architecture, design, and governance documentation plus a risk-policy config and skill lock.
Tests
tests/**/* (many new suites)
Extensive unit and integration tests covering hooks, utils, evals, grading, ingestors, evolution, monitoring, linter, and end-to-end flows with filesystem fixtures.
Small meta files
/.agents/skills/reins, /.claude/skills/reins, skill/settings_snippet.json
Pointer files and settings snippet for skill discovery and user settings.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant CLI as selftune CLI
  participant Logs as JSONL_Logs
  participant Extractor as PatternExtractor
  participant LLM as LLM (agent/API)
  participant Validator as Validator
  participant Git as Git/GitHub
  participant Audit as EvolutionAuditLog

  CLI->>Logs: read telemetry / skill / query logs
  CLI->>Extractor: request failure pattern extraction
  Extractor-->>CLI: failure patterns
  CLI->>LLM: generate proposal (system + prompt)
  LLM-->>CLI: proposal JSON (description, confidence)
  CLI->>Validator: validate proposal vs eval set (per-query LLM checks)
  Validator-->>CLI: ValidationResult (pass rates, regressions)
  CLI->>Git: optionally backup/update `SKILL.md` and create branch/PR
  Git-->>CLI: commit/PR result
  CLI->>Audit: append created/validated/deployed entries (JSONL)
  Audit-->>CLI: confirmation
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

🐇 I nibble logs and hop through lines of code,
I seed shuffles and cluster missed queries on the road,
I ask the LLM for clearer prose and run the test,
Back up the SKILL, deploy, audit — then rest.
Hooray for Reins, the repo grows best! 🌿

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title "Migrate selftune CLI from Python to Bun/TypeScript" is concise, specific, and clearly summarizes the main change: a language/runtime migration of the selftune CLI.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch WellDunDun/init-reins

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7294a0363f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +37 to +41
case "evals": {
// Strip "evals" from argv so parseArgs in the module sees the right args
process.argv = [process.argv[0], process.argv[1], ...process.argv.slice(3)];
await import("./eval/hooks-to-evals.js");
break;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Execute subcommand entrypoints instead of only importing

This dispatch path only imports the subcommand module, but those modules run their CLIs behind if (import.meta.main) guards, so the imported code never calls main(). In practice, selftune evals, grade, ingest-codex, ingest-opencode, and wrap-codex return successfully without performing any work, which breaks the top-level CLI for all non-doctor commands.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cdac726. All subcommand modules now export cliMain() and the router calls them explicitly instead of relying on import.meta.main.

Comment on lines +30 to +34
timestamp: new Date().toISOString(),
session_id: sessionId,
cwd,
transcript_path: transcriptPath,
...metrics,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include source in Stop-hook telemetry records

The Stop hook writes session telemetry without a source field, but log health validation treats source as required for session_telemetry, so valid Claude sessions are flagged as schema failures by doctor. This causes false negatives in health checks and makes observability output unreliable once real hook data is present.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cdac726. Added source: "claude_code" to the session telemetry record.

Comment on lines +5 to +7
branches: [master]
pull_request:
branches: [master]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Target main branch in CI triggers

This workflow is configured to run only for master, but the repository history/branch layout is on main, so pushes and PRs against main will skip CI entirely. That removes automated lint/test coverage from the normal merge path and can let regressions land unnoticed.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable — this repo uses master as its main branch (confirmed by git remote show origin). The CI config is correct.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Nitpick comments (19)
ARCHITECTURE.md (2)

26-49: Add language specifier to the module diagram code block.

Same issue as above - the markdown linter requires a language specifier.

♻️ Proposed fix
-```
+```text
 cli/selftune/
 ├── types.ts         Shared interfaces
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ARCHITECTURE.md` around lines 26 - 49, The fenced code block in
ARCHITECTURE.md that contains the module diagram is missing a language
specifier; update the opening triple-backticks to include a language (e.g.,
change the block beginning "```" to "```text") so the markdown linter accepts
it—edit the diagram block that starts with the tree for "cli/selftune/" and ends
before "skill/" to add the language token.

16-18: Add language specifier to fenced code blocks for linting compliance.

The static analysis flags these code blocks as lacking language specifiers. Since these are text diagrams, use text or plaintext.

♻️ Proposed fix
-```
+```text
 Observe → Detect → Diagnose → Propose → Validate → Deploy → Watch → Repeat
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @ARCHITECTURE.md around lines 16 - 18, The fenced code block containing the
diagram "Observe → Detect → Diagnose → Propose → Validate → Deploy → Watch →
Repeat" lacks a language specifier; update the block to use a plaintext
specifier (e.g., change totext) so the linter recognizes it as a
text/diagram block and satisfies static analysis.


</details>

</blockquote></details>
<details>
<summary>.gitignore (1)</summary><blockquote>

`1-15`: **Consider adding common ignore patterns for completeness.**

The gitignore covers the basics well. You may want to add patterns for environment files and potential build outputs if they become relevant.


<details>
<summary>♻️ Optional additions</summary>

```diff
 # Test artifacts
 coverage/
+
+# Environment
+.env
+.env.local
+.env*.local
+
+# Build outputs (if needed later)
+dist/
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 1 - 15, Add common ignore patterns for environment
and build artifacts in addition to the existing entries (node_modules, bun.lock,
*.tsbuildinfo, .context, .DS_Store, Thumbs.db, .vscode, .idea, coverage) by
appending patterns such as .env, .env.*, dist/, build/, out/, .next/,
.parcel-cache/, .turbo/, *.log (npm-debug.log, yarn-error.log), and logs/ so
sensitive env files and typical build/output caches are excluded from commits.
```

</details>

</blockquote></details>
<details>
<summary>.github/workflows/ci.yml (1)</summary><blockquote>

`14-15`: **Consider adding Bun dependency caching to speed up CI.**

Both jobs install dependencies independently. Adding cache configuration can significantly reduce CI runtime.


<details>
<summary>♻️ Proposed cache configuration</summary>

```diff
       - uses: oven-sh/setup-bun@v2
+        with:
+          bun-version: latest
+      - uses: actions/cache@v4
+        with:
+          path: ~/.bun/install/cache
+          key: ${{ runner.os }}-bun-${{ hashFiles('**/bun.lock', '**/package.json') }}
+          restore-keys: |
+            ${{ runner.os }}-bun-
       - run: bun install
```

Apply similar changes to the test job as well.
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 14 - 15, CI is reinstalling Bun deps
each job; add an actions/cache step before "uses: oven-sh/setup-bun@v2" / "run:
bun install" to restore/save Bun's cache so installs are faster. Insert an
actions/cache@v3 step that uses a key composed of runner OS and the bun.lockb
(or lock file) and restore-keys fallback, caching the Bun cache directories
(e.g., ~/.bun and .bun or the project's bun cache path) and ensure the same
cache step is added to both jobs that run "uses: oven-sh/setup-bun@v2" and "run:
bun install".
```

</details>

</blockquote></details>
<details>
<summary>cli/selftune/utils/logging.ts (1)</summary><blockquote>

`34-37`: **Harden error serialization for non-`Error` throwables and missing stacks.**

Current formatting can emit `\nundefined` and drops context for non-`Error` values. Normalizing this improves debugging reliability.


<details>
<summary>Suggested change</summary>

```diff
     error(message: string, err?: unknown) {
-      const exception =
-        err instanceof Error ? `${err.name}: ${err.message}\n${err.stack}` : undefined;
+      const exception =
+        err instanceof Error
+          ? [`${err.name}: ${err.message}`, err.stack].filter(Boolean).join("\n")
+          : err !== undefined
+            ? String(err)
+            : undefined;
       emit("ERROR", message, exception);
     },
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/utils/logging.ts` around lines 34 - 37, In the error(message:
string, err?: unknown) method, improve serialization of non-Error throwables and
guard against missing stacks: normalize err into a string by checking for Error
(use name, message and stack if present), otherwise attempt JSON.stringify and
fall back to util.inspect or String(err), and only append a newline+stack when a
stack exists to avoid producing "\nundefined"; then pass that normalized string
to emit("ERROR", message, exception). Ensure you reference the error(...) method
and the emit(...) call when making the changes.
```

</details>

</blockquote></details>
<details>
<summary>.agents/skills/reins/HarnessMethodology.md (2)</summary><blockquote>

`23-38`: **Add language specifiers to fenced code blocks.**

The code blocks showing directory structures lack language specifiers, which triggers markdownlint warnings. Using a specifier improves rendering in some viewers.


<details>
<summary>📝 Proposed fix</summary>

```diff
-```
+```text
 AGENTS.md              # ~100 lines, map to deeper docs
```

Apply the same change to the code block at line 50.
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.agents/skills/reins/HarnessMethodology.md around lines 23 - 38, The fenced
code block in HarnessMethodology.md that shows the directory tree (starting with
"AGENTS.md", "ARCHITECTURE.md", etc.) lacks a language specifier and causes
markdownlint warnings; update that opening fence to use a language like text (e.g., replace with ```text) and do the same for the other similar fenced
block later in the file (the one that also contains the directory listing) so
both directory-structure blocks include a language specifier.


</details>

---

`77-77`: **Minor: Consider hyphenating compound adjective.**

"full service observability" → "full-service observability" when used as a compound adjective modifying a noun. This is a minor style suggestion from static analysis.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In @.agents/skills/reins/HarnessMethodology.md at line 77, Update the phrase in
the sentence starting "For CLI-first repositories, prioritize diagnosability
surfaces..." to use the hyphenated compound adjective "full-service
observability" instead of "full service observability" so the modifier is
grammatical; locate the sentence in HarnessMethodology.md and replace the
unhyphenated phrase accordingly.
```

</details>

</blockquote></details>
<details>
<summary>lint-architecture.ts (2)</summary><blockquote>

`54-56`: **Empty catch hides potential errors beyond "directory doesn't exist".**

Permission errors or symlink issues would be silently swallowed. Consider logging unexpected errors in verbose mode or at least distinguishing ENOENT from other failures.


<details>
<summary>♻️ Proposed improvement</summary>

```diff
-  } catch {
-    // directory doesn't exist
+  } catch (err: unknown) {
+    // Ignore ENOENT (directory doesn't exist), but could log others if needed
+    if (err instanceof Error && "code" in err && (err as NodeJS.ErrnoException).code !== "ENOENT") {
+      console.warn(`Warning: Could not read directory ${dir}: ${err.message}`);
+    }
   }
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@lint-architecture.ts` around lines 54 - 56, The empty catch block that
currently reads "catch { // directory doesn't exist }" swallows all errors;
change it to capture the error (e.g., "catch (err)") and explicitly check
err.code === 'ENOENT' to keep the current behavior for missing directories, but
log or rethrow other errors (use the module's verbose/logger if available or
console.error) so permission, symlink, or unexpected failures are not silently
ignored; update the comment to reflect the distinction.
```

</details>

---

`27-37`: **Import detection may miss edge cases.**

The current check only catches static `import` statements at line start. It won't catch:
- Dynamic imports: `await import("../grading/...")`
- Re-exports: `export { foo } from "../grading/..."`

If these patterns are unlikely in your codebase, this is fine as-is. Otherwise, consider extending the check.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@lint-architecture.ts` around lines 27 - 37, The import detection loop only
checks lines that start with "import" and misses dynamic imports and re-exports;
update the loop that iterates lines (the block referencing lines[i], line,
FORBIDDEN_IMPORTS and violations) to also test each line for other import
patterns: check for "import(" (dynamic imports), "export" lines that contain "
from " (re-exports like export {..} from "…"), and inline "import ... from"
occurrences using a simple regex (e.g. /\bimport\b.*\bfrom\b/ or
line.includes('import(') / line.includes(' from ')); when any of these patterns
match, still iterate FORBIDDEN_IMPORTS and push the same violation message so
dynamic imports and re-exports are caught the same way as static top-level
imports.
```

</details>

</blockquote></details>
<details>
<summary>cli/selftune/eval/hooks-to-evals.ts (1)</summary><blockquote>

`329-333`: **Redundant null coalescing with default values.**

`parseArgs` already applies the `default` value, so `values["skill-log"] ?? SKILL_LOG` is redundant. The fallback will never be reached.


<details>
<summary>♻️ Simplify by removing redundant fallbacks</summary>

```diff
-  const skillRecords = readJsonl<SkillUsageRecord>(values["skill-log"] ?? SKILL_LOG);
-  const queryRecords = readJsonl<QueryLogRecord>(values["query-log"] ?? QUERY_LOG);
-  const telemetryRecords = readJsonl<SessionTelemetryRecord>(
-    values["telemetry-log"] ?? TELEMETRY_LOG,
-  );
+  const skillRecords = readJsonl<SkillUsageRecord>(values["skill-log"]!);
+  const queryRecords = readJsonl<QueryLogRecord>(values["query-log"]!);
+  const telemetryRecords = readJsonl<SessionTelemetryRecord>(values["telemetry-log"]!);
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/eval/hooks-to-evals.ts` around lines 329 - 333, Remove the
redundant null-coalescing fallbacks when calling readJsonl — parseArgs already
supplies defaults — so pass the values map entries directly (e.g., replace
readJsonl<SkillUsageRecord>(values["skill-log"] ?? SKILL_LOG) with
readJsonl<SkillUsageRecord>(values["skill-log"])). Do the same for
readJsonl<QueryLogRecord>(values["query-log"] ?? QUERY_LOG) and
readJsonl<SessionTelemetryRecord>(values["telemetry-log"] ?? TELEMETRY_LOG),
keeping the same variable names (skillRecords, queryRecords, telemetryRecords)
and constants (SKILL_LOG, QUERY_LOG, TELEMETRY_LOG) only if they are still used
elsewhere.
```

</details>

</blockquote></details>
<details>
<summary>cli/selftune/utils/jsonl.ts (1)</summary><blockquote>

`19-21`: **Consider runtime type validation for safer parsing.**

The `as T` cast trusts that parsed JSON matches the expected type. For logs that may be corrupted or have schema drift, this could lead to runtime errors downstream when accessing expected properties.

For a CLI tool where invalid records are already skipped, this is acceptable. If stricter guarantees are needed later, consider adding a schema validation layer (e.g., Zod or a simple runtime check).

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/utils/jsonl.ts` around lines 19 - 21, The code pushes
JSON.parse(trimmed) as T into records which trusts the cast; update the parsing
function in cli/selftune/utils/jsonl.ts to perform runtime validation before
pushing (e.g., accept an optional validator/zod schema or a predicate argument
and run it against the parsed object), so instead of unguarded
records.push(JSON.parse(trimmed) as T) you parse, validate the result and only
push when validation passes (log or skip invalid records in the
catch/validation-failure path); reference the parsed value from
JSON.parse(trimmed) and the records array when implementing the check.
```

</details>

</blockquote></details>
<details>
<summary>cli/selftune/utils/transcript.ts (1)</summary><blockquote>

`47-61`: **Code duplication with `extractUserText` helper.**

The user text extraction logic here (handling string content and array of text blocks) is duplicated in the `extractUserText` helper function at lines 244-259. Consider refactoring to reuse the helper.


<details>
<summary>♻️ Refactor to use extractUserText helper</summary>

```diff
     // Track last user query
     if (role === "user") {
-      if (typeof content === "string" && content.trim()) {
-        lastUserQuery = content.trim();
-      } else if (Array.isArray(content)) {
-        const texts = content
-          .filter(
-            (p): p is Record<string, unknown> =>
-              typeof p === "object" && p !== null && (p as Record<string, unknown>).type === "text",
-          )
-          .map((p) => (p.text as string) ?? "")
-          .filter(Boolean);
-        const text = texts.join(" ").trim();
-        if (text) lastUserQuery = text;
-      }
+      const text = extractUserText(content);
+      if (text) lastUserQuery = text;
     }
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/utils/transcript.ts` around lines 47 - 61, The user-text
extraction block inside the if (role === "user") branch duplicates logic already
implemented in extractUserText; replace the inline handling that reads and joins
string/array content (which assigns lastUserQuery) with a call to
extractUserText(content) and assign its result to lastUserQuery if truthy.
Locate the inline block around variables role, content, lastUserQuery and remove
the duplicated parsing, reusing the extractUserText helper (ensure you import or
reference extractUserText in the same module/scope).
```

</details>

</blockquote></details>
<details>
<summary>skill/SKILL.md (1)</summary><blockquote>

`34-36`: **Add language specifier to fenced code block.**

Per markdownlint, fenced code blocks should have a language specified. This appears to be a file path reference.


<details>
<summary>📝 Add language specifier</summary>

```diff
-```
+```text
 ~/.claude/session_telemetry_log.jsonl
 ```
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @skill/SKILL.md around lines 34 - 36, The fenced code block containing the
path "~/.claude/session_telemetry_log.jsonl" in SKILL.md lacks a language
specifier; update that block to use a language tag (e.g., change the opening
fence from totext) so it becomes text followed by ~/.claude/session_telemetry_log.jsonl and then the closing , satisfying
markdownlint.


</details>

</blockquote></details>
<details>
<summary>tests/observability.test.ts (1)</summary><blockquote>

`32-37`: **Test may be environment-dependent.**

This test assumes all hook files exist and pass. In a clean checkout or CI environment where the compiled hooks directory structure differs, this assertion could fail unexpectedly. Consider making this test more resilient or documenting the prerequisite.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@tests/observability.test.ts` around lines 32 - 37, The test "finds installed
hook files" is brittle because it assumes three passing hooks; update the test
to be resilient by either ensuring the hooks are created in test setup or
relaxing the assertion: call checkHookInstallation() (same symbol) in a
beforeAll that creates the expected hook files, or change the expectation from
expect(passing.length).toBe(3) to a non-failing assertion such as
expect(passing.length).toBeGreaterThanOrEqual(0) or
expect(passing.length).toBeGreaterThanOrEqual(1) (depending on minimum
required), so the test does not fail in clean/CI checkouts where compiled hook
files may be absent.
```

</details>

</blockquote></details>
<details>
<summary>tests/eval/hooks-to-evals.test.ts (2)</summary><blockquote>

`21-23`: **Unused helper function `writeJsonl`.**

The `writeJsonl` helper is defined but never invoked within any test in this file. The tests for `buildEvalSet` use in-memory arrays directly rather than reading from JSONL files. Consider removing this dead code or adding tests that exercise file-based input if that's the intended use case.


<details>
<summary>🧹 Remove unused helper</summary>

```diff
-// ---------------------------------------------------------------------------
-// Helper to write JSONL fixture files
-// ---------------------------------------------------------------------------
-function writeJsonl(path: string, records: unknown[]): void {
-  writeFileSync(path, `${records.map((r) => JSON.stringify(r)).join("\n")}\n`);
-}
-
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/hooks-to-evals.test.ts` around lines 21 - 23, The helper function
writeJsonl is defined but never used; remove the dead code by deleting the
writeJsonl function definition from tests/eval/hooks-to-evals.test.ts (and any
now-unused imports) or, if file-based input is intended, add a test that calls
writeJsonl to create a JSONL file and then invokes the existing buildEvalSet
test path to read that file; reference: function name writeJsonl and the tests
that exercise buildEvalSet so you update/remove the helper consistently.
```

</details>

---

`8-16`: **Unused temporary directory setup.**

The `tmpDir` variable is created in `beforeEach` and cleaned in `afterEach`, but no tests write files to this directory. This setup appears to be scaffolding for the unused `writeJsonl` helper. Either remove both or add file-based tests that utilize them.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/hooks-to-evals.test.ts` around lines 8 - 16, The test suite sets
up a temporary directory (tmpDir via beforeEach/afterEach) but never uses it;
remove the unused tmpDir variable and the beforeEach/afterEach hooks or instead
add tests that call the existing writeJsonl helper to write/read files using
tmpDir. Locate the tmpDir declaration and the beforeEach/afterEach hooks and
either delete them along with any unused writeJsonl helper, or modify existing
tests to create files with writeJsonl (referencing writeJsonl, tmpDir,
beforeEach, afterEach) and assert file-based behavior so the temp-dir lifecycle
is actually exercised.
```

</details>

</blockquote></details>
<details>
<summary>skill/settings_snippet.json (1)</summary><blockquote>

`11-35`: **Quote hook script paths in command strings.**

Unquoted replacement paths can fail on valid filesystem paths containing spaces.



<details>
<summary>💡 Proposed fix</summary>

```diff
-            "command": "bun run /PATH/TO/cli/selftune/hooks/prompt-log.ts",
+            "command": "bun run \"/PATH/TO/cli/selftune/hooks/prompt-log.ts\"",
...
-            "command": "bun run /PATH/TO/cli/selftune/hooks/skill-eval.ts",
+            "command": "bun run \"/PATH/TO/cli/selftune/hooks/skill-eval.ts\"",
...
-            "command": "bun run /PATH/TO/cli/selftune/hooks/session-stop.ts",
+            "command": "bun run \"/PATH/TO/cli/selftune/hooks/session-stop.ts\"",
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@skill/settings_snippet.json` around lines 11 - 35, The unquoted command
strings for the hook entries (the "command" values that run prompt-log.ts,
skill-eval.ts, and session-stop.ts) can break on filesystem paths containing
spaces; update each "command" string in the JSON (the entries that currently
read bun run /PATH/TO/cli/selftune/hooks/prompt-log.ts, bun run
/PATH/TO/cli/selftune/hooks/skill-eval.ts, and bun run
/PATH/TO/cli/selftune/hooks/session-stop.ts) to wrap the script path in quotes
(and escape any internal quotes if present) so the shell treats the full path as
a single argument.
```

</details>

</blockquote></details>
<details>
<summary>tests/ingestors/codex-wrapper.test.ts (1)</summary><blockquote>

`43-183`: **Consider adding a regression test for bad skill directory entries.**

Given the filesystem scan path, a test covering unreadable/broken entries would protect wrapper startup reliability.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@tests/ingestors/codex-wrapper.test.ts` around lines 43 - 183, Add a
regression test in tests/ingestors/codex-wrapper.test.ts that simulates bad
skill directory entries (e.g., unreadable file, broken symlink, corrupted JSON)
in a temp directory, then invokes the wrapper startup/skill-scan routine (call
whichever symbol exists in the codebase such as initializeCodexWrapper,
loadSkills, or scanSkills) and assert startup does not throw and the
returned/loaded skills exclude the bad entries (or that an expected
warning/error was logged); ensure the test cleans up temp files and uses the
same test harness style as existing tests (jest + tmpDir).
```

</details>

</blockquote></details>
<details>
<summary>cli/selftune/ingestors/opencode-ingest.ts (1)</summary><blockquote>

`189-268`: **Consider consolidating shared message-block parsing logic across both ingestion paths.**

Tool counting, Bash extraction, skill detection, and error scanning are duplicated with slight drift between SQLite and JSON handlers. A shared parser helper would reduce divergence risk.




Also applies to: 339-383

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/ingestors/opencode-ingest.ts` around lines 189 - 268, The loop
duplicates message-block parsing across ingestion paths—extract a shared helper
(e.g., processMessageBlocks or parseBlocks) that takes the normalized blocks
plus a mutable stats/context object (containing toolCalls, bashCommands,
skillsTriggered, errors, assistantTurns, firstUserQuery, skillNames) and
implements tool counting, Bash extraction, SKILL.md detection (using
basename/join), OpenAI tool_calls parsing, skill-name text scanning, and
tool_result error counting; replace the duplicated code in both places (the
block-handling inside the for (const msg of msgRows) loop and the similar code
at the other location) with calls to this helper, passing in
normalizeContent(msg.content ?? "[]") and the shared stats object so logic is
centralized and consistent.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/ci.yml:

  • Around line 1-25: The CI currently runs "bun install" in the "lint" and "test"
    jobs but the repository's .gitignore excludes bun.lock, which prevents
    committing the lockfile and causes non-reproducible builds; to fix, remove
    bun.lock from .gitignore and commit the resulting bun.lock to the repo (or
    explicitly document/accept the choice) so the "bun install" runs in the "lint"
    and "test" jobs produce deterministic dependency versions across CI runs.

In @AGENTS.md:

  • Around line 11-42: The fenced code block in AGENTS.md that starts with
    "selftune/" lacks a language tag (the triple-backtick block containing the
    directory tree), causing markdownlint MD040; update that fenced block to include
    a language identifier (for example add "text" after the opening ), so the block beginning with "selftune/" and ending with "└── [root configs] # package.json, tsconfig.json, Makefile, CI, etc." is fenced as text to satisfy
    the linter.

In @cli/selftune/eval/hooks-to-evals.ts:

  • Around line 210-215: The Math.min/Math.max calls on allTurns and allErrors can
    produce Infinity/-Infinity when those arrays are empty; update the console
    output logic (around the avg(allTurns).toFixed(1) and avg(allErrors).toFixed(1)
    lines) to compute safeMin and safeMax for each array (e.g., if array.length ===
    0 then use 0 or a placeholder) before formatting, and use those safe values in
    the template strings instead of Math.min(...allTurns)/Math.max(...allTurns) and
    Math.min(...allErrors)/Math.max(...allErrors) so the printed min/max are
    well-defined for empty arrays.

In @cli/selftune/grading/grade-session.ts:

  • Around line 200-214: In gradeViaAgent replace the readFileSync + promptContent
    usage with passing the temp file path (promptFile) to the CLI so the prompt is
    not sent as a large argv or exposed in process listings; remove/readFileSync
    usage (promptContent) and set cmd to use promptFile for each agent branch (e.g.,
    in the "claude", "codex", and "opencode" branches) or alternatively use stdin
    redirection if the specific CLI supports it—ensure promptFile is created with
    writeFileSync beforehand and referenced where cmd is constructed.
  • Around line 270-278: The fetch to API_URL in grade-session.ts has no
    timeout—wrap the request with an AbortController (create controller, set a
    timeout e.g., 10–30s that calls controller.abort()), pass controller.signal into
    the existing fetch call (which currently sends headers including "x-api-key" and
    "anthropic-version" and body JSON.stringify(payload)), and ensure you clear the
    timeout (clearTimeout) after fetch resolves or in finally; also handle an
    abort/timeout error path (detect AbortError) to surface a clear timeout error
    instead of hanging.

In @cli/selftune/hooks/session-stop.ts:

  • Around line 29-35: Add a source field to the SessionTelemetryRecord object
    created in the session-stop hook so telemetry can be distinguished; modify the
    record construction (the const record in session-stop.ts that builds
    SessionTelemetryRecord using timestamp, session_id, cwd, transcript_path and
    ...metrics) to include source: (use the existing source
    variable if available in the hook context or otherwise set a clear literal like
    'claude' passed into the hook) so downstream logs include the source attribute.

In @cli/selftune/hooks/skill-eval.ts:

  • Around line 49-56: When creating the SkillUsageRecord in the skill-eval hook,
    add an explicit source field to the record object so mixed-ingestor logs can be
    segmented; update the record construction (the const record in
    cli/selftune/hooks/skill-eval.ts) to include source: 'selftune' (or another
    clear identifier like 'cli/selftune' or 'skill-eval') alongside timestamp,
    session_id, skill_name, skill_path, query, and triggered.

In @cli/selftune/index.ts:

  • Around line 37-61: The router currently imports modules
    (./eval/hooks-to-evals.js, ./grading/grade-session.js,
    ./ingestors/codex-rollout.js, ./ingestors/opencode-ingest.js,
    ./ingestors/codex-wrapper.js) which rely on import.meta.main guards and do not
    export main, so their main logic never runs; change the router to follow the
    doctor pattern: import each module as a namespace and call its exported
    main(...) instead of relying on import-time execution, and update each target
    module to export its main function (export async function main(...) { ... }) so
    the router can invoke it after adjusting process.argv.

In @cli/selftune/ingestors/codex-rollout.ts:

  • Around line 217-227: The date-parsing code uses path.split("/") which fails on
    Windows; change the split to be OS-agnostic (e.g., normalize the input and split
    by path.sep or use a regex like /[\/]/) so the day/month/year extraction from
    the incoming path succeeds. Update the logic around the variables path,
    dayStr/monthStr/yearStr and the fileDate assignment to split using the OS
    separator instead of "/" so Number.parseInt(...) receives the correct parts and
    fileDate is correctly set.

In @cli/selftune/ingestors/codex-wrapper.ts:

  • Around line 37-40: The directory scan can throw on broken symlinks or
    unreadable entries when calling statSync; update the loop in the codex-wrapper
    (the for (...) using readdirSync, statSync, existsSync, join, names.add and
    checking "SKILL.md") to defensively skip problematic entries by wrapping the
    stat call in a try/catch (or use lstatSync) and only proceed to existsSync/join
    and names.add if the stat succeeds and isDirectory() returns true; on error
    simply continue to the next entry so broken/unreadable filesystem entries don't
    abort the scan.

In @cli/selftune/ingestors/opencode-ingest.ts:

  • Around line 485-487: The code silently accepts an invalid --since date
    (values.since) by doing sinceTs = new
    Date(${values.since}T00:00:00Z).getTime()/1000 which yields NaN and disables
    filtering; validate the parsed date immediately after creating it (e.g., const d
    = new Date(...); if (isNaN(d.getTime())) ...) and if invalid emit a clear error
    and abort (throw or processLogger.error + process.exit(1)) so the caller knows
    the --since value is malformed instead of silently ingesting all history.
  • Around line 336-399: The variable errors is declared as a const and never
    updated, so errors_encountered is always 0; change "const errors = 0" to "let
    errors = 0" and increment it when a tool/message indicates an error during
    ingestion (e.g., inside the assistant loop where you iterate blocks in
    opencode-ingest.ts: detect common error indicators such as block.error,
    block.output?.error, block.result?.error, or block.status === 'error' when
    processing tool_use blocks or message-level error fields) so that
    errors_encountered uses the updated errors counter in the sessions object.
  • Around line 169-171: The code assumes sessionRow.created (createdMs) is
    milliseconds; detect and normalize seconds vs milliseconds before creating the
    ISO timestamp: inspect createdMs (from sessionRow.created) and if it appears to
    be in seconds (e.g., a small epoch value) multiply by 1000, then construct
    timestamp = new Date(normalizedMs).toISOString(); update the logic around
    createdMs/timestamp in opencode-ingest.ts so ordering/filtering uses the
    normalized millisecond epoch.

In @cli/selftune/types.ts:

  • Around line 127-130: The EvalFeedback type currently declares suggestions as
    Array<{ assertion: string; reason: string }> but the code path in
    grade-session.ts and the SKILL.md example only use the reason field; fix the
    schema by removing the unused assertion property from the
    EvalFeedback.suggestions type (update the interface name EvalFeedback and the
    suggestions element shape to only include reason) OR alternatively add assertion
    usage everywhere (update grade-session.ts to populate/use suggestion.assertion
    and update the SKILL.md example/comment to include assertion) — choose the
    former for minimal change: modify the EvalFeedback interface to suggestions:
    Array<{ reason: string }> and update any related type references/comments to
    match.

In @docs/exec-plans/tech-debt-tracker.md:

  • Around line 5-8: The intro promises "ownership" but the table lacks an Owner
    column—add an Owner column to the markdown table by inserting "Owner" into the
    header row and adding the corresponding separator cell in the divider row (i.e.,
    update the header line "| ID | Description | Domain | Priority | Status |
    Created | Updated |" to include "| Owner |" and the separator line
    "|----|-------------|--------|----------|--------|---------|---------|" to
    include an extra "----" for the Owner column) so ownership is tracked, or
    alternatively remove the "ownership" mention from the intro if you prefer not to
    track owners.

In @docs/product-specs/index.md:

  • Around line 7-10: Update the table row so the Spec cell uses a markdown link
    instead of plain text: replace the plain "PRD.md" text in the table with a
    resolvable link (e.g., PRD.md or a relative path like
    PRD.md) so clicking the table entry opens the spec; modify the table
    row containing the header symbols "| Spec | Status | Priority | Owner |" and the
    row "| PRD.md | Current | P0 | WellDunDun |" accordingly.

In @PRD.md:

  • Around line 57-59: Add a language tag to the unlabeled fenced code block that
    contains "Observe → Detect → Diagnose → Propose → Validate → Deploy → Watch →
    Repeat" so it no longer triggers MD040; edit the fenced block in PRD.md to start
    with text (i.e., replace the opening with ```text) so the block is
    explicitly labeled as plain text.

In @README.md:

  • Around line 119-160: Three fenced code blocks in README.md lack explicit
    languages (triggering markdownlint MD040); update the three blocks that start
    with "Skill triggers in skill_usage_log...", the one showing the eval CLI output
    "Wrote 50 eval entries to pptx_eval.json...", and the block beginning
    "UserPromptSubmit fires..." to use explicit fence languages (e.g., change ``` to

---

Nitpick comments:
In @.agents/skills/reins/HarnessMethodology.md:
- Around line 23-38: The fenced code block in HarnessMethodology.md that shows
the directory tree (starting with "AGENTS.md", "ARCHITECTURE.md", etc.) lacks a
language specifier and causes markdownlint warnings; update that opening fence
to use a language like ```text (e.g., replace ``` with ```text) and do the same
for the other similar fenced block later in the file (the one that also contains
the directory listing) so both directory-structure blocks include a language
specifier.
- Line 77: Update the phrase in the sentence starting "For CLI-first
repositories, prioritize diagnosability surfaces..." to use the hyphenated
compound adjective "full-service observability" instead of "full service
observability" so the modifier is grammatical; locate the sentence in
HarnessMethodology.md and replace the unhyphenated phrase accordingly.

In @.github/workflows/ci.yml:
- Around line 14-15: CI is reinstalling Bun deps each job; add an actions/cache
step before "uses: oven-sh/setup-bun@v2" / "run: bun install" to restore/save
Bun's cache so installs are faster. Insert an actions/cache@v3 step that uses a
key composed of runner OS and the bun.lockb (or lock file) and restore-keys
fallback, caching the Bun cache directories (e.g., ~/.bun and .bun or the
project's bun cache path) and ensure the same cache step is added to both jobs
that run "uses: oven-sh/setup-bun@v2" and "run: bun install".

In @.gitignore:
- Around line 1-15: Add common ignore patterns for environment and build
artifacts in addition to the existing entries (node_modules, bun.lock,
*.tsbuildinfo, .context, .DS_Store, Thumbs.db, .vscode, .idea, coverage) by
appending patterns such as .env, .env.*, dist/, build/, out/, .next/,
.parcel-cache/, .turbo/, *.log (npm-debug.log, yarn-error.log), and logs/ so
sensitive env files and typical build/output caches are excluded from commits.

In `@ARCHITECTURE.md`:
- Around line 26-49: The fenced code block in ARCHITECTURE.md that contains the
module diagram is missing a language specifier; update the opening
triple-backticks to include a language (e.g., change the block beginning "```"
to "```text") so the markdown linter accepts it—edit the diagram block that
starts with the tree for "cli/selftune/" and ends before "skill/" to add the
language token.
- Around line 16-18: The fenced code block containing the diagram "Observe →
Detect → Diagnose → Propose → Validate → Deploy → Watch → Repeat" lacks a
language specifier; update the block to use a plaintext specifier (e.g., change
``` to ```text) so the linter recognizes it as a text/diagram block and
satisfies static analysis.

In `@cli/selftune/eval/hooks-to-evals.ts`:
- Around line 329-333: Remove the redundant null-coalescing fallbacks when
calling readJsonl — parseArgs already supplies defaults — so pass the values map
entries directly (e.g., replace readJsonl<SkillUsageRecord>(values["skill-log"]
?? SKILL_LOG) with readJsonl<SkillUsageRecord>(values["skill-log"])). Do the
same for readJsonl<QueryLogRecord>(values["query-log"] ?? QUERY_LOG) and
readJsonl<SessionTelemetryRecord>(values["telemetry-log"] ?? TELEMETRY_LOG),
keeping the same variable names (skillRecords, queryRecords, telemetryRecords)
and constants (SKILL_LOG, QUERY_LOG, TELEMETRY_LOG) only if they are still used
elsewhere.

In `@cli/selftune/ingestors/opencode-ingest.ts`:
- Around line 189-268: The loop duplicates message-block parsing across
ingestion paths—extract a shared helper (e.g., processMessageBlocks or
parseBlocks) that takes the normalized blocks plus a mutable stats/context
object (containing toolCalls, bashCommands, skillsTriggered, errors,
assistantTurns, firstUserQuery, skillNames) and implements tool counting, Bash
extraction, SKILL.md detection (using basename/join), OpenAI tool_calls parsing,
skill-name text scanning, and tool_result error counting; replace the duplicated
code in both places (the block-handling inside the for (const msg of msgRows)
loop and the similar code at the other location) with calls to this helper,
passing in normalizeContent(msg.content ?? "[]") and the shared stats object so
logic is centralized and consistent.

In `@cli/selftune/utils/jsonl.ts`:
- Around line 19-21: The code pushes JSON.parse(trimmed) as T into records which
trusts the cast; update the parsing function in cli/selftune/utils/jsonl.ts to
perform runtime validation before pushing (e.g., accept an optional
validator/zod schema or a predicate argument and run it against the parsed
object), so instead of unguarded records.push(JSON.parse(trimmed) as T) you
parse, validate the result and only push when validation passes (log or skip
invalid records in the catch/validation-failure path); reference the parsed
value from JSON.parse(trimmed) and the records array when implementing the
check.

In `@cli/selftune/utils/logging.ts`:
- Around line 34-37: In the error(message: string, err?: unknown) method,
improve serialization of non-Error throwables and guard against missing stacks:
normalize err into a string by checking for Error (use name, message and stack
if present), otherwise attempt JSON.stringify and fall back to util.inspect or
String(err), and only append a newline+stack when a stack exists to avoid
producing "\nundefined"; then pass that normalized string to emit("ERROR",
message, exception). Ensure you reference the error(...) method and the
emit(...) call when making the changes.

In `@cli/selftune/utils/transcript.ts`:
- Around line 47-61: The user-text extraction block inside the if (role ===
"user") branch duplicates logic already implemented in extractUserText; replace
the inline handling that reads and joins string/array content (which assigns
lastUserQuery) with a call to extractUserText(content) and assign its result to
lastUserQuery if truthy. Locate the inline block around variables role, content,
lastUserQuery and remove the duplicated parsing, reusing the extractUserText
helper (ensure you import or reference extractUserText in the same
module/scope).

In `@lint-architecture.ts`:
- Around line 54-56: The empty catch block that currently reads "catch { //
directory doesn't exist }" swallows all errors; change it to capture the error
(e.g., "catch (err)") and explicitly check err.code === 'ENOENT' to keep the
current behavior for missing directories, but log or rethrow other errors (use
the module's verbose/logger if available or console.error) so permission,
symlink, or unexpected failures are not silently ignored; update the comment to
reflect the distinction.
- Around line 27-37: The import detection loop only checks lines that start with
"import" and misses dynamic imports and re-exports; update the loop that
iterates lines (the block referencing lines[i], line, FORBIDDEN_IMPORTS and
violations) to also test each line for other import patterns: check for
"import(" (dynamic imports), "export" lines that contain " from " (re-exports
like export {..} from "…"), and inline "import ... from" occurrences using a
simple regex (e.g. /\bimport\b.*\bfrom\b/ or line.includes('import(') /
line.includes(' from ')); when any of these patterns match, still iterate
FORBIDDEN_IMPORTS and push the same violation message so dynamic imports and
re-exports are caught the same way as static top-level imports.

In `@skill/settings_snippet.json`:
- Around line 11-35: The unquoted command strings for the hook entries (the
"command" values that run prompt-log.ts, skill-eval.ts, and session-stop.ts) can
break on filesystem paths containing spaces; update each "command" string in the
JSON (the entries that currently read bun run
/PATH/TO/cli/selftune/hooks/prompt-log.ts, bun run
/PATH/TO/cli/selftune/hooks/skill-eval.ts, and bun run
/PATH/TO/cli/selftune/hooks/session-stop.ts) to wrap the script path in quotes
(and escape any internal quotes if present) so the shell treats the full path as
a single argument.

In `@skill/SKILL.md`:
- Around line 34-36: The fenced code block containing the path
"~/.claude/session_telemetry_log.jsonl" in SKILL.md lacks a language specifier;
update that block to use a language tag (e.g., change the opening fence from ```
to ```text) so it becomes ```text followed by
~/.claude/session_telemetry_log.jsonl and then the closing ```, satisfying
markdownlint.

In `@tests/eval/hooks-to-evals.test.ts`:
- Around line 21-23: The helper function writeJsonl is defined but never used;
remove the dead code by deleting the writeJsonl function definition from
tests/eval/hooks-to-evals.test.ts (and any now-unused imports) or, if file-based
input is intended, add a test that calls writeJsonl to create a JSONL file and
then invokes the existing buildEvalSet test path to read that file; reference:
function name writeJsonl and the tests that exercise buildEvalSet so you
update/remove the helper consistently.
- Around line 8-16: The test suite sets up a temporary directory (tmpDir via
beforeEach/afterEach) but never uses it; remove the unused tmpDir variable and
the beforeEach/afterEach hooks or instead add tests that call the existing
writeJsonl helper to write/read files using tmpDir. Locate the tmpDir
declaration and the beforeEach/afterEach hooks and either delete them along with
any unused writeJsonl helper, or modify existing tests to create files with
writeJsonl (referencing writeJsonl, tmpDir, beforeEach, afterEach) and assert
file-based behavior so the temp-dir lifecycle is actually exercised.

In `@tests/ingestors/codex-wrapper.test.ts`:
- Around line 43-183: Add a regression test in
tests/ingestors/codex-wrapper.test.ts that simulates bad skill directory entries
(e.g., unreadable file, broken symlink, corrupted JSON) in a temp directory,
then invokes the wrapper startup/skill-scan routine (call whichever symbol
exists in the codebase such as initializeCodexWrapper, loadSkills, or
scanSkills) and assert startup does not throw and the returned/loaded skills
exclude the bad entries (or that an expected warning/error was logged); ensure
the test cleans up temp files and uses the same test harness style as existing
tests (jest + tmpDir).

In `@tests/observability.test.ts`:
- Around line 32-37: The test "finds installed hook files" is brittle because it
assumes three passing hooks; update the test to be resilient by either ensuring
the hooks are created in test setup or relaxing the assertion: call
checkHookInstallation() (same symbol) in a beforeAll that creates the expected
hook files, or change the expectation from expect(passing.length).toBe(3) to a
non-failing assertion such as expect(passing.length).toBeGreaterThanOrEqual(0)
or expect(passing.length).toBeGreaterThanOrEqual(1) (depending on minimum
required), so the test does not fail in clean/CI checkouts where compiled hook
files may be absent.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b65cfb and 7294a03.

📒 Files selected for processing (59)
  • .agent/skills/.gitkeep
  • .agent/skills/reins
  • .agents/skills/reins/HarnessMethodology.md
  • .agents/skills/reins/SKILL.md
  • .agents/skills/reins/Workflows/Audit.md
  • .agents/skills/reins/Workflows/Doctor.md
  • .agents/skills/reins/Workflows/Evolve.md
  • .agents/skills/reins/Workflows/Scaffold.md
  • .claude/skills/.gitkeep
  • .claude/skills/reins
  • .github/workflows/ci.yml
  • .gitignore
  • AGENTS.md
  • ARCHITECTURE.md
  • Makefile
  • PRD.md
  • README.md
  • biome.json
  • cli/selftune/constants.ts
  • cli/selftune/eval/hooks-to-evals.ts
  • cli/selftune/grading/grade-session.ts
  • cli/selftune/hooks/prompt-log.ts
  • cli/selftune/hooks/session-stop.ts
  • cli/selftune/hooks/skill-eval.ts
  • cli/selftune/index.ts
  • cli/selftune/ingestors/codex-rollout.ts
  • cli/selftune/ingestors/codex-wrapper.ts
  • cli/selftune/ingestors/opencode-ingest.ts
  • cli/selftune/observability.ts
  • cli/selftune/types.ts
  • cli/selftune/utils/jsonl.ts
  • cli/selftune/utils/logging.ts
  • cli/selftune/utils/seeded-random.ts
  • cli/selftune/utils/transcript.ts
  • docs/design-docs/core-beliefs.md
  • docs/design-docs/index.md
  • docs/escalation-policy.md
  • docs/exec-plans/tech-debt-tracker.md
  • docs/golden-principles.md
  • docs/product-specs/index.md
  • lint-architecture.ts
  • package.json
  • risk-policy.json
  • skill/SKILL.md
  • skill/references/logs.md
  • skill/settings_snippet.json
  • skills-lock.json
  • tests/eval/hooks-to-evals.test.ts
  • tests/grading/grade-session.test.ts
  • tests/hooks/prompt-log.test.ts
  • tests/hooks/session-stop.test.ts
  • tests/hooks/skill-eval.test.ts
  • tests/ingestors/codex-rollout.test.ts
  • tests/ingestors/codex-wrapper.test.ts
  • tests/ingestors/opencode-ingest.test.ts
  • tests/observability.test.ts
  • tests/utils/jsonl.test.ts
  • tests/utils/transcript.test.ts
  • tsconfig.json

Comment on lines +1 to +25
name: CI

on:
push:
branches: [master]
pull_request:
branches: [master]

jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: oven-sh/setup-bun@v2
- run: bun install
- run: bunx biome check .
- run: bun run lint-architecture.ts

test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: oven-sh/setup-bun@v2
- run: bun install
- run: bun test
Copy link

@coderabbitai coderabbitai bot Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential inconsistency: bun.lock is gitignored but may be needed for reproducible builds.

The .gitignore excludes bun.lock, which means CI will generate fresh lock files on each run. This can lead to non-reproducible builds if dependency versions drift. Consider either:

  1. Removing bun.lock from .gitignore to commit the lock file (recommended for reproducibility)
  2. Accepting the current behavior if you intentionally want latest compatible versions
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 1 - 25, The CI currently runs "bun
install" in the "lint" and "test" jobs but the repository's .gitignore excludes
bun.lock, which prevents committing the lockfile and causes non-reproducible
builds; to fix, remove bun.lock from .gitignore and commit the resulting
bun.lock to the repo (or explicitly document/accept the choice) so the "bun
install" runs in the "lint" and "test" jobs produce deterministic dependency
versions across CI runs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional — bun.lock is gitignored by design for this project. CI uses bun install with the lockfile-free approach.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WellDunDun, understood! I've noted this design choice for future reference.


✏️ Learnings added
Learnt from: WellDunDun
Repo: WellDunDun/selftune PR: 1
File: .github/workflows/ci.yml:1-25
Timestamp: 2026-02-28T21:36:39.390Z
Learning: In the WellDunDun/selftune repository, `bun.lock` is intentionally excluded from version control (gitignored by design). The project uses a lockfile-free approach with `bun install` in CI.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: CR
Repo: WellDunDun/Ithraa-v1 PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-24T17:26:59.361Z
Learning: Run `yarn ci` locally (typecheck, lint, prettier, tests in parallel) before opening a PR — all must pass before merge

Learnt from: CR
Repo: WellDunDun/Ithraa-v1 PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-24T17:26:39.790Z
Learning: Before creating a PR, run `yarn ci` and fix any failures. Heavy CI checks (typecheck, lint, prettier, tests) run locally — not on GitHub Actions

Learnt from: CR
Repo: WellDunDun/reins PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-23T20:20:15.569Z
Learning: Applies to {bin/reins.cjs,cli/reins/src/index.ts} : Bun is the runtime; Node.js/tsx fallback via `bin/reins.cjs`

Learnt from: CR
Repo: WellDunDun/reins PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-23T20:20:15.569Z
Learning: Run tests with `cd cli/reins && bun test`

Learnt from: CR
Repo: WellDunDun/reins PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-23T20:20:15.569Z
Learning: Self-audit using `cd cli/reins && bun src/index.ts audit ../..`

Comment on lines +11 to +42
```
selftune/
├── cli/selftune/ # TypeScript package — the CLI
│ ├── types.ts # Shared interfaces
│ ├── constants.ts # Log paths, known tools, skip prefixes
│ ├── utils/ # Shared utilities
│ │ ├── jsonl.ts # JSONL read/write/append
│ │ ├── transcript.ts # Transcript parsing
│ │ ├── logging.ts # Structured JSON logging
│ │ └── seeded-random.ts # Deterministic PRNG
│ ├── hooks/ # Telemetry capture (Claude Code hooks)
│ │ ├── prompt-log.ts # UserPromptSubmit hook
│ │ ├── session-stop.ts # Stop hook
│ │ └── skill-eval.ts # PostToolUse hook
│ ├── ingestors/ # Platform adapters (Codex, OpenCode)
│ │ ├── codex-wrapper.ts # Real-time Codex wrapper
│ │ ├── codex-rollout.ts # Batch Codex ingestor
│ │ └── opencode-ingest.ts # OpenCode SQLite/JSON adapter
│ ├── eval/ # False negative detection, eval set generation
│ │ └── hooks-to-evals.ts
│ ├── grading/ # 3-tier session grading
│ │ └── grade-session.ts
│ ├── observability.ts # Health checks, log integrity
│ └── index.ts # CLI entry point
├── skill/ # Claude Code skill (skill-eval-grader)
│ ├── SKILL.md # Skill definition
│ ├── settings_snippet.json
│ └── references/
├── tests/ # Test suite (bun test)
├── docs/ # Reins harness docs
└── [root configs] # package.json, tsconfig.json, Makefile, CI, etc.
```
Copy link

@coderabbitai coderabbitai bot Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Specify a fenced code language for markdownlint compliance.

Line 11 opens a fenced block without a language, which triggers MD040 in strict markdown lint setups.

💡 Suggested fix
-```
+```text
 selftune/
 ├── cli/selftune/            # TypeScript package — the CLI
 ...
 └── [root configs]           # package.json, tsconfig.json, Makefile, CI, etc.
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>

[warning] 11-11: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @AGENTS.md around lines 11 - 42, The fenced code block in AGENTS.md that
starts with "selftune/" lacks a language tag (the triple-backtick block
containing the directory tree), causing markdownlint MD040; update that fenced
block to include a language identifier (for example add "text" after the opening

configs]           # package.json, tsconfig.json, Makefile, CI, etc." is fenced
as ```text to satisfy the linter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor markdown lint suggestion — not blocking. Will address in a follow-up if markdown linting is added to CI.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WellDunDun, sounds good! Feel free to revisit when convenient. 👍


🧠 Learnings used
Learnt from: CR
Repo: WellDunDun/reins PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-23T20:20:06.915Z
Learning: Canonical project instructions live in AGENTS.md

Learnt from: CR
Repo: WellDunDun/reins PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-23T20:20:06.915Z
Learning: If CLAUDE.md and AGENTS.md ever differ, follow AGENTS.md

Learnt from: CR
Repo: WellDunDun/reins PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-23T20:20:15.569Z
Learning: Self-audit using `cd cli/reins && bun src/index.ts audit ../..`

Comment on lines +210 to +215
console.log(
` Assistant turns: avg ${avg(allTurns).toFixed(1)} (min ${Math.min(...allTurns)}, max ${Math.max(...allTurns)})`,
);
console.log(
` Errors: avg ${avg(allErrors).toFixed(1)} (min ${Math.min(...allErrors)}, max ${Math.max(...allErrors)})`,
);
Copy link

@coderabbitai coderabbitai bot Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Edge case: Math.min/Math.max on empty arrays.

If sessions is non-empty but all assistant_turns or errors_encountered are undefined/0, allTurns/allErrors could be empty or contain only zeros. However, the real edge case is that Math.min(...[]) returns Infinity and Math.max(...[]) returns -Infinity.

The code already handles sessions.length === 0 early (line 185), and the arrays are populated from valid sessions, so this is unlikely in practice. Still, a defensive check could improve robustness.

🛡️ Optional defensive fix
+  const safeMin = (arr: number[]) => (arr.length > 0 ? Math.min(...arr) : 0);
+  const safeMax = (arr: number[]) => (arr.length > 0 ? Math.max(...arr) : 0);
+
   console.log(
-    `  Assistant turns:   avg ${avg(allTurns).toFixed(1)}  (min ${Math.min(...allTurns)}, max ${Math.max(...allTurns)})`,
+    `  Assistant turns:   avg ${avg(allTurns).toFixed(1)}  (min ${safeMin(allTurns)}, max ${safeMax(allTurns)})`,
   );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/eval/hooks-to-evals.ts` around lines 210 - 215, The
Math.min/Math.max calls on allTurns and allErrors can produce Infinity/-Infinity
when those arrays are empty; update the console output logic (around the
avg(allTurns).toFixed(1) and avg(allErrors).toFixed(1) lines) to compute safeMin
and safeMax for each array (e.g., if array.length === 0 then use 0 or a
placeholder) before formatting, and use those safe values in the template
strings instead of Math.min(...allTurns)/Math.max(...allTurns) and
Math.min(...allErrors)/Math.max(...allErrors) so the printed min/max are
well-defined for empty arrays.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable — the function has an early exit guard (if (sessions.length === 0)) before the Math.min/max calls, so empty arrays cannot reach that code path.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WellDunDun, you're absolutely right! The early exit guard at line 185 ensures sessions.length > 0 before the loop populates the arrays, so allTurns and allErrors will always contain at least sessions.length elements (with 0 as fallback via the ?? operator). The arrays cannot be empty when Math.min/Math.max are invoked. The code is safe as-is—thanks for catching my oversight!


🧠 Learnings used
Learnt from: CR
Repo: WellDunDun/reins PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-23T20:20:15.569Z
Learning: Self-audit using `cd cli/reins && bun src/index.ts audit ../..`

Comment on lines +270 to +278
const resp = await fetch(API_URL, {
method: "POST",
headers: {
"Content-Type": "application/json",
"x-api-key": apiKey,
"anthropic-version": "2023-06-01",
},
body: JSON.stringify(payload),
});
Copy link

@coderabbitai coderabbitai bot Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n 'fetch\(API_URL' cli/selftune/grading/grade-session.ts -C 10

Repository: WellDunDun/selftune

Length of output: 552


🏁 Script executed:

# Also check if AbortController or signal is used anywhere in this file
rg -n 'AbortController|signal' cli/selftune/grading/grade-session.ts

Repository: WellDunDun/selftune

Length of output: 45


🏁 Script executed:

# Get the full context around lines 270-278
sed -n '260,290p' cli/selftune/grading/grade-session.ts

Repository: WellDunDun/selftune

Length of output: 722


🏁 Script executed:

# Check if there are other fetch calls in the file that might also need timeout handling
rg -n 'fetch\(' cli/selftune/grading/grade-session.ts

Repository: WellDunDun/selftune

Length of output: 104


Add an abort timeout to the Anthropic API request.

fetch has no timeout signal here. A stalled network/API path can block the CLI indefinitely.

🛠️ Suggested fix
-  const resp = await fetch(API_URL, {
-    method: "POST",
-    headers: {
-      "Content-Type": "application/json",
-      "x-api-key": apiKey,
-      "anthropic-version": "2023-06-01",
-    },
-    body: JSON.stringify(payload),
-  });
+  const controller = new AbortController();
+  const timeout = setTimeout(() => controller.abort(), 120_000);
+  let resp: Response;
+  try {
+    resp = await fetch(API_URL, {
+      method: "POST",
+      headers: {
+        "Content-Type": "application/json",
+        "x-api-key": apiKey,
+        "anthropic-version": "2023-06-01",
+      },
+      body: JSON.stringify(payload),
+      signal: controller.signal,
+    });
+  } finally {
+    clearTimeout(timeout);
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const resp = await fetch(API_URL, {
method: "POST",
headers: {
"Content-Type": "application/json",
"x-api-key": apiKey,
"anthropic-version": "2023-06-01",
},
body: JSON.stringify(payload),
});
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 120_000);
let resp: Response;
try {
resp = await fetch(API_URL, {
method: "POST",
headers: {
"Content-Type": "application/json",
"x-api-key": apiKey,
"anthropic-version": "2023-06-01",
},
body: JSON.stringify(payload),
signal: controller.signal,
});
} finally {
clearTimeout(timeout);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/grading/grade-session.ts` around lines 270 - 278, The fetch to
API_URL in grade-session.ts has no timeout—wrap the request with an
AbortController (create controller, set a timeout e.g., 10–30s that calls
controller.abort()), pass controller.signal into the existing fetch call (which
currently sends headers including "x-api-key" and "anthropic-version" and body
JSON.stringify(payload)), and ensure you clear the timeout (clearTimeout) after
fetch resolves or in finally; also handle an abort/timeout error path (detect
AbortError) to surface a clear timeout error instead of hanging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted — the API grading path uses a simple fetch without AbortController. This is acceptable for now as grading is a CLI batch operation, not a long-running service.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WellDunDun, understood — I'll note that for future reference.

__

(o.o)
/>📝 noting preferences


✏️ Learnings added
Learnt from: WellDunDun
Repo: WellDunDun/selftune PR: 1
File: cli/selftune/grading/grade-session.ts:0-0
Timestamp: 2026-02-28T21:36:47.195Z
Learning: In the selftune CLI, batch grading operations (e.g., `gradeViaApi` in `cli/selftune/grading/grade-session.ts`) intentionally omit AbortController timeouts on fetch calls because grading is a CLI batch operation, not a long-running service.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: CR
Repo: WellDunDun/reins PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-23T20:20:15.569Z
Learning: Self-audit using `cd cli/reins && bun src/index.ts audit ../..`

Comment on lines +127 to +130
export interface EvalFeedback {
suggestions: Array<{ assertion: string; reason: string }>;
overall: string;
}
Copy link

@coderabbitai coderabbitai bot Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's check the EvalFeedback definition in cli/selftune/types.ts
cat -n cli/selftune/types.ts | grep -A 5 "interface EvalFeedback"

Repository: WellDunDun/selftune

Length of output: 252


🏁 Script executed:

# Check SKILL.md for the JSON example mentioned (lines 149-151)
cat -n skill/SKILL.md | sed -n '145,155p'

Repository: WellDunDun/selftune

Length of output: 376


🏁 Script executed:

# Search for how suggestions are used/constructed in the codebase
rg -n "suggestions" --type=ts -C 3 | head -100

Repository: WellDunDun/selftune

Length of output: 2472


🏁 Script executed:

# Also check for any factory functions or construction patterns for EvalFeedback
rg -n "EvalFeedback" --type=ts -C 2

Repository: WellDunDun/selftune

Length of output: 1069


Fix schema inconsistency: EvalFeedback.suggestions has unused assertion field.

The TypeScript type defines suggestions: Array<{ assertion: string; reason: string }>, but the actual usage in grade-session.ts (line 338-339) only accesses the reason field. The SKILL.md example (lines 149-151) also omits assertion. Either remove the unused assertion field from the type or update both the SKILL.md example and code comment (line 51) to include it for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/types.ts` around lines 127 - 130, The EvalFeedback type
currently declares suggestions as Array<{ assertion: string; reason: string }>
but the code path in grade-session.ts and the SKILL.md example only use the
reason field; fix the schema by removing the unused assertion property from the
EvalFeedback.suggestions type (update the interface name EvalFeedback and the
suggestions element shape to only include reason) OR alternatively add assertion
usage everywhere (update grade-session.ts to populate/use suggestion.assertion
and update the SKILL.md example/comment to include assertion) — choose the
former for minimal change: modify the EvalFeedback interface to suggestions:
Array<{ reason: string }> and update any related type references/comments to
match.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source field is already optional in the type interfaces. No change needed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WellDunDun, I think there may be some confusion here. The CodeRabbit comment above is about the assertion field in EvalFeedback.suggestions (lines 127-130), not the source field.

The issue is that EvalFeedback.suggestions is defined as:

suggestions: Array<{ assertion: string; reason: string }>;

But according to the usage in grade-session.ts and the example in SKILL.md, only the reason field is being used—the assertion field appears to be unused.

You're correct that the source field is optional in QueryLogRecord, SkillUsageRecord, and SessionTelemetryRecord, but that's a separate matter.

Would you like to address the unused assertion field by either:

  1. Removing it from the type definition (simplest), or
  2. Actually using it in the code and documentation?

🧠 Learnings used
Learnt from: CR
Repo: WellDunDun/Ithraa-v1 PR: 0
File: packages/backend/AGENTS.md:0-0
Timestamp: 2026-02-24T17:28:10.598Z
Learning: Applies to packages/backend/convex/**/*assessment*schema*.ts : Be aware that `currentVersionId` is optional on assessments due to circular reference workaround (insert assessment → insert version → patch assessment).

Learnt from: CR
Repo: WellDunDun/reins PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-23T20:20:15.569Z
Learning: Self-audit using `cd cli/reins && bun src/index.ts audit ../..`

Comment on lines +57 to +59
```
Observe → Detect → Diagnose → Propose → Validate → Deploy → Watch → Repeat
```
Copy link

@coderabbitai coderabbitai bot Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language tag to the feedback-loop fenced block.

Line 57 uses an unlabeled fenced block, which triggers MD040 in markdown lint.

💡 Suggested fix
-```
+```text
 Observe → Detect → Diagnose → Propose → Validate → Deploy → Watch → Repeat
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>

[warning] 57-57: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @PRD.md around lines 57 - 59, Add a language tag to the unlabeled fenced code
block that contains "Observe → Detect → Diagnose → Propose → Validate → Deploy →
Watch → Repeat" so it no longer triggers MD040; edit the fenced block in PRD.md
to start with text (i.e., replace the opening with ```text) so the block
is explicitly labeled as plain text.


</details>

<!-- fingerprinting:phantom:poseidon:hawk -->

<!-- This is an auto-generated reply by CodeRabbit -->

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor markdown lint suggestion — not blocking. Will address if markdown linting is added to CI.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WellDunDun, sounds good! That's a reasonable approach — addressing markdown linting issues when you add the linter to CI keeps the workflow focused. Feel free to revisit this when the tooling is in place.


🧠 Learnings used
Learnt from: CR
Repo: WellDunDun/reins PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-23T20:20:15.569Z
Learning: Self-audit using `cd cli/reins && bun src/index.ts audit ../..`

Comment on lines +119 to +160
```
Skill triggers in skill_usage_log (42 total records):
pptx 18 triggers
docx 14 triggers
xlsx 7 triggers
pdf 3 triggers

All queries in all_queries_log: 381
```

### Generate an eval set for a skill

```bash
bun run cli/selftune/index.ts evals --skill pptx --output pptx_eval.json
```

Output:
```
Wrote 50 eval entries to pptx_eval.json
Positives (should_trigger=true) : 18 (from 18 logged triggers)
Negatives (should_trigger=false): 32 (from 381 total logged queries)
```

### Grade a skill session

```bash
bun run cli/selftune/index.ts grade --skill pptx \
--expectations "SKILL.md was read before any files were created" \
"Output is a .pptx file"
```

### Health check

```bash
bun run cli/selftune/index.ts doctor
```

---

## How it works

```
Copy link

@coderabbitai coderabbitai bot Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add fence languages to unlabeled code blocks.

Three fenced blocks are missing explicit languages, which triggers markdownlint MD040.

💡 Proposed fix
-```
+```text
 Skill triggers in skill_usage_log (42 total records):
   pptx                            18 triggers
   docx                            14 triggers
   xlsx                             7 triggers
   pdf                              3 triggers

 All queries in all_queries_log: 381

- +text
Wrote 50 eval entries to pptx_eval.json
Positives (should_trigger=true) : 18 (from 18 logged triggers)
Negatives (should_trigger=false): 32 (from 381 total logged queries)


-```
+```text
UserPromptSubmit fires
  └── prompt-log.ts logs query → all_queries_log.jsonl
...
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>

[warning] 119-119: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

[warning] 136-136: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

[warning] 160-160: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @README.md around lines 119 - 160, Three fenced code blocks in README.md lack
explicit languages (triggering markdownlint MD040); update the three blocks that
start with "Skill triggers in skill_usage_log...", the one showing the eval CLI
output "Wrote 50 eval entries to pptx_eval.json...", and the block beginning
"UserPromptSubmit fires..." to use explicit fence languages (e.g., change ``` to

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor markdown lint suggestion — not blocking. Will address if markdown linting is added to CI.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WellDunDun, sounds good — I've noted your decision to defer this until markdown linting is part of CI. Thanks for the clarification!


🧠 Learnings used
Learnt from: CR
Repo: WellDunDun/reins PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-23T20:20:15.569Z
Learning: Self-audit using `cd cli/reins && bun src/index.ts audit ../..`

WellDunDun and others added 3 commits February 28, 2026 19:20
- Add evolution module with audit, deploy, evolve, extract-patterns,
  propose-description, rollback, stopping-criteria, and validate-proposal
- Add monitoring module with watch command
- Extract LLM call and schema validator into shared utilities
- Refactor grade-session to use shared llm-call utilities
- Enhance observability, eval hooks, types, and constants
- Update lint-architecture with expanded rules
- Add comprehensive tests for all new modules
- Update docs (README, ARCHITECTURE, AGENTS, PRD, tech-debt-tracker)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduces the init subsystem, 8 workflow docs (Doctor, Evals, Evolve,
Grade, Ingest, Initialize, Rollback, Watch), evolution and monitoring
pipeline design docs, grading methodology and invocation taxonomy
references, and agent-first skill restructure exec plan. Refactors
SKILL.md, README, and updates CLI, types, observability, and tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Apply auto-formatting to evolve function signature and sort imports
in evolve test file to resolve CI lint failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (17)
cli/selftune/utils/schema-validator.ts-8-8 (1)

8-8: ⚠️ Potential issue | 🟡 Minor

LogType is narrower than REQUIRED_FIELDS keys.

The LogType union excludes "evolution_audit" which exists in REQUIRED_FIELDS (see cli/selftune/constants.ts lines 56-62). This inconsistency could cause confusion or require a separate validation path for evolution audit records.

Consider expanding the type:

-export type LogType = "session_telemetry" | "skill_usage" | "all_queries";
+export type LogType = "session_telemetry" | "skill_usage" | "all_queries" | "evolution_audit";

Alternatively, if evolution_audit intentionally uses a different validation path, document this distinction.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/utils/schema-validator.ts` at line 8, LogType currently defines
"session_telemetry" | "skill_usage" | "all_queries" but omits "evolution_audit"
which exists as a key in REQUIRED_FIELDS; update the LogType union in
schema-validator.ts to include "evolution_audit" so type coverage matches
REQUIRED_FIELDS (or, if omission is intentional, add a clear comment/docstring
near LogType and REQUIRED_FIELDS explaining that evolution_audit uses a separate
validation path and where that validation lives). Ensure references to LogType
and REQUIRED_FIELDS remain consistent after the change.
tests/utils/llm-call.test.ts-246-247 (1)

246-247: ⚠️ Potential issue | 🟡 Minor

Missing await on rejects.toThrow() assertion.

Same issue as above—without await, this test will always pass.

🐛 Proposed fix
-    expect(callViaAgent("sys", "user", "claude")).rejects.toThrow(/exited with code 1/);
+    await expect(callViaAgent("sys", "user", "claude")).rejects.toThrow(/exited with code 1/);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/utils/llm-call.test.ts` around lines 246 - 247, The test assertion on
callViaAgent("sys", "user", "claude") is missing an await on the promise
rejection check; update the test so the assertion awaits the rejects matcher
(i.e., use await expect(callViaAgent(...)).rejects.toThrow(...)) to correctly
await the rejected promise in the test for the function callViaAgent.
tests/utils/llm-call.test.ts-294-307 (1)

294-307: ⚠️ Potential issue | 🟡 Minor

Missing await on rejects.toThrow() assertion.

🐛 Proposed fix
-    expect(callViaApi("system", "user")).rejects.toThrow("API error 429");
+    await expect(callViaApi("system", "user")).rejects.toThrow("API error 429");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/utils/llm-call.test.ts` around lines 294 - 307, The test "throws on
non-ok API response" is missing an await on the Jest rejection assertion; update
the assertion to await the promise returned by expect(...).rejects.toThrow so
the test actually waits for callViaApi("system","user") to reject. Locate the
test case in tests/utils/llm-call.test.ts and change the final line referencing
callViaApi to use await expect(callViaApi("system","user")).rejects.toThrow("API
error 429").
tests/utils/llm-call.test.ts-222-224 (1)

222-224: ⚠️ Potential issue | 🟡 Minor

Missing await on rejects.toThrow() assertion.

The expect(...).rejects.toThrow() returns a Promise and must be awaited. Without await, the test passes regardless of whether the promise actually rejects.

🐛 Proposed fix
   it("throws on unknown agent type", async () => {
-    expect(callViaAgent("sys", "user", "unknown-agent")).rejects.toThrow("Unknown agent");
+    await expect(callViaAgent("sys", "user", "unknown-agent")).rejects.toThrow("Unknown agent");
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/utils/llm-call.test.ts` around lines 222 - 224, The test for
callViaAgent("sys", "user", "unknown-agent") is missing an await on the Promise
returned by expect(...).rejects.toThrow(), so the assertion can be skipped;
update the test to await expect(callViaAgent("sys", "user",
"unknown-agent")).rejects.toThrow("Unknown agent") (or alternatively return that
promise) so the test runner waits for the rejection, ensuring the
rejects.toThrow assertion is actually executed.
tests/utils/llm-call.test.ts-272-275 (1)

272-275: ⚠️ Potential issue | 🟡 Minor

Missing await on rejects.toThrow() assertion.

🐛 Proposed fix
   it("throws when ANTHROPIC_API_KEY is not set", async () => {
     process.env.ANTHROPIC_API_KEY = undefined;
-    expect(callViaApi("system", "user")).rejects.toThrow("ANTHROPIC_API_KEY");
+    await expect(callViaApi("system", "user")).rejects.toThrow("ANTHROPIC_API_KEY");
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/utils/llm-call.test.ts` around lines 272 - 275, The test for callViaApi
is missing an await on the rejects assertion; update the test that sets
process.env.ANTHROPIC_API_KEY = undefined to await the promise rejection by
either prefixing the assertion with await (await
expect(callViaApi("system","user")).rejects.toThrow("ANTHROPIC_API_KEY")) or
returning the expectation, so the async test actually waits for the rejection
from callViaApi and properly asserts the error message.
cli/selftune/monitoring/watch.ts-109-113 (1)

109-113: ⚠️ Potential issue | 🟡 Minor

passRate = 1.0 when no queries may be misleading.

When totalQueries === 0, the pass rate defaults to 1.0 (100%), which could give false confidence that everything is working when there's actually no data. Consider using 0 or NaN to indicate "no data", or adding a flag like has_data: boolean to the snapshot.

💡 Alternative approach
   // 4. Compute pass rate: triggered_count / total_query_count
   const triggeredCount = filteredSkillRecords.filter((r) => r.triggered).length;
   const totalQueries = filteredQueryRecords.length;
-  const passRate = totalQueries === 0 ? 1.0 : triggeredCount / totalQueries;
+  const passRate = totalQueries === 0 ? 0 : triggeredCount / totalQueries;

Or extend MonitoringSnapshot with a has_sufficient_data: boolean field.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/monitoring/watch.ts` around lines 109 - 113, The current
passRate computation returns 1.0 when totalQueries === 0 which misrepresents "no
data"; update the logic around filteredSkillRecords/filteredQueryRecords so you
compute a hasData (e.g., const hasData = totalQueries > 0) flag, set passRate =
hasData ? triggeredCount / totalQueries : NaN (or 0 if you prefer), and include
that boolean (e.g., has_data or has_sufficient_data) on the MonitoringSnapshot
you build so callers can distinguish "no data" from a true 100% pass rate.
ARCHITECTURE.md-112-120 (1)

112-120: ⚠️ Potential issue | 🟡 Minor

The log schema section says “three JSONL files” but enumerates four.

Please align the sentence with the table to avoid reader confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ARCHITECTURE.md` around lines 112 - 120, The opening sentence "All modules
communicate through three JSONL files" is inconsistent with the table that lists
four files (`~/.claude/session_telemetry_log.jsonl`,
`~/.claude/skill_usage_log.jsonl`, `~/.claude/all_queries_log.jsonl`,
`~/.claude/evolution_audit_log.jsonl`); update the sentence to match the table
(e.g., change "three" to "four") or remove/alter the sentence to reflect the
exact set of files, ensuring the phrasing references the same list shown in the
table so readers aren’t confused.
ARCHITECTURE.md-18-20 (1)

18-20: ⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks (MD040).

The fenced blocks are missing a language tag and will keep markdownlint warnings active.

📝 Minimal lint fix
-```
+```text
 Observe → Detect → Diagnose → Propose → Validate → Deploy → Watch → Repeat

@@
- +text
cli/selftune/
├── index.ts CLI entry point (command router)
...

  • Hardcoding CLI paths in skill workflows (read from ~/.selftune/config.json instead)

Also applies to: 28-60

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ARCHITECTURE.md` around lines 18 - 20, The markdown fenced code blocks in
ARCHITECTURE.md (the block containing "Observe → Detect → Diagnose → Propose →
Validate → Deploy → Watch → Repeat" and the CLI tree block beginning with
"cli/selftune/") are missing language identifiers and trigger MD040; update each
opening fence from ``` to a language tag such as ```text (or ```text/tree as
preferred) for all fenced blocks in the range indicated (roughly lines 18–60) so
markdownlint stops warning and the snippets are treated as plain text.
skill/Workflows/Initialize.md-107-108 (1)

107-108: ⚠️ Potential issue | 🟡 Minor

Doctor verification criteria are too strict for first-run environments.

Fresh installs legitimately produce warn checks (e.g., no logs yet). The workflow should require fixing fail checks, not forcing all checks to pass.

✍️ Suggested wording
-Parse the JSON output. All checks should pass. If any fail, address the
-reported issues before proceeding.
+Parse the JSON output. Any `fail` checks must be addressed before proceeding.
+`warn` checks can be expected on fresh installs (for example, before the first captured session).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skill/Workflows/Initialize.md` around lines 107 - 108, The doctor
verification step currently requires every check to be "pass" when parsing the
JSON output; change the logic in the workflow that parses the JSON (the "Parse
the JSON output" / doctor verification block) so that it only treats entries
with severity "fail" as blocking—i.e., allow "warn" results to be considered
non-fatal and proceed if there are zero "fail" checks. Update any messaging to
state that "fail" checks must be fixed while "warn" checks are acceptable for
first-run environments.
cli/selftune/grading/grade-session.ts-311-313 (1)

311-313: ⚠️ Potential issue | 🟡 Minor

Validate --eval-id before lookup.

Non-numeric values become NaN and currently surface as a confusing “Eval ID NaN not found” path.

✅ Small validation patch
   if (values["evals-json"] && values["eval-id"] != null) {
-    expectations = loadExpectationsFromEvalsJson(values["evals-json"], Number(values["eval-id"]));
+    const evalId = Number(values["eval-id"]);
+    if (!Number.isInteger(evalId)) {
+      throw new Error("--eval-id must be an integer");
+    }
+    expectations = loadExpectationsFromEvalsJson(values["evals-json"], evalId);
   } else if (values.expectations?.length) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/grading/grade-session.ts` around lines 311 - 313, Validate
values["eval-id"] before using it: check that values["eval-id"] is present and
parses to a finite number (e.g., Number(values["eval-id"]) is not NaN) before
calling loadExpectationsFromEvalsJson; if it's missing or non-numeric, surface a
clear error (throw or processLogger.error + exit) instead of passing NaN into
loadExpectationsFromEvalsJson so you avoid the confusing “Eval ID NaN not found”
path and ensure expectations is only loaded with a valid numeric eval ID.
cli/selftune/grading/grade-session.ts-291-296 (1)

291-296: ⚠️ Potential issue | 🟡 Minor

Fail fast on invalid --agent values.

Unknown values are silently ignored today, which can run grading with a different agent than the caller intended.

🔍 Suggested guard
   if (mode === "agent") {
     const validAgents = ["claude", "codex", "opencode"];
+    if (values.agent && !validAgents.includes(values.agent)) {
+      throw new Error(`Invalid --agent '${values.agent}'. Use one of: ${validAgents.join(", ")}`);
+    }
     if (values.agent && validAgents.includes(values.agent)) {
       agent = values.agent;
     } else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/grading/grade-session.ts` around lines 291 - 296, The current
logic silently ignores invalid --agent values by checking validAgents and
falling back to _detectAgent(); instead validate values.agent and fail fast: if
values.agent is present and not in validAgents, throw a descriptive Error (or
call process.exit with a non-zero code after logging) indicating the invalid
agent and listing valid options; keep the existing branch that sets agent =
values.agent when valid, and only call _detectAgent() when no values.agent is
provided. Ensure you update the block referencing validAgents, values.agent,
agent, and _detectAgent() so an invalid explicit value does not get ignored.
docs/design-docs/evolution-pipeline.md-13-15 (1)

13-15: ⚠️ Potential issue | 🟡 Minor

Add a language tag to the pipeline fenced block.

The fenced block is unlabeled and triggers MD040.

Suggested fix
-```
+```text
 Extract Patterns → Generate Proposal → Validate → Deploy (or Reject + Retry)
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/design-docs/evolution-pipeline.md around lines 13 - 15, The fenced code
block containing "Extract Patterns → Generate Proposal → Validate → Deploy (or
Reject + Retry)" is unlabeled and triggers MD040; add a language tag (e.g.,
change the fence from totext) so the block becomes labeled (```text) to
satisfy the linter and preserve formatting, leaving the inner content unchanged.


</details>

</blockquote></details>
<details>
<summary>skill/references/invocation-taxonomy.md-77-82 (1)</summary><blockquote>

`77-82`: _⚠️ Potential issue_ | _🟡 Minor_

**Add a language tag to the fenced block.**

Line 77 uses an unlabeled fenced block and triggers MD040.

<details>
<summary>Suggested fix</summary>

```diff
-```
+```text
 Explicit only     -->  Skill is too rigid, users must babysit
 + Implicit        -->  Skill works for informed users
 + Contextual      -->  Skill works naturally in real workflows
 - Negative clean  -->  No false positives
 ```
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @skill/references/invocation-taxonomy.md around lines 77 - 82, The fenced
code block that contains the list starting with "Explicit only --> Skill is
too rigid, users must babysit" is unlabeled and triggers MD040; update the
opening fence from to a language-tagged fence (e.g.,text) so the block
becomes a labeled fenced code block, keeping the existing lines ("Explicit
only...", "+ Implicit...", "+ Contextual...", "- Negative clean...") unchanged.


</details>

</blockquote></details>
<details>
<summary>skill/Workflows/Evolve.md-110-114 (1)</summary><blockquote>

`110-114`: _⚠️ Potential issue_ | _🟡 Minor_

**Validation criteria description is incomplete vs implementation.**

The current text omits the additional improvement gate used in `cli/selftune/evolution/validate-proposal.ts` (`net_change >= 0.1` **or** `new_passes >= 2`), which can make outcomes look inconsistent to users.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@skill/Workflows/Evolve.md` around lines 110 - 114, Update the Validation
criteria text in Workflows/Evolve.md to match the implementation in
cli/selftune/evolution/validate-proposal.ts by adding the additional improvement
gate: require either net_change >= 0.1 OR new_passes >= 2 (in addition to
improving overall pass rate, not regressing >5% on previously-passing entries,
and exceeding the --confidence threshold); reference the same gate language and
numeric thresholds used in validate-proposal.ts so docs and code are consistent.
```

</details>

</blockquote></details>
<details>
<summary>cli/selftune/evolution/evolve.ts-314-323 (1)</summary><blockquote>

`314-323`: _⚠️ Potential issue_ | _🟡 Minor_

**Edge case: `maxIterations=0` results in `deployed: true` with `null` proposal.**

If `maxIterations` is set to 0 (or a non-positive value), the retry loop never executes, leaving `lastProposal` and `lastValidation` as `null`. The function then returns `deployed: true` with a `null` proposal, which is semantically incorrect.



<details>
<summary>🛡️ Proposed fix to guard against invalid maxIterations</summary>

Add a guard at the start of the function or before the loop:

```diff
+    // -----------------------------------------------------------------------
+    // Guard: Require at least one iteration
+    // -----------------------------------------------------------------------
+    if (maxIterations < 1) {
+      return {
+        proposal: null,
+        validation: null,
+        deployed: false,
+        auditEntries,
+        reason: "maxIterations must be at least 1",
+      };
+    }
+
     // -----------------------------------------------------------------------
     // Steps 7-12: Retry loop for proposal generation and validation
     // -----------------------------------------------------------------------
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/evolve.ts` around lines 314 - 323, The function evolve
should guard against non-positive maxIterations: before entering the retry loop,
validate the maxIterations parameter (or the local variable used to drive the
loop) and if it's <= 0 return a sensible failure result (e.g., deployed: false,
reason explaining invalid maxIterations) or throw a clear error; also ensure any
final return that uses lastProposal and lastValidation treats null proposals as
not deployed (deployed: false) so you never return deployed: true with a null
lastProposal. Use the existing symbols maxIterations, lastProposal,
lastValidation and the final return block to implement this check.
```

</details>

</blockquote></details>
<details>
<summary>cli/selftune/evolution/evolve.ts-389-391 (1)</summary><blockquote>

`389-391`: _⚠️ Potential issue_ | _🟡 Minor_

**Missing validation for numeric CLI arguments.**

`Number.parseFloat()` and `Number.parseInt()` return `NaN` for invalid inputs (e.g., `--confidence=abc`). This `NaN` would propagate through the `evolve()` function, causing unexpected behavior (e.g., `proposal.confidence < NaN` is always `false`).



<details>
<summary>🛡️ Proposed fix to validate numeric inputs</summary>

```diff
+  const confidenceThreshold = Number.parseFloat(values.confidence ?? "0.6");
+  const maxIterations = Number.parseInt(values["max-iterations"] ?? "3", 10);
+
+  if (Number.isNaN(confidenceThreshold) || confidenceThreshold < 0 || confidenceThreshold > 1) {
+    console.error("[ERROR] --confidence must be a number between 0.0 and 1.0");
+    process.exit(1);
+  }
+
+  if (Number.isNaN(maxIterations) || maxIterations < 1) {
+    console.error("[ERROR] --max-iterations must be a positive integer");
+    process.exit(1);
+  }
+
   evolve({
     skillName: values.skill,
     skillPath: values["skill-path"],
     evalSetPath: values["eval-set"],
     mode,
     agent: values.agent,
     dryRun: values["dry-run"] ?? false,
-    confidenceThreshold: Number.parseFloat(values.confidence ?? "0.6"),
-    maxIterations: Number.parseInt(values["max-iterations"] ?? "3", 10),
+    confidenceThreshold,
+    maxIterations,
   })
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/evolve.ts` around lines 389 - 391, The CLI parsing
currently assigns confidenceThreshold and maxIterations using
Number.parseFloat/parseInt which can yield NaN for invalid inputs; update the
parsing in the evolve() invocation (the keys confidenceThreshold and
maxIterations) to explicitly validate the parsed values (e.g., parse then check
Number.isFinite or !Number.isNaN) and either: a) fall back to the provided
defaults (0.6 for confidenceThreshold, 3 for maxIterations) when parsing fails,
or b) throw a clear CLI error and exit when the user supplies an invalid value;
ensure you apply the same validation for dryRun if needed and include the
referenced symbols confidenceThreshold and maxIterations in the fix so invalid
numeric inputs no longer propagate NaN into evolve().
```

</details>

</blockquote></details>
<details>
<summary>cli/selftune/evolution/evolve.ts-147-157 (1)</summary><blockquote>

`147-157`: _⚠️ Potential issue_ | _🟡 Minor_

**Duplicate file read: `SKILL_LOG` is read twice.**

The file is read at line 149 as `skillRecords` and again at line 157 as `skillUsage`. This is redundant I/O and could lead to inconsistencies if the file changes between reads.



<details>
<summary>🐛 Proposed fix to eliminate duplicate read</summary>

```diff
     } else {
       // Build from logs
       const skillRecords = readJsonl<SkillUsageRecord>(SKILL_LOG);
       const queryRecords = readJsonl<QueryLogRecord>(QUERY_LOG);
       evalSet = _buildEvalSet(skillRecords, queryRecords, skillName);
     }

     // -----------------------------------------------------------------------
     // Step 3: Load skill usage records
     // -----------------------------------------------------------------------
-    const skillUsage = readJsonl<SkillUsageRecord>(SKILL_LOG);
+    // When eval set was built from logs, reuse skillRecords; otherwise read now
+    const skillUsage = evalSetPath && existsSync(evalSetPath)
+      ? readJsonl<SkillUsageRecord>(SKILL_LOG)
+      : skillRecords!;
```

Alternatively, hoist `skillRecords` outside the conditional:

```diff
+    // Load skill usage records (needed for both eval set building and pattern extraction)
+    const skillRecords = readJsonl<SkillUsageRecord>(SKILL_LOG);
+
     if (evalSetPath && existsSync(evalSetPath)) {
       const raw = readFileSync(evalSetPath, "utf-8");
       evalSet = JSON.parse(raw) as EvalEntry[];
     } else {
       // Build from logs
-      const skillRecords = readJsonl<SkillUsageRecord>(SKILL_LOG);
       const queryRecords = readJsonl<QueryLogRecord>(QUERY_LOG);
       evalSet = _buildEvalSet(skillRecords, queryRecords, skillName);
     }

-    // -----------------------------------------------------------------------
-    // Step 3: Load skill usage records
-    // -----------------------------------------------------------------------
-    const skillUsage = readJsonl<SkillUsageRecord>(SKILL_LOG);
+    const skillUsage = skillRecords;
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/evolve.ts` around lines 147 - 157, Duplicate reads of
SKILL_LOG are happening: skillRecords is read inside the else branch and
SKILL_LOG is read again into skillUsage later; to fix, read SKILL_LOG once into
skillRecords (hoist the readJsonl<SkillUsageRecord>(SKILL_LOG) call so it
happens before the if/else or before both uses), pass that same skillRecords
into _buildEvalSet(skillRecords, queryRecords, skillName) and replace the later
skillUsage usage with the already-read skillRecords variable so SKILL_LOG is not
read twice.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🧹 Nitpick comments (18)</summary><blockquote>

<details>
<summary>skill/SKILL.md (1)</summary><blockquote>

`66-70`: **Add language specifier to the feedback loop diagram.**

The fenced code block at line 66 lacks a language identifier. Use ` ```text ` for ASCII diagrams to satisfy markdownlint MD040.


<details>
<summary>💡 Suggested fix</summary>

```diff
-```
+```text
 Observe --> Detect --> Diagnose --> Propose --> Validate --> Deploy --> Watch
    |                                                                    |
    +--------------------------------------------------------------------+
 ```
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @skill/SKILL.md around lines 66 - 70, The fenced ASCII diagram block (the
triple-backtick fenced block showing "Observe --> Detect --> Diagnose -->
Propose --> Validate --> Deploy --> Watch") is missing a language specifier;
update that fenced block to use ```text as the opening fence so markdownlint
MD040 is satisfied and the ASCII diagram is treated as plain text.


</details>

</blockquote></details>
<details>
<summary>cli/selftune/utils/llm-call.ts (2)</summary><blockquote>

`119-126`: **Consider static import for `unlinkSync`.**

The dynamic import of `unlinkSync` in the finally block works but adds unnecessary overhead. Since `readFileSync` and `writeFileSync` are already imported from `node:fs`, add `unlinkSync` to the static import:

<details>
<summary>♻️ Proposed fix</summary>

```diff
-import { readFileSync, writeFileSync } from "node:fs";
+import { readFileSync, unlinkSync, writeFileSync } from "node:fs";
```

Then simplify the cleanup:
```diff
   } finally {
     try {
-      const { unlinkSync } = await import("node:fs");
       unlinkSync(promptFile);
     } catch {
       // ignore cleanup errors
     }
   }
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/utils/llm-call.ts` around lines 119 - 126, Add unlinkSync to the
existing static import from "node:fs" (alongside readFileSync and writeFileSync)
and remove the dynamic import inside the finally block; then call
unlinkSync(promptFile) directly in the finally block and keep the try/catch to
ignore cleanup errors. Update references in the finally block where unlinkSync
is imported dynamically so they now use the statically imported unlinkSync.
```

</details>

---

`82-86`: **Redundant file I/O: write then immediately read.**

The prompt is written to a temp file (line 83) and then immediately read back (line 86). Since you already have the content in memory, you can skip the read:

<details>
<summary>♻️ Proposed fix</summary>

```diff
   const promptFile = join(tmpdir(), `selftune-llm-${Date.now()}.txt`);
-  writeFileSync(promptFile, `${systemPrompt}\n\n${userPrompt}`, "utf-8");
+  const promptContent = `${systemPrompt}\n\n${userPrompt}`;
+  writeFileSync(promptFile, promptContent, "utf-8");

   try {
-    const promptContent = readFileSync(promptFile, "utf-8");
     let cmd: string[];
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/utils/llm-call.ts` around lines 82 - 86, The code writes the
prompt to disk then immediately re-reads it (promptFile, writeFileSync,
readFileSync) which is redundant; instead, build the prompt string in memory
(e.g., const promptContent = `${systemPrompt}\n\n${userPrompt}`) and use that
value wherever readFileSync was used, removing the readFileSync call (and the
temporary file variable if you don't need the file at all) so the function in
llm-call.ts avoids unnecessary file I/O.
```

</details>

</blockquote></details>
<details>
<summary>tests/utils/schema-validator.test.ts (1)</summary><blockquote>

`1-7`: **Test location deviates from co-location pattern.**

Per learnings, tests should be co-located (`index.test.ts` next to `index.ts`). This test file is in `tests/utils/` rather than alongside `cli/selftune/utils/schema-validator.ts`. Consider moving to `cli/selftune/utils/schema-validator.test.ts` for consistency with the stated pattern.

That said, the test coverage is comprehensive and well-structured—covering valid records, missing fields, type mismatches, edge cases (empty/null/non-object), and the fail-open behavior of `appendJsonl`.



Based on learnings: "Tests are co-located (`index.test.ts` next to `index.ts`)"

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@tests/utils/schema-validator.test.ts` around lines 1 - 7, Move the test so it
is co-located with the module under test: relocate the current
tests/utils/schema-validator.test.ts file to the same directory as
schema-validator.ts and rename it to schema-validator.test.ts (keeping the same
test content). Update the import paths in the test to use local relative imports
for appendJsonl, validateRecord and the LogType type (instead of "../../..."),
and run the test suite to verify imports and paths are correct; no code changes
to validateRecord or appendJsonl are required.
```

</details>

</blockquote></details>
<details>
<summary>cli/selftune/evolution/validate-proposal.ts (2)</summary><blockquote>

`16-24`: **Interface name shadows `ValidationResult` from schema-validator.ts.**

This `ValidationResult` has a different shape than the one exported from `cli/selftune/utils/schema-validator.ts` (which has `valid: boolean; errors: string[]`). Consider renaming to `ProposalValidationResult` to avoid confusion when both are imported in the same file.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/validate-proposal.ts` around lines 16 - 24, The
interface ValidationResult in validate-proposal.ts conflicts with the
ValidationResult exported from cli/selftune/utils/schema-validator.ts; rename
this interface to ProposalValidationResult and update all local references and
exports accordingly (e.g., change the type name used in functions, return types,
arrays like regressions/new_passes, and any export statements). Also search
other modules that import ValidationResult from this file and update their
imports to ProposalValidationResult so both types can be imported unambiguously
alongside schema-validator's ValidationResult.
```

</details>

---

`85-112`: **Sequential LLM calls may be slow for large eval sets.**

The loop makes 2 LLM calls per entry (one for original, one for proposed description). For an eval set of 50 entries, this is 100 sequential LLM calls. Consider batching or parallelizing with `Promise.all` if latency becomes an issue, though rate limits may require careful throttling.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/validate-proposal.ts` around lines 85 - 112, The loop
in validate-proposal.ts makes two sequential LLM calls per evalSet entry
(callLlm using buildTriggerCheckPrompt for proposal.original_description and
proposal.proposed_description), which is slow for large sets; refactor so the
two calls per entry are executed in parallel (e.g., start both promises and
await Promise.all for [beforeRaw, afterRaw]) or implement batched concurrency (a
simple concurrency limiter that runs N entries in parallel) while still using
parseTriggerResponse and preserving the logic that computes beforePass/afterPass
and updates beforePassed, afterPassed, regressions, and newPasses; ensure error
handling and optional rate-limit/backoff are applied so you don’t overwhelm the
LLM.
```

</details>

</blockquote></details>
<details>
<summary>cli/selftune/utils/jsonl.ts (1)</summary><blockquote>

`36-51`: **Logger created inside validation loop could be optimized.**

A new logger is created for each validation warning. Consider hoisting the logger creation outside the loop or making it lazy at module level.


<details>
<summary>♻️ Optional optimization</summary>

```diff
+const logger = createLogger("jsonl");
+
 export function appendJsonl(path: string, record: unknown, logType?: LogType): void {
   if (logType) {
     const result = validateRecord(record, logType);
     if (!result.valid) {
-      const logger = createLogger("jsonl");
       for (const error of result.errors) {
         logger.warn(`Validation warning for ${logType}: ${error}`);
       }
     }
   }
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/utils/jsonl.ts` around lines 36 - 51, In appendJsonl, avoid
creating a new logger for each validation error: move createLogger("jsonl") out
of the for-loop (or hoist it to module scope as a lazy/shared logger) and only
instantiate it when result.errors exists; this keeps the validateRecord/result
handling the same but uses a single logger instance to call logger.warn for each
error (refer to appendJsonl, validateRecord, and createLogger).
```

</details>

</blockquote></details>
<details>
<summary>cli/selftune/observability.ts (1)</summary><blockquote>

`16-21`: **Evolution audit health is counted twice in `doctor()` summary.**

`checkLogHealth()` already includes `log_evolution_audit`, and `checkEvolutionHealth()` adds a second evolution-audit check, inflating `total/pass/fail/warn`.



<details>
<summary>♻️ One way to de-duplicate</summary>

```diff
 const LOG_FILES: Record<string, string> = {
   session_telemetry: join(LOG_DIR, "session_telemetry_log.jsonl"),
   skill_usage: join(LOG_DIR, "skill_usage_log.jsonl"),
   all_queries: join(LOG_DIR, "all_queries_log.jsonl"),
-  evolution_audit: join(LOG_DIR, "evolution_audit_log.jsonl"),
 };
```
</details>


Also applies to: 94-143, 171-177

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/observability.ts` around lines 16 - 21, The doctor() summary is
double-counting evolution-audit because checkLogHealth() already reports
log_evolution_audit and checkEvolutionHealth() adds it again; update doctor() to
dedupe checks by either removing the evolution-audit entry from
checkEvolutionHealth() results or aggregating results with a Set keyed by the
check id so each unique check (e.g., "log_evolution_audit", functions
checkLogHealth and checkEvolutionHealth) is counted once; locate references to
LOG_FILES, the "evolution_audit" key, checkLogHealth(), checkEvolutionHealth(),
and doctor() and ensure the final summary uses only one entry per unique check
id when calculating total/pass/fail/warn.
```

</details>

</blockquote></details>
<details>
<summary>tests/observability.test.ts (1)</summary><blockquote>

`9-107`: **Make observability tests deterministic by isolating filesystem inputs.**

These tests currently depend on ambient machine state (real log files/home-dir artifacts), which can cause flaky outcomes and weak signal. Prefer temp fixtures/injected paths for controlled pass/warn/fail cases.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@tests/observability.test.ts` around lines 9 - 107, Tests rely on real
filesystem artifacts causing flakiness; make them deterministic by modifying the
health functions (checkLogHealth, checkHookInstallation, checkEvolutionHealth,
doctor) to accept an injectable basePath or an options object (e.g., {basePath,
hookPaths}) or by allowing callers to pass a mock/fs-like interface, then update
the tests to create temporary fixture files (or mock fs) and call these
functions with the temp path so each test controls presence/absence of log and
hook files to assert pass/warn/fail deterministically.
```

</details>

</blockquote></details>
<details>
<summary>docs/design-docs/evolution-pipeline.md (1)</summary><blockquote>

`44-53`: **Clarify CLI vs skill-layer ownership in file/path references.**

Please make the boundaries explicit (e.g., `cli/selftune/evolution/...` for CLI modules and `skill/SKILL.md` for skill artifact) to avoid layer ambiguity in docs and workflows.


Based on learnings: When updating docs or workflows, preserve the separation between CLI and skill layers explicitly.


Also applies to: 104-115

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@docs/design-docs/evolution-pipeline.md` around lines 44 - 53, The docs
currently mix CLI vs skill-layer paths causing ambiguity; update the text in
deploy-proposal.ts and evolve.ts sections to explicitly use distinct path
prefixes (e.g., cli/selftune/evolution/... for CLI modules and skill/SKILL.md or
skill/<skill-name>/SKILL.md for skill artifacts), and update any examples
(commit message, backup creation, and PR/branch steps) to reference these
canonical paths so workflows and docs consistently preserve the CLI vs skill
layer separation.
```

</details>

</blockquote></details>
<details>
<summary>tests/evolution/audit.test.ts (1)</summary><blockquote>

`116-149`: **Harden “most recent deployed” semantics with out-of-order timestamps.**

This test validates append order, but not timestamp order. Add one case with out-of-order timestamps to lock expected behavior explicitly.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@tests/evolution/audit.test.ts` around lines 116 - 149, Add a test case that
verifies getLastDeployedProposal selects the deployed entry with the latest
timestamp even when entries are appended out-of-order: keep the existing append
order but add one deployed entry whose timestamp is earlier than a previously
appended deployed entry (or vice versa) using appendAuditEntry(makeEntry({...}),
logPath) and assert getLastDeployedProposal("pptx", logPath) returns the entry
with the chronologically latest timestamp (check proposal_id, action,
timestamp). This ensures getLastDeployedProposal's selection is based on
entry.timestamp, not append order.
```

</details>

</blockquote></details>
<details>
<summary>tests/evolution/extract-patterns.test.ts (1)</summary><blockquote>

`13-19`: **Add coverage for missing `invocation_type` fallback.**

`makeEval` always includes `invocation_type`, so the default-to-implicit branch isn’t directly exercised.



<details>
<summary>📌 Minimal helper refactor to enable that test path</summary>

```diff
 function makeEval(
   query: string,
   shouldTrigger: boolean,
-  invocationType: "explicit" | "implicit" | "contextual" | "negative" = "implicit",
+  invocationType?: "explicit" | "implicit" | "contextual" | "negative",
 ): EvalEntry {
-  return { query, should_trigger: shouldTrigger, invocation_type: invocationType };
+  return invocationType
+    ? { query, should_trigger: shouldTrigger, invocation_type: invocationType }
+    : { query, should_trigger: shouldTrigger };
 }
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@tests/evolution/extract-patterns.test.ts` around lines 13 - 19, The helper
makeEval always sets invocation_type so the code path where invocation_type is
absent isn’t tested; update the test helper or add a new helper variant so you
can create an EvalEntry without the invocation_type field (e.g., allow the
invocationType parameter to be optional or add makeEvalWithoutInvocationType)
and add a test that passes that object into the extractor to verify the
default-to-"implicit" fallback; reference makeEval, EvalEntry, and the
invocation_type property when locating and changing the helper and adding the
test.
```

</details>

</blockquote></details>
<details>
<summary>tests/evolution/rollback.test.ts (1)</summary><blockquote>

`229-267`: **Strengthen proposalId test to validate proposal-specific restoration path.**

This case uses `.bak`, so rollback can succeed without proving proposal-specific audit selection. Add a variant without `.bak` and with distinct created-entry originals per proposal ID.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@tests/evolution/rollback.test.ts` around lines 229 - 267, The test "rolls
back a specific proposal when proposalId is provided" currently writes a .bak
file which lets rollback succeed without proving proposal-specific audit
selection; modify the test to add a variant that removes the `${skillPath}.bak`
step and instead creates distinct original contents tied to each audit entry
(use makeAuditEntry to set proposal_id and store the corresponding original
content — e.g., write files like `${skillPath}.orig.evo-test-001` and
`${skillPath}.orig.evo-test-002` or keep originals in memory keyed by proposal
id), ensure appendAuditEntry records those proposal-specific originals, call
rollback(skillName: "test-skill", skillPath, proposalId: "evo-test-001",
logPath) and assert that rollback restored the file content to the original text
associated with "evo-test-001" (use readFileSync(skillPath) and
readJsonl<EvolutionAuditEntry>(logPath) to verify the rolled_back entry
references proposal_id "evo-test-001").
```

</details>

</blockquote></details>
<details>
<summary>tests/evolution/validate-proposal.test.ts (1)</summary><blockquote>

`219-225`: **Trim conversational inline commentary in assertions.**

This section reads like scratch reasoning and makes the test harder to scan; condense to one concise rationale line.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@tests/evolution/validate-proposal.test.ts` around lines 219 - 225, Replace
the multiline conversational comment block above the assertion with a single
concise rationale sentence describing the outcome; for example, collapse the
lines into one comment such as: "With 4 entries 5% = 0.2 so 1 regression >= 0.2
=> improved = false (conservative check for small eval sets)". Locate the
comment near the test/variable that computes "improved" (the assertion block in
validate-proposal.test.ts) and update it to that single-line explanatory
comment.
```

</details>

</blockquote></details>
<details>
<summary>tests/monitoring/integration.test.ts (1)</summary><blockquote>

`345-355`: **Replace `as unknown as WatchOptions` with `satisfies WatchOptions` to maintain type safety.**

The double-cast pattern suppresses TypeScript's structural checks in strict mode, creating unnecessary drift risk when the `WatchOptions` interface evolves. The test objects are already structurally compatible with the interface, so `satisfies` provides the same runtime behavior with stronger static validation.

Applies to lines 355, 408, and 474.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@tests/monitoring/integration.test.ts` around lines 345 - 355, Replace the
double-cast "as unknown as WatchOptions" used when calling watch(...) with the
TypeScript "satisfies WatchOptions" operator to preserve structural type
checking; locate the watch({ ... } as unknown as WatchOptions) usages (the calls
that pass telemetryPath, skillLogPath, queryLogPath, auditLogPath,
windowSessions, regressionThreshold, autoRollback) and change them to watch({
... } satisfies WatchOptions) so the object remains unaltered at runtime but
gets compile-time validation against the WatchOptions interface; apply the same
replacement to the other two similar test calls in this file.
```

</details>

</blockquote></details>
<details>
<summary>tests/evolution/integration.test.ts (1)</summary><blockquote>

`265-394`: **Add an integration case for rollback’s audit-trail fallback path.**

Current rollback integration tests always restore from `.bak`. A dedicated case for “no backup + created audit entry contains `original_description:`” would protect the second restoration strategy from regressions.


<details>
<summary>✅ Suggested test addition</summary>

```diff
 describe("integration: deploy then rollback restores original SKILL.md", () => {
+  test("rollback restores from audit trail when .bak is missing", async () => {
+    const skillPath = join(tmpDir, "SKILL.md");
+    const logPath = join(tmpDir, "audit_fallback.jsonl");
+    writeFileSync(skillPath, SAMPLE_SKILL_MD, "utf-8");
+
+    const proposal = makeProposal({
+      proposal_id: "evo-audit-fallback-001",
+      skill_path: skillPath,
+    });
+    const validation = makeValidation({ proposal_id: proposal.proposal_id });
+
+    await deployProposal({ proposal, validation, skillPath, createPr: false });
+    rmSync(`${skillPath}.bak`, { force: true });
+
+    appendAuditEntry(
+      {
+        timestamp: new Date().toISOString(),
+        proposal_id: proposal.proposal_id,
+        action: "created",
+        details: `original_description:${proposal.original_description}`,
+      },
+      logPath,
+    );
+    appendAuditEntry(
+      {
+        timestamp: new Date().toISOString(),
+        proposal_id: proposal.proposal_id,
+        action: "deployed",
+        details: "Deployed proposal for test-skill",
+      },
+      logPath,
+    );
+
+    const rollbackResult = await rollback({
+      skillName: "test-skill",
+      skillPath,
+      proposalId: proposal.proposal_id,
+      logPath,
+    });
+
+    expect(rollbackResult.rolledBack).toBe(true);
+  });
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@tests/evolution/integration.test.ts` around lines 265 - 394, Tests currently
only exercise rollback restoring from a .bak file; add an integration test that
covers the fallback path where no .bak exists and the audit log contains a
deployed entry with details including "original_description:". Create a new test
(within the same describe block) that writes SKILL.md, deploys a proposal via
deployProposal, deliberately removes any .bak file (or avoid creating it),
appendAuditEntry with an entry whose details include the original_description
field and proposal_id, then call rollback(skillName, skillPath, logPath) and
assert rolledBack true, SKILL.md equals the original SAMPLE_SKILL_MD, .bak does
not exist, and the audit trail (readJsonl<EvolutionAuditEntry>(logPath))
contains a rolled_back entry for that proposal_id; use existing helpers
makeProposal, makeValidation, appendAuditEntry, readJsonl, and rollback to
locate the code paths.
```

</details>

</blockquote></details>
<details>
<summary>cli/selftune/evolution/evolve.ts (2)</summary><blockquote>

`242-247`: **Rounding may produce inconsistent counts in audit snapshots.**

`Math.round()` for calculating `passed` could result in slight inconsistencies. For example, with `total=3` and `pass_rate=0.5`, you'd get `passed=2` (from `Math.round(1.5)`) and `failed=1`, but the actual pass rate of 2/3 ≈ 0.667 doesn't match the original 0.5.

This is a display/audit concern and unlikely to affect behavior, but using `Math.floor` would be more conservative and consistent.



<details>
<summary>♻️ Optional: use Math.floor for consistency</summary>

```diff
       const evalSnapshot: EvalPassRate = {
         total: evalSet.length,
-        passed: Math.round(validation.after_pass_rate * evalSet.length),
-        failed: evalSet.length - Math.round(validation.after_pass_rate * evalSet.length),
+        passed: Math.floor(validation.after_pass_rate * evalSet.length),
+        failed: evalSet.length - Math.floor(validation.after_pass_rate * evalSet.length),
         pass_rate: validation.after_pass_rate,
       };
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/evolve.ts` around lines 242 - 247, The audit snapshot
calculation for EvalPassRate (evalSnapshot) uses Math.round when computing
passed from validation.after_pass_rate and evalSet.length, which can create
inconsistent pass_rate vs counts; change the calculation for passed to use
Math.floor(validation.after_pass_rate * evalSet.length) and compute failed as
evalSet.length - passed so the recorded counts are conservative and consistent
with the original pass_rate field (update the passed and failed assignments in
the evalSnapshot construction that references validation.after_pass_rate and
evalSet.length).
```

</details>

---

`116-121`: **Consider logging audit write failures for observability.**

The fail-open pattern is appropriate for not breaking the pipeline, but silently swallowing the error loses visibility into persistent audit failures.



<details>
<summary>🔧 Suggested improvement</summary>

```diff
     try {
       _appendAuditEntry(entry);
     } catch {
       // Fail-open: audit write failures should not break the pipeline
+      console.warn(`[WARN] Failed to persist audit entry for proposal ${proposalId}`);
     }
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/evolve.ts` around lines 116 - 121, The catch block
swallowing errors on _appendAuditEntry(entry) loses observability; change it to
catch the error (catch (err)) and log a concise diagnostic that includes the
error and relevant entry/context using the module's logger (e.g., logger.error
or processLogger.error) or console.error as a fallback, while keeping the
fail-open behavior (do not rethrow).
```

</details>

</blockquote></details>

</blockquote></details>

---

<details>
<summary>ℹ️ Review info</summary>

**Configuration used**: defaults

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 7294a0363f64e3e8f9221f966592ce12f537b76e and 38e612c3f73016181050642a68ccb2712b3c91a9.

</details>

<details>
<summary>📒 Files selected for processing (63)</summary>

* `AGENTS.md`
* `ARCHITECTURE.md`
* `Makefile`
* `PRD.md`
* `README.md`
* `cli/selftune/constants.ts`
* `cli/selftune/eval/hooks-to-evals.ts`
* `cli/selftune/evolution/audit.ts`
* `cli/selftune/evolution/deploy-proposal.ts`
* `cli/selftune/evolution/evolve.ts`
* `cli/selftune/evolution/extract-patterns.ts`
* `cli/selftune/evolution/propose-description.ts`
* `cli/selftune/evolution/rollback.ts`
* `cli/selftune/evolution/stopping-criteria.ts`
* `cli/selftune/evolution/validate-proposal.ts`
* `cli/selftune/grading/grade-session.ts`
* `cli/selftune/index.ts`
* `cli/selftune/init.ts`
* `cli/selftune/monitoring/watch.ts`
* `cli/selftune/observability.ts`
* `cli/selftune/types.ts`
* `cli/selftune/utils/jsonl.ts`
* `cli/selftune/utils/llm-call.ts`
* `cli/selftune/utils/schema-validator.ts`
* `docs/design-docs/evolution-pipeline.md`
* `docs/design-docs/index.md`
* `docs/design-docs/monitoring-pipeline.md`
* `docs/escalation-policy.md`
* `docs/exec-plans/active/agent-first-skill-restructure.md`
* `docs/exec-plans/tech-debt-tracker.md`
* `docs/golden-principles.md`
* `lint-architecture.ts`
* `skill/SKILL.md`
* `skill/Workflows/Doctor.md`
* `skill/Workflows/Evals.md`
* `skill/Workflows/Evolve.md`
* `skill/Workflows/Grade.md`
* `skill/Workflows/Ingest.md`
* `skill/Workflows/Initialize.md`
* `skill/Workflows/Rollback.md`
* `skill/Workflows/Watch.md`
* `skill/references/grading-methodology.md`
* `skill/references/invocation-taxonomy.md`
* `skill/references/logs.md`
* `tests/eval/hooks-to-evals.test.ts`
* `tests/evolution/audit.test.ts`
* `tests/evolution/deploy-proposal.test.ts`
* `tests/evolution/evolve.test.ts`
* `tests/evolution/extract-patterns.test.ts`
* `tests/evolution/integration.test.ts`
* `tests/evolution/propose-description.test.ts`
* `tests/evolution/rollback.test.ts`
* `tests/evolution/stopping-criteria.test.ts`
* `tests/evolution/types.test.ts`
* `tests/evolution/validate-proposal.test.ts`
* `tests/grading/grade-session.test.ts`
* `tests/init/init.test.ts`
* `tests/lint-architecture.test.ts`
* `tests/monitoring/integration.test.ts`
* `tests/monitoring/watch.test.ts`
* `tests/observability.test.ts`
* `tests/utils/llm-call.test.ts`
* `tests/utils/schema-validator.test.ts`

</details>

<details>
<summary>✅ Files skipped from review due to trivial changes (3)</summary>

* skill/Workflows/Evals.md
* skill/Workflows/Rollback.md
* docs/escalation-policy.md

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (3)</summary>

* cli/selftune/index.ts
* lint-architecture.ts
* tests/eval/hooks-to-evals.test.ts

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines +20 to +22
* When skillName is provided, returns only entries whose `details` field
* contains the skill name (case-insensitive match).
*/
Copy link

@coderabbitai coderabbitai bot Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Skill lookup via free-text details is ambiguous.

Filtering by substring in details can match unrelated skills (e.g., short names colliding), so getLastDeployedProposal() can return the wrong deployment.

Also applies to: 37-43

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/audit.ts` around lines 20 - 22,
getLastDeployedProposal() currently filters proposals by doing a
case-insensitive substring match against the free-text `details` field, which is
ambiguous; change the lookup to match a structured skill identifier instead: if
proposals expose a dedicated property (e.g., `skillName`, `skillId`, or
`metadata.skill`), use that exact/canonicalized value for comparison
(case-normalize and compare equality or use word-boundary regex) rather than
substring-matching `details`; update the filter logic in
getLastDeployedProposal() and the other similar block that checks `details` (the
second occurrence around the same function) to use the structured field and fall
back to a strict word-boundary match only if no structured field exists.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — substring matching in details is a known limitation. A dedicated skill_name field in the audit entry schema would be cleaner. Tracked as future improvement. The details guard for undefined was fixed in cdac726.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WellDunDun, understood — thanks for acknowledging the limitation and for fixing the undefined guard in cdac726. I've noted that a dedicated skill_name field is planned for the future. Looking forward to that improvement when it's prioritized!


🧠 Learnings used
Learnt from: CR
Repo: WellDunDun/reins PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-23T20:20:15.569Z
Learning: Self-audit using `cd cli/reins && bun src/index.ts audit ../..`

Learnt from: WellDunDun
Repo: WellDunDun/selftune PR: 1
File: cli/selftune/ingestors/codex-rollout.ts:217-227
Timestamp: 2026-02-28T21:37:01.437Z
Learning: In the WellDunDun/selftune repository, code under cli/selftune and its subdirectories should assume Unix-style path conventions ("/" separators) and Windows-specific handling is not currently a goal. Apply this guidance to all TypeScript files within cli/selftune (e.g., paths, join/normalize logic, and any platform-dependent path handling) to maintain consistency with Unix conventions.

Comment on lines +29 to +70
## Output Format

```json
{
"skill_name": "pptx",
"window_size": 20,
"sessions_evaluated": 18,
"current_pass_rate": 0.89,
"baseline_pass_rate": 0.92,
"threshold": 0.1,
"regression_detected": false,
"delta": -0.03,
"status": "healthy",
"evaluated_at": "2026-02-28T14:00:00Z"
}
```

### Status Values

| Status | Meaning |
|--------|---------|
| `healthy` | Current pass rate is within threshold of baseline |
| `warning` | Pass rate dropped but within threshold |
| `regression` | Pass rate dropped below baseline minus threshold |
| `insufficient_data` | Not enough sessions in the window to evaluate |

## Parsing Instructions

### Check Regression Status

```bash
# Parse: .regression_detected (boolean)
# Parse: .status (string)
# Parse: .delta (float, negative = regression)
```

### Get Key Metrics

```bash
# Parse: .current_pass_rate vs .baseline_pass_rate
# Parse: .sessions_evaluated (should be close to .window_size)
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find watch-related files in CLI
echo "=== Finding watch command files ==="
find cli -type f -name "*.ts" | grep -E "(watch|monitoring)" | head -20

# Check the main watch implementation
echo -e "\n=== Checking watch.ts implementation ==="
find cli -name "watch.ts" -type f -exec cat -n {} \;

# Look for output serialization in index files
echo -e "\n=== Checking index.ts for watch exports ==="
find cli -path "*/selftune/*" -name "index.ts" -type f -exec cat -n {} \;

# Search for JSON serialization and field names
echo -e "\n=== Searching for field definitions ==="
rg -n "current_pass_rate|window_size|sessions_evaluated|status|delta|evaluated_at" --type ts cli/ || echo "Documented fields not found"

echo -e "\n=== Searching for implementation field names ==="
rg -n "pass_rate|window_sessions|baseline_pass_rate|alert|rolledBack|recommendation" --type ts cli/ || echo "Implementation fields not found"

# Check if there's any remapping or transformation
echo -e "\n=== Searching for JSON.stringify or console output in watch context ==="
rg -n "JSON\.stringify|console\.log|console\.error" --type ts cli/selftune -A 2 -B 2 | head -50

Repository: WellDunDun/selftune

Length of output: 26238


Documentation output schema is incompatible with actual CLI implementation.

The documented JSON structure and parsing instructions are incorrect. The actual selftune watch command outputs a nested structure with snapshot, alert, rolledBack, and recommendation at the top level. The documented flat structure uses non-existent field names (window_size, current_pass_rate, status, delta, sessions_evaluated) and misses critical response fields. Parsing instructions referencing .status, .delta, and .sessions_evaluated will fail against actual output.

Correct output structure:

{
  "snapshot": {
    "timestamp": "2026-02-28T14:00:00Z",
    "skill_name": "pptx",
    "window_sessions": 20,
    "pass_rate": 0.89,
    "false_negative_rate": 0.11,
    "by_invocation_type": {...},
    "regression_detected": false,
    "baseline_pass_rate": 0.92
  },
  "alert": null,
  "rolledBack": false,
  "recommendation": "..."
}

Update the Output Format section and Parsing Instructions to reflect the actual nested structure and field names from cli/selftune/monitoring/watch.ts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skill/Workflows/Watch.md` around lines 29 - 70, The documented Output Format
and Parsing Instructions are using a flat schema that doesn't match the real CLI
output; update the documentation to match the nested structure emitted by the
`selftune watch` implementation in `cli/selftune/monitoring/watch.ts`: replace
the flat keys (`window_size`, `current_pass_rate`, `status`, `delta`,
`sessions_evaluated`) with the actual top-level keys (`snapshot`, `alert`,
`rolledBack`, `recommendation`) and inside `snapshot` use `timestamp`,
`skill_name`, `window_sessions`, `pass_rate`, `false_negative_rate`,
`by_invocation_type`, `regression_detected`, and `baseline_pass_rate`; adjust
the Parsing Instructions to parse `.snapshot.pass_rate`,
`.snapshot.baseline_pass_rate`, `.snapshot.window_sessions`,
`.snapshot.regression_detected`, and top-level `.alert`/`.rolledBack` instead of
`.status`/`.delta` etc., and update examples and JSON sample in the Output
Format section to the corrected nested JSON shown in the review comment.

Comment on lines +152 to +163
// Use bun's mock.module to mock callLlm
mock.module("../../cli/selftune/utils/llm-call.js", () => ({
...mockModule,
callLlm: async () => mockResponse,
stripMarkdownFences: mockModule.stripMarkdownFences,
}));

// Re-import the module to pick up mocked dependencies
// Clear module cache and re-import
const { generateProposal: mockedGenerate } = await import(
"../../cli/selftune/evolution/propose-description.js"
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== 1. Search for module mocks in TypeScript test files ==="
rg -n 'mock\.module\s*\(' --type=ts tests/

echo
echo "=== 2. Search for mock.restore calls in TypeScript test files ==="
rg -n 'mock\.restore' --type=ts tests/

echo
echo "=== 3. Check the specific test file ==="
if [ -f "tests/evolution/propose-description.test.ts" ]; then
  echo "File exists. Showing full content:"
  cat -n "tests/evolution/propose-description.test.ts"
else
  echo "File not found at tests/evolution/propose-description.test.ts"
fi

Repository: WellDunDun/selftune

Length of output: 626


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Full content of tests/evolution/propose-description.test.ts ==="
cat -n tests/evolution/propose-description.test.ts

echo
echo "=== Also check validate-proposal.test.ts which also uses mock.module ==="
cat -n tests/evolution/validate-proposal.test.ts

Repository: WellDunDun/selftune

Length of output: 23808


Add mock.restore() cleanup to prevent test-state leakage.

The mock.module() call at line 153 is never restored. Module mocks persist globally and can cause subsequent tests to observe unexpected mocked behavior. Add the suggested afterEach hook to clean up:

Suggested fix
-import { describe, expect, mock, test } from "bun:test";
+import { afterEach, describe, expect, mock, test } from "bun:test";
+
+afterEach(() => {
+  mock.restore();
+});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/evolution/propose-description.test.ts` around lines 152 - 163, The test
uses mock.module(...) to stub callLlm but never restores it, causing global mock
leakage; add an afterEach hook that calls mock.restore() to undo module mocks
after each spec. Locate the test file where mock.module is invoked (the test
that imports generateProposal from
"../../cli/selftune/evolution/propose-description.js") and add afterEach(() =>
mock.restore()) so module mocks are cleaned up between tests.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/evolution/evolve.test.ts (2)

197-208: Minor inefficiency: temp directory created even when skillPath is overridden.

When overrides.skillPath is provided, createTempSkill() still creates an unused temp directory. Consider checking for override first:

♻️ Suggested improvement
 function makeOptions(overrides: Partial<EvolveOptions> = {}): EvolveOptions {
-  const { skillPath } = createTempSkill();
+  const skillPath = overrides.skillPath ?? createTempSkill().skillPath;
   return {
     skillName: "test-skill",
     skillPath,
     mode: "api",
     dryRun: false,
     confidenceThreshold: 0.6,
     maxIterations: 3,
     ...overrides,
   };
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/evolution/evolve.test.ts` around lines 197 - 208, The helper
makeOptions currently always calls createTempSkill(), which creates a temp
directory even when overrides.skillPath is provided; change makeOptions to first
check if overrides.skillPath is present and only call createTempSkill() when it
is not, using the provided overrides.skillPath as skillPath otherwise, and
ensure returned EvolveOptions uses that resolved skillPath; refer to the
makeOptions function, the overrides parameter and createTempSkill symbol to
locate and update the logic.

330-342: Consider using tmpdir() for portability.

The hardcoded /tmp/ path at line 333 won't work on Windows. For consistency with the rest of the file:

♻️ Suggested fix for portability
   test("missing SKILL.md returns error reason", async () => {
     const opts = makeOptions({
-      skillPath: "/tmp/nonexistent-skill-path/SKILL.md",
+      skillPath: join(tmpdir(), "nonexistent-skill-path", "SKILL.md"),
     });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/evolution/evolve.test.ts` around lines 330 - 342, The test "missing
SKILL.md returns error reason" uses a hardcoded "/tmp/..." path which breaks on
Windows; update the test to build the skillPath using the system temporary
directory (e.g., call tmpdir()/os.tmpdir() and path.join) when constructing opts
via makeOptions, and add any required imports (os/tmpdir and path) at the top of
the test file so the call to evolve(opts, makeDeps()) uses a portable temporary
path instead of "/tmp/...".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/evolution/evolve.test.ts`:
- Around line 377-404: The test name claims the retry loop feeds the previous
failure reason into the next proposal but never asserts it; update the test
around mockGenerateProposal/capturedArgs to assert that the second (or
subsequent) call to mockGenerateProposal receives the failure context produced
by mockValidateProposal (e.g., inspect capturedArgs[1] for a field or object
that contains the failure reason produced on the first validation), or
alternatively rename the test to "retry loop succeeds on second attempt after
initial validation failure" if you prefer not to assert the propagated context;
reference mockGenerateProposal, capturedArgs, mockValidateProposal, and evolve
when making the change.

---

Nitpick comments:
In `@tests/evolution/evolve.test.ts`:
- Around line 197-208: The helper makeOptions currently always calls
createTempSkill(), which creates a temp directory even when overrides.skillPath
is provided; change makeOptions to first check if overrides.skillPath is present
and only call createTempSkill() when it is not, using the provided
overrides.skillPath as skillPath otherwise, and ensure returned EvolveOptions
uses that resolved skillPath; refer to the makeOptions function, the overrides
parameter and createTempSkill symbol to locate and update the logic.
- Around line 330-342: The test "missing SKILL.md returns error reason" uses a
hardcoded "/tmp/..." path which breaks on Windows; update the test to build the
skillPath using the system temporary directory (e.g., call tmpdir()/os.tmpdir()
and path.join) when constructing opts via makeOptions, and add any required
imports (os/tmpdir and path) at the top of the test file so the call to
evolve(opts, makeDeps()) uses a portable temporary path instead of "/tmp/...".

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38e612c and 03ebd3b.

📒 Files selected for processing (2)
  • cli/selftune/evolution/evolve.ts
  • tests/evolution/evolve.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • cli/selftune/evolution/evolve.ts

Comment on lines +377 to +404
// 10. Retry feeds failure reason into subsequent proposal attempts
test("retry loop feeds failure reason back to next iteration", async () => {
let iteration = 0;
const capturedArgs: unknown[][] = [];

mockGenerateProposal.mockImplementation(async (...args: unknown[]) => {
capturedArgs.push(args);
iteration++;
return makeProposal({ confidence: 0.9 });
});

// First attempt fails validation, second succeeds
let validateCallCount = 0;
mockValidateProposal.mockImplementation(async () => {
validateCallCount++;
if (validateCallCount === 1) {
return makeValidationResult({ improved: false, net_change: -0.1 });
}
return makeValidationResult({ improved: true });
});

const opts = makeOptions({ maxIterations: 3 });
const result = await evolve(opts, makeDeps());

// Should have been called twice (first fails, second succeeds)
expect(capturedArgs.length).toBe(2);
expect(result.deployed).toBe(true);
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test name suggests verification that isn't performed.

The test "retry loop feeds failure reason back to next iteration" captures capturedArgs but never asserts that subsequent calls actually receive failure context from previous iterations. Currently it only verifies:

  • generateProposal was called twice
  • Deployment eventually succeeded

To match the test description, consider adding assertions on capturedArgs contents to verify failure feedback is propagated, or rename the test to reflect what it actually verifies (e.g., "retry loop succeeds on second attempt after initial validation failure").

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/evolution/evolve.test.ts` around lines 377 - 404, The test name claims
the retry loop feeds the previous failure reason into the next proposal but
never asserts it; update the test around mockGenerateProposal/capturedArgs to
assert that the second (or subsequent) call to mockGenerateProposal receives the
failure context produced by mockValidateProposal (e.g., inspect capturedArgs[1]
for a field or object that contains the failure reason produced on the first
validation), or alternatively rename the test to "retry loop succeeds on second
attempt after initial validation failure" if you prefer not to assert the
propagated context; reference mockGenerateProposal, capturedArgs,
mockValidateProposal, and evolve when making the change.

- Fix CLI router to use exported cliMain() pattern instead of relying on
  import.meta.main (which is false for dynamically imported modules)
- Add source field to session-stop and skill-eval hook telemetry records
- Guard audit.ts details field against undefined values
- Add record.triggered check in extract-patterns.ts
- Sanitize git branch names in deploy-proposal.ts and surface git errors
- Validate agent type and LLM mode enum casts in init.ts
- Harden grade-session.ts printSummary against missing fields
- Wrap statSync in try/catch for broken symlinks in codex-wrapper.ts
- Fix errors counting (const→let) and add --since date validation in
  opencode-ingest.ts
- Add hook_settings health check in observability.ts
- Add Owner column to tech-debt-tracker.md
- Make PRD.md a relative link in product-specs/index.md
- Update tests to match new behavior (4 hook checks, flexible passing)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

♻️ Duplicate comments (6)
cli/selftune/evolution/extract-patterns.ts (1)

56-66: ⚠️ Potential issue | 🟠 Major

Cluster merging is still order-dependent and misses transitive single-link merges.

At Line [56]-Line [66], the loop stops at the first matching cluster. If a query matches multiple clusters, those clusters are never merged, so extracted patterns can differ by input order.

🔧 Proposed fix
 export function clusterQueries(queries: string[], threshold = 0.3): string[][] {
   if (queries.length === 0) return [];

   const clusters: string[][] = [];

   for (const query of queries) {
-    let merged = false;
-
-    for (const cluster of clusters) {
-      // Single-linkage: if ANY member has similarity >= threshold, add to cluster
-      for (const member of cluster) {
-        if (computeQuerySimilarity(query, member) >= threshold) {
-          cluster.push(query);
-          merged = true;
-          break;
-        }
-      }
-      if (merged) break;
-    }
-
-    if (!merged) {
+    const matchingClusterIndexes: number[] = [];
+    for (let i = 0; i < clusters.length; i++) {
+      const cluster = clusters[i];
+      if (cluster.some((member) => computeQuerySimilarity(query, member) >= threshold)) {
+        matchingClusterIndexes.push(i);
+      }
+    }
+
+    if (matchingClusterIndexes.length === 0) {
       clusters.push([query]);
+      continue;
+    }
+
+    const [targetIndex, ...toMerge] = matchingClusterIndexes;
+    clusters[targetIndex].push(query);
+
+    for (const mergeIndex of toMerge.reverse()) {
+      clusters[targetIndex].push(...clusters[mergeIndex]);
+      clusters.splice(mergeIndex, 1);
     }
   }

   return clusters;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/extract-patterns.ts` around lines 56 - 66, The current
loop in extract-patterns.ts stops at the first cluster match which makes
clustering order-dependent and misses transitive merges; change the logic around
clusters/cluster/computeQuerySimilarity/query/threshold so that for each query
you collect all clusters whose any member has similarity >= threshold, then
merge those matching clusters together (e.g., combine their members into a
single cluster, remove the originals) and add the query to that merged cluster;
ensure this runs for every query (or iterate until no further merges) so
transitive single-link merges are resolved rather than stopping after the first
matched cluster.
cli/selftune/init.ts (1)

104-107: ⚠️ Potential issue | 🟡 Minor

Invalid --llm-mode values are silently ignored.

A typo currently falls back to auto-detection instead of surfacing input error. Failing fast here will avoid confusing init outputs.

🔧 Proposed fix
   const validModes = ["agent", "api"] as const;
+  if (modeOverride && !validModes.includes(modeOverride as (typeof validModes)[number])) {
+    throw new Error(`Invalid --llm-mode value: "${modeOverride}". Expected one of: agent, api`);
+  }
   const resolvedMode =
-    modeOverride && validModes.includes(modeOverride as (typeof validModes)[number])
-      ? (modeOverride as "agent" | "api")
-      : undefined;
+    modeOverride ? (modeOverride as "agent" | "api") : undefined;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/init.ts` around lines 104 - 107, The code silently ignores
invalid mode overrides by setting resolvedMode to undefined; instead validate
modeOverride against validModes and fail fast: if modeOverride is provided but
not included in validModes (check the modeOverride and validModes variables and
the resolvedMode assignment), throw or return a clear user-facing error
indicating the invalid --llm-mode value and listing allowed values (e.g.,
"agent" | "api") so typos don't fall back to auto-detection.
cli/selftune/observability.ts (1)

77-89: ⚠️ Potential issue | 🟠 Major

Hook installation check is validating bundled files, not the user’s active hooks directory.

This can report pass even when the repo/user hooks are not installed, so doctor health is a false positive.

🔧 Proposed fix
+import { homedir } from "node:os";
+import { spawnSync } from "node:child_process";

+function resolveHooksDir(): string | null {
+  const root = spawnSync("git", ["rev-parse", "--show-toplevel"], { encoding: "utf-8" })
+    .stdout.trim();
+  if (!root) return null;
+  const hooksPath =
+    spawnSync("git", ["config", "--get", "core.hooksPath"], { encoding: "utf-8" })
+      .stdout.trim() || ".git/hooks";
+  return join(root, hooksPath);
+}

 export function checkHookInstallation(): HealthCheck[] {
   const checks: HealthCheck[] = [];
+  const hooksDir = resolveHooksDir();

   for (const hook of HOOK_FILES) {
     const check: HealthCheck = { name: `hook_${hook}`, path: hook, status: "pass", message: "" };
-    const hookPath = join(import.meta.dir, "hooks", hook);
+    const hookPath = hooksDir ? join(hooksDir, hook) : "";
     if (existsSync(hookPath)) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/observability.ts` around lines 77 - 89, The health check is
verifying bundled hook files under import.meta.dir, causing false positives;
change the hook existence check to look at the repository's active hooks
directory (e.g., resolve the repo root from process.cwd() or use `git rev-parse
--show-toplevel` and then join(root, ".git", "hooks", hook)) instead of
join(import.meta.dir, "hooks", hook). Update the symbol `hookPath` to point at
the repo hooks location, keep the same HealthCheck object (`check`,
`HOOK_FILES`, `checks.push(check)`), and preserve the existing pass/fail
messages but ensure they reflect the active hooks directory presence.
cli/selftune/evolution/rollback.ts (2)

127-131: ⚠️ Potential issue | 🔴 Critical

Do not overwrite entire SKILL.md with original_description.

The fallback currently writes audit description payload directly to the file, which can corrupt the document structure.

🔧 Proposed fix
+function restoreDescriptionSection(skillMd: string, description: string): string {
+  const section = /(##\s+Description\s*\n)([\s\S]*?)(\n##\s+|\n#\s+|$)/m;
+  if (section.test(skillMd)) {
+    return skillMd.replace(section, `$1${description.trim()}\n\n$3`);
+  }
+  return `${skillMd.trimEnd()}\n\n## Description\n${description.trim()}\n`;
+}
@@
   const originalFromAudit = findOriginalFromAudit(targetProposalId, logPath);
   if (originalFromAudit) {
-    writeFileSync(skillPath, originalFromAudit, "utf-8");
+    const currentSkillMd = readFileSync(skillPath, "utf-8");
+    const restoredSkillMd = restoreDescriptionSection(currentSkillMd, originalFromAudit);
+    writeFileSync(skillPath, restoredSkillMd, "utf-8");
@@
-      restoredDescription: originalFromAudit,
+      restoredDescription: restoredSkillMd,

Also applies to: 141-145

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/rollback.ts` around lines 127 - 131, The code in
rollback.ts currently calls writeFileSync(skillPath, originalFromAudit, "utf-8")
when originalFromAudit is truthy (via findOriginalFromAudit), which overwrites
the entire SKILL.md and can corrupt its structure; modify the logic to load the
existing SKILL.md content, parse/locate the description section (or the specific
frontmatter/section you intend to restore), replace only that description
payload with originalFromAudit, and then write the updated content back to disk
(update the code around findOriginalFromAudit, originalFromAudit, writeFileSync
and skillPath so only the description block is replaced rather than the whole
file).

88-102: ⚠️ Potential issue | 🟠 Major

Only use .bak when rolling back the latest deployed proposal.

The backup file reflects the latest pre-deploy state, so applying it for an explicitly older proposalId can restore the wrong revision.

🔧 Proposed fix
 export async function rollback(options: RollbackOptions): Promise<RollbackResult> {
   const { skillName, skillPath, proposalId, logPath } = options;
+  const lastDeployed = getLastDeployedProposal(skillName, logPath);
@@
   } else {
     // Use the most recent deployed proposal
-    const lastDeployed = getLastDeployedProposal(skillName, logPath);
     if (!lastDeployed) {
       return noRollback(`No deployed proposal found for skill "${skillName}"`);
     }
     targetProposalId = lastDeployed.proposal_id;
   }
@@
   const backupPath = `${skillPath}.bak`;
-  if (existsSync(backupPath)) {
+  const canUseBackup = lastDeployed?.proposal_id === targetProposalId;
+  if (canUseBackup && existsSync(backupPath)) {

Also applies to: 104-106

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/rollback.ts` around lines 88 - 102, The rollback
currently applies the `.bak` backup unconditionally which can restore the wrong
revision for an explicit older proposalId; change the logic in rollback.ts
(around the code using findDeployedEntry and getLastDeployedProposal) so that
`.bak` is used only when targetProposalId was obtained via
getLastDeployedProposal (i.e., when no proposalId was supplied), and when
proposalId is provided use the revision/state recorded in the found entry (from
findDeployedEntry) to restore that specific revision instead of reading the
`.bak` file; apply the same conditional change to the later block referenced
(lines ~104-106) where backups/revisions are chosen.
cli/selftune/evolution/audit.ts (1)

23-31: ⚠️ Potential issue | 🟠 Major

Use deterministic skill matching instead of free-text substring matching.

Filtering by details.includes(skillName) is ambiguous and can select unrelated entries, which can return the wrong deployed proposal.

🔧 Proposed fix
+function escapeRegExp(value: string): string {
+  return value.replace(/[.*+?^${}()|[\]\\]/g, "\\$&");
+}
+
+function matchesSkill(entry: EvolutionAuditEntry, skillName: string): boolean {
+  const normalized = skillName.trim().toLowerCase();
+  const structured = (entry as EvolutionAuditEntry & { skill_name?: string }).skill_name;
+  if (typeof structured === "string" && structured.trim()) {
+    return structured.trim().toLowerCase() === normalized;
+  }
+  const details = typeof entry.details === "string" ? entry.details : "";
+  return new RegExp(`\\b${escapeRegExp(normalized)}\\b`, "i").test(details);
+}
+
 export function readAuditTrail(
   skillName?: string,
   logPath: string = EVOLUTION_AUDIT_LOG,
 ): EvolutionAuditEntry[] {
   const entries = readJsonl<EvolutionAuditEntry>(logPath);
   if (!skillName) return entries;
-  const needle = skillName.toLowerCase();
-  return entries.filter((e) => (e.details ?? "").toLowerCase().includes(needle));
+  return entries.filter((e) => matchesSkill(e, skillName));
 }

Also applies to: 37-43

🧹 Nitpick comments (9)
cli/selftune/hooks/session-stop.ts (1)

19-22: Return type includes null but function never returns null.

The docstring mentions "or null if skipped" but there's no skip condition—the function always returns a record. If skip logic is planned, this is fine; otherwise, consider simplifying the return type.

♻️ Proposed simplification
 export function processSessionStop(
   payload: StopPayload,
   logPath: string = TELEMETRY_LOG,
-): SessionTelemetryRecord | null {
+): SessionTelemetryRecord {

And update the docstring accordingly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/hooks/session-stop.ts` around lines 19 - 22, The declared return
type of processSessionStop (SessionTelemetryRecord | null) and its docstring
claim it may return null, but the implementation always returns a record; either
implement the intended "skip" condition or simplify the signature and docs:
remove the | null from the return type (use SessionTelemetryRecord) and update
the docstring to remove the "or null if skipped" wording, or alternatively add
the skip logic inside processSessionStop (checking payload, logPath, or other
conditions) and explicitly return null in that branch; reference
processSessionStop, StopPayload, SessionTelemetryRecord and TELEMETRY_LOG when
making the change.
cli/selftune/evolution/deploy-proposal.ts (1)

134-150: Consider adding a timeout for subprocess execution.

Git/GH commands could hang indefinitely (e.g., network issues during git push). Without a timeout, deployProposal may never resolve.

♻️ Proposed enhancement with timeout
-async function runCommand(args: string[], cwd?: string): Promise<string> {
+async function runCommand(args: string[], cwd?: string, timeoutMs = 60_000): Promise<string> {
   const proc = Bun.spawn(args, {
     cwd,
     stdout: "pipe",
     stderr: "pipe",
   });

-  const stdout = await new Response(proc.stdout).text();
-  const stderr = await new Response(proc.stderr).text();
-  const exitCode = await proc.exited;
+  const timeoutPromise = new Promise<never>((_, reject) =>
+    setTimeout(() => {
+      proc.kill();
+      reject(new Error(`Command timed out after ${timeoutMs}ms: ${args.join(" ")}`));
+    }, timeoutMs)
+  );
+
+  const [stdout, stderr, exitCode] = await Promise.race([
+    Promise.all([
+      new Response(proc.stdout).text(),
+      new Response(proc.stderr).text(),
+      proc.exited,
+    ]),
+    timeoutPromise,
+  ]);

   if (exitCode !== 0) {
     throw new Error(`Command failed (exit ${exitCode}): ${args.join(" ")}\n${stderr}`);
   }

   return stdout.trim();
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/deploy-proposal.ts` around lines 134 - 150, The
runCommand function can hang; add a configurable timeout to abort the
subprocess: create an AbortController and pass its signal to Bun.spawn (or, if
Bun.spawn doesn't accept it, call proc.kill on timeout), start a timer (e.g.,
30s or passed-in ms) that aborts/kills the process when expired, ensure you
clear the timer on normal exit, catch the abort/timeout case and throw a
descriptive Error including the command, timeout value, and any captured
stderr/stdout, and wire the new timeout parameter into any callers such as
deployProposal so long-running git commands get a reasonable deadline.
cli/selftune/grading/grade-session.ts (2)

311-312: Validate --eval-id is a valid number.

If the user passes a non-numeric string (e.g., --eval-id abc), Number() returns NaN, resulting in a confusing error: "Eval ID NaN not found".

♻️ Suggested validation
   if (values["evals-json"] && values["eval-id"] != null) {
-    expectations = loadExpectationsFromEvalsJson(values["evals-json"], Number(values["eval-id"]));
+    const evalId = Number(values["eval-id"]);
+    if (Number.isNaN(evalId)) {
+      console.error("[ERROR] --eval-id must be a valid number");
+      process.exit(1);
+    }
+    expectations = loadExpectationsFromEvalsJson(values["evals-json"], evalId);
   } else if (values.expectations?.length) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/grading/grade-session.ts` around lines 311 - 312, The code
currently converts values["eval-id"] with Number() and may pass NaN into
loadExpectationsFromEvalsJson; validate that values["eval-id"] is a finite
integer before using it: check using Number.isFinite(Number(values["eval-id"]))
(or parseInt and Number.isInteger) and if invalid throw or log a clear error
like "Invalid --eval-id: must be a number" and exit; only call
loadExpectationsFromEvalsJson(values["evals-json"], Number(values["eval-id"]))
when the parsed value is a valid number (reference: values["eval-id"], Number(),
loadExpectationsFromEvalsJson, expectations).

97-103: Add defensive error handling for file read and JSON parse.

If evalsJsonPath doesn't exist or contains invalid JSON, the error messages will be cryptic. Consider wrapping in try/catch for clearer diagnostics.

♻️ Suggested improvement
 export function loadExpectationsFromEvalsJson(evalsJsonPath: string, evalId: number): string[] {
-  const data = JSON.parse(readFileSync(evalsJsonPath, "utf-8"));
+  let data: { evals?: Array<{ id: number; expectations?: string[] }> };
+  try {
+    data = JSON.parse(readFileSync(evalsJsonPath, "utf-8"));
+  } catch (err) {
+    throw new Error(`Failed to read/parse ${evalsJsonPath}: ${err}`);
+  }
   for (const ev of data.evals ?? []) {
     if (ev.id === evalId) return ev.expectations ?? [];
   }
   throw new Error(`Eval ID ${evalId} not found in ${evalsJsonPath}`);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/grading/grade-session.ts` around lines 97 - 103, Wrap the file
read and JSON.parse logic in loadExpectationsFromEvalsJson with a try/catch to
provide clearer diagnostics: catch filesystem (readFileSync) and parse errors,
include evalsJsonPath and the original error message in the thrown/logged error,
and preserve behavior of returning expectations for a matching ev.id
(ev.expectations ?? []) while still throwing a clear "Eval ID not found" error
if no matching id is present; ensure the function references
loadExpectationsFromEvalsJson and the ev/expectations handling remain unchanged.
cli/selftune/ingestors/codex-rollout.ts (2)

282-287: Consider passing logType to enable schema validation.

The appendJsonl utility supports an optional logType parameter for runtime validation (see cli/selftune/utils/jsonl.ts lines 35-50). Currently, records are written without validation.

🛠️ Suggested enhancement
-    appendJsonl(queryLogPath, queryRecord);
+    appendJsonl(queryLogPath, queryRecord, "query");
   }

   // Write telemetry (everything except query)
   const { query: _q, ...telemetry } = parsed;
-  appendJsonl(telemetryLogPath, telemetry);
+  appendJsonl(telemetryLogPath, telemetry, "telemetry");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/ingestors/codex-rollout.ts` around lines 282 - 287, The
appendJsonl calls write query and telemetry records without schema validation;
update the two calls to pass the appropriate logType to enable runtime
validation (use the same log type names used by cli/selftune/utils/jsonl.ts) —
specifically modify the call that writes queryRecord (where
appendJsonl(queryLogPath, queryRecord) is invoked) to include its logType, and
modify the telemetry write (appendJsonl(telemetryLogPath, telemetry)) to include
the telemetry logType so both writes use appendJsonl(path, record, logType) and
trigger validation.

66-68: Consider handling broken symlinks in directory traversal.

statSync throws on broken symlinks. While rare, this could abort ingestion mid-way through a valid sessions directory.

🛠️ Suggested defensive fix
   for (const yearEntry of readdirSync(sessionsDir).sort()) {
     const yearDir = join(sessionsDir, yearEntry);
-    if (!statSync(yearDir).isDirectory()) continue;
+    try {
+      if (!statSync(yearDir).isDirectory()) continue;
+    } catch {
+      continue;
+    }
     const year = Number.parseInt(yearEntry, 10);

Apply similar wrapping for monthDir and dayDir checks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/ingestors/codex-rollout.ts` around lines 66 - 68, The directory
traversal uses statSync on yearDir (and likewise monthDir/dayDir) which will
throw on broken symlinks; wrap each statSync call in a try/catch and skip the
entry on error so traversal continues (e.g., when iterating
readdirSync(sessionsDir) for yearEntry, wrap statSync(yearDir) in try/catch and
continue on exception, and do the same for monthDir and dayDir checks in the
same loop), ensuring you still check isDirectory() when stat succeeds.
cli/selftune/hooks/skill-eval.ts (1)

39-40: Harden file_path parsing with a runtime type guard.

payload.tool_input?.file_path is typed as unknown in PostToolUsePayload and comes from untrusted JSON.parse input (line 66). The type assertion as string bypasses TypeScript safety but does not prevent non-string values at runtime. If file_path is a number, object, or array, it will pass the type assertion and the ?? "" fallback (which only handles null/undefined), then crash in basename() or dirname() with a TypeError.

Add a runtime type guard before calling extractSkillName:

-  const filePath = (payload.tool_input?.file_path as string) ?? "";
-  const skillName = extractSkillName(filePath);
+  const filePath = payload.tool_input?.file_path;
+  if (typeof filePath !== "string" || !filePath) return null;
+  const skillName = extractSkillName(filePath);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/hooks/skill-eval.ts` around lines 39 - 40,
payload.tool_input?.file_path is typed unknown and currently forced to a string
with "as string", which can crash extractSkillName (and underlying
basename/dirname) if the runtime value is not a string; add a runtime type
guard: check typeof payload.tool_input?.file_path === 'string' and only assign
that to filePath (or fallback to ""), then call extractSkillName(filePath);
update the code around the filePath binding and the call site (referencing
payload.tool_input?.file_path, filePath, and extractSkillName) so non-string
values safely fall back to "" and avoid TypeErrors when parsing paths.
cli/selftune/ingestors/codex-wrapper.ts (1)

321-323: Handle top-level cliMain() rejection explicitly.

cliMain() is an async function that can throw unhandled rejections. Calling it without a .catch() handler at the entry point (line 322) leaves the process without explicit error handling for non-ENOENT failures.

💡 Suggested fix
 if (import.meta.main) {
-  cliMain();
+  cliMain().catch((e) => {
+    process.stderr.write(
+      `[codex-wrapper] Unexpected failure: ${e instanceof Error ? e.message : String(e)}\n`,
+    );
+    process.exit(1);
+  });
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/ingestors/codex-wrapper.ts` around lines 321 - 323, The entry
point calls the async cliMain() without handling rejections; wrap the call in a
Promise rejection handler by changing the import.meta.main block to call
cliMain().catch(err => { log the error (e.g., console.error or
processLogger.error) and exit with a non-zero code for unexpected failures);
optionally handle expected ENOENT specially (e.g., print a friendly message and
exit with a specific code) before calling process.exit(1). Ensure the handler
references cliMain and the import.meta.main entry so top-level rejections are
not unhandled.
cli/selftune/observability.ts (1)

202-207: Config health should validate semantic fields, not only JSON syntax.

A config with invalid agent_type/llm_mode currently passes doctor. Consider validating enum values and required keys before setting pass.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/observability.ts` around lines 202 - 207, The current try block
only verifies JSON parse success before setting check.status to "pass"; change
it to also validate the parsed SelftuneConfig's required fields (agent_type and
llm_mode) against allowed enum/value sets before marking pass. After
JSON.parse(…) (the code handling SELFTUNE_CONFIG_PATH and the parsed config
variable), verify config.agent_type and config.llm_mode are present and match
allowed values (create or reference allowedAgentTypes/allowedLlmModes arrays or
enums), and if any validation fails set check.status="fail" with a clear
check.message describing the missing/invalid key(s); only set
check.status="pass" and set check.message=`agent_type=…, llm_mode=…` when both
fields validate. Ensure JSON parse errors remain handled by the catch and
produce a failing check with an explanatory message.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cli/selftune/evolution/deploy-proposal.ts`:
- Around line 163-165: The backup code currently uses a static suffix and will
overwrite previous backups; change the logic that builds backupPath (where
copyFileSync(skillPath, backupPath) is called) to append a unique timestamp or
UUID (e.g., `${skillPath}.${new Date().toISOString().replace(/[:.]/g,'-')}.bak`
or use a short uuid) so each deploy creates a distinct backup file; ensure the
new name is used in the call to copyFileSync and optionally add a small
retention step (e.g., keep only N latest backups) if needed.
- Around line 115-128: sanitizeForGitRef can return an empty string for names
made entirely of invalid chars; update either sanitizeForGitRef or
makeBranchName so the branch segment is never empty: call
sanitizeForGitRef(skillName) in makeBranchName, and if it yields an empty string
replace it with a safe fallback like "untitled" (or a sanitized prefix), then
build the branch as `${prefix}/${safeName}-${timestamp}` ensuring there is no
extra leading hyphen or empty segment; reference sanitizeForGitRef and
makeBranchName when applying the change.
- Around line 55-93: The function replaceDescription currently drops any content
that appears before the first "# " heading; preserve that preamble by capturing
lines.slice(0, headingIndex) (call it preHeading) and include it back when
rebuilding the string: if headingIndex === -1 keep existing behavior, otherwise
construct the output as preHeading + headingLine + descriptionBlock (+
afterSubHeading if subHeadingIndex !== -1), making sure to join preHeading with
a newline only when non-empty so spacing is preserved; update the return paths
in replaceDescription to include this preHeading segment.

In `@cli/selftune/grading/grade-session.ts`:
- Around line 228-233: The loop printing suggestion reasons assumes s.reason is
a string and can crash; change the console.log inside the for loop (where
feedback.suggestions is iterated) to defensively coerce s.reason to a string
like String(s.reason ?? "") before calling .slice(0, 100) so it matches the
defensive pattern used earlier on lines 221/223 and prevents crashes on
malformed grader output.

In `@cli/selftune/ingestors/codex-rollout.ts`:
- Around line 320-322: Validate the parsed `--since` value before passing it to
`findRolloutFiles`: when constructing `const since = values.since ? new
Date(values.since) : undefined;` check that `values.since` produces a valid Date
(e.g., `isNaN(since.getTime())`) and if invalid log a clear error and exit (or
throw) instead of silently using `undefined`; update the `const since` handling
in codex-rollout.ts to mirror the validation approach used in opencode-ingest.ts
so `findRolloutFiles(codexHome, since)` never receives a NaN date.

In `@cli/selftune/ingestors/codex-wrapper.ts`:
- Around line 139-149: The current detection uses substring matching on
textContent which yields false positives; update the check in the loop over
skillNames (where textContent is built from item.text and item.command and etype
is checked for "item.completed") to perform whole-word, case-insensitive
matching instead of includes; e.g., normalize textContent and skillName to lower
case and either split/tokenize textContent on non-word boundaries and check
membership, or build a word-boundary regex for each skillName and test against
textContent, then only push to skillsTriggered when a true whole-word match is
found and the etype and duplicate checks still pass.

In `@cli/selftune/ingestors/opencode-ingest.ts`:
- Line 84: The PRAGMA table_info query interpolates an untrusted table name
(variable name) into db.query, allowing SQL injection; update the code in
opencode-ingest.ts where cols is computed to either validate/sanitize the name
(e.g., allow only /^[A-Za-z0-9_]+$/) or safely quote the identifier before
interpolation by wrapping name in double quotes and escaping any internal double
quotes (replace " with ""), then use the sanitized/quoted identifier in
db.query(`PRAGMA table_info(${quotedName})`). Also consider alternatively
rejecting names not present in sqlite_master via a strict whitelist check before
calling db.query.
- Around line 157-158: The code interpolates sessionsTable (and messagesTable)
directly into SQL in the .query(...).all() call, which allows SQL injection
because those names come from sqlite_master; instead validate and constrain
table identifiers before using them: use getDbSchema or a whitelist to confirm
the table name exists, enforce a strict safe identifier pattern (e.g.
/^[A-Za-z_][A-Za-z0-9_]*$/) and/or compare against the list returned by
getDbSchema, then only after validation build the SQL; do not accept arbitrary
strings from sqlite_master, and use safe identifier quoting only after
validation when constructing the query for the session/message retrieval logic
(where sessionsTable/messagesTable are used).

In `@cli/selftune/init.ts`:
- Around line 273-275: Replace the brittle entry-point detection that compares
import.meta.url and process.argv[1] (the isMain constant) with a portable
pattern: import fileURLToPath from node:url and set isMain to use (import.meta
as any).main if present, otherwise compare process.argv[1] to
fileURLToPath(import.meta.url); update the isMain initialization (referencing
isMain, import.meta.url, and process.argv[1]) to this fallback sequence so it
works on Bun, newer Node.js, and Node.js LTS versions.

In `@cli/selftune/monitoring/watch.ts`:
- Around line 285-290: The parsed numeric CLI inputs passed into watch()
(windowSessions and regressionThreshold) can be NaN or out-of-range and should
be validated before calling watch(); ensure values from values.window and
values.threshold are parsed and checked (use Number.parseInt/parseFloat results)
so windowSessions is a positive integer >= 1 and regressionThreshold is a finite
number within [0,1] (or your allowed range), and if validation fails log a clear
error and exit (or return) instead of invoking watch(); update the call sites
and any upstream handling in watch() / computeMonitoringSnapshot() to rely on
these validated variables.

In `@docs/product-specs/index.md`:
- Around line 11-16: Update the "## Adding a New Spec" instructions to remove
ambiguity about file placement: clarify whether PRD.md is a special-case at
repository root or whether new specs should go into docs/product-specs; adjust
step 1 to either explicitly say "Create a new markdown file in
docs/product-specs/ (PRD.md is a special case kept at repo root)" or "Create a
new markdown file in an appropriate location (e.g., docs/product-specs/; PRD.md
is at repo root)" and add a short note referencing PRD.md so readers know its
intended location.

---

Duplicate comments:
In `@cli/selftune/evolution/extract-patterns.ts`:
- Around line 56-66: The current loop in extract-patterns.ts stops at the first
cluster match which makes clustering order-dependent and misses transitive
merges; change the logic around
clusters/cluster/computeQuerySimilarity/query/threshold so that for each query
you collect all clusters whose any member has similarity >= threshold, then
merge those matching clusters together (e.g., combine their members into a
single cluster, remove the originals) and add the query to that merged cluster;
ensure this runs for every query (or iterate until no further merges) so
transitive single-link merges are resolved rather than stopping after the first
matched cluster.

In `@cli/selftune/evolution/rollback.ts`:
- Around line 127-131: The code in rollback.ts currently calls
writeFileSync(skillPath, originalFromAudit, "utf-8") when originalFromAudit is
truthy (via findOriginalFromAudit), which overwrites the entire SKILL.md and can
corrupt its structure; modify the logic to load the existing SKILL.md content,
parse/locate the description section (or the specific frontmatter/section you
intend to restore), replace only that description payload with
originalFromAudit, and then write the updated content back to disk (update the
code around findOriginalFromAudit, originalFromAudit, writeFileSync and
skillPath so only the description block is replaced rather than the whole file).
- Around line 88-102: The rollback currently applies the `.bak` backup
unconditionally which can restore the wrong revision for an explicit older
proposalId; change the logic in rollback.ts (around the code using
findDeployedEntry and getLastDeployedProposal) so that `.bak` is used only when
targetProposalId was obtained via getLastDeployedProposal (i.e., when no
proposalId was supplied), and when proposalId is provided use the revision/state
recorded in the found entry (from findDeployedEntry) to restore that specific
revision instead of reading the `.bak` file; apply the same conditional change
to the later block referenced (lines ~104-106) where backups/revisions are
chosen.

In `@cli/selftune/init.ts`:
- Around line 104-107: The code silently ignores invalid mode overrides by
setting resolvedMode to undefined; instead validate modeOverride against
validModes and fail fast: if modeOverride is provided but not included in
validModes (check the modeOverride and validModes variables and the resolvedMode
assignment), throw or return a clear user-facing error indicating the invalid
--llm-mode value and listing allowed values (e.g., "agent" | "api") so typos
don't fall back to auto-detection.

In `@cli/selftune/observability.ts`:
- Around line 77-89: The health check is verifying bundled hook files under
import.meta.dir, causing false positives; change the hook existence check to
look at the repository's active hooks directory (e.g., resolve the repo root
from process.cwd() or use `git rev-parse --show-toplevel` and then join(root,
".git", "hooks", hook)) instead of join(import.meta.dir, "hooks", hook). Update
the symbol `hookPath` to point at the repo hooks location, keep the same
HealthCheck object (`check`, `HOOK_FILES`, `checks.push(check)`), and preserve
the existing pass/fail messages but ensure they reflect the active hooks
directory presence.

---

Nitpick comments:
In `@cli/selftune/evolution/deploy-proposal.ts`:
- Around line 134-150: The runCommand function can hang; add a configurable
timeout to abort the subprocess: create an AbortController and pass its signal
to Bun.spawn (or, if Bun.spawn doesn't accept it, call proc.kill on timeout),
start a timer (e.g., 30s or passed-in ms) that aborts/kills the process when
expired, ensure you clear the timer on normal exit, catch the abort/timeout case
and throw a descriptive Error including the command, timeout value, and any
captured stderr/stdout, and wire the new timeout parameter into any callers such
as deployProposal so long-running git commands get a reasonable deadline.

In `@cli/selftune/grading/grade-session.ts`:
- Around line 311-312: The code currently converts values["eval-id"] with
Number() and may pass NaN into loadExpectationsFromEvalsJson; validate that
values["eval-id"] is a finite integer before using it: check using
Number.isFinite(Number(values["eval-id"])) (or parseInt and Number.isInteger)
and if invalid throw or log a clear error like "Invalid --eval-id: must be a
number" and exit; only call loadExpectationsFromEvalsJson(values["evals-json"],
Number(values["eval-id"])) when the parsed value is a valid number (reference:
values["eval-id"], Number(), loadExpectationsFromEvalsJson, expectations).
- Around line 97-103: Wrap the file read and JSON.parse logic in
loadExpectationsFromEvalsJson with a try/catch to provide clearer diagnostics:
catch filesystem (readFileSync) and parse errors, include evalsJsonPath and the
original error message in the thrown/logged error, and preserve behavior of
returning expectations for a matching ev.id (ev.expectations ?? []) while still
throwing a clear "Eval ID not found" error if no matching id is present; ensure
the function references loadExpectationsFromEvalsJson and the ev/expectations
handling remain unchanged.

In `@cli/selftune/hooks/session-stop.ts`:
- Around line 19-22: The declared return type of processSessionStop
(SessionTelemetryRecord | null) and its docstring claim it may return null, but
the implementation always returns a record; either implement the intended "skip"
condition or simplify the signature and docs: remove the | null from the return
type (use SessionTelemetryRecord) and update the docstring to remove the "or
null if skipped" wording, or alternatively add the skip logic inside
processSessionStop (checking payload, logPath, or other conditions) and
explicitly return null in that branch; reference processSessionStop,
StopPayload, SessionTelemetryRecord and TELEMETRY_LOG when making the change.

In `@cli/selftune/hooks/skill-eval.ts`:
- Around line 39-40: payload.tool_input?.file_path is typed unknown and
currently forced to a string with "as string", which can crash extractSkillName
(and underlying basename/dirname) if the runtime value is not a string; add a
runtime type guard: check typeof payload.tool_input?.file_path === 'string' and
only assign that to filePath (or fallback to ""), then call
extractSkillName(filePath); update the code around the filePath binding and the
call site (referencing payload.tool_input?.file_path, filePath, and
extractSkillName) so non-string values safely fall back to "" and avoid
TypeErrors when parsing paths.

In `@cli/selftune/ingestors/codex-rollout.ts`:
- Around line 282-287: The appendJsonl calls write query and telemetry records
without schema validation; update the two calls to pass the appropriate logType
to enable runtime validation (use the same log type names used by
cli/selftune/utils/jsonl.ts) — specifically modify the call that writes
queryRecord (where appendJsonl(queryLogPath, queryRecord) is invoked) to include
its logType, and modify the telemetry write (appendJsonl(telemetryLogPath,
telemetry)) to include the telemetry logType so both writes use
appendJsonl(path, record, logType) and trigger validation.
- Around line 66-68: The directory traversal uses statSync on yearDir (and
likewise monthDir/dayDir) which will throw on broken symlinks; wrap each
statSync call in a try/catch and skip the entry on error so traversal continues
(e.g., when iterating readdirSync(sessionsDir) for yearEntry, wrap
statSync(yearDir) in try/catch and continue on exception, and do the same for
monthDir and dayDir checks in the same loop), ensuring you still check
isDirectory() when stat succeeds.

In `@cli/selftune/ingestors/codex-wrapper.ts`:
- Around line 321-323: The entry point calls the async cliMain() without
handling rejections; wrap the call in a Promise rejection handler by changing
the import.meta.main block to call cliMain().catch(err => { log the error (e.g.,
console.error or processLogger.error) and exit with a non-zero code for
unexpected failures); optionally handle expected ENOENT specially (e.g., print a
friendly message and exit with a specific code) before calling process.exit(1).
Ensure the handler references cliMain and the import.meta.main entry so
top-level rejections are not unhandled.

In `@cli/selftune/observability.ts`:
- Around line 202-207: The current try block only verifies JSON parse success
before setting check.status to "pass"; change it to also validate the parsed
SelftuneConfig's required fields (agent_type and llm_mode) against allowed
enum/value sets before marking pass. After JSON.parse(…) (the code handling
SELFTUNE_CONFIG_PATH and the parsed config variable), verify config.agent_type
and config.llm_mode are present and match allowed values (create or reference
allowedAgentTypes/allowedLlmModes arrays or enums), and if any validation fails
set check.status="fail" with a clear check.message describing the
missing/invalid key(s); only set check.status="pass" and set
check.message=`agent_type=…, llm_mode=…` when both fields validate. Ensure JSON
parse errors remain handled by the catch and produce a failing check with an
explanatory message.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03ebd3b and cdac726.

📒 Files selected for processing (19)
  • cli/selftune/eval/hooks-to-evals.ts
  • cli/selftune/evolution/audit.ts
  • cli/selftune/evolution/deploy-proposal.ts
  • cli/selftune/evolution/evolve.ts
  • cli/selftune/evolution/extract-patterns.ts
  • cli/selftune/evolution/rollback.ts
  • cli/selftune/grading/grade-session.ts
  • cli/selftune/hooks/session-stop.ts
  • cli/selftune/hooks/skill-eval.ts
  • cli/selftune/index.ts
  • cli/selftune/ingestors/codex-rollout.ts
  • cli/selftune/ingestors/codex-wrapper.ts
  • cli/selftune/ingestors/opencode-ingest.ts
  • cli/selftune/init.ts
  • cli/selftune/monitoring/watch.ts
  • cli/selftune/observability.ts
  • docs/exec-plans/tech-debt-tracker.md
  • docs/product-specs/index.md
  • tests/observability.test.ts
✅ Files skipped from review due to trivial changes (1)
  • docs/exec-plans/tech-debt-tracker.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/observability.test.ts
  • cli/selftune/evolution/evolve.ts
  • cli/selftune/index.ts

…ss fixes

- Fix SQL injection in opencode-ingest via assertSafeIdentifier validation
- Add timestamped backups in deploy-proposal to prevent overwrites
- Guard against empty sanitizeForGitRef results with "untitled" fallback
- Preserve pre-heading preamble in replaceDescription
- Add configurable timeout to runCommand to prevent hung subprocesses
- Fix order-dependent clustering in extract-patterns with transitive merges
- Fix rollback to use .bak only for latest deploy, replaceDescription for audit
- Add findLatestBackup helper for timestamped backup file discovery
- Defensive coercion for grade-session suggestion reasons
- Validate --eval-id, --since, --window, --threshold CLI inputs
- Add try/catch diagnostics in loadExpectationsFromEvalsJson
- Use word-boundary regex for skill detection in codex-wrapper
- Add unhandled rejection handlers at CLI entry points
- Use portable isMain detection via fileURLToPath in init.ts
- Fail fast on invalid --llm-mode with clear error message
- Fix observability hook path to check .git/hooks not bundled dir
- Validate SelftuneConfig agent_type/llm_mode against allowed values
- Add runtime type guard for skill-eval file_path
- Simplify session-stop return type to match implementation
- Clarify PRD.md location in product-specs docs
- Wrap codex-rollout statSync in try/catch for broken symlinks
- Pass logType to appendJsonl for runtime schema validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (2)
cli/selftune/evolution/rollback.ts (1)

67-76: ⚠️ Potential issue | 🟠 Major

Audit fallback still depends on an upstream format that is not emitted.

Line 73 requires created.details to start with original_description:. With current upstream creation behavior, Strategy 2 becomes unreachable when no backup exists.

Based on learnings: In cli/selftune/evolution/evolve.ts, the "created" audit entry currently stores only a message string instead of the required original_description: payload.

Also applies to: 153-155

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/rollback.ts` around lines 67 - 76,
findOriginalFromAudit currently only recognizes audit entries whose
created.details starts with ORIGINAL_DESC_PREFIX, but upstream emits a plain
message string; update findOriginalFromAudit to accept both formats by first
checking if details startsWith(ORIGINAL_DESC_PREFIX) and slicing as before,
otherwise if details is a non-empty string return details as the original
description (or attempt a safe JSON parse if you expect structured payloads);
also apply the same tolerant logic to the other occurrence around lines 153-155
so the audit fallback works with the existing created entry format (refer to
readAuditTrail, ORIGINAL_DESC_PREFIX, and findOriginalFromAudit to locate the
code).
cli/selftune/monitoring/watch.ts (1)

285-295: ⚠️ Potential issue | 🟡 Minor

Use strict numeric parsing for CLI args (--window, --threshold).

Line 285 and Line 291 currently allow partial numeric strings ("20abc", "0.1foo"). That can silently misconfigure monitoring behavior.

🔧 Proposed fix
-  const windowSessions = Number.parseInt(values.window ?? "20", 10);
-  if (!Number.isFinite(windowSessions) || windowSessions < 1) {
+  const windowSessions = Number((values.window ?? "20").trim());
+  if (!Number.isInteger(windowSessions) || windowSessions < 1) {
     console.error("[ERROR] --window must be a positive integer >= 1");
     process.exit(1);
   }
 
-  const regressionThreshold = Number.parseFloat(values.threshold ?? "0.1");
+  const regressionThreshold = Number((values.threshold ?? "0.1").trim());
   if (!Number.isFinite(regressionThreshold) || regressionThreshold < 0 || regressionThreshold > 1) {
     console.error("[ERROR] --threshold must be a finite number between 0 and 1");
     process.exit(1);
   }
What are the parsing differences between Number("20abc"), Number.parseInt("20abc", 10), and Number.parseFloat("0.1foo") in JavaScript?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/monitoring/watch.ts` around lines 285 - 295, The CLI numeric
parsing for values.window and values.threshold is too permissive (e.g., "20abc"
or "0.1foo" are accepted); update the parsing around windowSessions and
regressionThreshold to validate the raw string first and only then convert: for
window (values.window) require the whole string match an integer regex (e.g.,
/^\d+$/) and then parseInt to produce windowSessions; for threshold
(values.threshold) require a full-match numeric/decimal regex (e.g.,
/^\d+(\.\d+)?$/ or a stricter 0-1 pattern) and then parseFloat to produce
regressionThreshold; keep existing range checks and error messages but reject
inputs that fail the full-string regex before calling
Number.parseInt/parseFloat.
🧹 Nitpick comments (12)
cli/selftune/observability.ts (3)

109-110: Prefer top-level ESM import over inline require().

The file uses ESM imports throughout but switches to require("node:os") inline. Move homedir to the top-level imports for consistency.

♻️ Suggested refactor

At top of file (after line 13):

import { homedir } from "node:os";

Then at line 109:

-  const { homedir } = require("node:os");
-  const settingsPath = join(homedir(), ".claude", "settings.json");
+  const settingsPath = join(homedir(), ".claude", "settings.json");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/observability.ts` around lines 109 - 110, Replace the inline
CommonJS require call for homedir with a top-level ESM import: add import {
homedir } from "node:os" alongside the other imports, then remove
require("node:os") and keep the existing settingsPath assignment (const
settingsPath = join(homedir(), ".claude", "settings.json")); update any
references to homedir to use the imported symbol so the file consistently uses
ESM imports.

78-88: The checkHookInstallation fix addresses the previous concern.

The implementation now correctly resolves the repository root using git rev-parse --show-toplevel and checks hooks in .git/hooks/ rather than bundled source files.

However, execSync lacks a timeout, which could cause the doctor command to hang if git becomes unresponsive.

🛡️ Optional: Add timeout to execSync
   try {
-    repoRoot = execSync("git rev-parse --show-toplevel", { encoding: "utf-8" }).trim();
+    repoRoot = execSync("git rev-parse --show-toplevel", { encoding: "utf-8", timeout: 5000 }).trim();
   } catch {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/observability.ts` around lines 78 - 88, The execSync call inside
checkHookInstallation should include a timeout to avoid hanging; update the
execSync invocation (the one assigning repoRoot) to pass a timeout option (e.g.
timeout: 5000) alongside encoding so the call fails fast if git is unresponsive
and remains caught by the existing try/catch; ensure you add the timeout to the
options object passed to execSync.

156-205: Consider extracting shared JSONL validation logic.

checkEvolutionHealth duplicates the line parsing, JSON validation, and schema checking logic from checkLogHealth. A helper function (e.g., validateJsonlFile(path, requiredFields)) could reduce duplication.

Not blocking, but worth considering for maintainability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/observability.ts` around lines 156 - 205, checkEvolutionHealth
duplicates JSONL parsing and schema validation present in checkLogHealth;
extract the common logic into a shared helper (e.g., validateJsonlFile(filePath:
string, requiredFields: string[]): {lineCount:number, parseErrors:number,
schemaErrors:number}) and have checkEvolutionHealth call that helper and map its
result to the existing HealthCheck shape. Update checkLogHealth to use the same
helper so both functions reuse parsing, counting, and schema-checking behavior
(refer to functions checkEvolutionHealth and checkLogHealth and introduce
validateJsonlFile).
tests/observability.test.ts (2)

31-40: Conditional assertion may pass trivially.

The test only asserts status === "warn" when status is not "pass", meaning if the log file exists and is valid, the assertion inside the if block never runs. Consider adding a separate test that explicitly creates a missing-file scenario or testing both branches.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/observability.test.ts` around lines 31 - 40, The test "evolution audit
log check has warn status when missing" conditionally asserts
evolutionCheck.status === "warn" only if status !== "pass", which lets the
assertion skip when the file exists; update tests to deterministically cover
both branches: add one test that ensures a missing evolution log produces status
"warn" by removing or mocking the file and calling checkLogHealth() and
asserting evolutionCheck.status === "warn", and add another test that simulates
a present/valid evolution log (or mocks checkLogHealth() response) asserting
status === "pass"; reference the existing test name, the checkLogHealth() call,
and the evolutionCheck variable to locate and replace/extend the test.

82-90: Same conditional assertion pattern.

Similar to the earlier test, this assertion only runs when status is not "pass". The test passes regardless of which branch is taken.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/observability.test.ts` around lines 82 - 90, The test "evolution audit
check uses warn when log is missing" uses a conditional that only asserts when
auditCheck.status !== "pass", allowing the test to pass in both branches; update
the test to deterministically assert the expected status by controlling the
input or state and removing the conditional: call checkEvolutionHealth() in a
scenario where the evolution audit log is absent (or mock its dependency), then
assert auditCheck is defined and assert auditCheck.status === "warn" directly
(use the checkEvolutionHealth function and the auditCheck variable to locate and
change the assertion).
cli/selftune/ingestors/codex-wrapper.ts (1)

139-151: Skill detection now uses word-boundary matching.

The previous substring-matching concern has been addressed with proper word-boundary regex. The escaping of special characters (replace(/[.*+?^${}()|[\]\\]/g, "\\$&")) mitigates ReDoS risk since skill names come from filesystem directory names, not arbitrary user input.

Minor performance note: Regex is compiled for each skill on each line. For large streams or many skills, precompiling the patterns outside the loop would improve performance.

♻️ Optional: precompile skill regexes
 export function parseJsonlStream(lines: string[], skillNames: Set<string>): ParsedCodexStream {
   // ... existing declarations ...
+  const skillPatterns = new Map<string, RegExp>();
+  for (const skillName of skillNames) {
+    skillPatterns.set(
+      skillName,
+      new RegExp(`\\b${skillName.replace(/[.*+?^${}()|[\]\\]/g, "\\$&")}\\b`, "i"),
+    );
+  }

   for (const rawLine of lines) {
     // ... existing parsing ...

       // Detect skill names in text on completed events (whole-word match)
       const textContent = ((item.text as string) ?? "") + ((item.command as string) ?? "");
       for (const skillName of skillNames) {
+        const pattern = skillPatterns.get(skillName)!;
         if (
           etype === "item.completed" &&
           !skillsTriggered.includes(skillName) &&
-          new RegExp(`\\b${skillName.replace(/[.*+?^${}()|[\]\\]/g, "\\$&")}\\b`, "i").test(
-            textContent,
-          )
+          pattern.test(textContent)
         ) {
           skillsTriggered.push(skillName);
         }
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/ingestors/codex-wrapper.ts` around lines 139 - 151, The
word-boundary regex is rebuilt for every skill on every item which is wasteful;
precompile RegExp objects once (e.g., map skillNames to an array of RegExp using
the current escape logic) outside the loop that iterates items, then within the
loop in codex-wrapper.ts use those precompiled patterns to test textContent
(retain the same checks using etype === "item.completed" and
skillsTriggered.includes) to avoid repeated RegExp construction and improve
performance.
cli/selftune/ingestors/opencode-ingest.ts (2)

447-471: Pass logType to appendJsonl for schema validation consistency.

Unlike codex-rollout.ts which passes logType arguments (e.g., "all_queries", "session_telemetry", "skill_usage") to enable schema validation, this function omits them. For consistency and to catch schema mismatches early, consider adding the log type identifiers.

♻️ Suggested fix
   if (prompt && prompt.length >= 4) {
     const queryRecord: QueryLogRecord = {
       timestamp: session.timestamp,
       session_id: sessionId,
       query: prompt,
       source: session.source,
     };
-    appendJsonl(queryLogPath, queryRecord);
+    appendJsonl(queryLogPath, queryRecord, "all_queries");
   }

   const { query: _q, ...telemetry } = session;
-  appendJsonl(telemetryLogPath, telemetry);
+  appendJsonl(telemetryLogPath, telemetry, "session_telemetry");

   for (const skillName of skills) {
     const skillRecord: SkillUsageRecord = {
       ...
     };
-    appendJsonl(skillLogPath, skillRecord);
+    appendJsonl(skillLogPath, skillRecord, "skill_usage");
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/ingestors/opencode-ingest.ts` around lines 447 - 471, The code
is missing the logType argument when calling appendJsonl which breaks schema
validation; update the three appendJsonl calls to include the appropriate
logType strings: pass "all_queries" when writing queryRecord (use queryLogPath
and QueryLogRecord), pass "session_telemetry" when writing telemetry (use
telemetryLogPath and the telemetry object), and pass "skill_usage" when writing
skillRecord inside the skills loop (use skillLogPath and SkillUsageRecord) so
schema validation matches other modules that call appendJsonl.

163-166: Consider using parameterized query for the --since filter.

While sinceTs is validated as a number making this safe, using a parameterized query would be more idiomatic and consistent with the message query on Line 190.

♻️ Suggested refactor
-  let whereClause = "";
-  if (sinceTs) {
-    whereClause = `WHERE created > ${Math.floor(sinceTs * 1000)}`;
-  }
+  const whereClause = sinceTs ? "WHERE created > ?" : "";
+  const whereParams = sinceTs ? [Math.floor(sinceTs * 1000)] : [];

   let sessionRows: Array<Record<string, unknown>>;
   try {
     sessionRows = db
-      .query(`SELECT * FROM ${safeSessionsTable} ${whereClause} ORDER BY created ASC`)
-      .all() as Array<Record<string, unknown>>;
+      .query(`SELECT * FROM ${safeSessionsTable} ${whereClause} ORDER BY created ASC`)
+      .all(...whereParams) as Array<Record<string, unknown>>;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/ingestors/opencode-ingest.ts` around lines 163 - 166, The WHERE
clause currently interpolates sinceTs directly into whereClause; change it to a
parameterized placeholder and add the value to the query parameters used later
(e.g., replace `WHERE created > ${Math.floor(sinceTs * 1000)}` with a
placeholder like `WHERE created > $N` and push Math.floor(sinceTs * 1000) into
the params/values array), ensuring the placeholder index N is the next
positional index after existing parameters used by the message query on line
~190 in opencode-ingest.ts so the combined parameter list remains in the correct
order.
cli/selftune/ingestors/codex-rollout.ts (1)

297-299: Telemetry record may not match SessionTelemetryRecord type.

The spread { query: _q, ...telemetry } from ParsedRollout produces a record that may not align with SessionTelemetryRecord. The ParsedRollout interface includes fields like rollout_path, last_user_query, and transcript_path that might not be part of the expected telemetry schema. Consider explicitly constructing a SessionTelemetryRecord to ensure type safety and schema validation success.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/ingestors/codex-rollout.ts` around lines 297 - 299, The current
telemetry write uses a spread to build the record from parsed (const { query:
_q, ...telemetry } = parsed) which can include fields from ParsedRollout
(rollout_path, last_user_query, transcript_path, etc.) that don't match
SessionTelemetryRecord; instead explicitly construct a SessionTelemetryRecord
object from parsed by selecting and mapping only the expected fields (e.g.
session_id, user_id, timestamp, event, metadata) and any validated/normalized
values, then pass that typed object to appendJsonl(telemetryLogPath,
telemetryRecord, "session_telemetry") to ensure schema/type safety and avoid
leaking rollout-specific fields.
cli/selftune/hooks/session-stop.ts (1)

47-49: Add debug-gated error diagnostics in the hook catch block.

Line [47]-Line [49] currently swallows all failures with no trace. Keeping non-blocking behavior is right, but a debug-only stderr log makes telemetry failures actionable without impacting Claude flow.

💡 Proposed fix
-  } catch {
-    // silent — hooks must never block Claude
+  } catch (error) {
+    // never block Claude; allow opt-in diagnostics
+    if (process.env.SELFTUNE_HOOK_DEBUG === "1") {
+      console.error("[selftune/session-stop] hook failed", error);
+    }
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/hooks/session-stop.ts` around lines 47 - 49, The catch block in
session-stop.ts currently swallows errors; modify it to capture the thrown error
(e.g., catch (err) { ... }) and emit a debug-gated stderr diagnostic so failures
are visible when debugging but remain non-blocking for Claude; implement a
conditional like if (process.env.DEBUG || process.env.NODE_ENV ===
'development') console.error('session-stop hook failed:', err) (or use the
existing logger if present) and then return silently to preserve the current
behavior referenced by the "hooks must never block Claude" comment.
cli/selftune/grading/grade-session.ts (1)

97-111: Consider adding runtime validation for the evals JSON structure.

The current implementation relies on type assertions that could produce unexpected behavior if the JSON structure doesn't match expectations (e.g., if expectations is not a string[]). This is low-risk since the grader would likely fail anyway, but explicit validation could provide clearer error messages.

🛡️ Optional defensive validation
 export function loadExpectationsFromEvalsJson(evalsJsonPath: string, evalId: number): string[] {
   let data: Record<string, unknown>;
   try {
     data = JSON.parse(readFileSync(evalsJsonPath, "utf-8"));
   } catch (err) {
     throw new Error(
       `Failed to read or parse evals JSON at ${evalsJsonPath}: ${err instanceof Error ? err.message : String(err)}`,
     );
   }
-  for (const ev of (data as Record<string, unknown[]>).evals ?? []) {
-    if ((ev as Record<string, unknown>).id === evalId)
-      return ((ev as Record<string, unknown>).expectations as string[]) ?? [];
+  const evals = Array.isArray(data.evals) ? data.evals : [];
+  for (const ev of evals) {
+    if (typeof ev === "object" && ev !== null && (ev as Record<string, unknown>).id === evalId) {
+      const expectations = (ev as Record<string, unknown>).expectations;
+      if (!Array.isArray(expectations)) return [];
+      return expectations.filter((e): e is string => typeof e === "string");
+    }
   }
   throw new Error(`Eval ID ${evalId} not found in ${evalsJsonPath}`);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/grading/grade-session.ts` around lines 97 - 111, The function
loadExpectationsFromEvalsJson currently uses unchecked type assertions on the
parsed JSON (e.g., treating data.evals as an array and ev.expectations as
string[]), which can cause confusing runtime errors; add defensive runtime
validation after parsing: verify that data is an object, that data.evals is an
array, iterate ensuring each ev is an object with a numeric/id-compatible id and
that ev.expectations (if present) is an array of strings, and throw descriptive
errors when these checks fail (include evalsJsonPath and evalId in messages) so
callers get clear diagnostics instead of silent type failures.
cli/selftune/init.ts (1)

189-192: Consider defensive parsing for potentially corrupted config files.

If the config file exists but contains malformed JSON or an unexpected structure, the raw SyntaxError or runtime errors from missing fields may confuse users. A try-catch with a user-friendly message would improve the experience.

🛡️ Optional defensive parsing
   // If config exists and no --force, return existing
   if (!force && existsSync(configPath)) {
-    const raw = readFileSync(configPath, "utf-8");
-    return JSON.parse(raw) as SelftuneConfig;
+    try {
+      const raw = readFileSync(configPath, "utf-8");
+      const config = JSON.parse(raw) as SelftuneConfig;
+      // Basic shape check
+      if (!config.agent_type || !config.cli_path) {
+        throw new Error("Incomplete config");
+      }
+      return config;
+    } catch {
+      console.error("[WARN] Existing config is corrupted, reinitializing...");
+      // Fall through to regenerate config
+    }
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/init.ts` around lines 189 - 192, Wrap the JSON.parse call that
reads the existing config (when !force and existsSync(configPath)) in a
try-catch to defensively handle malformed or unexpected JSON; in the catch, log
or throw a clear, user-friendly error referencing configPath and SelftuneConfig
(e.g., "Config file at <configPath> is corrupted or contains invalid JSON"), and
either fall back to creating a new default SelftuneConfig or surface the error
so callers can recreate the config. Ensure you update the branch that returns
JSON.parse(raw) to only return a validated SelftuneConfig after successful
parsing/validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cli/selftune/evolution/deploy-proposal.ts`:
- Around line 150-153: The current sequential draining (new
Response(proc.stdout).text() then new Response(proc.stderr).text()) risks
deadlock when both stdout and stderr are piped; instead read both streams
concurrently and await proc.exited afterwards. Update the code around proc,
proc.stdout, proc.stderr, stdout and stderr to use Promise.all to start both
Response(...).text() reads in parallel, then await proc.exited once both have
completed so neither pipe can block the child process.

In `@cli/selftune/evolution/rollback.ts`:
- Around line 52-57: The current lexicographic reverse sort on entries ->
backupFiles can put plain `<base>.bak` before newer `<base>.<timestamp>.bak`
files; fix by replacing the naive .sort().reverse() with a comparator that
extracts a timestamp for each filename (treat `${base}.bak` as timestamp 0, and
for `${base}.<ts>.bak` parse the <ts> portion as a number or Date) and then sort
by that numeric timestamp descending so newest backups come first; update the
code around readdirSync(dir) / entries and the backupFiles variable where
.filter(...) currently chains .sort().reverse() to use .sort((a,b)=>...) that
compares extracted timestamps.

In `@cli/selftune/grading/grade-session.ts`:
- Around line 181-193: Wrap the JSON.parse calls in gradeViaAgent and
gradeViaApi in try-catch blocks: after obtaining raw from
callViaAgent(callViaApi) and before parsing _stripMarkdownFences(raw), try to
parse and on failure throw a new Error that includes which function failed
(gradeViaAgent or gradeViaApi), the GRADER_SYSTEM context, and the raw LLM
output (or the stripped string) to aid debugging; ensure you preserve the
original error as the cause or include its message when rethrowing so callers
can still see the underlying SyntaxError.
- Around line 403-412: Replace the manual entrypoint detection that defines
isMain (the import.meta.url / process.argv checks) with Bun's built-in
import.meta.main; specifically remove the isMain variable and use if
(import.meta.main) to call cliMain().catch(...) so the file-level guard around
cliMain() uses import.meta.main instead of import.meta.url/process.argv logic.

In `@cli/selftune/hooks/session-stop.ts`:
- Around line 23-27: The current normalization uses nullish coalescing which
doesn't guard types; ensure payload values are type-checked and coerced to safe
defaults before calling parseTranscript. Replace the sessionId and cwd
normalization with explicit string checks (e.g., treat non-strings as the
default "unknown" / empty string) and compute transcriptPath with a guard like:
if payload.transcript_path is a string use it, otherwise set transcriptPath to
""; only call parseTranscript(transcriptPath) when transcriptPath is a valid
non-empty string (or pass a validated empty string) so parseTranscript never
receives non-string input.

In `@cli/selftune/ingestors/codex-rollout.ts`:
- Around line 46-51: The loop that enumerates skill directories uses
statSync(skillDir) without error handling and can throw on broken symlinks or
permission errors; wrap the statSync and existsSync check for SKILL.md in a
try/catch around the body that computes skillDir (the same pattern used in
findRolloutFiles), and on error simply continue to the next entry so that the
names.add(entry) only runs when statSync succeeds and the SKILL.md check passes
(reference variables/idents: readdirSync(dir) loop, skillDir, statSync,
existsSync(join(skillDir, "SKILL.md")), names.add(entry)).

In `@cli/selftune/ingestors/opencode-ingest.ts`:
- Around line 56-61: The directory enumeration loop using readdirSync and
statSync can throw on broken symlinks or permission errors; wrap the
statSync(skillDir) call in a try/catch and skip the entry on any error so the
process doesn't crash—specifically guard the block that calls statSync(skillDir)
and existsSync(join(skillDir, "SKILL.md")) (the loop that adds entry via
names.add(entry)) and on catch simply continue to the next entry, optionally
logging a debug message.

In `@cli/selftune/monitoring/watch.ts`:
- Around line 234-243: The current catch in the dynamic import in watch.ts
swallows all errors and treats them as "rollback module not available"; change
the catch so it only suppresses module-resolution failures (check error.code ===
"ERR_MODULE_NOT_FOUND" or equivalent for Node ESM) and returns the fallback
async rollback only in that case, but rethrow any other errors (syntax/runtime)
so they surface; update the import block around the dynamic
import("../evolution/rollback.js") and the returned mod.rollback to implement
this selective error handling.

---

Duplicate comments:
In `@cli/selftune/evolution/rollback.ts`:
- Around line 67-76: findOriginalFromAudit currently only recognizes audit
entries whose created.details starts with ORIGINAL_DESC_PREFIX, but upstream
emits a plain message string; update findOriginalFromAudit to accept both
formats by first checking if details startsWith(ORIGINAL_DESC_PREFIX) and
slicing as before, otherwise if details is a non-empty string return details as
the original description (or attempt a safe JSON parse if you expect structured
payloads); also apply the same tolerant logic to the other occurrence around
lines 153-155 so the audit fallback works with the existing created entry format
(refer to readAuditTrail, ORIGINAL_DESC_PREFIX, and findOriginalFromAudit to
locate the code).

In `@cli/selftune/monitoring/watch.ts`:
- Around line 285-295: The CLI numeric parsing for values.window and
values.threshold is too permissive (e.g., "20abc" or "0.1foo" are accepted);
update the parsing around windowSessions and regressionThreshold to validate the
raw string first and only then convert: for window (values.window) require the
whole string match an integer regex (e.g., /^\d+$/) and then parseInt to produce
windowSessions; for threshold (values.threshold) require a full-match
numeric/decimal regex (e.g., /^\d+(\.\d+)?$/ or a stricter 0-1 pattern) and then
parseFloat to produce regressionThreshold; keep existing range checks and error
messages but reject inputs that fail the full-string regex before calling
Number.parseInt/parseFloat.

---

Nitpick comments:
In `@cli/selftune/grading/grade-session.ts`:
- Around line 97-111: The function loadExpectationsFromEvalsJson currently uses
unchecked type assertions on the parsed JSON (e.g., treating data.evals as an
array and ev.expectations as string[]), which can cause confusing runtime
errors; add defensive runtime validation after parsing: verify that data is an
object, that data.evals is an array, iterate ensuring each ev is an object with
a numeric/id-compatible id and that ev.expectations (if present) is an array of
strings, and throw descriptive errors when these checks fail (include
evalsJsonPath and evalId in messages) so callers get clear diagnostics instead
of silent type failures.

In `@cli/selftune/hooks/session-stop.ts`:
- Around line 47-49: The catch block in session-stop.ts currently swallows
errors; modify it to capture the thrown error (e.g., catch (err) { ... }) and
emit a debug-gated stderr diagnostic so failures are visible when debugging but
remain non-blocking for Claude; implement a conditional like if
(process.env.DEBUG || process.env.NODE_ENV === 'development')
console.error('session-stop hook failed:', err) (or use the existing logger if
present) and then return silently to preserve the current behavior referenced by
the "hooks must never block Claude" comment.

In `@cli/selftune/ingestors/codex-rollout.ts`:
- Around line 297-299: The current telemetry write uses a spread to build the
record from parsed (const { query: _q, ...telemetry } = parsed) which can
include fields from ParsedRollout (rollout_path, last_user_query,
transcript_path, etc.) that don't match SessionTelemetryRecord; instead
explicitly construct a SessionTelemetryRecord object from parsed by selecting
and mapping only the expected fields (e.g. session_id, user_id, timestamp,
event, metadata) and any validated/normalized values, then pass that typed
object to appendJsonl(telemetryLogPath, telemetryRecord, "session_telemetry") to
ensure schema/type safety and avoid leaking rollout-specific fields.

In `@cli/selftune/ingestors/codex-wrapper.ts`:
- Around line 139-151: The word-boundary regex is rebuilt for every skill on
every item which is wasteful; precompile RegExp objects once (e.g., map
skillNames to an array of RegExp using the current escape logic) outside the
loop that iterates items, then within the loop in codex-wrapper.ts use those
precompiled patterns to test textContent (retain the same checks using etype ===
"item.completed" and skillsTriggered.includes) to avoid repeated RegExp
construction and improve performance.

In `@cli/selftune/ingestors/opencode-ingest.ts`:
- Around line 447-471: The code is missing the logType argument when calling
appendJsonl which breaks schema validation; update the three appendJsonl calls
to include the appropriate logType strings: pass "all_queries" when writing
queryRecord (use queryLogPath and QueryLogRecord), pass "session_telemetry" when
writing telemetry (use telemetryLogPath and the telemetry object), and pass
"skill_usage" when writing skillRecord inside the skills loop (use skillLogPath
and SkillUsageRecord) so schema validation matches other modules that call
appendJsonl.
- Around line 163-166: The WHERE clause currently interpolates sinceTs directly
into whereClause; change it to a parameterized placeholder and add the value to
the query parameters used later (e.g., replace `WHERE created >
${Math.floor(sinceTs * 1000)}` with a placeholder like `WHERE created > $N` and
push Math.floor(sinceTs * 1000) into the params/values array), ensuring the
placeholder index N is the next positional index after existing parameters used
by the message query on line ~190 in opencode-ingest.ts so the combined
parameter list remains in the correct order.

In `@cli/selftune/init.ts`:
- Around line 189-192: Wrap the JSON.parse call that reads the existing config
(when !force and existsSync(configPath)) in a try-catch to defensively handle
malformed or unexpected JSON; in the catch, log or throw a clear, user-friendly
error referencing configPath and SelftuneConfig (e.g., "Config file at
<configPath> is corrupted or contains invalid JSON"), and either fall back to
creating a new default SelftuneConfig or surface the error so callers can
recreate the config. Ensure you update the branch that returns JSON.parse(raw)
to only return a validated SelftuneConfig after successful parsing/validation.

In `@cli/selftune/observability.ts`:
- Around line 109-110: Replace the inline CommonJS require call for homedir with
a top-level ESM import: add import { homedir } from "node:os" alongside the
other imports, then remove require("node:os") and keep the existing settingsPath
assignment (const settingsPath = join(homedir(), ".claude", "settings.json"));
update any references to homedir to use the imported symbol so the file
consistently uses ESM imports.
- Around line 78-88: The execSync call inside checkHookInstallation should
include a timeout to avoid hanging; update the execSync invocation (the one
assigning repoRoot) to pass a timeout option (e.g. timeout: 5000) alongside
encoding so the call fails fast if git is unresponsive and remains caught by the
existing try/catch; ensure you add the timeout to the options object passed to
execSync.
- Around line 156-205: checkEvolutionHealth duplicates JSONL parsing and schema
validation present in checkLogHealth; extract the common logic into a shared
helper (e.g., validateJsonlFile(filePath: string, requiredFields: string[]):
{lineCount:number, parseErrors:number, schemaErrors:number}) and have
checkEvolutionHealth call that helper and map its result to the existing
HealthCheck shape. Update checkLogHealth to use the same helper so both
functions reuse parsing, counting, and schema-checking behavior (refer to
functions checkEvolutionHealth and checkLogHealth and introduce
validateJsonlFile).

In `@tests/observability.test.ts`:
- Around line 31-40: The test "evolution audit log check has warn status when
missing" conditionally asserts evolutionCheck.status === "warn" only if status
!== "pass", which lets the assertion skip when the file exists; update tests to
deterministically cover both branches: add one test that ensures a missing
evolution log produces status "warn" by removing or mocking the file and calling
checkLogHealth() and asserting evolutionCheck.status === "warn", and add another
test that simulates a present/valid evolution log (or mocks checkLogHealth()
response) asserting status === "pass"; reference the existing test name, the
checkLogHealth() call, and the evolutionCheck variable to locate and
replace/extend the test.
- Around line 82-90: The test "evolution audit check uses warn when log is
missing" uses a conditional that only asserts when auditCheck.status !== "pass",
allowing the test to pass in both branches; update the test to deterministically
assert the expected status by controlling the input or state and removing the
conditional: call checkEvolutionHealth() in a scenario where the evolution audit
log is absent (or mock its dependency), then assert auditCheck is defined and
assert auditCheck.status === "warn" directly (use the checkEvolutionHealth
function and the auditCheck variable to locate and change the assertion).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cdac726 and a073bd3.

📒 Files selected for processing (15)
  • cli/selftune/evolution/deploy-proposal.ts
  • cli/selftune/evolution/extract-patterns.ts
  • cli/selftune/evolution/rollback.ts
  • cli/selftune/grading/grade-session.ts
  • cli/selftune/hooks/session-stop.ts
  • cli/selftune/hooks/skill-eval.ts
  • cli/selftune/ingestors/codex-rollout.ts
  • cli/selftune/ingestors/codex-wrapper.ts
  • cli/selftune/ingestors/opencode-ingest.ts
  • cli/selftune/init.ts
  • cli/selftune/monitoring/watch.ts
  • cli/selftune/observability.ts
  • docs/product-specs/index.md
  • tests/evolution/rollback.test.ts
  • tests/observability.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • cli/selftune/evolution/extract-patterns.ts
  • cli/selftune/hooks/skill-eval.ts
  • tests/evolution/rollback.test.ts

Comment on lines +150 to +153
const stdout = await new Response(proc.stdout).text();
const stderr = await new Response(proc.stderr).text();
const exitCode = await proc.exited;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "deploy-proposal.ts" -type f

Repository: WellDunDun/selftune

Length of output: 106


🌐 Web query:

Bun.spawn stdout stderr piped streams deadlock backpressure concurrent consumption

💡 Result:

With Bun.spawn(), proc.stdout/proc.stderr are ReadableStreams when you set (or default) them to "pipe" (stdout defaults to "pipe", stderr defaults to "inherit"). If you pipe both and don’t continuously read both streams, the child can block when an OS pipe buffer fills (backpressure), and then “wait-for-exit then read” (or reading only one stream to completion first) can deadlock. [1] [3]

What to do

  • Start consuming stdout and stderr immediately and concurrently (before/while awaiting proc.exited).
  • Or don’t pipe one of them (stderr: "inherit" or "ignore") if you don’t need it. [1]
  • If you pipe stderr to handle it, read it as a stream (or .text()), but still avoid sequential-drain patterns when both are piped. [2] [3]

Safe pattern (concurrent consumption)

const proc = Bun.spawn(cmd, { stdout: "pipe", stderr: "pipe" });

const [stdout, stderr, exitCode] = await Promise.all([
  proc.stdout.text(),
  proc.stderr.text(),
  proc.exited,
]);

Streaming consumption (keeps pipes drained)

const proc = Bun.spawn(cmd, { stdout: "pipe", stderr: "pipe" });

const pump = async (s: ReadableStream, write: (chunk: Uint8Array) => void) => {
  const r = s.getReader();
  for (;;) {
    const { value, done } = await r.read();
    if (done) break;
    write(value);
  }
};

await Promise.all([
  pump(proc.stdout, chunk => Bun.write(Bun.stdout, chunk)),
  pump(proc.stderr, chunk => Bun.write(Bun.stderr, chunk)),
  proc.exited,
]);

If you describe your exact setup (are you piping stdout→another process’s stdin? are you calling proc.exited before draining? using .text()/Response?), I can point out the specific deadlock edge in your code.


🏁 Script executed:

sed -n '140,165p' ./cli/selftune/evolution/deploy-proposal.ts

Repository: WellDunDun/selftune

Length of output: 674


🏁 Script executed:

sed -n '120,150p' ./cli/selftune/evolution/deploy-proposal.ts

Repository: WellDunDun/selftune

Length of output: 941


Drain stdout and stderr concurrently to avoid pipe backpressure deadlock.

With both stdout: "pipe" and stderr: "pipe" set in Bun.spawn, sequential consumption causes deadlock risk. Line 150 fully drains stdout before Line 151 starts consuming stderr. If stderr output exceeds the OS pipe buffer capacity, the child process blocks—unable to proceed, and unable to exit—while the parent is still waiting for stdout to finish. This stalls await proc.exited indefinitely.

Use Promise.all() to consume both streams concurrently:

🔧 Proposed fix
-    const stdout = await new Response(proc.stdout).text();
-    const stderr = await new Response(proc.stderr).text();
-    const exitCode = await proc.exited;
+    const [stdout, stderr, exitCode] = await Promise.all([
+      new Response(proc.stdout).text(),
+      new Response(proc.stderr).text(),
+      proc.exited,
+    ]);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/deploy-proposal.ts` around lines 150 - 153, The
current sequential draining (new Response(proc.stdout).text() then new
Response(proc.stderr).text()) risks deadlock when both stdout and stderr are
piped; instead read both streams concurrently and await proc.exited afterwards.
Update the code around proc, proc.stdout, proc.stderr, stdout and stderr to use
Promise.all to start both Response(...).text() reads in parallel, then await
proc.exited once both have completed so neither pipe can block the child
process.

Comment on lines +52 to +57
const entries = readdirSync(dir);
// Match <base>.bak or <base>.<anything>.bak
const backupFiles = entries
.filter((f) => f === `${base}.bak` || (f.startsWith(`${base}.`) && f.endsWith(".bak")))
.sort()
.reverse(); // lexicographic descending puts latest timestamp first
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Backup ordering can restore a stale revision.

Line 56-57 sorts filenames lexicographically, which can place SKILL.md.bak ahead of newer SKILL.md.<timestamp>.bak files. That can roll back to the wrong version.

🔧 Proposed fix
-import { existsSync, readFileSync, readdirSync, unlinkSync, writeFileSync } from "node:fs";
+import { existsSync, readFileSync, readdirSync, statSync, unlinkSync, writeFileSync } from "node:fs";
@@
-  const backupFiles = entries
-    .filter((f) => f === `${base}.bak` || (f.startsWith(`${base}.`) && f.endsWith(".bak")))
-    .sort()
-    .reverse(); // lexicographic descending puts latest timestamp first
+  const backupFiles = entries
+    .filter((f) => f === `${base}.bak` || (f.startsWith(`${base}.`) && f.endsWith(".bak")))
+    .map((name) => ({ name, mtimeMs: statSync(join(dir, name)).mtimeMs }))
+    .sort((a, b) => b.mtimeMs - a.mtimeMs)
+    .map((f) => f.name); // newest file first by filesystem timestamp
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/evolution/rollback.ts` around lines 52 - 57, The current
lexicographic reverse sort on entries -> backupFiles can put plain `<base>.bak`
before newer `<base>.<timestamp>.bak` files; fix by replacing the naive
.sort().reverse() with a comparator that extracts a timestamp for each filename
(treat `${base}.bak` as timestamp 0, and for `${base}.<ts>.bak` parse the <ts>
portion as a number or Date) and then sort by that numeric timestamp descending
so newest backups come first; update the code around readdirSync(dir) / entries
and the backupFiles variable where .filter(...) currently chains
.sort().reverse() to use .sort((a,b)=>...) that compares extracted timestamps.

Comment on lines +181 to +193
export async function gradeViaAgent(prompt: string, agent: string): Promise<GraderOutput> {
const raw = await callViaAgent(GRADER_SYSTEM, prompt, agent);
return JSON.parse(_stripMarkdownFences(raw)) as GraderOutput;
}

// ---------------------------------------------------------------------------
// Grading via direct Anthropic API
// ---------------------------------------------------------------------------

export async function gradeViaApi(prompt: string): Promise<GraderOutput> {
const raw = await callViaApi(GRADER_SYSTEM, prompt);
return JSON.parse(_stripMarkdownFences(raw)) as GraderOutput;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Wrap JSON.parse in try-catch for clearer error handling on malformed LLM output.

LLM responses can sometimes include malformed JSON despite the system prompt instructions. The current implementation will throw a generic SyntaxError that propagates to the CLI's catch block without context about what failed.

🛡️ Proposed fix for clearer error messages
 export async function gradeViaAgent(prompt: string, agent: string): Promise<GraderOutput> {
   const raw = await callViaAgent(GRADER_SYSTEM, prompt, agent);
-  return JSON.parse(_stripMarkdownFences(raw)) as GraderOutput;
+  const cleaned = _stripMarkdownFences(raw);
+  try {
+    return JSON.parse(cleaned) as GraderOutput;
+  } catch {
+    throw new Error(`Failed to parse grader JSON output. Raw (first 500 chars): ${cleaned.slice(0, 500)}`);
+  }
 }

 // ---------------------------------------------------------------------------
 // Grading via direct Anthropic API
 // ---------------------------------------------------------------------------

 export async function gradeViaApi(prompt: string): Promise<GraderOutput> {
   const raw = await callViaApi(GRADER_SYSTEM, prompt);
-  return JSON.parse(_stripMarkdownFences(raw)) as GraderOutput;
+  const cleaned = _stripMarkdownFences(raw);
+  try {
+    return JSON.parse(cleaned) as GraderOutput;
+  } catch {
+    throw new Error(`Failed to parse grader JSON output. Raw (first 500 chars): ${cleaned.slice(0, 500)}`);
+  }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export async function gradeViaAgent(prompt: string, agent: string): Promise<GraderOutput> {
const raw = await callViaAgent(GRADER_SYSTEM, prompt, agent);
return JSON.parse(_stripMarkdownFences(raw)) as GraderOutput;
}
// ---------------------------------------------------------------------------
// Grading via direct Anthropic API
// ---------------------------------------------------------------------------
export async function gradeViaApi(prompt: string): Promise<GraderOutput> {
const raw = await callViaApi(GRADER_SYSTEM, prompt);
return JSON.parse(_stripMarkdownFences(raw)) as GraderOutput;
}
export async function gradeViaAgent(prompt: string, agent: string): Promise<GraderOutput> {
const raw = await callViaAgent(GRADER_SYSTEM, prompt, agent);
const cleaned = _stripMarkdownFences(raw);
try {
return JSON.parse(cleaned) as GraderOutput;
} catch {
throw new Error(`Failed to parse grader JSON output. Raw (first 500 chars): ${cleaned.slice(0, 500)}`);
}
}
// ---------------------------------------------------------------------------
// Grading via direct Anthropic API
// ---------------------------------------------------------------------------
export async function gradeViaApi(prompt: string): Promise<GraderOutput> {
const raw = await callViaApi(GRADER_SYSTEM, prompt);
const cleaned = _stripMarkdownFences(raw);
try {
return JSON.parse(cleaned) as GraderOutput;
} catch {
throw new Error(`Failed to parse grader JSON output. Raw (first 500 chars): ${cleaned.slice(0, 500)}`);
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/grading/grade-session.ts` around lines 181 - 193, Wrap the
JSON.parse calls in gradeViaAgent and gradeViaApi in try-catch blocks: after
obtaining raw from callViaAgent(callViaApi) and before parsing
_stripMarkdownFences(raw), try to parse and on failure throw a new Error that
includes which function failed (gradeViaAgent or gradeViaApi), the GRADER_SYSTEM
context, and the raw LLM output (or the stripped string) to aid debugging;
ensure you preserve the original error as the cause or include its message when
rethrowing so callers can still see the underlying SyntaxError.

Comment on lines +403 to +412
// Guard: only run when invoked directly
const isMain =
import.meta.url === `file://${process.argv[1]}` || process.argv[1]?.endsWith("grade-session.ts");

if (isMain) {
cliMain().catch((err) => {
console.error(`[FATAL] ${err}`);
process.exit(1);
});
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

Bun.main module detection how to use

💡 Result:

Use import.meta.main.

// foo.ts
export function foo() {}

if (import.meta.main) {
  // runs only when executed directly (entrypoint), not when imported
  console.log("running as main");
}

In Bun, import.meta.main is true when the file is the entrypoint (directly executed by bun run), and false when it’s imported. [1]

If you’re coming from CommonJS (require.main === module), Bun’s bundler rewrites that pattern to import.meta.main so you can use it with ESM import syntax. [2]

Sources: [1] [2]


🏁 Script executed:

# Check how grade-session.ts is invoked and what runtime it targets
cd cli/selftune/grading && head -20 grade-session.ts

Repository: WellDunDun/selftune

Length of output: 638


🏁 Script executed:

# Check the bin/reins.cjs to understand the Node.js fallback setup
cat bin/reins.cjs | head -50

Repository: WellDunDun/selftune

Length of output: 108


🏁 Script executed:

# Search for other uses of isMain or import.meta.url in the codebase to understand patterns
rg "import.meta.url|Bun.main" --type ts --type js -B2 -A2

Repository: WellDunDun/selftune

Length of output: 675


Use import.meta.main instead of manual argument checking.

The current isMain check is overly complex. Bun natively supports import.meta.main to detect if a file is the entrypoint. Replace the manual checks with a simpler approach:

if (import.meta.main) {
  cliMain().catch((err) => {
    console.error(`[FATAL] ${err}`);
    process.exit(1);
  });
}

This is more reliable than comparing import.meta.url or checking process.argv suffixes, and it's already used consistently in cli/selftune/init.ts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/grading/grade-session.ts` around lines 403 - 412, Replace the
manual entrypoint detection that defines isMain (the import.meta.url /
process.argv checks) with Bun's built-in import.meta.main; specifically remove
the isMain variable and use if (import.meta.main) to call cliMain().catch(...)
so the file-level guard around cliMain() uses import.meta.main instead of
import.meta.url/process.argv logic.

Comment on lines +23 to +27
const sessionId = payload.session_id ?? "unknown";
const transcriptPath = payload.transcript_path ?? "";
const cwd = payload.cwd ?? "";

const metrics = parseTranscript(transcriptPath);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Harden runtime payload normalization before transcript parsing.

On Line [24], ?? only handles null/undefined; it does not validate types. If transcript_path arrives as a non-string, it can propagate into parseTranscript(...) and drop the record path on malformed hook input.

💡 Proposed fix
 export function processSessionStop(
-  payload: StopPayload,
+  payload: StopPayload | null | undefined,
   logPath: string = TELEMETRY_LOG,
 ): SessionTelemetryRecord {
-  const sessionId = payload.session_id ?? "unknown";
-  const transcriptPath = payload.transcript_path ?? "";
-  const cwd = payload.cwd ?? "";
+  const safePayload = (payload && typeof payload === "object" ? payload : {}) as Partial<StopPayload>;
+  const sessionId =
+    typeof safePayload.session_id === "string" && safePayload.session_id.length > 0
+      ? safePayload.session_id
+      : "unknown";
+  const transcriptPath =
+    typeof safePayload.transcript_path === "string" ? safePayload.transcript_path : "";
+  const cwd = typeof safePayload.cwd === "string" ? safePayload.cwd : "";
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const sessionId = payload.session_id ?? "unknown";
const transcriptPath = payload.transcript_path ?? "";
const cwd = payload.cwd ?? "";
const metrics = parseTranscript(transcriptPath);
export function processSessionStop(
payload: StopPayload | null | undefined,
logPath: string = TELEMETRY_LOG,
): SessionTelemetryRecord {
const safePayload = (payload && typeof payload === "object" ? payload : {}) as Partial<StopPayload>;
const sessionId =
typeof safePayload.session_id === "string" && safePayload.session_id.length > 0
? safePayload.session_id
: "unknown";
const transcriptPath =
typeof safePayload.transcript_path === "string" ? safePayload.transcript_path : "";
const cwd = typeof safePayload.cwd === "string" ? safePayload.cwd : "";
const metrics = parseTranscript(transcriptPath);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/hooks/session-stop.ts` around lines 23 - 27, The current
normalization uses nullish coalescing which doesn't guard types; ensure payload
values are type-checked and coerced to safe defaults before calling
parseTranscript. Replace the sessionId and cwd normalization with explicit
string checks (e.g., treat non-strings as the default "unknown" / empty string)
and compute transcriptPath with a guard like: if payload.transcript_path is a
string use it, otherwise set transcriptPath to ""; only call
parseTranscript(transcriptPath) when transcriptPath is a valid non-empty string
(or pass a validated empty string) so parseTranscript never receives non-string
input.

Comment on lines +46 to +51
for (const entry of readdirSync(dir)) {
const skillDir = join(dir, entry);
if (statSync(skillDir).isDirectory() && existsSync(join(skillDir, "SKILL.md"))) {
names.add(entry);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential crash on broken symlinks in skill directory enumeration.

statSync(skillDir) on Line 48 can throw if the entry is a broken symlink or access is denied. Unlike findRolloutFiles (lines 68-72, 78-82, 88-92) which wraps statSync in try/catch, this code lacks protection.

🛠️ Suggested fix
   for (const entry of readdirSync(dir)) {
     const skillDir = join(dir, entry);
-    if (statSync(skillDir).isDirectory() && existsSync(join(skillDir, "SKILL.md"))) {
-      names.add(entry);
+    try {
+      if (statSync(skillDir).isDirectory() && existsSync(join(skillDir, "SKILL.md"))) {
+        names.add(entry);
+      }
+    } catch {
+      // Skip inaccessible entries (e.g., broken symlinks)
     }
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/ingestors/codex-rollout.ts` around lines 46 - 51, The loop that
enumerates skill directories uses statSync(skillDir) without error handling and
can throw on broken symlinks or permission errors; wrap the statSync and
existsSync check for SKILL.md in a try/catch around the body that computes
skillDir (the same pattern used in findRolloutFiles), and on error simply
continue to the next entry so that the names.add(entry) only runs when statSync
succeeds and the SKILL.md check passes (reference variables/idents:
readdirSync(dir) loop, skillDir, statSync, existsSync(join(skillDir,
"SKILL.md")), names.add(entry)).

Comment on lines +56 to +61
for (const entry of readdirSync(dir)) {
const skillDir = join(dir, entry);
if (statSync(skillDir).isDirectory() && existsSync(join(skillDir, "SKILL.md"))) {
names.add(entry);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential crash on broken symlinks in skill directory enumeration.

Same issue as in codex-rollout.ts: statSync(skillDir) on Line 58 can throw on broken symlinks or permission errors. Consider wrapping in try/catch for defensive handling.

🛠️ Suggested fix
   for (const entry of readdirSync(dir)) {
     const skillDir = join(dir, entry);
-    if (statSync(skillDir).isDirectory() && existsSync(join(skillDir, "SKILL.md"))) {
-      names.add(entry);
+    try {
+      if (statSync(skillDir).isDirectory() && existsSync(join(skillDir, "SKILL.md"))) {
+        names.add(entry);
+      }
+    } catch {
+      // Skip inaccessible entries
     }
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/ingestors/opencode-ingest.ts` around lines 56 - 61, The
directory enumeration loop using readdirSync and statSync can throw on broken
symlinks or permission errors; wrap the statSync(skillDir) call in a try/catch
and skip the entry on any error so the process doesn't crash—specifically guard
the block that calls statSync(skillDir) and existsSync(join(skillDir,
"SKILL.md")) (the loop that adds entry via names.add(entry)) and on catch simply
continue to the next entry, optionally logging a debug message.

Comment on lines +234 to +243
try {
const mod = await import("../evolution/rollback.js");
return mod.rollback;
} catch {
// Rollback module not yet available (being created by another agent)
return async () => ({
rolledBack: false,
restoredDescription: "",
reason: "Rollback module not available",
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f "watch.ts" --path "*/selftune/monitoring/*"

Repository: WellDunDun/selftune

Length of output: 296


🏁 Script executed:

cat -n cli/selftune/monitoring/watch.ts | sed -n '230,250p'

Repository: WellDunDun/selftune

Length of output: 846


🌐 Web query:

Node.js ESM dynamic import error codes ERR_MODULE_NOT_FOUND vs syntax errors

💡 Result:

When import() fails in Node.js ESM, the error you get depends on which phase failed:

1) Resolution failure ⇒ ERR_MODULE_NOT_FOUND

If Node can’t resolve the specifier to an actual module file (bad path, missing file, missing mandatory extension for a relative/absolute import, package subpath not exported, etc.), the promise rejects with an Error whose .code is:

  • ERR_MODULE_NOT_FOUND — “A module file could not be resolved by the ECMAScript modules loader…” (nodejs.org)

A very common cause in ESM is that relative/absolute imports must include the file extension (and directory indexes must be fully specified), otherwise resolution can fail and you’ll see ERR_MODULE_NOT_FOUND. (nodejs.org)

2) Load/parse failure ⇒ SyntaxError (no ERR_MODULE_NOT_FOUND)

If Node does find the file, but then can’t parse it as the module format Node decided to use, you’ll typically get a JavaScript parse error like:

  • SyntaxError: Unexpected token 'export'

This often happens when you dynamic-import a file that contains ESM syntax (export ...) but Node is treating that file as CommonJS (e.g., .js without "type": "module" and without a .mjs extension). (sanity.io)

Key point about dynamic import()

Dynamic import() is supported in both CommonJS and ES modules—but the imported file still has to be recognized/loaded in the correct module format, otherwise you’ll hit the SyntaxError case above. (nodejs.org)

Practical check

try {
  await import(specifier);
} catch (e) {
  if (e && e.code === 'ERR_MODULE_NOT_FOUND') {
    // resolution problem (wrong path, missing extension, exports map, etc.)
  } else if (e instanceof SyntaxError) {
    // found the file, but failed to parse (format mismatch or real syntax error)
  }
  throw e;
}

Sources

[1] (nodejs.org) Node.js v20 “Errors” docs: ERR_MODULE_NOT_FOUND definition
[2] (nodejs.org) Node.js “ECMAScript modules” docs: mandatory file extensions
[3] (nodejs.org) Node.js “ECMAScript modules” docs: import() expressions support
[4] (sanity.io) Example of SyntaxError: Unexpected token 'export' from module-format mismatch (dynamic import into CJS context)

Citations:


🌐 Web query:

Bun ESM import() error handling ERR_MODULE_NOT_FOUND missing module

💡 Result:

In Bun, a missing module during runtime import() is handled the standard ESM way: import() returns a Promise that rejects if resolution/loading fails, so you can try/catch (with await) or .catch() the Promise [3].

// optional dependency pattern (runtime)
export async function loadOptional(spec: string) {
  try {
    return await import(spec);
  } catch (err: any) {
    if (err?.code === "ERR_MODULE_NOT_FOUND") return null; // module missing
    throw err; // other failures (syntax error, runtime error while evaluating, etc.)
  }
}

If you want to preflight the resolution (and still keep it optional), Bun supports import.meta.resolve(spec); you can resolve first, then import the resolved URL. If resolve fails, treat it as “not installed” [1].

export async function loadOptional(spec: string) {
  try {
    const url = import.meta.resolve(spec); // throws if it can't resolve [1]
    return await import(url);
  } catch (err: any) {
    if (err?.code === "ERR_MODULE_NOT_FOUND") return null;
    throw err;
  }
}

If you’re seeing ERR_MODULE_NOT_FOUND during bundling (e.g. bun build / Bun.build) rather than at runtime, the bundler tries to resolve imports at build-time; mark truly-optional deps as external so they’re left for runtime resolution (then the runtime import() catch above works) [2].

await Bun.build({
  entrypoints: ["./src/index.ts"],
  outdir: "./dist",
  external: ["some-optional-package"], // don’t resolve/bundle it [2]
});

Also check resolution basics in Bun: Bun supports extensionless relative imports and tries a list of extensions / index.* when resolving local files [1]. For packages/subpaths, ensure you’re importing a path that the package actually exports via package.json "exports" (otherwise it may not resolve even if installed) [1].

Sources: Bun module resolution + import.meta.resolve [1], Bun bundler external [2], import() Promise behavior (errors are catchable) [3].


Distinguish module-not-found from syntax/runtime errors in dynamic import.

The bare catch block at line 237 masks all import failures as "rollback module not available," hiding real defects like syntax errors or runtime failures in the rollback module itself. This can silently disable auto-rollback when it's actually needed.

Restructure to only catch and suppress ERR_MODULE_NOT_FOUND (module resolution failure), and rethrow all other errors:

🔧 Proposed fix
-  } catch {
-    // Rollback module not yet available (being created by another agent)
-    return async () => ({
-      rolledBack: false,
-      restoredDescription: "",
-      reason: "Rollback module not available",
-    });
-  }
+  } catch (error: unknown) {
+    if (error instanceof Error && error.code === "ERR_MODULE_NOT_FOUND") {
+      // Rollback module not yet available (being created by another agent)
+      return async () => ({
+        rolledBack: false,
+        restoredDescription: "",
+        reason: "Rollback module not available",
+      });
+    }
+    throw error;
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try {
const mod = await import("../evolution/rollback.js");
return mod.rollback;
} catch {
// Rollback module not yet available (being created by another agent)
return async () => ({
rolledBack: false,
restoredDescription: "",
reason: "Rollback module not available",
});
try {
const mod = await import("../evolution/rollback.js");
return mod.rollback;
} catch (error: unknown) {
if (error instanceof Error && (error as any).code === "ERR_MODULE_NOT_FOUND") {
// Rollback module not yet available (being created by another agent)
return async () => ({
rolledBack: false,
restoredDescription: "",
reason: "Rollback module not available",
});
}
throw error;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/selftune/monitoring/watch.ts` around lines 234 - 243, The current catch
in the dynamic import in watch.ts swallows all errors and treats them as
"rollback module not available"; change the catch so it only suppresses
module-resolution failures (check error.code === "ERR_MODULE_NOT_FOUND" or
equivalent for Node ESM) and returns the fallback async rollback only in that
case, but rethrow any other errors (syntax/runtime) so they surface; update the
import block around the dynamic import("../evolution/rollback.js") and the
returned mod.rollback to implement this selective error handling.

…eanup

- Fix stdout/stderr deadlock in runCommand via Promise.all concurrent draining
- Fix backup sort order in rollback to handle timestamped vs plain .bak correctly
- Add format tolerance in findOriginalFromAudit for plain description strings
- Wrap JSON.parse in gradeViaAgent/gradeViaApi with descriptive error context
- Replace manual isMain detection with import.meta.main in grade-session
- Add runtime validation for loadExpectationsFromEvalsJson parsed structure
- Add type guards for session-stop payload fields (typeof string checks)
- Add debug-gated error logging in session-stop catch block
- Wrap statSync in try/catch for skill enumeration in codex-rollout and opencode-ingest
- Explicitly construct SessionTelemetryRecord in codex-rollout (no spread)
- Add logType arguments to appendJsonl calls in opencode-ingest
- Parameterize SQL WHERE clause for sinceTs in opencode-ingest
- Selective error handling in watch.ts dynamic import (only suppress MODULE_NOT_FOUND)
- Strict numeric parsing with regex validation for watch.ts CLI inputs
- Precompile word-boundary regex in codex-wrapper skill matching
- Wrap JSON.parse for config in init.ts with try/catch and clear error
- Replace require("node:os") with ESM import in observability
- Add timeout to execSync in checkHookInstallation
- Extract shared validateJsonlFile helper in observability
- Make observability test assertions deterministic (no conditional skips)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant