feat(eval): add agent profile cells#79
Conversation
✅ No Blockers —
|
| kimi-code | deepseek | aggregate | |
|---|---|---|---|
| Readiness | 93 | 95 | 93 |
| Confidence | 97 | 98 | 97 |
| Correctness | 93 | 97 | 93 |
| Security | 92 | 98 | 92 |
| Testing | 91 | 92 | 91 |
| Architecture | 90 | 95 | 90 |
Read every changed file and callee (pre-registration.ts, errors.ts). All 1282 tests pass and tsc is clean. The PR replaces agent-profile + scorecard with a richer agent-profile-cell module, integrates it into eval-campaign and run-record with validation at both boundaries, and removes dead exports. No runtime defects found. | Comprehensive replacement of AgentProfile + Scorecard with content-addressed AgentProfileCell system. Reads every changed file, runs full test suite (1282/1282 pass), verifies typecheck + build. No bugs, no stale references, no missing error handling. Thorough normalizati
🟡 LOW isAgentProfileCell uses duck-typing rather than branded discriminator — src/agent-profile-cell.ts
The type guard at line 607 checks
'schemaVersion' in input && 'cellId' in input, which distinguishes AgentProfileCell from AgentProfileCellInput by duck-typed property presence. This works correctly with the current types (AgentProfileCellInput has neither property), but adding a field namedcellIdto AgentProfileCellInput in the future would silently break the type guard. Akind: 'built' | 'input'discriminator would be more robust. Low severity — current types are safe.
🟡 LOW isAgentProfileCell type guard can misidentify invalid objects — src/eval-campaign.ts
Line 607-611:
isAgentProfileCellchecks only'schemaVersion' in input && 'cellId' in input. AnAgentProfileCellInputthat happens to carry these keys at runtime would be misidentified, causingverifyAgentProfileCellto throw rather thanbuildAgentProfileCellto run. In practice this only affects callers who violate the type contract, so impact is minimal.
🟡 LOW Breaking API surface removal without deprecation — src/index.ts
The PR removes public exports for scorecard, agent-profile, and pr-review-benchmark modules. While the files are gone and internal references are cleaned up, external consumers importing these will break on upgrade. At v0.33.0 this is acceptable, but the CHANGELOG should call out the breaking change explicitly.
🟡 LOW Test coverage gap for edge-case validation inputs — tests/agent-profile-cell.test.ts
The validation test at line 93-99 only checks empty
profileId. Missing test cases for: empty harness id, invalid MCP transport value, malformed model object, empty prompt hash. The normalization functions handle these correctly (confirmed by code review), but no test exercises the error paths. Low severity — runtime behavior is correct; adding these cases would improve coverage confidence.
tangletools · 2026-05-22T19:40:27Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 4 non-blocking findings — a8ec3e26
Read every changed file and callee (pre-registration.ts, errors.ts). All 1282 tests pass and tsc is clean. The PR replaces agent-profile + scorecard with a richer agent-profile-cell module, integrates it into eval-campaign and run-record with validation at both boundaries, and removes dead exports. No runtime defects found. | Comprehensive replacement of AgentProfile + Scorecard with content-addre
Full findings and scores: review summary
tangletools · 2026-05-22T19:40:27Z · trace
a8ec3e2 to
1d06056
Compare
src/index.ts has exported `PrReviewAuditCase`, `scorePrReviewComments`, `summarizePrReviewBenchmark`, et al. from `./pr-review-benchmark` since the run-record refactor landed, but `src/pr-review-benchmark.ts` and its co-located test were authored locally and never committed. A fresh clone fails typecheck; CI on main has been red on #78, #79, and #81. The files were already typecheck-clean, biome-clean, and the 5 co-located tests pass. No content changes — only `git add`.
- Restore agent-profile, scorecard, and pr-review-benchmark as deprecated stubs to prevent breaking API surface changes. Re-add exports to index.ts with @deprecated annotations. - Add optional seed parameter to confidenceInterval in statistics.ts to fix non-deterministic bootstrap (was using Math.random without a seed option, unlike pairedBootstrap which already had one). - Fix silently-swallowed git error in auto-pr.ts ghCliClient: the git branch -D command used exec() directly and ignored ALL errors. Now it only ignores the expected 'branch not found' error and surfaces unexpected failures.
head=0e032d1a
Auto-repair succeeded —
|
Summary
Verification