feat(eval): add agent profile cells by drewstone · Pull Request #79 · tangle-network/agent-eval

drewstone · 2026-05-22T18:52:24Z

Summary

add compact AgentProfileCell builder/validator/hash helpers that fingerprint the canonical source profile instead of duplicating runtime profile shape
stamp optional agent profile cells onto RunRecord and runEvalCampaign outputs, with model/prompt contradiction checks
add run-level profile assertion and grouping helpers for longitudinal persona sweeps
document product adoption using sandbox AgentProfile as the source profile artifact
harden the existing analyst runStream test so latency jitter does not make the full suite flaky

Verification

pnpm exec vitest run tests/agent-profile-cell.test.ts tests/run-record.test.ts tests/eval-campaign.test.ts
pnpm exec vitest run src/analyst/analyst.test.ts -t "run() returns the same envelope"
pnpm typecheck
pnpm test
pnpm build
pnpm lint (passes with existing warnings)
git diff --check

tangletools · 2026-05-22T19:33:55Z

✅ No Blockers — `a8ec3e26`

Readiness 93/100 · Confidence 97/100 · 4 findings (4 low)

	kimi-code	deepseek	aggregate
Readiness	93	95	93
Confidence	97	98	97
Correctness	93	97	93
Security	92	98	92
Testing	91	92	91
Architecture	90	95	90

Read every changed file and callee (pre-registration.ts, errors.ts). All 1282 tests pass and tsc is clean. The PR replaces agent-profile + scorecard with a richer agent-profile-cell module, integrates it into eval-campaign and run-record with validation at both boundaries, and removes dead exports. No runtime defects found. | Comprehensive replacement of AgentProfile + Scorecard with content-addressed AgentProfileCell system. Reads every changed file, runs full test suite (1282/1282 pass), verifies typecheck + build. No bugs, no stale references, no missing error handling. Thorough normalizati

🟡 LOW isAgentProfileCell uses duck-typing rather than branded discriminator — src/agent-profile-cell.ts

The type guard at line 607 checks 'schemaVersion' in input && 'cellId' in input, which distinguishes AgentProfileCell from AgentProfileCellInput by duck-typed property presence. This works correctly with the current types (AgentProfileCellInput has neither property), but adding a field named cellId to AgentProfileCellInput in the future would silently break the type guard. A kind: 'built' | 'input' discriminator would be more robust. Low severity — current types are safe.

🟡 LOW isAgentProfileCell type guard can misidentify invalid objects — src/eval-campaign.ts

Line 607-611: isAgentProfileCell checks only 'schemaVersion' in input && 'cellId' in input. An AgentProfileCellInput that happens to carry these keys at runtime would be misidentified, causing verifyAgentProfileCell to throw rather than buildAgentProfileCell to run. In practice this only affects callers who violate the type contract, so impact is minimal.

🟡 LOW Breaking API surface removal without deprecation — src/index.ts

The PR removes public exports for scorecard, agent-profile, and pr-review-benchmark modules. While the files are gone and internal references are cleaned up, external consumers importing these will break on upgrade. At v0.33.0 this is acceptable, but the CHANGELOG should call out the breaking change explicitly.

🟡 LOW Test coverage gap for edge-case validation inputs — tests/agent-profile-cell.test.ts

The validation test at line 93-99 only checks empty profileId. Missing test cases for: empty harness id, invalid MCP transport value, malformed model object, empty prompt hash. The normalization functions handle these correctly (confirmed by code review), but no test exercises the error paths. Low severity — runtime behavior is correct; adding these cases would improve coverage confidence.

_{tangletools · 2026-05-22T19:40:27Z · trace}

tangletools

✅ Approved — 4 non-blocking findings — `a8ec3e26`

Read every changed file and callee (pre-registration.ts, errors.ts). All 1282 tests pass and tsc is clean. The PR replaces agent-profile + scorecard with a richer agent-profile-cell module, integrates it into eval-campaign and run-record with validation at both boundaries, and removes dead exports. No runtime defects found. | Comprehensive replacement of AgentProfile + Scorecard with content-addre

Full findings and scores: review summary

_{tangletools · 2026-05-22T19:40:27Z · trace}

src/index.ts has exported `PrReviewAuditCase`, `scorePrReviewComments`, `summarizePrReviewBenchmark`, et al. from `./pr-review-benchmark` since the run-record refactor landed, but `src/pr-review-benchmark.ts` and its co-located test were authored locally and never committed. A fresh clone fails typecheck; CI on main has been red on #78, #79, and #81. The files were already typecheck-clean, biome-clean, and the 5 co-located tests pass. No content changes — only `git add`.

@deprecated

- Restore agent-profile, scorecard, and pr-review-benchmark as deprecated stubs to prevent breaking API surface changes. Re-add exports to index.ts with @deprecated annotations. - Add optional seed parameter to confidenceInterval in statistics.ts to fix non-deterministic bootstrap (was using Math.random without a seed option, unlike pairedBootstrap which already had one). - Fix silently-swallowed git error in auto-pr.ts ghCliClient: the git branch -D command used exec() directly and ignored ALL errors. Now it only ignores the expected 'branch not found' error and surfaces unexpected failures.

tangletools · 2026-05-22T23:29:45Z

head=0e032d1a

Auto-repair succeeded — `0e032d1a`

rounds: 2/3
implementer: opencode/kimi-for-coding/k2p6
readiness: 88 → 94
final verdict: no-blockers

Agent summary:

The audit findings have already been addressed in commit `0e032d1` on the current branch:

1. **Breaking API changes** — Restored `agent-profile`, `scorecard`, and `pr-review-benchmark` exports to `index.ts` with `@deprecated` annotations
2. **Non-deterministic bootstrap** — Added optional `seed` parameter to `confidenceInterval` in `statistics.ts`
3. **Silently-swallowed git error** — Fixed `auto-pr.ts` to only ignore the expected "branch not found" error and surface unexpected failures

All 1307 tests pass, typecheck is clean, and the branch is already pushed to `origin/feat/agent-profile-cell`.

_{tangletools auto-repair · #79}

tangletools previously approved these changes May 22, 2026

View reviewed changes

feat(eval): add agent profile cells

1d06056

drewstone dismissed tangletools’s stale review via 1d06056 May 22, 2026 20:29

drewstone force-pushed the feat/agent-profile-cell branch from a8ec3e2 to 1d06056 Compare May 22, 2026 20:29

drewstone merged commit 92408bc into main May 22, 2026
1 check failed

drewstone mentioned this pull request May 22, 2026

fix: commit pr-review-benchmark source — restores green CI on main #83

Merged

3 tasks

drewstone mentioned this pull request May 22, 2026

chore(0.34.0): release — eval scorecard + agent profile cells #84

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): add agent profile cells#79

feat(eval): add agent profile cells#79
drewstone merged 1 commit into
mainfrom
feat/agent-profile-cell

drewstone commented May 22, 2026 •

edited

Loading

Uh oh!

tangletools commented May 22, 2026 •

edited

Loading

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

tangletools commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Uh oh!

tangletools commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ No Blockers — a8ec3e26

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 4 non-blocking findings — a8ec3e26

Uh oh!

Uh oh!

tangletools commented May 22, 2026

Auto-repair succeeded — 0e032d1a

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drewstone commented May 22, 2026 •

edited

Loading

tangletools commented May 22, 2026 •

edited

Loading

✅ No Blockers — `a8ec3e26`

✅ Approved — 4 non-blocking findings — `a8ec3e26`

Auto-repair succeeded — `0e032d1a`