Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,38 @@
# Changelog

## 0.34.0 — 2026-05-23

### Eval evolution-tracking — first-class `AgentProfile` + per-cell scorecard

The headline shift: a feature PR's eval can now answer the question a single
run cannot — *did this change regress persona P on profile F, even while the
aggregate improved?*

- **`AgentProfile` + `agentProfileHash`** — the harness's unit of variation.
Model lives inside the profile (skill/tool order doesn't matter; the `id`
label is excluded from identity), so "same model, different skills" is two
profiles. (#78)
- **Append-only JSONL scorecard** keyed `(scenarioId, profileHash)` —
`recordRuns` / `recordRunsToScorecard` / `loadScorecard`. Idempotent
appends on `eventId` so concurrent campaign runs cannot clobber. (#78)
- **`diffScorecard`** — per-cell verdict (`improved` / `regressed` / `flat` /
`new`) using Cohen's d + Welch's t-test; the keystone CI guard is
`diff.cells.filter(c => c.verdict === 'regressed')`. `formatScorecardDiff`
renders the PR-facing report. (#78)
- **Agent profile cells** — `src/agent-profile-cell.ts` extends the profile
contract into `RunRecord` rows and `runEvalCampaign` so every campaign row
is keyed by `(profile, scenario, seed)` end-to-end. (#79)
- **Stats consolidation** — `pairedBootstrap`, power analysis, and the
paired/Welch primitives now all live in `src/statistics.ts`. (#73)
- **LLM retry classifier unified** across `llm-client` and `judge-retry`
via `isTransientLlmError`. (#74)
- **`pr-review-benchmark` source committed** — the module was exported from
`index.ts` since the run-record refactor but the source files were never
committed; CI on `main` has been red on #78/#79/#81 as a result. (#83)
- **Examples**: `scorecard/`, `held-out-gate/`, `user-simulation-driver/`. (#81)

No breaking changes — additive across the board.

## 0.33.0 — 2026-05-21

### Release — `decideNextUserTurn` in the published tarball
Expand Down
2 changes: 1 addition & 1 deletion clients/python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "hatchling.build"

[project]
name = "agent-eval-rpc"
version = "0.33.0"
version = "0.34.0"
description = "Python RPC client for @tangle-network/agent-eval — judge content against rubrics over HTTP or stdio RPC. Eval logic runs in the Node runtime; this package is a thin wire client."
readme = "README.md"
requires-python = ">=3.10"
Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@tangle-network/agent-eval",
"version": "0.33.0",
"version": "0.34.0",
"description": "Substrate for self-improving agents: traces, verifiable rewards, preferences, GEPA / reflective mutation, auto-research, replay, sequential anytime-valid stats, and release gates.",
"homepage": "https://github.com/tangle-network/agent-eval#readme",
"repository": {
Expand Down
Loading