feat(reporting): add researchReport executive summary layer by drewstone · Pull Request #34 · tangle-network/agent-eval

drewstone · 2026-05-08T14:59:58Z

Summary

researchReport composes the lower-level reporting primitives (summaryTable, paretoChart, gainHistogram, held-out GateDecisions, optional failureClusterView) into one structured launch-decision artifact. Verdicts are made on paired evidence — never on marginal means alone — and respect any held-out gate the caller passes through.

Decision rule

In order — first match wins:

Comparator itself → hold (baseline).
No comparator → hold if on the cost/quality Pareto frontier, else needs_more_data.
Held-out gate ≠ promote → reject (gate is necessary, not sufficient).
Paired N < 6 (RESEARCH_REPORT_HARD_PAIR_FLOOR) → needs_more_data.
ROPE configured AND paired-delta CI ⊂ ROPE → equivalent.
Paired-delta CI upper bound < 0 → reject.
Paired N < minPairs (default 20) → needs_more_data with MDE attached.
BH-adjusted q ≤ fdr AND paired CI excludes zero → promote.
Otherwise → hold.

What every candidate carries

BH-FDR-adjusted Wilcoxon q-value, Cohen's d, marginal CI (from summaryTable).
Bootstrap CI on the median paired delta (from gainHistogram).
Bayesian-bootstrap-style Pr(Δ > 0) and Pr(Δ ∈ ROPE) on the mean paired delta (Rubin 1981 bootstrap-prior duality).
Minimum detectable paired effect at the configured power / α via the new pairedMde primitive in power-analysis.
On-Pareto-frontier flag (cost vs quality).
Held-out gate label, when supplied.

What every report carries

runFingerprint — SHA-256 over the canonicalised input run set (stable across input order).
preregistrationHash — optional, links a signed HypothesisManifest so reviewers can verify the analysis was the preregistered one rather than post-hoc.
methodology — structured assumptions / methods / alternatives / when-not-to-apply / citations, also embedded in the rendered markdown.
Standalone methodology companion at docs/research-report-methodology.md.

Citations: Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979; Rubin 1981; Kruschke 2018; Howard et al. 2021 (background reading on always-valid extensions).

API change

researchReport is now async — it uses Web Crypto via hashJson for the run fingerprint. No prior consumers exist (this PR is the first time the function ships).

Wired into src/index.ts (root barrel) and the @tangle-network/agent-eval/reporting subpath, plus RESEARCH_REPORT_HARD_PAIR_FLOOR and ResearchReportMethodology.

Adds a CLAUDE.md authorship directive (no AI-attribution trailers in this repo).

Test plan

pnpm typecheck clean
pnpm build produces dist/reporting.{js,d.ts} with the new exports
pnpm test — 837 / 837 (11 dedicated researchReport cases covering: promote, needs-more-data + MDE, hard-floor below 6 pairs, ROPE → equivalent, gate override, fingerprint determinism + preregistration passthrough, failure-cluster surfacing — plus the existing summaryTable / paretoChart / gainHistogram coverage).
Spot-check rendered markdown / HTML in a downstream consumer before tagging a release.

researchReport composes summaryTable, paretoChart, gainHistogram, held-out gate decisions, and optional failureClusterView output into one structured artifact for coding-vertical benchmark runs: - promote/hold/reject decision with rationale, risks, next actions - per-candidate stats (Wilcoxon q-value, Cohen's d, paired N, gain CI) - Pareto frontier flagging - markdown, HTML, and JSON chart specs Wired into src/index.ts and the @tangle-network/agent-eval/reporting subpath. Tests in tests/summary-report.test.ts cover decision logic, candidate scoring, and output formats. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Upgrades the researchReport that landed in #34 from a senior-applied-DS deliverable to one a launch reviewer or peer reviewer can sign off on. Decisions are now made on paired evidence consistently; the report carries its own methodology, provenance, and actionable next-step inference. Decision rule (paired-delta-only — never marginal means inside the paired framework): 1. Comparator → hold (baseline) 2. No comparator → hold if on Pareto frontier else needs_more_data 3. Held-out gate ≠ promote → reject (gate is necessary, not sufficient) 4. Paired N < 6 (RESEARCH_REPORT_HARD_PAIR_FLOOR) → needs_more_data 5. ROPE configured AND paired CI ⊂ ROPE → equivalent 6. Paired CI upper < 0 → reject 7. Paired N < minPairs (default 20) → needs_more_data + MDE attached 8. q ≤ fdr AND CI lower > 0 → promote 9. Otherwise → hold Per-candidate additions: - Bayesian-bootstrap-style Pr(Δ>0), Pr(Δ∈ROPE) on mean paired delta (Rubin 1981 bootstrap-prior duality) - Minimum detectable paired effect at the configured power / α via the new pairedMde primitive in power-analysis - meanGain alongside the existing medianGain Per-report additions: - runFingerprint: SHA-256 over the canonicalised input run set - preregistrationHash: optional, links a signed HypothesisManifest - methodology: assumptions, methods, alternatives, when-not-to-apply, citations (Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979; Rubin 1981; Kruschke 2018) — embedded in markdown and as structured data Standalone methodology companion at docs/research-report-methodology.md. Behavioural changes: - Default minPairs raised from 6 to 20 (soft floor); hard floor held at 6. - Reject rule no longer mixes paired delta with marginal means. - Held-out gate ≠ promote now reliably overrides paired stats. - researchReport is async (Web Crypto via hashJson for the fingerprint). Adds a CLAUDE.md authorship directive (no AI-attribution trailers). 837/837 tests passing — 11 dedicated researchReport cases including: hard floor below 6 pairs, ROPE → equivalent, gate override, fingerprint determinism + preregistration passthrough, MDE in the needs_more_data reason, plus the existing summaryTable / paretoChart / gainHistogram coverage.

drewstone merged commit 214b0f0 into main May 8, 2026

drewstone mentioned this pull request May 8, 2026

feat(reporting): elevate researchReport to launch-review grade #35

Merged

4 tasks

drewstone mentioned this pull request May 8, 2026

feat: 0.21.0 — capture integrity for launch-grade benchmark runs #36

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(reporting): add researchReport executive summary layer#34

feat(reporting): add researchReport executive summary layer#34
drewstone merged 1 commit into
mainfrom
feat/research-report

drewstone commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Decision rule

What every candidate carries

What every report carries

API change

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

drewstone commented May 8, 2026 •

edited

Loading