Skip to content

feat(reporting): add researchReport executive summary layer#34

Merged
drewstone merged 1 commit into
mainfrom
feat/research-report
May 8, 2026
Merged

feat(reporting): add researchReport executive summary layer#34
drewstone merged 1 commit into
mainfrom
feat/research-report

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

@drewstone drewstone commented May 8, 2026

Summary

researchReport composes the lower-level reporting primitives (summaryTable, paretoChart, gainHistogram, held-out GateDecisions, optional failureClusterView) into one structured launch-decision artifact. Verdicts are made on paired evidence — never on marginal means alone — and respect any held-out gate the caller passes through.

Decision rule

In order — first match wins:

  1. Comparator itself → hold (baseline).
  2. No comparator → hold if on the cost/quality Pareto frontier, else needs_more_data.
  3. Held-out gate ≠ promotereject (gate is necessary, not sufficient).
  4. Paired N < 6 (RESEARCH_REPORT_HARD_PAIR_FLOOR) → needs_more_data.
  5. ROPE configured AND paired-delta CI ⊂ ROPE → equivalent.
  6. Paired-delta CI upper bound < 0 → reject.
  7. Paired N < minPairs (default 20) → needs_more_data with MDE attached.
  8. BH-adjusted q ≤ fdr AND paired CI excludes zero → promote.
  9. Otherwise → hold.

What every candidate carries

  • BH-FDR-adjusted Wilcoxon q-value, Cohen's d, marginal CI (from summaryTable).
  • Bootstrap CI on the median paired delta (from gainHistogram).
  • Bayesian-bootstrap-style Pr(Δ > 0) and Pr(Δ ∈ ROPE) on the mean paired delta (Rubin 1981 bootstrap-prior duality).
  • Minimum detectable paired effect at the configured power / α via the new pairedMde primitive in power-analysis.
  • On-Pareto-frontier flag (cost vs quality).
  • Held-out gate label, when supplied.

What every report carries

  • runFingerprint — SHA-256 over the canonicalised input run set (stable across input order).
  • preregistrationHash — optional, links a signed HypothesisManifest so reviewers can verify the analysis was the preregistered one rather than post-hoc.
  • methodology — structured assumptions / methods / alternatives / when-not-to-apply / citations, also embedded in the rendered markdown.
  • Standalone methodology companion at docs/research-report-methodology.md.

Citations: Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979; Rubin 1981; Kruschke 2018; Howard et al. 2021 (background reading on always-valid extensions).

API change

researchReport is now async — it uses Web Crypto via hashJson for the run fingerprint. No prior consumers exist (this PR is the first time the function ships).

Wired into src/index.ts (root barrel) and the @tangle-network/agent-eval/reporting subpath, plus RESEARCH_REPORT_HARD_PAIR_FLOOR and ResearchReportMethodology.

Adds a CLAUDE.md authorship directive (no AI-attribution trailers in this repo).

Test plan

  • pnpm typecheck clean
  • pnpm build produces dist/reporting.{js,d.ts} with the new exports
  • pnpm test837 / 837 (11 dedicated researchReport cases covering: promote, needs-more-data + MDE, hard-floor below 6 pairs, ROPE → equivalent, gate override, fingerprint determinism + preregistration passthrough, failure-cluster surfacing — plus the existing summaryTable / paretoChart / gainHistogram coverage).
  • Spot-check rendered markdown / HTML in a downstream consumer before tagging a release.

researchReport composes summaryTable, paretoChart, gainHistogram, held-out
gate decisions, and optional failureClusterView output into one structured
artifact for coding-vertical benchmark runs:

- promote/hold/reject decision with rationale, risks, next actions
- per-candidate stats (Wilcoxon q-value, Cohen's d, paired N, gain CI)
- Pareto frontier flagging
- markdown, HTML, and JSON chart specs

Wired into src/index.ts and the @tangle-network/agent-eval/reporting
subpath. Tests in tests/summary-report.test.ts cover decision logic,
candidate scoring, and output formats.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@drewstone drewstone merged commit 214b0f0 into main May 8, 2026
drewstone added a commit that referenced this pull request May 8, 2026
Upgrades the researchReport that landed in #34 from a senior-applied-DS
deliverable to one a launch reviewer or peer reviewer can sign off on.
Decisions are now made on paired evidence consistently; the report carries
its own methodology, provenance, and actionable next-step inference.

Decision rule (paired-delta-only — never marginal means inside the paired
framework):

  1. Comparator → hold (baseline)
  2. No comparator → hold if on Pareto frontier else needs_more_data
  3. Held-out gate ≠ promote → reject (gate is necessary, not sufficient)
  4. Paired N < 6 (RESEARCH_REPORT_HARD_PAIR_FLOOR) → needs_more_data
  5. ROPE configured AND paired CI ⊂ ROPE → equivalent
  6. Paired CI upper < 0 → reject
  7. Paired N < minPairs (default 20) → needs_more_data + MDE attached
  8. q ≤ fdr AND CI lower > 0 → promote
  9. Otherwise → hold

Per-candidate additions:
  - Bayesian-bootstrap-style Pr(Δ>0), Pr(Δ∈ROPE) on mean paired delta
    (Rubin 1981 bootstrap-prior duality)
  - Minimum detectable paired effect at the configured power / α via the
    new pairedMde primitive in power-analysis
  - meanGain alongside the existing medianGain

Per-report additions:
  - runFingerprint: SHA-256 over the canonicalised input run set
  - preregistrationHash: optional, links a signed HypothesisManifest
  - methodology: assumptions, methods, alternatives, when-not-to-apply,
    citations (Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979;
    Rubin 1981; Kruschke 2018) — embedded in markdown and as structured data

Standalone methodology companion at docs/research-report-methodology.md.

Behavioural changes:
  - Default minPairs raised from 6 to 20 (soft floor); hard floor held at 6.
  - Reject rule no longer mixes paired delta with marginal means.
  - Held-out gate ≠ promote now reliably overrides paired stats.
  - researchReport is async (Web Crypto via hashJson for the fingerprint).

Adds a CLAUDE.md authorship directive (no AI-attribution trailers).

837/837 tests passing — 11 dedicated researchReport cases including: hard
floor below 6 pairs, ROPE → equivalent, gate override, fingerprint
determinism + preregistration passthrough, MDE in the needs_more_data
reason, plus the existing summaryTable / paretoChart / gainHistogram
coverage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant