feat(reporting): add researchReport executive summary layer#34
Merged
Conversation
researchReport composes summaryTable, paretoChart, gainHistogram, held-out gate decisions, and optional failureClusterView output into one structured artifact for coding-vertical benchmark runs: - promote/hold/reject decision with rationale, risks, next actions - per-candidate stats (Wilcoxon q-value, Cohen's d, paired N, gain CI) - Pareto frontier flagging - markdown, HTML, and JSON chart specs Wired into src/index.ts and the @tangle-network/agent-eval/reporting subpath. Tests in tests/summary-report.test.ts cover decision logic, candidate scoring, and output formats. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
drewstone
added a commit
that referenced
this pull request
May 8, 2026
Upgrades the researchReport that landed in #34 from a senior-applied-DS deliverable to one a launch reviewer or peer reviewer can sign off on. Decisions are now made on paired evidence consistently; the report carries its own methodology, provenance, and actionable next-step inference. Decision rule (paired-delta-only — never marginal means inside the paired framework): 1. Comparator → hold (baseline) 2. No comparator → hold if on Pareto frontier else needs_more_data 3. Held-out gate ≠ promote → reject (gate is necessary, not sufficient) 4. Paired N < 6 (RESEARCH_REPORT_HARD_PAIR_FLOOR) → needs_more_data 5. ROPE configured AND paired CI ⊂ ROPE → equivalent 6. Paired CI upper < 0 → reject 7. Paired N < minPairs (default 20) → needs_more_data + MDE attached 8. q ≤ fdr AND CI lower > 0 → promote 9. Otherwise → hold Per-candidate additions: - Bayesian-bootstrap-style Pr(Δ>0), Pr(Δ∈ROPE) on mean paired delta (Rubin 1981 bootstrap-prior duality) - Minimum detectable paired effect at the configured power / α via the new pairedMde primitive in power-analysis - meanGain alongside the existing medianGain Per-report additions: - runFingerprint: SHA-256 over the canonicalised input run set - preregistrationHash: optional, links a signed HypothesisManifest - methodology: assumptions, methods, alternatives, when-not-to-apply, citations (Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979; Rubin 1981; Kruschke 2018) — embedded in markdown and as structured data Standalone methodology companion at docs/research-report-methodology.md. Behavioural changes: - Default minPairs raised from 6 to 20 (soft floor); hard floor held at 6. - Reject rule no longer mixes paired delta with marginal means. - Held-out gate ≠ promote now reliably overrides paired stats. - researchReport is async (Web Crypto via hashJson for the fingerprint). Adds a CLAUDE.md authorship directive (no AI-attribution trailers). 837/837 tests passing — 11 dedicated researchReport cases including: hard floor below 6 pairs, ROPE → equivalent, gate override, fingerprint determinism + preregistration passthrough, MDE in the needs_more_data reason, plus the existing summaryTable / paretoChart / gainHistogram coverage.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
researchReportcomposes the lower-level reporting primitives (summaryTable,paretoChart,gainHistogram, held-outGateDecisions, optionalfailureClusterView) into one structured launch-decision artifact. Verdicts are made on paired evidence — never on marginal means alone — and respect any held-out gate the caller passes through.Decision rule
In order — first match wins:
hold(baseline).holdif on the cost/quality Pareto frontier, elseneeds_more_data.promote→reject(gate is necessary, not sufficient).RESEARCH_REPORT_HARD_PAIR_FLOOR) →needs_more_data.equivalent.reject.minPairs(default 20) →needs_more_datawith MDE attached.fdrAND paired CI excludes zero →promote.hold.What every candidate carries
summaryTable).gainHistogram).Pr(Δ > 0)andPr(Δ ∈ ROPE)on the mean paired delta (Rubin 1981 bootstrap-prior duality).pairedMdeprimitive inpower-analysis.What every report carries
runFingerprint— SHA-256 over the canonicalised input run set (stable across input order).preregistrationHash— optional, links a signedHypothesisManifestso reviewers can verify the analysis was the preregistered one rather than post-hoc.methodology— structured assumptions / methods / alternatives / when-not-to-apply / citations, also embedded in the rendered markdown.docs/research-report-methodology.md.Citations: Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979; Rubin 1981; Kruschke 2018; Howard et al. 2021 (background reading on always-valid extensions).
API change
researchReportis now async — it uses Web Crypto viahashJsonfor the run fingerprint. No prior consumers exist (this PR is the first time the function ships).Wired into
src/index.ts(root barrel) and the@tangle-network/agent-eval/reportingsubpath, plusRESEARCH_REPORT_HARD_PAIR_FLOORandResearchReportMethodology.Adds a
CLAUDE.mdauthorship directive (no AI-attribution trailers in this repo).Test plan
pnpm typecheckcleanpnpm buildproducesdist/reporting.{js,d.ts}with the new exportspnpm test— 837 / 837 (11 dedicatedresearchReportcases covering: promote, needs-more-data + MDE, hard-floor below 6 pairs, ROPE → equivalent, gate override, fingerprint determinism + preregistration passthrough, failure-cluster surfacing — plus the existingsummaryTable/paretoChart/gainHistogramcoverage).