feat(reporting): elevate researchReport to launch-review grade by drewstone · Pull Request #35 · tangle-network/agent-eval

drewstone · 2026-05-08T15:25:26Z

Summary

Follow-up to #34. Elevates researchReport from a senior-applied-DS deliverable to one that a launch reviewer or peer reviewer can sign off on, by closing the methodological gaps called out in the post-merge review.

What changed

Decision rule — now paired-delta-only. The previous reject rule mixed paired evidence with marginal means; that's fixed. Authoritative order:

Comparator → hold (baseline).
No comparator → hold on Pareto frontier else needs_more_data.
Held-out gate ≠ promote → reject (gate is necessary, not sufficient).
Paired N < 6 (RESEARCH_REPORT_HARD_PAIR_FLOOR) → needs_more_data.
ROPE configured AND paired-delta CI ⊂ ROPE → equivalent (new state).
Paired-delta CI upper bound < 0 → reject.
Paired N < minPairs (default raised 6 → 20) → needs_more_data with MDE attached.
BH-adjusted q ≤ fdr AND paired CI excludes zero → promote.
Otherwise → hold.

Per-candidate additions:

prGreaterThanZero, prInRope — Bayesian-bootstrap-style posterior summaries on the mean paired delta (Rubin 1981 bootstrap-prior duality).
mde — minimum detectable paired effect at the configured power / α via the new pairedMde primitive in power-analysis. Reported on every candidate so a needs_more_data verdict is actionable.
meanGain alongside the existing medianGain.

Per-report additions:

runFingerprint — SHA-256 over the canonicalised input run set; deterministic across input order.
preregistrationHash — optional, links a signed HypothesisManifest so a reader can verify the analysis was the preregistered one rather than post-hoc.
methodology — structured assumptions / methods / alternatives / when-not-to-apply / citations, embedded both in the rendered markdown and as machine-readable fields.
Standalone docs/research-report-methodology.md.

API change: researchReport is now async (Web Crypto via hashJson for the fingerprint). PR #34 is the only consumer that's shipped; no external callers exist.

Citations: Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979; Rubin 1981; Kruschke 2018; Howard et al. 2021 (background reading on always-valid extensions).

Why these changes specifically

The original researchReport was solid senior-applied-DS work. The bar for "launch reviewer / peer reviewer / SVP-research" requires:

No methodology drift. Embed assumptions and citations in the artifact; don't make readers reconstruct them.
Pairing consistently. Never compare marginal means inside a paired framework — a real defect in the original decision rule.
Actionable null results. A needs_more_data verdict that doesn't tell you the MDE at current N is methodologically lazy.
Practical equivalence. Frequentist tests can't accept the null. ROPE handles that — required for "is the cheap candidate as good as the expensive one?" launch decisions.
Bayesian-style summaries. Pr(Δ > 0) is what stakeholders actually want; the bootstrap-prior duality lets us produce it without committing to a parametric prior.
Provenance. A report that can't be tied to a fingerprinted input set or a preregistered protocol is descriptive, not anchoring.

Test plan

pnpm typecheck clean
pnpm build clean
pnpm test — 837 / 837 (11 dedicated researchReport cases including: ROPE → equivalent, gate override of favourable stats, hard floor below 6 pairs, MDE in the needs_more_data reason, fingerprint determinism + preregistration passthrough)
Spot-check rendered markdown / HTML in a downstream consumer before tagging a release.

Out of scope

Sequential / always-valid inference (e-values, mSPRT, alpha-spending). Single-look report by design; sequential extensions belong in a separate primitive that wraps this one.
Hierarchical Bayesian shrinkage across many candidates. Future work; the current ranking is on raw paired statistics.
Bootstrap-CI calibration via simulation. Future work; we lean on the asymptotic guarantee plus the hard pair floor.

Upgrades the researchReport that landed in #34 from a senior-applied-DS deliverable to one a launch reviewer or peer reviewer can sign off on. Decisions are now made on paired evidence consistently; the report carries its own methodology, provenance, and actionable next-step inference. Decision rule (paired-delta-only — never marginal means inside the paired framework): 1. Comparator → hold (baseline) 2. No comparator → hold if on Pareto frontier else needs_more_data 3. Held-out gate ≠ promote → reject (gate is necessary, not sufficient) 4. Paired N < 6 (RESEARCH_REPORT_HARD_PAIR_FLOOR) → needs_more_data 5. ROPE configured AND paired CI ⊂ ROPE → equivalent 6. Paired CI upper < 0 → reject 7. Paired N < minPairs (default 20) → needs_more_data + MDE attached 8. q ≤ fdr AND CI lower > 0 → promote 9. Otherwise → hold Per-candidate additions: - Bayesian-bootstrap-style Pr(Δ>0), Pr(Δ∈ROPE) on mean paired delta (Rubin 1981 bootstrap-prior duality) - Minimum detectable paired effect at the configured power / α via the new pairedMde primitive in power-analysis - meanGain alongside the existing medianGain Per-report additions: - runFingerprint: SHA-256 over the canonicalised input run set - preregistrationHash: optional, links a signed HypothesisManifest - methodology: assumptions, methods, alternatives, when-not-to-apply, citations (Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979; Rubin 1981; Kruschke 2018) — embedded in markdown and as structured data Standalone methodology companion at docs/research-report-methodology.md. Behavioural changes: - Default minPairs raised from 6 to 20 (soft floor); hard floor held at 6. - Reject rule no longer mixes paired delta with marginal means. - Held-out gate ≠ promote now reliably overrides paired stats. - researchReport is async (Web Crypto via hashJson for the fingerprint). Adds a CLAUDE.md authorship directive (no AI-attribution trailers). 837/837 tests passing — 11 dedicated researchReport cases including: hard floor below 6 pairs, ROPE → equivalent, gate override, fingerprint determinism + preregistration passthrough, MDE in the needs_more_data reason, plus the existing summaryTable / paretoChart / gainHistogram coverage.

tangletools · 2026-05-08T15:31:30Z

✅ No Blockers — `a23fccdb`

Readiness 85/100 · Confidence 95/100 · 3 findings (1 medium, 2 low)

kimi-code: Correctness 85 · Security 95 · Testing 90 · Architecture 90

I read every changed file, ran the focused test suite (summary-report and power-analysis tests all pass), and verified the TypeScript of changed files is clean. The statistical methodology is sound, the ROPE/equivalence logic is well-implemented, and the async conversion of researchReport is handled correctly with no broken callers. I found one real edge-case regression in buildRecommendation and two minor doc/comment drifts.

🟠 MEDIUM Edge-case regression: recommendation returns 'reject' when only comparator exists — src/summary-report.ts

In buildRecommendation (lines 969-981), the decision logic was changed from checking candidates.some(...) to nonComparator.some(...). When runs contains only the comparator (no other candidates), nonComparator is empty, so the expression falls through to 'reject'. Previously it returned 'hold' because the comparator itself has decision 'hold'. This is a regression from the old behavior and produces a misleading recommendation. Fix: add || candidates.some((c) => c.decision === 'hold') to the final fallback, or special-case nonComparator.length === 0.

🟡 LOW Misleading JSDoc on mde field — src/summary-report.ts

The ResearchReportCandidate.mde JSDoc (around line 500) says 'Standardised by the observed paired-delta SD and inverted via requiredSampleSize.' The actual implementation in pairedPosterior (line 621) calls pairedMde, not requiredSampleSize. Update the doc comment to reference pairedMde so it doesn't drift from the code.

🟡 LOW seedRng and makeRng diverge for seed=0 despite 'same RNG family' claim — src/summary-report.ts

The bootstrapMeanSamples helper uses a local seedRng while the existing pairedBootstrap in paired-stats.ts uses makeRng. They differ in seed-0 handling (seed >>> 0 vs (seed | 0) || 0x9e3779b9) and in the state-update cast (>>> 0 vs | 0). For seed=0 they produce different sequences, which means pairedBootstrap and bootstrapMeanSamples are not actually using the same RNG family with the same seed. For most positive seeds the output is identical, so this is minor, but the comment in pairedPosterior (line 611) should be corrected.

_{tangletools · 2026-05-08T15:37:42Z · trace}

tangletools

✅ Approved — 3 non-blocking findings

I read every changed file, ran the focused test suite (summary-report and power-analysis tests all pass), and verified the TypeScript of changed files is clean. The statistical methodology is sound, the ROPE/equivalence logic is well-implemented, and the async conversion of researchReport is handled correctly with no broken callers. I found one real edge-case regression in buildRecommendation and two minor doc/comment drifts.

🟠 MEDIUM Edge-case regression: recommendation returns 'reject' when only comparator exists — src/summary-report.ts

In buildRecommendation (lines 969-981), the decision logic was changed from checking candidates.some(...) to nonComparator.some(...). When runs contains only the comparator (no other candidates), nonComparator is empty, so the expression falls through to 'reject'. Previously it returned 'hold' because the comparator itself has decision 'hold'. This is a regression from the old behavior and produces a misleading recommendation. Fix: add || candidates.some((c) => c.decision === 'hold') to the final fallback, or special-case nonComparator.length === 0.

🟡 LOW Misleading JSDoc on mde field — src/summary-report.ts

The ResearchReportCandidate.mde JSDoc (around line 500) says 'Standardised by the observed paired-delta SD and inverted via requiredSampleSize.' The actual implementation in pairedPosterior (line 621) calls pairedMde, not requiredSampleSize. Update the doc comment to reference pairedMde so it doesn't drift from the code.

🟡 LOW seedRng and makeRng diverge for seed=0 despite 'same RNG family' claim — src/summary-report.ts

The bootstrapMeanSamples helper uses a local seedRng while the existing pairedBootstrap in paired-stats.ts uses makeRng. They differ in seed-0 handling (seed >>> 0 vs (seed | 0) || 0x9e3779b9) and in the state-update cast (>>> 0 vs | 0). For seed=0 they produce different sequences, which means pairedBootstrap and bootstrapMeanSamples are not actually using the same RNG family with the same seed. For most positive seeds the output is identical, so this is minor, but the comment in pairedPosterior (line 611) should be corrected.

_{tangletools · 2026-05-08T15:37:42Z · trace}

tangletools approved these changes May 8, 2026

View reviewed changes

drewstone merged commit c8f03bd into main May 8, 2026

drewstone mentioned this pull request May 8, 2026

feat: 0.21.0 — capture integrity for launch-grade benchmark runs #36

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(reporting): elevate researchReport to launch-review grade#35

feat(reporting): elevate researchReport to launch-review grade#35
drewstone merged 1 commit into
mainfrom
feat/research-report-upgrade

drewstone commented May 8, 2026

Uh oh!

tangletools commented May 8, 2026 •

edited

Loading

Uh oh!

tangletools left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented May 8, 2026

Summary

What changed

Why these changes specifically

Test plan

Out of scope

Uh oh!

tangletools commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ No Blockers — a23fccdb

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 3 non-blocking findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tangletools commented May 8, 2026 •

edited

Loading

✅ No Blockers — `a23fccdb`