feat(reporting): elevate researchReport to launch-review grade#35
Conversation
Upgrades the researchReport that landed in #34 from a senior-applied-DS deliverable to one a launch reviewer or peer reviewer can sign off on. Decisions are now made on paired evidence consistently; the report carries its own methodology, provenance, and actionable next-step inference. Decision rule (paired-delta-only — never marginal means inside the paired framework): 1. Comparator → hold (baseline) 2. No comparator → hold if on Pareto frontier else needs_more_data 3. Held-out gate ≠ promote → reject (gate is necessary, not sufficient) 4. Paired N < 6 (RESEARCH_REPORT_HARD_PAIR_FLOOR) → needs_more_data 5. ROPE configured AND paired CI ⊂ ROPE → equivalent 6. Paired CI upper < 0 → reject 7. Paired N < minPairs (default 20) → needs_more_data + MDE attached 8. q ≤ fdr AND CI lower > 0 → promote 9. Otherwise → hold Per-candidate additions: - Bayesian-bootstrap-style Pr(Δ>0), Pr(Δ∈ROPE) on mean paired delta (Rubin 1981 bootstrap-prior duality) - Minimum detectable paired effect at the configured power / α via the new pairedMde primitive in power-analysis - meanGain alongside the existing medianGain Per-report additions: - runFingerprint: SHA-256 over the canonicalised input run set - preregistrationHash: optional, links a signed HypothesisManifest - methodology: assumptions, methods, alternatives, when-not-to-apply, citations (Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979; Rubin 1981; Kruschke 2018) — embedded in markdown and as structured data Standalone methodology companion at docs/research-report-methodology.md. Behavioural changes: - Default minPairs raised from 6 to 20 (soft floor); hard floor held at 6. - Reject rule no longer mixes paired delta with marginal means. - Held-out gate ≠ promote now reliably overrides paired stats. - researchReport is async (Web Crypto via hashJson for the fingerprint). Adds a CLAUDE.md authorship directive (no AI-attribution trailers). 837/837 tests passing — 11 dedicated researchReport cases including: hard floor below 6 pairs, ROPE → equivalent, gate override, fingerprint determinism + preregistration passthrough, MDE in the needs_more_data reason, plus the existing summaryTable / paretoChart / gainHistogram coverage.
✅ No Blockers —
|
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 3 non-blocking findings
I read every changed file, ran the focused test suite (summary-report and power-analysis tests all pass), and verified the TypeScript of changed files is clean. The statistical methodology is sound, the ROPE/equivalence logic is well-implemented, and the async conversion of researchReport is handled correctly with no broken callers. I found one real edge-case regression in buildRecommendation and two minor doc/comment drifts.
🟠 MEDIUM Edge-case regression: recommendation returns 'reject' when only comparator exists — src/summary-report.ts
In buildRecommendation (lines 969-981), the decision logic was changed from checking
candidates.some(...)tononComparator.some(...). Whenrunscontains only the comparator (no other candidates),nonComparatoris empty, so the expression falls through to'reject'. Previously it returned'hold'because the comparator itself has decision'hold'. This is a regression from the old behavior and produces a misleading recommendation. Fix: add|| candidates.some((c) => c.decision === 'hold')to the final fallback, or special-casenonComparator.length === 0.
🟡 LOW Misleading JSDoc on mde field — src/summary-report.ts
The
ResearchReportCandidate.mdeJSDoc (around line 500) says 'Standardised by the observed paired-delta SD and inverted viarequiredSampleSize.' The actual implementation inpairedPosterior(line 621) callspairedMde, notrequiredSampleSize. Update the doc comment to referencepairedMdeso it doesn't drift from the code.
🟡 LOW seedRng and makeRng diverge for seed=0 despite 'same RNG family' claim — src/summary-report.ts
The
bootstrapMeanSampleshelper uses a localseedRngwhile the existingpairedBootstrapinpaired-stats.tsusesmakeRng. They differ in seed-0 handling (seed >>> 0vs(seed | 0) || 0x9e3779b9) and in the state-update cast (>>> 0vs| 0). For seed=0 they produce different sequences, which meanspairedBootstrapandbootstrapMeanSamplesare not actually using the same RNG family with the same seed. For most positive seeds the output is identical, so this is minor, but the comment inpairedPosterior(line 611) should be corrected.
tangletools · 2026-05-08T15:37:42Z · trace
Summary
Follow-up to #34. Elevates
researchReportfrom a senior-applied-DS deliverable to one that a launch reviewer or peer reviewer can sign off on, by closing the methodological gaps called out in the post-merge review.What changed
Decision rule — now paired-delta-only. The previous reject rule mixed paired evidence with marginal means; that's fixed. Authoritative order:
hold(baseline).holdon Pareto frontier elseneeds_more_data.promote→reject(gate is necessary, not sufficient).RESEARCH_REPORT_HARD_PAIR_FLOOR) →needs_more_data.equivalent(new state).reject.minPairs(default raised 6 → 20) →needs_more_datawith MDE attached.fdrAND paired CI excludes zero →promote.hold.Per-candidate additions:
prGreaterThanZero,prInRope— Bayesian-bootstrap-style posterior summaries on the mean paired delta (Rubin 1981 bootstrap-prior duality).mde— minimum detectable paired effect at the configured power / α via the newpairedMdeprimitive inpower-analysis. Reported on every candidate so aneeds_more_dataverdict is actionable.meanGainalongside the existingmedianGain.Per-report additions:
runFingerprint— SHA-256 over the canonicalised input run set; deterministic across input order.preregistrationHash— optional, links a signedHypothesisManifestso a reader can verify the analysis was the preregistered one rather than post-hoc.methodology— structured assumptions / methods / alternatives / when-not-to-apply / citations, embedded both in the rendered markdown and as machine-readable fields.docs/research-report-methodology.md.API change:
researchReportis now async (Web Crypto viahashJsonfor the fingerprint). PR #34 is the only consumer that's shipped; no external callers exist.Citations: Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979; Rubin 1981; Kruschke 2018; Howard et al. 2021 (background reading on always-valid extensions).
Why these changes specifically
The original
researchReportwas solid senior-applied-DS work. The bar for "launch reviewer / peer reviewer / SVP-research" requires:needs_more_dataverdict that doesn't tell you the MDE at current N is methodologically lazy.Test plan
pnpm typecheckcleanpnpm buildcleanpnpm test— 837 / 837 (11 dedicatedresearchReportcases including: ROPE → equivalent, gate override of favourable stats, hard floor below 6 pairs, MDE in theneeds_more_datareason, fingerprint determinism + preregistration passthrough)Out of scope