Skip to content

feat(reporting): elevate researchReport to launch-review grade#35

Merged
drewstone merged 1 commit into
mainfrom
feat/research-report-upgrade
May 8, 2026
Merged

feat(reporting): elevate researchReport to launch-review grade#35
drewstone merged 1 commit into
mainfrom
feat/research-report-upgrade

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

Follow-up to #34. Elevates researchReport from a senior-applied-DS deliverable to one that a launch reviewer or peer reviewer can sign off on, by closing the methodological gaps called out in the post-merge review.

What changed

Decision rule — now paired-delta-only. The previous reject rule mixed paired evidence with marginal means; that's fixed. Authoritative order:

  1. Comparator → hold (baseline).
  2. No comparator → hold on Pareto frontier else needs_more_data.
  3. Held-out gate ≠ promotereject (gate is necessary, not sufficient).
  4. Paired N < 6 (RESEARCH_REPORT_HARD_PAIR_FLOOR) → needs_more_data.
  5. ROPE configured AND paired-delta CI ⊂ ROPE → equivalent (new state).
  6. Paired-delta CI upper bound < 0 → reject.
  7. Paired N < minPairs (default raised 6 → 20) → needs_more_data with MDE attached.
  8. BH-adjusted q ≤ fdr AND paired CI excludes zero → promote.
  9. Otherwise → hold.

Per-candidate additions:

  • prGreaterThanZero, prInRope — Bayesian-bootstrap-style posterior summaries on the mean paired delta (Rubin 1981 bootstrap-prior duality).
  • mde — minimum detectable paired effect at the configured power / α via the new pairedMde primitive in power-analysis. Reported on every candidate so a needs_more_data verdict is actionable.
  • meanGain alongside the existing medianGain.

Per-report additions:

  • runFingerprint — SHA-256 over the canonicalised input run set; deterministic across input order.
  • preregistrationHash — optional, links a signed HypothesisManifest so a reader can verify the analysis was the preregistered one rather than post-hoc.
  • methodology — structured assumptions / methods / alternatives / when-not-to-apply / citations, embedded both in the rendered markdown and as machine-readable fields.
  • Standalone docs/research-report-methodology.md.

API change: researchReport is now async (Web Crypto via hashJson for the fingerprint). PR #34 is the only consumer that's shipped; no external callers exist.

Citations: Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979; Rubin 1981; Kruschke 2018; Howard et al. 2021 (background reading on always-valid extensions).

Why these changes specifically

The original researchReport was solid senior-applied-DS work. The bar for "launch reviewer / peer reviewer / SVP-research" requires:

  • No methodology drift. Embed assumptions and citations in the artifact; don't make readers reconstruct them.
  • Pairing consistently. Never compare marginal means inside a paired framework — a real defect in the original decision rule.
  • Actionable null results. A needs_more_data verdict that doesn't tell you the MDE at current N is methodologically lazy.
  • Practical equivalence. Frequentist tests can't accept the null. ROPE handles that — required for "is the cheap candidate as good as the expensive one?" launch decisions.
  • Bayesian-style summaries. Pr(Δ > 0) is what stakeholders actually want; the bootstrap-prior duality lets us produce it without committing to a parametric prior.
  • Provenance. A report that can't be tied to a fingerprinted input set or a preregistered protocol is descriptive, not anchoring.

Test plan

  • pnpm typecheck clean
  • pnpm build clean
  • pnpm test837 / 837 (11 dedicated researchReport cases including: ROPE → equivalent, gate override of favourable stats, hard floor below 6 pairs, MDE in the needs_more_data reason, fingerprint determinism + preregistration passthrough)
  • Spot-check rendered markdown / HTML in a downstream consumer before tagging a release.

Out of scope

  • Sequential / always-valid inference (e-values, mSPRT, alpha-spending). Single-look report by design; sequential extensions belong in a separate primitive that wraps this one.
  • Hierarchical Bayesian shrinkage across many candidates. Future work; the current ranking is on raw paired statistics.
  • Bootstrap-CI calibration via simulation. Future work; we lean on the asymptotic guarantee plus the hard pair floor.

Upgrades the researchReport that landed in #34 from a senior-applied-DS
deliverable to one a launch reviewer or peer reviewer can sign off on.
Decisions are now made on paired evidence consistently; the report carries
its own methodology, provenance, and actionable next-step inference.

Decision rule (paired-delta-only — never marginal means inside the paired
framework):

  1. Comparator → hold (baseline)
  2. No comparator → hold if on Pareto frontier else needs_more_data
  3. Held-out gate ≠ promote → reject (gate is necessary, not sufficient)
  4. Paired N < 6 (RESEARCH_REPORT_HARD_PAIR_FLOOR) → needs_more_data
  5. ROPE configured AND paired CI ⊂ ROPE → equivalent
  6. Paired CI upper < 0 → reject
  7. Paired N < minPairs (default 20) → needs_more_data + MDE attached
  8. q ≤ fdr AND CI lower > 0 → promote
  9. Otherwise → hold

Per-candidate additions:
  - Bayesian-bootstrap-style Pr(Δ>0), Pr(Δ∈ROPE) on mean paired delta
    (Rubin 1981 bootstrap-prior duality)
  - Minimum detectable paired effect at the configured power / α via the
    new pairedMde primitive in power-analysis
  - meanGain alongside the existing medianGain

Per-report additions:
  - runFingerprint: SHA-256 over the canonicalised input run set
  - preregistrationHash: optional, links a signed HypothesisManifest
  - methodology: assumptions, methods, alternatives, when-not-to-apply,
    citations (Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979;
    Rubin 1981; Kruschke 2018) — embedded in markdown and as structured data

Standalone methodology companion at docs/research-report-methodology.md.

Behavioural changes:
  - Default minPairs raised from 6 to 20 (soft floor); hard floor held at 6.
  - Reject rule no longer mixes paired delta with marginal means.
  - Held-out gate ≠ promote now reliably overrides paired stats.
  - researchReport is async (Web Crypto via hashJson for the fingerprint).

Adds a CLAUDE.md authorship directive (no AI-attribution trailers).

837/837 tests passing — 11 dedicated researchReport cases including: hard
floor below 6 pairs, ROPE → equivalent, gate override, fingerprint
determinism + preregistration passthrough, MDE in the needs_more_data
reason, plus the existing summaryTable / paretoChart / gainHistogram
coverage.
@tangletools
Copy link
Copy Markdown
Contributor

tangletools commented May 8, 2026

✅ No Blockers — a23fccdb

Readiness 85/100 · Confidence 95/100 · 3 findings (1 medium, 2 low)

kimi-code: Correctness 85 · Security 95 · Testing 90 · Architecture 90

I read every changed file, ran the focused test suite (summary-report and power-analysis tests all pass), and verified the TypeScript of changed files is clean. The statistical methodology is sound, the ROPE/equivalence logic is well-implemented, and the async conversion of researchReport is handled correctly with no broken callers. I found one real edge-case regression in buildRecommendation and two minor doc/comment drifts.

🟠 MEDIUM Edge-case regression: recommendation returns 'reject' when only comparator exists — src/summary-report.ts

In buildRecommendation (lines 969-981), the decision logic was changed from checking candidates.some(...) to nonComparator.some(...). When runs contains only the comparator (no other candidates), nonComparator is empty, so the expression falls through to 'reject'. Previously it returned 'hold' because the comparator itself has decision 'hold'. This is a regression from the old behavior and produces a misleading recommendation. Fix: add || candidates.some((c) => c.decision === 'hold') to the final fallback, or special-case nonComparator.length === 0.

🟡 LOW Misleading JSDoc on mde field — src/summary-report.ts

The ResearchReportCandidate.mde JSDoc (around line 500) says 'Standardised by the observed paired-delta SD and inverted via requiredSampleSize.' The actual implementation in pairedPosterior (line 621) calls pairedMde, not requiredSampleSize. Update the doc comment to reference pairedMde so it doesn't drift from the code.

🟡 LOW seedRng and makeRng diverge for seed=0 despite 'same RNG family' claim — src/summary-report.ts

The bootstrapMeanSamples helper uses a local seedRng while the existing pairedBootstrap in paired-stats.ts uses makeRng. They differ in seed-0 handling (seed >>> 0 vs (seed | 0) || 0x9e3779b9) and in the state-update cast (>>> 0 vs | 0). For seed=0 they produce different sequences, which means pairedBootstrap and bootstrapMeanSamples are not actually using the same RNG family with the same seed. For most positive seeds the output is identical, so this is minor, but the comment in pairedPosterior (line 611) should be corrected.


tangletools · 2026-05-08T15:37:42Z · trace

Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 3 non-blocking findings

I read every changed file, ran the focused test suite (summary-report and power-analysis tests all pass), and verified the TypeScript of changed files is clean. The statistical methodology is sound, the ROPE/equivalence logic is well-implemented, and the async conversion of researchReport is handled correctly with no broken callers. I found one real edge-case regression in buildRecommendation and two minor doc/comment drifts.

🟠 MEDIUM Edge-case regression: recommendation returns 'reject' when only comparator exists — src/summary-report.ts

In buildRecommendation (lines 969-981), the decision logic was changed from checking candidates.some(...) to nonComparator.some(...). When runs contains only the comparator (no other candidates), nonComparator is empty, so the expression falls through to 'reject'. Previously it returned 'hold' because the comparator itself has decision 'hold'. This is a regression from the old behavior and produces a misleading recommendation. Fix: add || candidates.some((c) => c.decision === 'hold') to the final fallback, or special-case nonComparator.length === 0.

🟡 LOW Misleading JSDoc on mde field — src/summary-report.ts

The ResearchReportCandidate.mde JSDoc (around line 500) says 'Standardised by the observed paired-delta SD and inverted via requiredSampleSize.' The actual implementation in pairedPosterior (line 621) calls pairedMde, not requiredSampleSize. Update the doc comment to reference pairedMde so it doesn't drift from the code.

🟡 LOW seedRng and makeRng diverge for seed=0 despite 'same RNG family' claim — src/summary-report.ts

The bootstrapMeanSamples helper uses a local seedRng while the existing pairedBootstrap in paired-stats.ts uses makeRng. They differ in seed-0 handling (seed >>> 0 vs (seed | 0) || 0x9e3779b9) and in the state-update cast (>>> 0 vs | 0). For seed=0 they produce different sequences, which means pairedBootstrap and bootstrapMeanSamples are not actually using the same RNG family with the same seed. For most positive seeds the output is identical, so this is minor, but the comment in pairedPosterior (line 611) should be corrected.


tangletools · 2026-05-08T15:37:42Z · trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants