Reproducible pipeline comparing U.S. economic metrics under Democratic vs. Republican presidents. 83 metrics across 10 families, permutation-based inference with BH-FDR correction.
Live results: redvsblue.fyi
Standard t-tests assume normally distributed data and large samples. Presidential terms give us neither — depending on data coverage, we have only 23 to 51 four-year terms. Permutation tests make no distributional assumptions. They work by shuffling the D/R party labels across terms thousands of times and asking: how often does the shuffled data produce a gap as large as the real one? If the answer is "rarely" the observed gap is unlikely to be chance.
Testing 83 metrics at once means some will look significant by luck. The Benjamini-Hochberg procedure controls the false discovery rate — the expected fraction of "discoveries" that are false positives. Each metric gets a q-value: the smallest FDR threshold at which it would still be called significant. A q-value of 0.05 means that among all metrics at or below that threshold, no more than 5% are expected to be false discoveries.
The 83 metrics are intentionally broad but highly correlated (e.g., GDP level and GDP growth move together), so the number of truly independent signals is smaller than 83. BH-FDR corrects for the nominal count, not the effective one — this makes it conservative in practice.
Bootstrap confidence intervals (2,000 resamples) give a range for the D-minus-R gap on each metric. These are computed independently of the permutation p-values and provide a complementary measure of uncertainty.
By default, the pipeline uses unrestricted permutation: any D/R label arrangement is equally
likely under the null. An alternative is block shuffling, which only permutes labels within
time windows (e.g., 20-year blocks) to control for secular trends. Block shuffling with small
blocks constrains the null distribution and can inflate significance — the choice of block size
is a researcher degree of freedom with no consensus value in the literature (Blinder-Watson 2014
used 4-year blocks; other reviewers have suggested 4, 8, or 20). We default to unrestricted
permutation as the most conservative option. Use --term-block-years N for sensitivity
analysis.
A GitHub Actions workflow re-runs the full pipeline every Sunday at 06:00 UTC and commits
site/data.json if anything changed. The live site at redvsblue.fyi
picks up the update automatically. The workflow can also be triggered manually via
workflow_dispatch.
Prereqs:
- Python 3.13+ and
uv FRED_API_KEYin.env(for FRED series)
uv syncrb ingest --refresh # fetch and cache raw data
rb presidents --refresh # presidential terms + party labels
rb compute # term-level metrics + party summaries
rb validate # sanity checks on derived data
rb randomization # permutation tests with FDR correction
rb scoreboard # markdown scoreboard from computed CSVs
rb export-json # JSON export for the static sitereports/term_metrics_v1.csv— metric values per presidential termreports/party_summary_v1.csv— D vs R party-level means and mediansreports/permutation_party_term_v1.csv— permutation test results with FDR q-valuesreports/scoreboard.md— human-readable summary (sorted by q)site/data.json— JSON bundle consumed by the static sitesite/index.html— static scoreboard page
The Benjamini-Hochberg procedure was originally proven under independence (Benjamini & Hochberg, 1995). Benjamini & Yekutieli (2001) later showed it also controls the FDR under positive regression dependency from a subset (PRDS) — roughly, when learning that one test statistic is large makes the others more likely to be large too. Our 83 metrics are drawn from a handful of underlying economic series (GDP, employment, inflation, etc.), so they are positively correlated almost by construction. This satisfies PRDS, meaning BH is formally valid here, not merely a heuristic. Because BH corrects for all 83 nominal tests rather than the smaller number of independent signals, the procedure is conservative: the true FDR is likely well below the reported q-values.
- Pipeline code:
rb/ - Metric and attribution specs:
spec/ - Static site:
site/ - Literature corpus:
literature/ - AI reviews:
reviews/ - Website:
site/