Red vs. Blue

Reproducible pipeline comparing U.S. economic metrics under Democratic vs. Republican presidents. 83 metrics across 10 families, permutation-based inference with BH-FDR correction.

Live results: redvsblue.fyi

Methodology

Why permutation tests?

Standard t-tests assume normally distributed data and large samples. Presidential terms give us neither — depending on data coverage, we have only 23 to 51 four-year terms. Permutation tests make no distributional assumptions. They work by shuffling the D/R party labels across terms thousands of times and asking: how often does the shuffled data produce a gap as large as the real one? If the answer is "rarely" the observed gap is unlikely to be chance.

Controlling for multiple comparisons (BH-FDR)

Testing 83 metrics at once means some will look significant by luck. The Benjamini-Hochberg procedure controls the false discovery rate — the expected fraction of "discoveries" that are false positives. Each metric gets a q-value: the smallest FDR threshold at which it would still be called significant. A q-value of 0.05 means that among all metrics at or below that threshold, no more than 5% are expected to be false discoveries.

The 83 metrics are intentionally broad but highly correlated (e.g., GDP level and GDP growth move together), so the number of truly independent signals is smaller than 83. BH-FDR corrects for the nominal count, not the effective one — this makes it conservative in practice.

Confidence intervals

Bootstrap confidence intervals (2,000 resamples) give a range for the D-minus-R gap on each metric. These are computed independently of the permutation p-values and provide a complementary measure of uncertainty.

Permutation blocking

By default, the pipeline uses unrestricted permutation: any D/R label arrangement is equally likely under the null. An alternative is block shuffling, which only permutes labels within time windows (e.g., 20-year blocks) to control for secular trends. Block shuffling with small blocks constrains the null distribution and can inflate significance — the choice of block size is a researcher degree of freedom with no consensus value in the literature (Blinder-Watson 2014 used 4-year blocks; other reviewers have suggested 4, 8, or 20). We default to unrestricted permutation as the most conservative option. Use --term-block-years N for sensitivity analysis.

Data freshness

A GitHub Actions workflow re-runs the full pipeline every Sunday at 06:00 UTC and commits site/data.json if anything changed. The live site at redvsblue.fyi picks up the update automatically. The workflow can also be triggered manually via workflow_dispatch.

Quick Start

Prereqs:

Python 3.13+ and uv
FRED_API_KEY in .env (for FRED series)

uv sync

Pipeline

rb ingest --refresh              # fetch and cache raw data
rb presidents --refresh          # presidential terms + party labels
rb compute                       # term-level metrics + party summaries
rb validate                      # sanity checks on derived data
rb randomization                 # permutation tests with FDR correction
rb scoreboard                    # markdown scoreboard from computed CSVs
rb export-json                   # JSON export for the static site

Key Outputs

reports/term_metrics_v1.csv — metric values per presidential term
reports/party_summary_v1.csv — D vs R party-level means and medians
reports/permutation_party_term_v1.csv — permutation test results with FDR q-values
reports/scoreboard.md — human-readable summary (sorted by q)
site/data.json — JSON bundle consumed by the static site
site/index.html — static scoreboard page

BH-FDR validity under correlated metrics

The Benjamini-Hochberg procedure was originally proven under independence (Benjamini & Hochberg, 1995). Benjamini & Yekutieli (2001) later showed it also controls the FDR under positive regression dependency from a subset (PRDS) — roughly, when learning that one test statistic is large makes the others more likely to be large too. Our 83 metrics are drawn from a handful of underlying economic series (GDP, employment, inflation, etc.), so they are positively correlated almost by construction. This satisfies PRDS, meaning BH is formally valid here, not merely a heuristic. Because BH corrects for all 83 nominal tests rather than the smaller number of independent signals, the procedure is conservative: the true FDR is likely well below the reported q-values.

Where Things Live

Pipeline code: rb/
Metric and attribution specs: spec/
Static site: site/
Literature corpus: literature/
AI reviews: reviews/
Website: site/

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
.github/workflows		.github/workflows
literature		literature
queries		queries
rb		rb
reviews		reviews
site		site
spec		spec
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
REVIEWS.md		REVIEWS.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Red vs. Blue

Methodology

Why permutation tests?

Controlling for multiple comparisons (BH-FDR)

Confidence intervals

Permutation blocking

Data freshness

Quick Start

Pipeline

Key Outputs

BH-FDR validity under correlated metrics

Where Things Live

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

wakamex/redvsblue

Folders and files

Latest commit

History

Repository files navigation

Red vs. Blue

Methodology

Why permutation tests?

Controlling for multiple comparisons (BH-FDR)

Confidence intervals

Permutation blocking

Data freshness

Quick Start

Pipeline

Key Outputs

BH-FDR validity under correlated metrics

Where Things Live

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages