# Swing States Correlations Analysis

> In God We Trust. All others must bring data.
> -W. Edwards Deming

US presidential elections are decided by the [Electoral College](https://en.wikipedia.org/wiki/United_States_Electoral_College). Swing states' influence over elections is disproportionate to their populations because of the Electoral College & the US's ([mostly](https://www.270towin.com/content/split-electoral-votes-maine-and-nebraska/)) winner-take-all electoral system.

In the 2024 election, seven states are widely regarded as swing states:
- Arizona
- Georgia
- Michigan
- Wisconsin
- North Carolina
- Pennsylvania
- Nevada

Polling averages as of October 18, 2024 indicate that the two candidates are **statistically tied** in these seven states. Additionally, Nebraska's second congressional district is closely divided and could swing the election [in some scenarios](https://www.ft.com/content/714f8c07-3b2f-4862-bdf2-6878ac8c42ca).

Media outlets are rife with speculation about the different possible electoral maps. For example, *The Financial Times* writes:


> But in mathematical terms, there are scores of other pathways to winning the necessary votes in the electoral college, with and without Pennsylvania.
> For example, either candidate could shore up support in the southern swing states, and Harris would score a big win if she could flip North Carolina — a long-standing Democratic target that Trump won by a razor-thin margin in 2016 — and its 16 electoral college votes back to the Democrats’ column.
> Here is one way to enumerate the routes: **there are 128 combinations of possible outcomes in the seven swing states (two candidates to the seventh power)** where polls suggest the races are in effect tied.

While the FT's basic math is correct, the $2^7$ possible outcomes **are not equally likely.** 

If they were equally likely, we'd have to believe that:
1. the probability of Donald Trump or Kamala Harris winning each state is 50%
2. the outcomes are independent of each other

Polling supports proposition #1, but does not address claim #2.

Anecdotally, recent history suggests swing state outcomes may not be independent of each other. Prior to 2016, Wisconsin, Michigan, and Pennsylvania formed part of the so-called Blue Wall of reliably Democratic Midwestern states. In 2016, these states moved *together* into Donald Trump's column; in 2020, they again moved together into Joe Biden's. 

I was interested in checking whether this anecdotal evidence holds up more rigorously, so I accessed data on US presidential elections since 1976 to investigate quantitatively **how strongly swing state outcomes move together** – statisticians and data scientists call this covariance.

## Hypothesis Pre-registration

Recently statisticians and other quantitative researchers have moved towards a norm of pre-registering hypotheses. This is intended to prevent cherry-picking data. To contribute to this norm, I am writing my hypotheses in this notebook BEFORE any analysis. Readers can verify this by referencing the GitHub commit history and examining the "pre-registering hypotheses" commit.

Hypotheses:
1. The presidential election outcome in each swing state are is moderately correlated ($R^2 > 0.3$) with at least one other swing state.
2. Swing state correlations have increased over time, as evidenced by an increase in the moving average correlation coefficient.
3. Wisconsin, Michigan, and Pennsylvania are more highly correlated with each other than with other swing states.

## Data Import and Preparation

In [116]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import seaborn as sns
import matplotlib.colors as mcolors


In [67]:
IN_SCOPE_PARTIES = {"DEMOCRAT", "REPUBLICAN"}
SWING_STATES = {"ARIZONA", "WISCONSIN", "MICHIGAN", "PENNSYLVANIA", "NEVADA", "NORTH CAROLINA", "GEORGIA"}

In [68]:
data_path = "../data/us_elections_1976-2020.csv"

raw_df = pd.read_csv(data_path)

raw_df.head()

Unnamed: 0,year,state,state_po,state_fips,state_cen,state_ic,office,candidate,party_detailed,writein,candidatevotes,totalvotes,version,notes,party_simplified
0,1976,ALABAMA,AL,1,63,41,US PRESIDENT,"CARTER, JIMMY",DEMOCRAT,False,659170,1182850,20210113,,DEMOCRAT
1,1976,ALABAMA,AL,1,63,41,US PRESIDENT,"FORD, GERALD",REPUBLICAN,False,504070,1182850,20210113,,REPUBLICAN
2,1976,ALABAMA,AL,1,63,41,US PRESIDENT,"MADDOX, LESTER",AMERICAN INDEPENDENT PARTY,False,9198,1182850,20210113,,OTHER
3,1976,ALABAMA,AL,1,63,41,US PRESIDENT,"BUBAR, BENJAMIN """"BEN""""",PROHIBITION,False,6669,1182850,20210113,,OTHER
4,1976,ALABAMA,AL,1,63,41,US PRESIDENT,"HALL, GUS",COMMUNIST PARTY USE,False,1954,1182850,20210113,,OTHER


In [69]:
print("\n".join(list(raw_df.columns)))

year
state
state_po
state_fips
state_cen
state_ic
office
candidate
party_detailed
writein
candidatevotes
totalvotes
version
notes
party_simplified


In [70]:
set(raw_df["office"])

{'US PRESIDENT'}

In [71]:
compact_df = raw_df[["year",
                     "state",
                     "candidate",
                     "candidatevotes",
                     "totalvotes",
                     "party_detailed",
                     "party_simplified"]]

compact_df.head()

Unnamed: 0,year,state,candidate,candidatevotes,totalvotes,party_detailed,party_simplified
0,1976,ALABAMA,"CARTER, JIMMY",659170,1182850,DEMOCRAT,DEMOCRAT
1,1976,ALABAMA,"FORD, GERALD",504070,1182850,REPUBLICAN,REPUBLICAN
2,1976,ALABAMA,"MADDOX, LESTER",9198,1182850,AMERICAN INDEPENDENT PARTY,OTHER
3,1976,ALABAMA,"BUBAR, BENJAMIN """"BEN""""",6669,1182850,PROHIBITION,OTHER
4,1976,ALABAMA,"HALL, GUS",1954,1182850,COMMUNIST PARTY USE,OTHER


In [72]:
major_parties_df = compact_df[compact_df["party_detailed"].isin(IN_SCOPE_PARTIES)]

major_parties_df.head()

Unnamed: 0,year,state,candidate,candidatevotes,totalvotes,party_detailed,party_simplified
0,1976,ALABAMA,"CARTER, JIMMY",659170,1182850,DEMOCRAT,DEMOCRAT
1,1976,ALABAMA,"FORD, GERALD",504070,1182850,REPUBLICAN,REPUBLICAN
7,1976,ALASKA,"FORD, GERALD",71555,123574,REPUBLICAN,REPUBLICAN
8,1976,ALASKA,"CARTER, JIMMY",44058,123574,DEMOCRAT,DEMOCRAT
11,1976,ARIZONA,"FORD, GERALD",418642,742719,REPUBLICAN,REPUBLICAN


In [73]:
swing_state_df = major_parties_df[major_parties_df["state"].isin(SWING_STATES)]
swing_state_df.head()

Unnamed: 0,year,state,candidate,candidatevotes,totalvotes,party_detailed,party_simplified
11,1976,ARIZONA,"FORD, GERALD",418642,742719,REPUBLICAN,REPUBLICAN
12,1976,ARIZONA,"CARTER, JIMMY",295602,742719,DEMOCRAT,DEMOCRAT
60,1976,GEORGIA,"CARTER, JIMMY",979409,1463152,DEMOCRAT,DEMOCRAT
61,1976,GEORGIA,"FORD, GERALD",483743,1463152,REPUBLICAN,REPUBLICAN
133,1976,MICHIGAN,"FORD, GERALD",1893742,3651590,REPUBLICAN,REPUBLICAN


In [74]:
swing_state_df.loc[:,"vote_pct"] = swing_state_df["candidatevotes"] / swing_state_df["totalvotes"]
swing_state_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  swing_state_df.loc[:,"vote_pct"] = swing_state_df["candidatevotes"] / swing_state_df["totalvotes"]


Unnamed: 0,year,state,candidate,candidatevotes,totalvotes,party_detailed,party_simplified,vote_pct
11,1976,ARIZONA,"FORD, GERALD",418642,742719,REPUBLICAN,REPUBLICAN,0.563661
12,1976,ARIZONA,"CARTER, JIMMY",295602,742719,DEMOCRAT,DEMOCRAT,0.398
60,1976,GEORGIA,"CARTER, JIMMY",979409,1463152,DEMOCRAT,DEMOCRAT,0.669383
61,1976,GEORGIA,"FORD, GERALD",483743,1463152,REPUBLICAN,REPUBLICAN,0.330617
133,1976,MICHIGAN,"FORD, GERALD",1893742,3651590,REPUBLICAN,REPUBLICAN,0.518608


In [75]:
pivoted_df = swing_state_df.pivot(index=["year", "state", "candidate"], columns = ["party_detailed"], values="vote_pct")

In [88]:
party_share_df = pivoted_df.groupby(["year", "state"]).agg("sum")

PCT_ADJ = 100

party_share_df["DEMOCRAT"] = round(party_share_df["DEMOCRAT"] * PCT_ADJ,2) # clean up formatting
party_share_df["REPUBLICAN"] = round(party_share_df["REPUBLICAN"] * PCT_ADJ,2)

party_share_df

Unnamed: 0_level_0,party_detailed,DEMOCRAT,REPUBLICAN
year,state,Unnamed: 2_level_1,Unnamed: 3_level_1
1976,ARIZONA,39.80,56.37
1976,GEORGIA,66.94,33.06
1976,MICHIGAN,46.47,51.86
1976,NEVADA,45.81,50.17
1976,NORTH CAROLINA,55.27,44.22
...,...,...,...
2020,MICHIGAN,50.62,47.84
2020,NEVADA,50.06,47.67
2020,NORTH CAROLINA,48.59,49.93
2020,PENNSYLVANIA,50.01,48.84


## Analysis

In [89]:
df = party_share_df.reset_index()

# Create subplots
states = df['state'].unique()
fig = make_subplots(rows=len(states), cols=1, shared_xaxes=True, 
                    vertical_spacing=0.1, subplot_titles=states)

# Add traces for each state
for i, state in enumerate(states):
    state_data = df[df['state'] == state]
    
    # Add Democrat trace; show legend only for the first subplot
    fig.add_trace(go.Scatter(
        x=state_data['year'],
        y=state_data['DEMOCRAT'],
        mode='lines+markers',
        name='Democrat',
        line=dict(color='blue'),
        showlegend=(i == 0)  # Show legend only for the first state
    ), row=i + 1, col=1)
    
    # Add Republican trace; show legend only for the first subplot
    fig.add_trace(go.Scatter(
        x=state_data['year'],
        y=state_data['REPUBLICAN'],
        mode='lines+markers',
        name='Republican',
        line=dict(color='red'),
        showlegend=(i == 0)  # Show legend only for the first state
    ), row=i + 1, col=1)

# Update layout
fig.update_layout(
    title='Vote Share Over Time by State',
    xaxis_title='Year',
    yaxis_title='Vote Share (%)',
    height=300 * len(states),  # Adjust height based on the number of states
    showlegend=True
)

# Show the figure
fig.show()

Visually, we can see that many states seem to move up and down together. {Add more detail here.}

But let's quantify this rigorously.

In [112]:
dem_df = party_share_df.reset_index().loc[:, ["year", "state", "DEMOCRAT"]]

state_dem_df = dem_df.pivot(index = ["year"], columns="state", values="DEMOCRAT")

state_dem_df

state,ARIZONA,GEORGIA,MICHIGAN,NEVADA,NORTH CAROLINA,PENNSYLVANIA,WISCONSIN
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1976,39.8,66.94,46.47,45.81,55.27,50.4,49.5
1980,28.24,55.8,42.5,27.36,47.18,42.48,43.18
1984,32.54,39.79,40.24,32.42,37.89,45.99,45.02
1988,38.74,39.5,45.67,38.68,41.71,48.39,51.41
1992,36.52,43.47,43.77,37.36,42.65,45.15,41.13
1996,46.52,45.84,51.69,43.93,44.04,49.23,48.81
2000,44.73,43.21,51.28,45.94,43.15,50.61,47.83
2004,44.4,41.37,51.23,47.88,43.58,50.92,49.7
2008,45.12,46.99,57.43,55.15,49.7,54.49,56.22
2012,44.59,45.51,54.21,52.36,48.35,52.08,52.78


In [151]:
import plotly.express as px
import numpy as np

correlation_matrix = state_dem_df.corr()
rounded_corr = correlation_matrix.round(2)

In [158]:
# Make the diagonal values None to hide them
diagonal_mask = np.eye(rounded_corr.shape[0], dtype=bool)

# Create a heatmap using Plotly with 'Purples' color scale
fig = px.imshow(
    rounded_corr,
    text_auto=True,
    color_continuous_scale='Purples',  # Using Purples color scale
    title='Correlation Between Share of <b>DEMOCRATIC</b> vote in each state',
    aspect='auto',
    zmin=-1,
    zmax=1
)

# Add the gray diagonal heatmap
fig.add_trace(go.Heatmap(
    z=np.where(diagonal_mask, 0.0, None),  # Set diagonal values to 0
    colorscale=[[0, 'gray'], [1, 'gray']],  # Gray color for the diagonal
    showscale=False,  # Do not show the color scale for the diagonal
    name='Diagonal'
))


fig.update_layout(
    xaxis_title='Features',
    yaxis_title='Features',
    coloraxis_colorbar=dict(title='Correlation'),
    plot_bgcolor='white',  # Remove the shading behind the plot
    paper_bgcolor='white',  # Set the background of the entire figure
    xaxis=dict(gridcolor=None),  # Remove grid lines from the x-axis
    yaxis=dict(gridcolor=None)   # Remove grid lines from the y-axis
)

# Show the figure
fig.show()

This shows convincingly that since 1976, the share of Democratic votes between swing states are **highly correlated with each other, not independent**. Because third-party candidates are insignificant in most elections, I would get an almost identical table if I plotted the correlation between Republican votes across swing states. 

Let's dig deeper to see which states are the most correlated with others and which are more independent of other states. Visually, Arizona and the Blue Wall states seem have a lot of high correlations <span style="color: purple;"><b>(dark purple)</b></span> 