Hi ! With the upcoming elections, I wanted to analyze the past ones and try to find the demographic factors that have driven Americans' votes. I start by a quick EDA, that I'll deepen later, followed by ML predictions.

Share and upvote this notebook if you like it ! I plan to add an analysis of the 2020 Elections soon.

# I - Libraries and Data 

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import plotly.offline as py

from urllib.request import urlopen
import json

from sklearn.model_selection import KFold, GroupKFold, StratifiedKFold
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import lightgbm as lgb

sns.set_style("white")
pd.set_option('display.max_columns', 500)
py.init_notebook_mode(connected=True)

In [None]:
elections = pd.read_csv('../input/2012-2016-presidential-elections/US_County_Level_Presidential_Results_12-16.csv')
demographic = pd.read_csv('../input/us-census-demographic-data/acs2017_county_data.csv')

In [None]:
elections['combined_fips'] = elections['combined_fips'].astype(str)
elections.loc[elections['combined_fips'].apply(len) < 5, 'combined_fips'] = '0' + elections.loc[elections['combined_fips'].apply(len) < 5, 'combined_fips']

demographic['combined_fips'] = demographic['CountyId'].astype(str)
demographic.loc[demographic['combined_fips'].apply(len) < 5, 'combined_fips'] = '0' + demographic.loc[demographic['combined_fips'].apply(len) < 5, 'combined_fips']

In [None]:
elections = pd.merge(elections, demographic, on='combined_fips')

elections.head()

In [None]:
# some data is missing for 2012 elections (28 counties to be precise)
# elections.isnull().sum()

In [None]:
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)

# II - Visualization of votes

In [None]:
fig = go.Figure()

fig.add_trace(go.Choropleth(
                    geojson = counties,
                    z = elections['per_dem_2012'],
                    locations = elections['combined_fips'],
                    text = elections['county_name'],
                    colorscale = 'rdbu',
                    colorbar_ticksuffix = '%',
                    colorbar_title = '% Votes for Democrats',
                ))

fig.add_trace(go.Choropleth(
                    geojson = counties,
                    z = elections['per_dem_2016'],
                    locations = elections['combined_fips'],
                    text = elections['county_name'],
                    colorscale = 'rdbu',
                    colorbar_ticksuffix = '%',
                    colorbar_title = '% Votes for Democrats',
                ))

fig.update_layout(
    updatemenus=[
        dict(
            type = "buttons",
            direction = "left",
            buttons=list([
                dict(label="2012",
                     method="update",
                     args=[{"visible": [True, False]},
                           {"title": "Votes in 2012"}]),
                dict(label="2016",
                     method="update",
                     args=[{"visible": [False, True]},
                           {"title": "Votes in 2016"}])]),
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.11,
            xanchor="left",
            y=1.1, 
            yanchor="top"
        ),
    ],
    geo = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        lakecolor='rgb(255, 255, 255)'),
    geo2 = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        lakecolor='rgb(255, 255, 255)'),
)

### Comments
- As expected, we observe a clear geographical division between costs and the rest of America, particularly for 2016 results, interior lands being more Republicans and costs voting more for Democrats.
- We can see the changes in votes from 2012 and 2016, with way more red counties in 2016. At this date, many counties have massively vote for Trump, where his score is regularly around 90% of the total votes (!)

In [None]:
elect_clean = elections.dropna().copy()
elect_clean['evolution%'] = (elect_clean['per_dem_2016'] - elect_clean['per_dem_2012'])/elect_clean['per_dem_2012']

In [None]:
fig = go.Figure(data=go.Choropleth(
                    geojson = counties,
                    z = elect_clean['evolution%'],
                    locations = elect_clean['combined_fips'],
                    text = elect_clean['county_name'],
#                     colorscale = 'rdbu',
                    colorscale= [[0, 'red'],   
                           [0.3473, 'white'],  #0 - min(elect_clean['evolution%']) / (max(elect_clean['evolution%']) - min(elect_clean['evolution%']))
                           [1, 'blue']],
                    colorbar_ticksuffix = '%',
                    colorbar_title = '%',
                ))

fig.update_layout(
    title_text='Evolution between 2012 and 2016 elections',
    geo = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        lakecolor='rgb(255, 255, 255)'),
)

fig.show()

The vast majority of American counties has more votes for Republicans in 2016 than in 2012. We can however note some exceptions, such as Wibaux or Sterling counties, whose Democrat votes have risen up, whereas they are among the most pro-Republican counties. I wonder if this is due to some local events.

# III - Analysis of vote factors

In this part, I will for now focus on 2016 elections, adding some demographic data from 2017. I think the 1-year gap is neglectable at this point.

Arbitrarily, the target variable will be the % of Democrats vote per country. Others, such as % for Republicans, or number of votes for each candidate lead to the same results.

In [None]:
elections_2016 = elections.drop(['total_votes_2012', 'votes_dem_2012', 'votes_gop_2012', 'county_fips',
       'state_fips', 'per_dem_2012', 'per_gop_2012', 'diff_2012',
       'per_point_diff_2012', 'Unnamed: 0',  'votes_dem_2016', 'votes_gop_2016',
       'total_votes_2016', 'per_gop_2016', 'diff_2016', 'per_point_diff_2016', 'FIPS'], axis = 1)

elections_2016['Men%'] = elections_2016['Men']/elections_2016['TotalPop'] * 100
elections_2016['Women%'] = elections_2016['Women']/elections_2016['TotalPop'] * 100
elections_2016['VotingAge%'] = elections_2016['VotingAgeCitizen']/elections_2016['TotalPop'] * 100

elections_2016.drop(['Men', 'Women', 'VotingAgeCitizen', 'IncomeErr', 'IncomePerCapErr'], axis = 1, inplace = True)

elections_2016.head()

### Target Value Analysis
Let's select only the variables that we will use in the prediction part.

In [None]:
pred_columns = ['per_dem_2016', 'TotalPop', 'Hispanic', 'White', 'Black',
       'Native', 'Asian', 'Pacific', 'IncomePerCap', 'Poverty',
       'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction',
       'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp',
       'WorkAtHome', 'MeanCommute', 'PrivateWork', 'PublicWork',
       'SelfEmployed', 'FamilyWork', 'Unemployment', 'Men%', 
       'VotingAge%']
# Drop Employed because correlation coefficient of 0.998 with TotalPop
# Will need to test States as categorical variable
# Maybe leverage IncomePerCapErr

In [None]:
corr_mat = elections_2016[pred_columns].corr()
corr_mat['per_dem_2016'].sort_values(ascending = False)

These correlation coefficients already highlight some obvious trends. Counties with minorities, more unemployment and more population tend to have higher vote percentages for Democrats, whereas countries with more White people, Men and people working in the Construction sector (!) are more pro-Republican.

In [None]:
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(16,5))

sns.scatterplot(data = elections_2016, x = 'TotalPop', y = 'per_dem_2016', ax = ax1)
sns.scatterplot(data = elections_2016[elections_2016['TotalPop'] <= elections_2016['TotalPop'].quantile(0.9)],
                x = 'TotalPop', y = 'per_dem_2016', ax = ax2)

ax1.title.set_text('Democrats vote against population per county')
ax2.title.set_text('Zoom on counties below the 90th quantile for population')

It is hard to identify a clear trend, but we can see a slight rise in Democrats vote as the Total Population of counties augments.

Let's now look at the relation between Democrats vote and its most correlated variables:

In [None]:
fig, (axes1, axes2) = plt.subplots(2,2, figsize=(16,10))
sns.scatterplot(data = elections_2016, x = 'White', y = 'per_dem_2016', ax = axes1[0])
sns.scatterplot(data = elections_2016, x = 'IncomePerCap', y = 'per_dem_2016', ax = axes1[1])
sns.scatterplot(data = elections_2016, x = 'Hispanic', y = 'per_dem_2016', ax = axes2[0])
sns.scatterplot(data = elections_2016, x = 'Men%', y = 'per_dem_2016', ax = axes2[1])

axes1[0].title.set_text('Democrats vote against White% per county')
axes1[1].title.set_text('Democrats vote against Income per person')
axes2[0].title.set_text('Democrats vote against Hispanic% per county')
axes2[1].title.set_text('Democrats vote against Men% per county')

We see a negative linear trend between White percentage per county and Democrats votes. The other trends are harder to identify, but we can still assert that counties with more Hispanic % and/or income per person tend to be more pro-Democrats. For Men%, there is no clear relation, but let's note that Forest County (PA), with 80% of Men, is a clear outlier in the plot.

### Dependent Variables Analysis

In [None]:
#Correlation matrix
fig = plt.figure(figsize = (16,12))
mask = np.tril(corr_mat)

g = sns.heatmap(corr_mat, mask = mask , annot = True, fmt = '0.1g')
g.add_patch(Rectangle((0, 0), 32, 1, fill=False, edgecolor='white', lw=3))

plt.show()

# IV - Predicting Votes

In [None]:
#Fit a simple model
X, y = elections_2016[pred_columns].drop(['per_dem_2016'], axis = 1), elections_2016['per_dem_2016']
lgb_model = lgb.LGBMRegressor(num_iterations = 300)
feat_imp = np.zeros(X.shape[1])
preds = pd.DataFrame(y)
preds['prediction'] = 0

kf = KFold(n_splits=5, shuffle = True, random_state = 42)
kf.get_n_splits(X)

for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    lgb_model.fit(X_train, y_train, eval_set=[(X_test, y_test)],
            early_stopping_rounds=50, verbose = 10)
    predictions = lgb_model.predict(X_test)
    preds.loc[test_index, 'prediction'] = predictions
    feat_imp += lgb_model.feature_importances_ / 5

In [None]:
print('MSE of the 5-Fold LGB Model : {}'.format(mean_squared_error(preds['per_dem_2016'], preds['prediction'])))

These results are satisfying: we have a Mean Squared Error of 0.004, which is rather low!

In [None]:
preds['squared_error'] = (preds['per_dem_2016'] - preds['prediction']) ** 2
elections_2016['pred_dem_2016'] = preds['prediction']

In [None]:
fig = go.Figure()

fig.add_trace(go.Choropleth(
                    geojson = counties,
                    z = elections_2016['per_dem_2016'],
                    locations = elections_2016['combined_fips'],
                    text = elections_2016['county_name'],
                    colorscale = 'rdbu',
                    colorbar_ticksuffix = '%',
                    colorbar_title = '% Votes for Democrats',
                ))
fig.add_trace(go.Choropleth(
                    geojson = counties,
                    z = preds['squared_error'],
                    locations = elections_2016['combined_fips'],
                    text = elections_2016['county_name']+'<br>Votes : '+elections_2016['per_dem_2016'].apply(lambda x: str(x)[:5])+'<br>Predictions :'+elections_2016['pred_dem_2016'].apply(lambda x: str(x)[:5]),
                    colorbar_title = 'Squared Error',
                ))

fig.add_trace(go.Choropleth(
                    geojson = counties,
                    z = preds['prediction'],
                    locations = elections_2016['combined_fips'],
                                        colorscale = 'rdbu',
                    colorbar_ticksuffix = '%',
                    colorbar_title = '% Votes for Democrats',
                    zmax = elections_2016['per_dem_2016'].max()
                ))

fig.update_layout(
    updatemenus=[
        dict(
            type = "buttons",
            direction = "left",
            buttons=list([
                dict(label="predictions",
                     method="update",
                     args=[{"visible": [False, False, True]},
                           {"title": "Votes in 2016 - Predictions"}]),
                dict(label="actual",
                     method="update",
                     args=[{"visible": [True, False, False]},
                           {"title": "Votes in 2016 - Actual Values"}]),
                dict(label="MSE",
                         method="update",
                         args=[{"visible": [False, True, False]},
                               {"title": "Votes in 2016 - MSE"}])]),
            pad={"r": 10, "t": 10},
#             showactive=True,
            x=0.11,
            xanchor="left",
            y=1.1, 
            yanchor="top"
        ),
    ],
    title_text='Predictions for 2016 Elections',
    geo = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        lakecolor='rgb(255, 255, 255)'),
    geo2 = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        lakecolor='rgb(255, 255, 255)'),
    geo3 = dict(
        scope='usa',
        projection=go.layout.geo.Projection(type = 'albers usa'),
        lakecolor='rgb(255, 255, 255)'),
)

The predictions are rather close to the actual score, but let's note that the highest error (by far) is obtained at Citrus County (FL), where we predicted 31.1% of the votes for Democrats, the actual score being 29.5% : the difference is still low !

In [None]:
feature_importance = pd.DataFrame({"Value": feat_imp, "Feature": X.columns}) \
                    .sort_values(by="Value", ascending=False)

# Change size of the plot, so we can see all features
fig_dims = (10, 14)
fig, ax = plt.subplots(figsize=fig_dims)

sns.barplot(x="Value", y="Feature", ax=ax, data=feature_importance)
plt.title('LightGBM Features')
plt.tight_layout()
plt.show()

The most important factors are the White population, the Income per person and the percentage of the population having the right to vote. As we don't have any other age-related variable, we could deduce that the latter introduce this notion in the model, which has a huge impact on county votes.

Let's know visualize our results per state !

In [None]:
elections_2016['pred_votes_dem'] = (elections_2016['TotalPop'] * elections_2016['VotingAge%']/100 * elections_2016['pred_dem_2016']).astype(int)
elections_2016['pred_votes_rep'] = (elections_2016['TotalPop'] * elections_2016['VotingAge%']/100 * (1 - elections_2016['pred_dem_2016'])).astype(int)
elections_2016['actual_votes_dem'] = (elections_2016['TotalPop'] * elections_2016['VotingAge%']/100 * elections_2016['per_dem_2016']).astype(int)
elections_2016['actual_votes_rep'] = (elections_2016['TotalPop'] * elections_2016['VotingAge%']/100 * (1 - elections_2016['per_dem_2016'])).astype(int)

In [None]:
states_2016 = elections_2016[['pred_votes_dem', 'pred_votes_rep', 'actual_votes_dem', 'actual_votes_rep', 'state_abbr']].groupby('state_abbr').sum()
states_2016['pred_dem%'] = states_2016['pred_votes_dem'] / (states_2016['pred_votes_dem'] + states_2016['pred_votes_rep']) * 100
states_2016['actual_dem%'] = states_2016['actual_votes_dem'] / (states_2016['actual_votes_dem'] + states_2016['actual_votes_rep']) * 100
states_2016['state_abbr'] = states_2016.index
states_2016.head()

In [None]:
fig = go.Figure(data=go.Choropleth(
    locations=states_2016['state_abbr'], 
    z = states_2016['pred_dem%'],
    locationmode = 'USA-states',
    colorscale= [[0, "red"],   
                [0.4, 'white'],
                [1, "blue"]],
    colorbar_ticksuffix = '%',
    colorbar_title = '% Votes for Democrats',
))

fig.update_layout(
    title_text = 'Predictions for 2016 Elections',
    geo_scope='usa'
)

fig.show()

In [None]:
fig = go.Figure(data=go.Choropleth(
    locations=states_2016['state_abbr'], 
    z = states_2016['actual_dem%'],
    locationmode = 'USA-states',
    colorscale= [[0, "rgb(178, 24, 43)"],   
                [0.4, 'white'],
                [1, "rgb(0,102,172)"]],
    colorbar_ticksuffix = '%',
    colorbar_title = '% Votes for Democrats',
))

fig.update_layout(
    title_text = '2016 Presidential results',
    geo_scope='usa'
)

fig.show()

WIP - I'm looking for 2020 Data to make predictions for the elections