# Predictive Election NN Model

Using the cleaned registration and election result data, built a neural network model for predicting outcomes of elections. Input features are:

* Partisan registration in upcoming election
* Partisan registration in prior election
* Outcome of prior election

One of the challenges with predicting election outcomes in Colorado is the variability due to about a third of the state being unaffiliated with the two major parties. The behavior of those unaffiliateds with vary by district, so we need a model that can account for that.

We have a pretty limited dataset, using the following combinations

| Inputs | Output |
|---------|----------|
|2012 reg., results, 2016 reg. | 2016 results |
|2014 reg., results, 2018 reg. | 2018 results |
|2016 reg., results, 2020 reg. | 2020 results |

with the final goal of using the 2016 reg. and results + the most current 2020 reg. to predict the 2020 results.

In [1]:
# Useful imports
import pandas as pd

from tensorflow import keras

import numpy as np

res_dir = '../data/results/cleaned'
reg_dir = '../data/registration/cleaned'

## Data Preparation

Need to arrange the cleaned CSV files into the inputs and outputs with appropriate groupings.

Senate elections are every four years, so prior results from 2012 will be paired with registration in 2016 to predict the 2016 outcome. Similarly with 2014 and 2018.

First, define some helper functions that will pre-process the data from the cleaned registration and results files

In [2]:
def get_registration(year):
    """Given an input year, snag the registration data for that year and collapse the affiliation of active
    voters into REP, DEM, OTHER
    
    Returns a dataframe with columns DISTRICT-COUNTY, REP, DEM, OTHERS"""
    
    # read in the data
    reg_df = pd.read_csv(reg_dir+'/{}.csv'.format(year))
    
    # filter out where the county value is empty, which is a "total" row
    reg_df = reg_df[reg_df['COUNTY'].notnull()]
    
    # isolate only active voters
    active_cols = [col for col in reg_df.columns if '-ACTIVE' in col]
    
    # find third-party/unaffiliated voter data
    other_cols = [col for col in active_cols if 'DEM' not in col and 'REP' not in col]
    
    # combine all unaffiliateds into one column
    reg_df['OTHER-ACTIVE'] = reg_df[other_cols].sum(axis=1)
    
    # combine the district and county labels into one
    reg_df['DIST_COUNTY'] = reg_df['DISTRICT'].astype(str) +'-'+ reg_df['COUNTY'].astype(str)

    # pick off only the interesting data
    relabel_dict = {
        'REP-ACTIVE' : 'REP',
        'DEM-ACTIVE' : 'DEM',
        'OTHER-ACTIVE' : 'OTHER'
    }
    new_df = reg_df[['DIST_COUNTY', 'REP-ACTIVE', 'DEM-ACTIVE', 'OTHER-ACTIVE']].rename(columns=relabel_dict)
    
    return new_df
    
def get_results(year):
    """Given an input year, snag the results data for that year and collapse the affiliation of active
    candidates into REP, DEM, OTHER
    
    Returns a dataframe with columns DISTRICT-COUNTY, REP, DEM, OTHERS"""

    # read in the results file
    df = pd.read_csv(res_dir+'/{}.csv'.format(year))
    
    # create a DIST-COUNTY label
    df['DIST_COUNTY'] = df['DISTRICT'].astype(str) + '-' + df['COUNTY'].astype(str)
    
    # isolate third party candidates
    parties = ['DEMOCRATIC PARTY', 'REPUBLICAN PARTY']
    df['PARTY'][~df['PARTY'].isin(parties)] = 'OTHER'
    df['PARTY'][df['PARTY'] == 'REPUBLICAN PARTY'] = 'REP'
    df['PARTY'][df['PARTY'] == 'DEMOCRATIC PARTY'] = 'DEM'

    # sum over precincts, if precincts exist
    agg_cols = {'YES VOTES' : 'sum'}
    df = df.groupby([df['DIST_COUNTY'], df['PARTY']], as_index=False).aggregate(agg_cols)
    
    # Make party votes into columns for each DIST_COUNTY
    df = df.pivot(index='DIST_COUNTY', columns='PARTY', values='YES VOTES').fillna(0)
    df.reset_index(level=0, inplace=True)

    return df

In [3]:
res_2012 = get_results(2012)
print(res_2012)

PARTY       DIST_COUNTY      DEM    OTHER      REP
0         SD 10-EL PASO      0.0  15976.0  44200.0
1         SD 12-EL PASO      0.0  16365.0  34673.0
2         SD 14-LARIMER  46673.0   4994.0  28874.0
3         SD 17-BOULDER  45426.0   3848.0  23983.0
4         SD 18-BOULDER  66619.0      0.0  18427.0
5       SD 19-JEFFERSON  35664.0   5104.0  35080.0
6           SD 21-ADAMS  30308.0      0.0  16373.0
7       SD 22-JEFFERSON  38845.0      0.0  35008.0
8      SD 23-BROOMFIELD  15898.0      0.0  14755.0
9         SD 23-LARIMER   2996.0      0.0   5407.0
10           SD 23-WELD  15358.0      0.0  23787.0
11          SD 25-ADAMS  27961.0   2461.0  20310.0
12       SD 26-ARAPAHOE  38744.0      0.0  32890.0
13       SD 27-ARAPAHOE  34957.0      0.0  42411.0
14       SD 28-ARAPAHOE  37181.0   2459.0  24475.0
15       SD 29-ARAPAHOE  30149.0   2420.0  18745.0
16       SD 31-ARAPAHOE   1839.0      0.0   1039.0
17         SD 31-DENVER  52551.0      0.0  22386.0
18         SD 32-DENVER  47995.

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


At this point, it is helpful to combine the all the relevant input and output data for each dist-county into a single input and output dataframe.

In [55]:
# Will store the district-county data using the standard key format 
# 'SD [NUMBER]-[COUNTY]' for the Year of the Predicted Results
year_pairs = [['2012', '2016'], ['2014', '2018']]

X_set = np.array([], dtype=np.float64).reshape(0,9)
Y_set = np.array([], dtype=np.float64).reshape(0,3)

for pair in year_pairs:
    
    # The output variables
    current_result = get_results(pair[1])
    elections = current_result.DIST_COUNTY.unique()
    
    # The input variables
    current_reg = get_registration(pair[1])
    current_reg = current_reg[current_reg['DIST_COUNTY'].isin(elections)]
    
    past_reg = get_registration(pair[0])
    past_reg = past_reg[past_reg['DIST_COUNTY'].isin(elections)]
    
    past_result = get_results(pair[0])
    past_result = past_result[past_result['DIST_COUNTY'].isin(elections)]
        
    tot_regs = current_reg.sum(axis=1).values
        
    print('tot_regs - {}'.format(tot_regs))

    dem_regs = current_reg.DEM.values
    rep_regs = current_reg.REP.values
    other_regs = current_reg.OTHER.values

    reg_frac_curr_d = dem_regs/tot_regs
    reg_frac_curr_r = rep_regs/tot_regs
    reg_frac_curr_o = other_regs/tot_regs
    
    print('reg_frac_curr = {}'.format(reg_frac_curr_o))

    tot_regs = past_reg.sum(axis=1).values
    dem_regs = past_reg.DEM.values
    rep_regs = past_reg.REP.values
    other_regs = past_reg.OTHER.values

    reg_frac_past_d = dem_regs/tot_regs
    reg_frac_past_r = rep_regs/tot_regs
    reg_frac_past_o = other_regs/tot_regs

    dem_votes = past_result.DEM.values
    rep_votes = past_result.REP.values
    other_votes = past_result.OTHER.values
    tot_votes = dem_votes + rep_votes + other_votes

    votes_frac_past_d = dem_votes/tot_votes
    votes_frac_past_r = rep_votes/tot_votes
    votes_frac_past_o = other_votes/tot_votes
    
    X = np.array([reg_frac_curr_d, reg_frac_curr_r, reg_frac_curr_o,
               reg_frac_past_d, reg_frac_past_r, reg_frac_past_o,
               votes_frac_past_d, votes_frac_past_r, votes_frac_past_o]).T
    
    # Now for the outputs
    dem_votes = current_result.DEM.values
    rep_votes = current_result.REP.values
    other_votes = current_result.OTHER.values
    tot_votes = dem_votes + rep_votes + other_votes
    
    votes_frac_curr_d = dem_votes/tot_votes
    votes_frac_curr_r = rep_votes/tot_votes
    votes_frac_curr_o = other_votes/tot_votes
    
    Y = np.array([votes_frac_curr_d, votes_frac_curr_r, votes_frac_curr_o]).T
    
    print(np.shape(X))
    X_set = np.vstack([X_set, X])
    print(np.shape(X_set))

    Y_set = np.vstack([Y_set, Y])
    print(np.shape(Y))
    print(np.shape(Y_set))



tot_regs - [109270.  31106.  10354.    982.   7374.   3811.  16698.  19852.  92607.
  82648. 105654.  97144. 105368.  97382.  70892.  42592.  13652.  59092.
  75144.  95261.  98298.  88563.  74861.   4355. 100646.  95069.   8382.
   2525.   2418.   4843.   2434.   1870.   3467.   4422.    936.   8376.
    735.  10492.   6152.  12402.   6682.   3608.]
reg_frac_curr = [0.33337604 0.41583617 0.36073015 0.19246436 0.30987253 0.21044345
 0.41448078 0.45884546 0.36419493 0.36921644 0.41439037 0.38757926
 0.36121023 0.390298   0.3838656  0.39965252 0.35855552 0.37856224
 0.38410252 0.36987854 0.3497121  0.37552928 0.36871001 0.38576349
 0.35486756 0.33986894 0.31663088 0.24792079 0.29197684 0.15610159
 0.20090386 0.2540107  0.23247765 0.26775215 0.24358974 0.26611748
 0.19455782 0.30632863 0.30656697 0.30761168 0.26264591 0.30986696]
(42, 9)
(42, 9)
(42, 3)
(42, 3)
tot_regs - [  1234.  18748.   4353.   2748.  11172.  15009.   2673.   1556.   3037.
  22915.   5448.   6958.  29461.  26940.  127

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


## Model Training

## 2020 Predictions