# Problem & Use Case

The problem being solved is whether we can predict the candidate that a county will vote for based on the county's demographics. Political parties, SuperPACS, and activists are constantly trying to figure out where they should spend their campaign resources. In other words, which counties will we most likely win already, and which counties do we need to spend money on ads for?

The aim is to build a model that will predict who will win a county based on demographic features such as racial and gender breakdown, income levels, commutes, type of employment, etc. This will help political parties and PACs decide which counties they would receive the highest return from with their investment.

# Dataset Creation

There are two datasets here, the presidential election set and the 2015 US census data, which must be merged.


The presidential election dataset shows each county in the US, the candidates that won votes there and how many, and who won the county. The 2015 county census data shows the demograpic breakup of each county, including income level, race, and other indicators.

The merged dataset will have the demographic breakup of each county and which party's candidate won that county, in order to predict whether a political party will win a county based on its demographics. The final dataset looks as follows.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import joblib
import warnings
warnings.filterwarnings("ignore")
from sklearn.metrics import confusion_matrix

In [2]:
data = pd.read_csv('cleaned_dataset.csv').drop('Unnamed: 0', 1)
data

Unnamed: 0,county,lead,state,totalpop,men,women,hispanic,white,black,native,...,walk,othertransp,workathome,meancommute,employed,privatework,publicwork,selfemployed,familywork,unemployment
0,Los Angeles,Hillary Clinton,CA,10038388.0,4945351.0,5093037.0,48.2,26.9,8.0,0.2,...,2.8,2.3,5.1,30.0,4635465.0,79.0,11.5,9.4,0.2,10.0
1,Cook,Hillary Clinton,IL,5236393.0,2537245.0,2699148.0,24.7,43.1,23.7,0.1,...,4.4,2.2,4.2,32.3,2463655.0,83.9,11.5,4.5,0.1,10.7
2,Harris,Hillary Clinton,TX,4356362.0,2166727.0,2189635.0,41.6,31.7,18.5,0.2,...,1.5,2.0,3.3,28.2,2081889.0,83.4,10.1,6.3,0.1,7.5
3,Maricopa,Donald Trump,AZ,4018143.0,1986158.0,2031985.0,30.1,57.3,4.9,1.6,...,1.6,2.6,5.9,25.5,1821038.0,82.5,11.7,5.7,0.2,7.7
4,Miami-Dade,Hillary Clinton,FL,2639042.0,1280221.0,1358821.0,65.6,15.1,16.8,0.1,...,2.2,1.9,4.3,29.9,1204871.0,81.9,10.2,7.7,0.2,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3040,McPherson,Donald Trump,NE,433.0,223.0,210.0,0.9,97.5,0.0,0.0,...,15.7,4.6,22.6,32.0,222.0,69.4,5.9,23.9,0.9,0.9
3041,Clark,Donald Trump,ID,901.0,440.0,461.0,41.4,58.2,0.0,0.0,...,13.0,0.0,3.8,17.0,442.0,74.4,21.5,3.4,0.7,3.9
3042,Arthur,Donald Trump,NE,448.0,223.0,225.0,0.0,98.9,0.0,0.0,...,12.4,0.0,19.9,19.5,193.0,54.4,17.1,27.5,1.0,4.0
3043,Kenedy,Hillary Clinton,TX,565.0,295.0,270.0,66.2,33.6,0.0,0.0,...,5.4,0.0,0.0,16.6,185.0,51.9,48.1,0.0,0.0,0.0


# Feature Engineering

The data exploration phase indicates that many of the columns correlate with each other because they are raw numbers.

Thus, for example, if population increases, the number of people employed, number of men, etc. also increases. To account for this, many columns are converted to percentage values of the total population and some redundant columns are dropped, such as number employed vs. % unemployed.

Dummy variables for each state is added because there is a higher likelihood of a county voting for a Democrat vs. a Republican based on which state it is in, such as in a Democratic stronghold such as California or a Republican stronghold such as Texas.

The candidate that won each county is instead converted to a binary (1 or 0) that represents wich party won the county. 1, if the county went to the Republican, and 0, is the county went to the Democrat.

The final features are as follows.

In [3]:
features = pd.read_csv('features_v3.csv')
features

Unnamed: 0,party,totalpop,income,service,office,drive,carpool,transit,workathome,meancommute,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
0,0.0,1.000000,0.354074,0.446203,0.654952,0.755932,0.331104,0.110211,0.137097,0.593023,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.521624,0.344999,0.417722,0.648562,0.632768,0.290970,0.298217,0.112903,0.659884,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.433955,0.337373,0.392405,0.619808,0.825989,0.371237,0.047002,0.088710,0.540698,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.400262,0.335184,0.424051,0.738019,0.795480,0.367893,0.038898,0.158602,0.462209,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.262875,0.228581,0.506329,0.769968,0.800000,0.307692,0.089141,0.115591,0.590116,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3040,1.0,0.000017,0.335923,0.170886,0.760383,0.467797,0.324415,0.000000,0.607527,0.651163,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3041,1.0,0.000063,0.137758,0.493671,0.252396,0.696045,0.521739,0.000000,0.102151,0.215116,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3042,1.0,0.000018,0.192528,0.186709,0.000000,0.520904,0.521739,0.000000,0.534946,0.287791,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3043,0.0,0.000030,0.164322,0.920886,0.523962,1.000000,0.000000,0.000000,0.000000,0.203488,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Model Def

For this problem, a binary classifier algorithm is needed, to classify a county as voting for the Democrat or the Republican. A logistic regression model and a simple neural network were both tested and the logistic regression model performed better based on a simple accuracy score (sum of true positives and true negatives).

There is an imbalance in the training dataset where there are many more Republican counties than Democratic counties. Thus, during the model training phase, different downsampling techniques will be used to decide which model is the most accurate. In essence, we are reducing the number of Republican counties in the model to see if that improves model performance in any way.

In [4]:
# final model with imported dataset
data = pd.read_csv('features_v3.csv')

X = data[['totalpop', 'income', 'service', 'office', 'drive', 'carpool',
           'transit', 'workathome', 'meancommute', 'unemployment',
           'perc_men', 'perc_white', 'perc_private_work', 'perc_citizen', 'AL',
           'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI', 'IA', 'ID',
           'IL', 'IN', 'KS', 'KY', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT',
           'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA',
           'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']]
y = data['party']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LogisticRegression()

model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

# Model Training

The model was trained with multiple downsampling rates and the results saved to compare them. The model with the 100% Republican downsampling rate was the best model, so this is the the final model. The model and the final scores are shown below.

In [5]:
model = joblib.load('finalized_election_model.sav') # load in final model
score = model.score(X_test, y_test)
score

0.9408866995073891

The final model score on the test set was about 94%, which is a sufficiently good score to use this as a the final model.