# Homework 6-1: "Fundamentals" based election prediction

In this homework you will explore an alternate election prediction model, using various economic and political indicators instead of polling data -- and also deal with the challenges of model building when there is very little training data. Political scientists have long analyzed these types of "fundamentals" models, and they can be reasonably accurate. For example, fundamentals [slightly favored](https://fivethirtyeight.com/features/it-wasnt-clintons-election-to-lose/) the Republicans in 2016

Data sources which I used to generate `election-fundamentals.csv`:

- Historical presidential approval ratings (highest and lowest for each president) from [Wikipedia](https://en.wikipedia.org/wiki/United_States_presidential_approval_rating) 
- GDP growth in election year from [World Bank](https://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG?locations=US)

Note that there are some timing issues here which more careful forecasts would avoid. The presidential approval rating is for the entire presidential term.The GDP growth is for the entire election year. These variables might have higher predictive power if they were (for example) sampled in the last quarters before the election.

For a comprehensive view of election prediction from non-poll data, and how well it might or might not be able to do, try [this](https://fivethirtyeight.com/features/models-based-on-fundamentals-have-failed-at-predicting-presidential-elections/) from Fivethirtyeight.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [2]:
# First, import data/election-fundamentals.csv and take a look at what we have
fund = pd.read_csv('data/election-fundamentals.csv')
fund.head()

Unnamed: 0,year,incumbent_president,incumbent_party,term,highest_approval,lowest_approval,year_gdp_growth,winner
0,1960,Esienhower,R,2,79,47,2.6,D
1,1964,Johnson,D,1,79,34,5.8,D
2,1968,Johnson,D,2,79,34,4.8,R
3,1972,Nixon,R,1,66,24,5.3,R
4,1976,Nixon,R,2,66,24,5.4,D


In [3]:
# How many elections do we have data for?
fund.shape

(15, 8)

In [5]:
# Rather than predicting the winning party, we're going to predict whether the same party stays in power or flips
# This is going to be the target variable
fund['flips'] = fund.winner != fund.incumbent_party

In [9]:
# Pull out all other numeric columns as features. Create features and and target numpy arrays
fields = ['term', 'highest_approval', 'lowest_approval', 'year_gdp_growth']
features = fund[fields].values
target = fund['flips'].values

In [10]:
# Use 3-fold cross validation to see how well we can do with a RandomForestClassifier. 
# Print out the scores
my_classifier = DecisionTreeClassifier()
scores = cross_val_score(my_classifier,
                        features,
                         target,
                         cv=3
                        )
scores

array([0.83333333, 0.4       , 0.75      ])

How predictable are election results just from these variables, as compared to a coin flip?

Not really – after all, one of the scores is below 0.4, which indicates that the predictor is basically in the dark what is going on. Probably the data is not big enough.

In [13]:
# Now create a logistic regression using all the data
# Normally we'd split into test and training years, but here we're only interested in the coefficients
lr = LogisticRegression()
lr.fit(features, target)
lr.coef_

array([[ 1.21895849,  0.025108  , -0.0436287 , -0.55659437]])

In [14]:
# What is the influence of each feature?
# Remeber to use np.exp to turn the lr coefficients into odds ratios
results = pd.DataFrame(np.exp(lr.coef_), columns=fields)
results

Unnamed: 0,term,highest_approval,lowest_approval,year_gdp_growth
0,3.383662,1.025426,0.957309,0.573158


Describe the effect of each one of our features on whether or not the party in power flips. What feature has the biggest effect? How does economic growth relate? Are there any factors that operate backwards from what you would expect, and if so what do you think is happening?

According to these numbers, the term has the highest influence – people probably expect something to change with a new person and a new party, and this grows the longer a party ruled before.

GDP growth makes people less likely to vote a different party into power.

Approval ratings seem to have very little influence either way.

