# Election Forecasting: Predicting the Winner Before any Votes are Cast

<img src="images/donkey-and-elephant.jpg"/>

### Election Prediction

    Goal: Use polling data to predict state winners
    
### The Dataset

Data from RealClearPolitics.com

Instances represent a state in a given election

    State: Name of state

    Year: Election year (2004, 2008, 20012)

Dependent variable

    Republican: 1 if Republican won state, 0 if Democrat won

Independent variables

    Rasmussen, SurveyUSA: Polled R% - Polled D%

    DiffCount: Polls with R winner - Polls with D winner

    PrepR: Polls with R winner / # polls

### Simple Approaches to Missing Data

    Delete the missing observations

        We would be throwing away more than 50% of the data

        We want to predict for all states

    Delete variables with missing values

        We want to retain data from Rasmussen/SurveyUSA

    Fill missing data points with average values

        The average value for a poll will be close to 0 (tie between Democrat and Republican)

        If other polls in a state favor one candidate, the missing one probably would have, too

### Multiple Imputation

    Fill in missing values based on non-missing values

        If Rasmussen is very negative, then a missing SurveyUSA value will likely be negative

        Just like sample.split results will differ between runs unless you fix the random seed

    Although the method is complicated, we can use it easily through R’s libraries

    We will use Multiple Imputation by Chained Equations (mice) package

### Read in Dataset

In [1]:
polling = read.csv("data/PollingData.csv")
head(polling)

Unnamed: 0_level_0,State,Year,Rasmussen,SurveyUSA,DiffCount,PropR,Republican
Unnamed: 0_level_1,<fct>,<int>,<int>,<int>,<int>,<dbl>,<int>
1,Alabama,2004,11.0,18.0,5,1,1
2,Alabama,2008,21.0,25.0,5,1,1
3,Alaska,2004,,,1,1,1
4,Alaska,2008,16.0,,6,1,1
5,Arizona,2004,5.0,15.0,8,1,1
6,Arizona,2008,5.0,,9,1,1


In [2]:
str(polling)

'data.frame':	145 obs. of  7 variables:
 $ State     : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 3 4 4 4 ...
 $ Year      : int  2004 2008 2004 2008 2004 2008 2012 2004 2008 2012 ...
 $ Rasmussen : int  11 21 NA 16 5 5 8 7 10 NA ...
 $ SurveyUSA : int  18 25 NA NA 15 NA NA 5 NA NA ...
 $ DiffCount : int  5 5 1 6 8 9 4 8 5 2 ...
 $ PropR     : num  1 1 1 1 1 ...
 $ Republican: int  1 1 1 1 1 1 1 1 1 1 ...


In [3]:
summary(polling)

         State          Year        Rasmussen          SurveyUSA       
 Arizona    :  3   Min.   :2004   Min.   :-41.0000   Min.   :-33.0000  
 Arkansas   :  3   1st Qu.:2004   1st Qu.: -8.0000   1st Qu.:-11.7500  
 California :  3   Median :2008   Median :  1.0000   Median : -2.0000  
 Colorado   :  3   Mean   :2008   Mean   :  0.0404   Mean   : -0.8243  
 Connecticut:  3   3rd Qu.:2012   3rd Qu.:  8.5000   3rd Qu.:  8.0000  
 Florida    :  3   Max.   :2012   Max.   : 39.0000   Max.   : 30.0000  
 (Other)    :127                  NA's   :46         NA's   :71        
   DiffCount           PropR          Republican    
 Min.   :-19.000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.: -6.000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :  1.000   Median :0.6250   Median :1.0000  
 Mean   : -1.269   Mean   :0.5259   Mean   :0.5103  
 3rd Qu.:  4.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   : 11.000   Max.   :1.0000   Max.   :1.0000  
                                                    

In [4]:
table(polling$Year)


2004 2008 2012 
  50   50   45 

A approach would be to fill the missing data points with average values. So for Rasmussen and SurveyUSA, the average value for a poll would be very close to zero across all the times with it reported, which is roughly a tie between the Democrat and Republican candidate.

However, if PropR is very close to one or zero, we would expect the Rasmussen or SurveyUSA values that are currently missing
to be positive or negative, respectively. This leads to a more complicated approach called multiple imputation in which we fill in the missing values based on the non-missing values for an observation.

### Install and load mice package

In [5]:
# install.packages("mice")
library(mice)


Attaching package: 'mice'


The following objects are masked from 'package:base':

    cbind, rbind




### Multiple imputation

In [6]:
simple = polling[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")]
head(simple)

Unnamed: 0_level_0,Rasmussen,SurveyUSA,PropR,DiffCount
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>
1,11.0,18.0,1,5
2,21.0,25.0,1,5
3,,,1,1
4,16.0,,1,6
5,5.0,15.0,1,8
6,5.0,,1,9


In [7]:
summary(simple)

   Rasmussen          SurveyUSA            PropR          DiffCount      
 Min.   :-41.0000   Min.   :-33.0000   Min.   :0.0000   Min.   :-19.000  
 1st Qu.: -8.0000   1st Qu.:-11.7500   1st Qu.:0.0000   1st Qu.: -6.000  
 Median :  1.0000   Median : -2.0000   Median :0.6250   Median :  1.000  
 Mean   :  0.0404   Mean   : -0.8243   Mean   :0.5259   Mean   : -1.269  
 3rd Qu.:  8.5000   3rd Qu.:  8.0000   3rd Qu.:1.0000   3rd Qu.:  4.000  
 Max.   : 39.0000   Max.   : 30.0000   Max.   :1.0000   Max.   : 11.000  
 NA's   :46         NA's   :71                                           

In [8]:
set.seed(144)

imputed = complete(mice(simple))
summary(imputed)


 iter imp variable
  1   1  Rasmussen  SurveyUSA
  1   2  Rasmussen  SurveyUSA
  1   3  Rasmussen  SurveyUSA
  1   4  Rasmussen  SurveyUSA
  1   5  Rasmussen  SurveyUSA
  2   1  Rasmussen  SurveyUSA
  2   2  Rasmussen  SurveyUSA
  2   3  Rasmussen  SurveyUSA
  2   4  Rasmussen  SurveyUSA
  2   5  Rasmussen  SurveyUSA
  3   1  Rasmussen  SurveyUSA
  3   2  Rasmussen  SurveyUSA
  3   3  Rasmussen  SurveyUSA
  3   4  Rasmussen  SurveyUSA
  3   5  Rasmussen  SurveyUSA
  4   1  Rasmussen  SurveyUSA
  4   2  Rasmussen  SurveyUSA
  4   3  Rasmussen  SurveyUSA
  4   4  Rasmussen  SurveyUSA
  4   5  Rasmussen  SurveyUSA
  5   1  Rasmussen  SurveyUSA
  5   2  Rasmussen  SurveyUSA
  5   3  Rasmussen  SurveyUSA
  5   4  Rasmussen  SurveyUSA
  5   5  Rasmussen  SurveyUSA


   Rasmussen         SurveyUSA           PropR          DiffCount      
 Min.   :-41.000   Min.   :-33.000   Min.   :0.0000   Min.   :-19.000  
 1st Qu.: -8.000   1st Qu.:-11.000   1st Qu.:0.0000   1st Qu.: -6.000  
 Median :  3.000   Median :  1.000   Median :0.6250   Median :  1.000  
 Mean   :  2.786   Mean   :  2.014   Mean   :0.5259   Mean   : -1.269  
 3rd Qu.: 13.000   3rd Qu.: 18.000   3rd Qu.:1.0000   3rd Qu.:  4.000  
 Max.   : 39.000   Max.   : 30.000   Max.   :1.0000   Max.   : 11.000  

So the output here shows us that five rounds of imputation have been run, and now all of the variables have been filled in.

So there's no more missing values.

The last step in this imputation process is to actually copy the Rasmussen and SurveyUSA variables back into our original polling data frame, which has all the variables for the problem.

In [9]:
polling$Rasmussen = imputed$Rasmussen
polling$SurveyUSA = imputed$SurveyUSA
summary(polling)

         State          Year        Rasmussen         SurveyUSA      
 Arizona    :  3   Min.   :2004   Min.   :-41.000   Min.   :-33.000  
 Arkansas   :  3   1st Qu.:2004   1st Qu.: -8.000   1st Qu.:-11.000  
 California :  3   Median :2008   Median :  3.000   Median :  1.000  
 Colorado   :  3   Mean   :2008   Mean   :  2.786   Mean   :  2.014  
 Connecticut:  3   3rd Qu.:2012   3rd Qu.: 13.000   3rd Qu.: 18.000  
 Florida    :  3   Max.   :2012   Max.   : 39.000   Max.   : 30.000  
 (Other)    :127                                                     
   DiffCount           PropR          Republican    
 Min.   :-19.000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.: -6.000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :  1.000   Median :0.6250   Median :1.0000  
 Mean   : -1.269   Mean   :0.5259   Mean   :0.5103  
 3rd Qu.:  4.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   : 11.000   Max.   :1.0000   Max.   :1.0000  
                                                    

### Split the set

In [10]:
# Subset data into training set and test set
Train = subset(polling, Year == 2004 | Year == 2008)
Test = subset(polling, Year == 2012)

In [11]:
head(Train)

Unnamed: 0_level_0,State,Year,Rasmussen,SurveyUSA,DiffCount,PropR,Republican
Unnamed: 0_level_1,<fct>,<int>,<int>,<int>,<int>,<dbl>,<int>
1,Alabama,2004,11,18,5,1,1
2,Alabama,2008,21,25,5,1,1
3,Alaska,2004,39,24,1,1,1
4,Alaska,2008,16,19,6,1,1
5,Arizona,2004,5,15,8,1,1
6,Arizona,2008,5,3,9,1,1


In [12]:
head(Test)

Unnamed: 0_level_0,State,Year,Rasmussen,SurveyUSA,DiffCount,PropR,Republican
Unnamed: 0_level_1,<fct>,<int>,<int>,<int>,<int>,<dbl>,<int>
7,Arizona,2012,8,15,4,0.8333333,1
10,Arkansas,2012,16,21,2,1.0,1
13,California,2012,-8,-14,-6,0.0,0
16,Colorado,2012,3,-2,-5,0.3076923,0
19,Connecticut,2012,-7,-13,-8,0.0,0
24,Florida,2012,2,0,6,0.6666667,0


### Simple Baseline

In [13]:
table(Train$Republican)


 0  1 
47 53 

What we can see here is that in 47 of the 100 training observations, the Democrat won the state, and in 53 of the observations, the Republican won the state. 

Our simple baseline model is always going to predict the more common outcome, which is that the Republican is going to win the state. The simple baseline model will have accuracy of 53% on the training set.

### Smart Baseline

So to compute a smart baseline, we're going to use a new function called the sign function. This function does is, if it's
passed a positive number, it returns the value 1. If it's passed a negative number, it returns negative 1. And if it's passed 0, it returns 0.

So if we passed the Rasmussen variable into sign, whenever the Republican was winning the state, meaning Rasmussen is positive, it's going to return a 1.

In [14]:
sign(20)

In [15]:
sign(-10)

In [16]:
sign(0)

So 1 signifies that the Republican is predicted to win, -1 means this smart baseline is predicting that the Democrat won the state. If we took the sign of 0, meaning that the Rasmussen poll had a tie, saying that the model is inconclusive about who's going to win the state.

So now, we're ready to actually compute this prediction for all of our training set.

In [17]:
table(sign(Train$Rasmussen))


-1  0  1 
42  2 56 

    In 56 of the 100 training set observations, the smart baseline predicted that the Republican was going to win.

    In 42 instances, it predicted the Democrat.

    And in two instances, it was inconclusive.
    
We really want to do is to see the breakdown of how the smart baseline model does, compared to the actual result who actually won the state.  we want to compare the training set's outcome against the sign of the polling data.

In [18]:
table(Train$Republican, sign(Train$Rasmussen))

   
    -1  0  1
  0 42  1  4
  1  0  1 52

                 Democrat      Tie     Republican 
    Democrat     42 obs         1       4 erros
    Republican   0 errors       1       52 obs

.

We have 42 observations where the Rasmussen smart baseline predicted the Democrat would win, and the Democrat actually did win.

There were 52 observations where the smart baseline predicted the Republican would win, and the Republican actually did win.

There were those 2 inconclusive observations.

And finally, there were 4 mistakes, four times where the smart baseline model predicted that the Republican would win, but actually the Democrat won the state.

### Multicollinearity

In [19]:
#cor(Train) not possible, because Train$State is `Factor` type.

cor(Train[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount", "Republican")])

Unnamed: 0,Rasmussen,SurveyUSA,PropR,DiffCount,Republican
Rasmussen,1.0,0.9127481,0.8356056,0.4926308,0.7908133
SurveyUSA,0.9127481,1.0,0.8869625,0.5695477,0.8418046
PropR,0.8356056,0.8869625,1.0,0.8273785,0.9484204
DiffCount,0.4926308,0.5695477,0.8273785,1.0,0.8092777
Republican,0.7908133,0.8418046,0.9484204,0.8092777,1.0


The varivable PropR have the best correlation with our wanted varivable. So, PropR is the best candidate for our Logisti Regression Model.

### Logistic Regression Model

In [20]:
mod1 = glm(Republican~PropR, data=Train, family="binomial")
summary(mod1)


Call:
glm(formula = Republican ~ PropR, family = "binomial", data = Train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.22880  -0.06541   0.10260   0.10260   1.37392  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -6.146      1.977  -3.108 0.001882 ** 
PropR         11.390      3.153   3.613 0.000303 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.269  on 99  degrees of freedom
Residual deviance:  15.772  on 98  degrees of freedom
AIC: 19.772

Number of Fisher Scoring iterations: 8


### Training set predictions

First, we want to compute the predictions, the predicted probabilities that the Republican is going to win on the training set.

In [21]:
pred1 = predict(mod1, type="response")
table(Train$Republican, pred1 >= 0.5)

   
    FALSE TRUE
  0    45    2
  1     2   51

                      Predict Democrat   Predict Republican

    Real Democrat            45                2

    Real Republican           2                51
    

. 

4 mistakes.

### Two-variable model

In [22]:
mod2 = glm(Republican~SurveyUSA+DiffCount, data=Train, family="binomial")
summary(mod2)


Call:
glm(formula = Republican ~ SurveyUSA + DiffCount, family = "binomial", 
    data = Train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.96335  -0.01207   0.01526   0.06363   1.50373  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.9991     1.2437  -0.803   0.4218  
SurveyUSA     0.2583     0.1454   1.777   0.0756 .
DiffCount     0.7388     0.4464   1.655   0.0979 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.269  on 99  degrees of freedom
Residual deviance:  10.749  on 97  degrees of freedom
AIC: 16.749

Number of Fisher Scoring iterations: 9


The AIC value is a measure of the quality of the model. The preferred model is the one with the minimum AIC.

mod1 -> AIC: 19.772

mod2 -> AIC: 16.749 (Better!)

In [23]:
pred2 = predict(mod2, type="response")
table(Train$Republican, pred2 >= 0.5)

   
    FALSE TRUE
  0    45    2
  1     1   52

                      Predict Democrat   Predict Republican

    Real Democrat            45                2

    Real Republican           1                52
    

. 

3 mistakes.

### Smart baseline accuracy

In [24]:
table(Test$Republican, sign(Test$Rasmussen))

   
    -1  0  1
  0 18  2  4
  1  0  0 21

                 Democrat      Tie     Republican 
    Democrat     18 obs         2       4 erros
    Republican   0 errors       0       21 obs

### Test set predictions

In [25]:
TestPrediction = predict(mod2, newdata=Test, type="response")
table(Test$Republican, TestPrediction >= 0.5)

   
    FALSE TRUE
  0    23    1
  1     0   21

                      Predict Democrat   Predict Republican

    Real Democrat            23                1

    Real Republican           0               21

### Analyze Mistake

In [26]:
subset(Test, TestPrediction >= 0.5 & Republican == 0)

Unnamed: 0_level_0,State,Year,Rasmussen,SurveyUSA,DiffCount,PropR,Republican
Unnamed: 0_level_1,<fct>,<int>,<int>,<int>,<int>,<dbl>,<int>
24,Florida,2012,2,0,6,0.6666667,0
