# Predict 2016 Colorado voter turnout

Welcome to the Spring 2017, Harvard Statistics 149 prediction contest/course project!

**Prediction contest ends April 30, 2017, at 10pm EDT**

The goal of this project is to use the modeling methods you learned in Statistics 149 (and possibly other related methods) to analyze a data set on whether a Colorado voting-eligible citizen ended up actually voting in the 2016 US election. These data were kindly provided by moveon.org. More details can be found [here](https://inclass.kaggle.com/c/who-voted).

## Exploration of features

The goal of this notebook is to explore which features may be important variables for predicting whether a voter is likely to turn out to vote or not. This is the second notebook for this competition. See [part 1](who-voted_EDA.ipynb) for more details.

In [41]:
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

User-designed functions for this project.

In [52]:
import who_voted_functions as wv
import importlib as imp
imp.reload(wv);

In [24]:
#load voter data; congress_district and state_house coded as strings
data = wv.load_data('train_renamed.csv')
data.head()

Unnamed: 0,voted,gender,congress_district,state_house,age,dist_ballot,dist_poll,party,race,hs_only,married,children,cath,evang,non_chrst,other_chrst,days_reg
0,Y,M,7.0,31.0,36,,,U,Hispanic,25.4,63.4,54.0,16.7,16.5,39.6,27.3,420
1,Y,F,6.0,38.0,55,,,U,Uncoded,7.9,97.8,59.8,16.7,15.5,30.9,36.9,307
2,Y,F,2.0,53.0,24,,,U,Caucasian,50.2,7.6,49.5,14.6,24.0,29.6,31.7,292
3,Y,F,7.0,30.0,25,,,D,Caucasian,38.0,8.5,47.4,13.1,22.3,33.3,31.4,316
4,Y,M,5.0,19.0,22,,,R,Caucasian,30.5,19.1,23.1,16.0,10.5,39.1,34.5,392


In [25]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118529 entries, 0 to 118528
Data columns (total 17 columns):
voted                118529 non-null object
gender               118529 non-null object
congress_district    118527 non-null object
state_house          118527 non-null object
age                  118529 non-null int64
dist_ballot          5282 non-null float64
dist_poll            5282 non-null float64
party                118529 non-null object
race                 118529 non-null object
hs_only              118529 non-null float64
married              118529 non-null float64
children             118529 non-null float64
cath                 118529 non-null float64
evang                118529 non-null float64
non_chrst            118529 non-null float64
other_chrst          118529 non-null float64
days_reg             118529 non-null int64
dtypes: float64(9), int64(2), object(6)
memory usage: 15.4+ MB


As there are so many missing values, let's first explore features in the 5282 observations where all columns are present.

In [27]:
no_nulls = data.dropna()
no_nulls.info()
no_nulls.to_csv('no_nulls', index = False, compression = 'gzip')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5282 entries, 30 to 118499
Data columns (total 17 columns):
voted                5282 non-null object
gender               5282 non-null object
congress_district    5282 non-null object
state_house          5282 non-null object
age                  5282 non-null int64
dist_ballot          5282 non-null float64
dist_poll            5282 non-null float64
party                5282 non-null object
race                 5282 non-null object
hs_only              5282 non-null float64
married              5282 non-null float64
children             5282 non-null float64
cath                 5282 non-null float64
evang                5282 non-null float64
non_chrst            5282 non-null float64
other_chrst          5282 non-null float64
days_reg             5282 non-null int64
dtypes: float64(9), int64(2), object(6)
memory usage: 742.8+ KB


In [30]:
no_nulls['voted'].value_counts()

Y    3897
N    1385
Name: voted, dtype: int64

The imbalance between the two classes of the response variable is somewhat preserved, although the imbalance is slightly greater in favor of the positive class then in the full dataset.

Let's create X and y matrices for model fitting.

In [37]:
#converts 'Y' and 'N' to 1s and 0s
y_vector = wv.get_y_vector(no_nulls, 'voted')
no_nulls.drop('voted', axis = 1, inplace = True)

#generates lists of names of quantitative and categorical columns
quant, categ = wv.get_cols[no_null]

#converts dataframe to matrix where categorical variables are one-hot-encoded
X_matrix = wv.design_Xmatrix(no_nulls, quant, categ)

y_vector.shape
X_matrix.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


(5282,)

(5282, 98)

### Fitting a random forest to identify potentially important features

The RandomForestClassifier's feature_importances method calculates how important a given predictor is when making a split via mean decrease in impurity, which is the total decrease in node impurity averaged over all trees per split. Furthermore, random forests are generally good out-of-the-box models that can handle imbalanced classes, non-linear relationships and different types of predictor variables. By fitting a default random forest, we can thus get a sense of which covariates could aid in the prediction task while also getting a benchmark log-loss score, which is how our models will be evaluated in the competition.

We first fit a model on only the 5282 observations where no missing values are present.

In [45]:
#start with 10 cv splits
cv = StratifiedKFold(n_splits = 10, random_state = 123)

#set 100 trees and random state; all other parameters are default values
clf_default = RandomForestClassifier(n_estimators = 100, random_state = 123)

#returns mean & std of cv log-loss scores
wv.cross_val_LL(clf_default, X_matrix, y_vector, cv)

(-0.55755892528421891, 0.0073372008757506277)

Examine the features used to make decisions (splits) at each node.

In [47]:
wv.find_important_features(no_nulls, clf_default.fit(X_matrix, y_vector), 
                           categ_cols, quant_cols)

Unnamed: 0,features,importance
4,dist_ballot,0.075595
10,children,0.072817
2,state_house,0.06987
1,congress_district,0.067527
5,dist_poll,0.066841
3,age,0.06652
4,race,0.065688
3,party,0.065589
9,married,0.064899
8,hs_only,0.064329


According to the default classifier, dist_ballot and dist_poll have high importance scores and can help distinguish between those who did and did not vote. When a classifier is fit on all observations without these 2 columns (see below), the log-loss score is also less than the log-loss score from the model fit to the data missing most rows. These findings suggest dist_ballot and dist_poll will be important for making predictions.

In [60]:
#get full response vector
y = wv.get_y_vector(data, 'voted')
y.shape

(118529,)

In [61]:
#create X matrix without distance metrics
no_dist = data.drop(['voted', 'dist_ballot', 'dist_poll'], axis = 1)

quant3, categ3 = wv.get_cols(no_dist)
X_nodist = wv.design_Xmatrix(no_dist, quant3, categ3)
X_nodist.shape

(118529, 96)

In [62]:
wv.cross_val_LL(clf_default, X_nodist, y, cv)

(-0.56769026462139638, 0.0039721495481984921)

### Can we eliminate any features?

We want to test whether eliminating features deemed unimportant by the random forest results in a decreased log-loss.

In [75]:
#4 columns with lowest importance scores from default RF
col_to_drop = ['evang', 'non_chrst', 'days_reg', 'other_chrst']

#test log-loss score when columns are sequentially dropped by importance score
for i, _ in enumerate(col_to_drop):
    fewer_features = no_nulls.drop(col_to_drop[0:(i+1)], axis = 1)

    #generate new columns and design matrix
    quant2, categ2 = wv.get_cols(fewer_features)
    X_fewer = wv.design_Xmatrix(fewer_features, quant2, categ2)
    wv.cross_val_LL(clf_default, X_fewer, y_vector, cv)

(-0.55740517971166526, 0.0099206122521314118)

(-0.5580721042775999, 0.005600791120642652)

(-0.56211666251687831, 0.0059754751911819037)

(-0.57247490932110379, 0.024143017883888975)

Removing likelihood of being evangelical, which was the least important predictor used by the random forest, results in a (slightly) lower log-loss compared to the model trained with all predictors. However, because the log-loss decrease is not large, and because we are not training on all observations in the dataset, we will have to re-evaluate model performance following imputation.

We will impute the missing values in dist_ballot and dist_poll in [part 3](who-voted_impute.ipynb).