The data can be found at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910/DVN/II2DB6

It is the data from the Cooperative Congressional Election Study by YouGov.
This part of the project involves multiple steps
1. Getting the data
2. Transforming the data so it fits our needs, by adding/removing certain parameters
3. Vectorize the non-numeric values in our data source
4. Use the data to create a data source for our network
5. Train the network
6. Test the network

In [13]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.python.framework.constant_op import convert_to_eager_tensor

In [11]:
#reading in the data
df = pd.DataFrame()
df = pd.read_stata('election_data.dta')
#filtering by year 2018, which holds 2016 election data. there is more data that predates this election
df_2018 = df[df['year']==2018]
df = df_2018
print(df.head(10))

        year    case_id    weight  weight_cumulative           state  st  \
392755  2018  410751329  0.808436           0.441945           Texas  TX   
392756  2018  410766300  1.757688           0.960869            Ohio  OH   
392757  2018  410770169  1.024198           0.559895        Kentucky  KY   
392758  2018  410770285  0.461958           0.252537         Arizona  AZ   
392759  2018  410099450  0.275367           0.150534    Pennsylvania  PA   
392760  2018  410642421  0.977051           0.534121  North Carolina  NC   
392761  2018  410932685  0.896677           0.490183    Pennsylvania  PA   
392762  2018  411865891  1.338906           0.731935           Texas  TX   
392763  2018  411864923  0.670503           0.366542         Arizona  AZ   
392764  2018  411881120  0.818311           0.447343        Illinois  IL   

           cd  dist  dist_up cong  ...   voted_rep_chosen    voted_sen_chosen  \
392755  TX-29    29       29  115  ...                                          
3

#### So, the data is in 73 rows, however, we will use only some to train our model.
The ones we will use are 

'st', 'dist', 'cong'(geography),

'gender', 'birthyr', 'age', 'educ', 'race', 'faminc', 'marstat'(demographics),

'newsint'(news interest),

'approval_pres'(did they approve of 2016 president),

'ideo5'(ideology), 

'voted_pres_16' - THIS IS THE OUTCOME, IT IS WHAT WE WILL BE TRAINING AND TESTING ON


In [14]:
#filtering by these and dropping NaNs,
df = df_2018
df = df.filter(['st', 'dist', 'cong', 'gender', 'birthyr', 'age', 'educ', 'race', 'faminc', 'marstat','newsint','approval_pres','ideo5', 'voted_pres_16'])
df = df.dropna()
df.reset_index(drop=True, inplace=True)
df.shape

(46206, 14)

So, after we have dropped all the categories and rows we don't want, I have some good data to work with. Later, I will vectorize this by turning each value to a numeric so that we can feed the data in as ~140 values into our model


In [15]:
#creating a list/dict of results
list_of_choices = []

for prez in df['voted_pres_16']:
    if prez not in list_of_choices:
        list_of_choices.append(prez)
count = 0
president_vector = dict()
for prez in list_of_choices:
    president_vector[prez]=count
    count+=1
#giving dummy values to strings to fit our model
result_series = df['voted_pres_16']
df = df.drop(['voted_pres_16'], axis = 1)
df = (pd.get_dummies(df))
result_as_list = []
for i in list(result_series):
    result_as_list.append(president_vector[i])
print('dataframe shape: ', df.shape)
print('result shape: ', len(result_as_list))

dataframe shape:  (46206, 114)
result shape:  46206


In [16]:
#splitting our data into training and test data. I'm using rows 1-39,000 as training and the rest as test data
test_df = df[39000:]
df = df[0:39000]
test_result_as_list = result_as_list[39000:]
result_as_list = result_as_list[0:39000]

In [18]:
#converting result list into a list of numpy arrays with the same number(to compare with last layer of NN)
for element in result_as_list:
    element_vector = np.zeros((5,1))
    element_vector[element]=1
result_series = pd.Series(result_as_list)
print(type(result_series[0]))

<class 'numpy.int64'>


Currently getting the model together and training it
