# Election Analysis using Machine Learning

## Outline

1\. Introduction

2\. Description of Data and Setup

3\. Algorithms

4\. Results

5\. Conclusion and References

## 1. Introduction

For my final project in Big Data Analysis using Python, I choose to perform and analysis on the 2016 US election. The goal of this project was to gain experience using machine learning libraries as well as data visualization packages. 

The data is from Kaggle: https://www.kaggle.com/joelwilson/2012-2016-presidential-elections

I will be trying to predict county level outcome in Pennsylvania from the 2016 election. The attributes that I am using to make the prediction are various metrics such as demographics, economic and societal factors. I will use a variety of algorithms from Support Vector Machines to Convloutional Neural Networks through scikit learn to make the predictions. After the predictions are made, I will display the maps using Plotly. 

# 2. Description of Data and Setup

The data from Kaggle contained 4 different tables, however I only needed to use two of them. The first one being the voting results table, the second one being the county facts table. To preprocess the data, I needed to clean certain things up. For instance, Alaska was not uniform throughout the tables, so I removed it all together. 

Esentially, what I needed to do was load in both tables and join them together so I had one dataset which I can split up into trainning and testing data, for Pennsylvania.

In addition, before doing any predictions, I thought it would be worth while to see a plot of the expected outcome. This is done shortly after the preprocessing.

Load in Packages

In [1]:
import plotly
plotly.tools.set_credentials_file(username='YOUR USERNAME', api_key='YOUR API KEY')
import pandas as pd
import numpy as np
import plotly.graph_objs as graph_objs
import json

Read in voting data.

In [4]:
#votes=pd.read_csv("C:\Users\shard\Documents\BigData\Project\\2012-and-2016-presidential-elections\US_County_Level_Presidential_Results_12-16.csv")
votes=pd.read_csv("2012-and-2016-presidential-elections\US_County_Level_Presidential_Results_12-16.csv")
votes.shape 

(3141, 21)

In [5]:
votes.tail()

Unnamed: 0.1,Unnamed: 0,combined_fips,votes_dem_2016,votes_gop_2016,total_votes_2016,per_dem_2016,per_gop_2016,diff_2016,per_point_diff_2016,state_abbr,...,FIPS,total_votes_2012,votes_dem_2012,votes_gop_2012,county_fips,state_fips,per_dem_2012,per_gop_2012,diff_2012,per_point_diff_2012
3136,3136,56037,3233.0,12153.0,16661.0,0.194046,0.729428,8920,-0.535382,WY,...,56037,16750.0,4773.0,11427.0,37.0,56.0,0.284955,0.682209,6654.0,-0.397254
3137,3137,56039,7313.0,3920.0,12176.0,0.600608,0.321945,3393,0.278663,WY,...,56039,11356.0,6211.0,4858.0,39.0,56.0,0.546936,0.427791,1353.0,0.119144
3138,3138,56041,1202.0,6154.0,8053.0,0.149261,0.764187,4952,-0.614926,WY,...,56041,8453.0,1628.0,6613.0,41.0,56.0,0.192594,0.782326,4985.0,-0.589731
3139,3139,56043,532.0,2911.0,3715.0,0.143203,0.78358,2379,-0.640377,WY,...,56043,3911.0,794.0,3013.0,43.0,56.0,0.203017,0.770391,2219.0,-0.567374
3140,3140,56045,294.0,2898.0,3334.0,0.088182,0.869226,2604,-0.781044,WY,...,56045,3323.0,422.0,2821.0,45.0,56.0,0.126994,0.848932,2399.0,-0.721938


This dataset only contains a couple of important columns. These are the location columns (state and county) and the votes for each party. Next we will create a new column called color which will represent the winner of the county based on which party had more votes. 


Each county in the table was listed as "Bucks County" for instance. I removed the word county. I also added a color based on which party won in the county. Lastly I removed Alaska.

In [5]:
### Get County by itself
votes["county"]=""
for i in range(0,len(votes["county"])):
    votes.iloc[i,21]=votes.iloc[i,10][:-7]
    
### Create colors for the counties based on winners
votes["color"]="r"
for i in range(0,len(votes["color"])):
    if votes.iloc[i,2]>votes.iloc[i,3]:
        votes.iloc[i,22]="b"
        
### Keep only location and winning party
votes2=votes.loc[:,["state_abbr","county_name","county","color"]]
votes2=votes2[votes2.loc[:,"state_abbr"]!="AK"].reset_index(drop=True)
votes2.head()

Unnamed: 0,state_abbr,county_name,county,color
0,AL,Autauga County,Autauga,r
1,AL,Baldwin County,Baldwin,r
2,AL,Barbour County,Barbour,r
3,AL,Bibb County,Bibb,r
4,AL,Blount County,Blount,r


Now read in the county facts table. There are many important columns in this dataset. Here is a list of what the columns represent to name a few:

PST045214	Population, 2014 estimate

PST120214	Population, percent change - April 1, 2010 to July 1, 2014

AGE135214	Persons under 5 years, percent, 2014

SEX255214	Female persons, percent, 2014

RHI125214	White alone, percent, 2014

HSD310213	Persons per household, 2009-2013


I removed alaska as well as null rows. Then for both datasets, I made a common column which is a combination of the state and county name.

In [6]:
### Read in demographics data
demo=pd.read_csv("2012-and-2016-presidential-elections\\county_facts.csv")
demo["county_name"]=demo["area_name"]

### Remove Alaska and null rows
demo2=demo[demo.loc[:,"state_abbreviation"]!="AK"].reset_index(drop=True)
demo2=demo2[pd.isnull(demo2.loc[:,"state_abbreviation"])==False].reset_index(drop=True)

### Make loc a combination of state and county to join datasets
demo2['loc']=demo2['state_abbreviation']+" "+demo2['area_name']
demo2.head()

Unnamed: 0,fips,area_name,state_abbreviation,PST045214,PST040210,PST120214,POP010210,AGE135214,AGE295214,AGE775214,...,MAN450207,WTN220207,RTN130207,RTN131207,AFN120207,BPS030214,LND110210,POP060210,county_name,loc
0,1001,Autauga County,AL,55395,54571,1.5,54571,6.0,25.2,13.8,...,0,0,598175,12003,88157,131,594.44,91.8,Autauga County,AL Autauga County
1,1003,Baldwin County,AL,200111,182265,9.8,182265,5.6,22.2,18.7,...,1410273,0,2966489,17166,436955,1384,1589.78,114.6,Baldwin County,AL Baldwin County
2,1005,Barbour County,AL,26887,27457,-2.1,27457,5.7,21.2,16.5,...,0,0,188337,6334,0,8,884.88,31.0,Barbour County,AL Barbour County
3,1007,Bibb County,AL,22506,22919,-1.8,22915,5.3,21.0,14.8,...,0,0,124707,5804,10757,19,622.58,36.8,Bibb County,AL Bibb County
4,1009,Blount County,AL,57719,57322,0.7,57322,6.1,23.6,17.0,...,341544,0,319700,5622,20941,3,644.78,88.9,Blount County,AL Blount County


In [7]:
votes2['loc']=votes2['state_abbr']+" "+votes2['county_name']
votes2.head()

Unnamed: 0,state_abbr,county_name,county,color,loc
0,AL,Autauga County,Autauga,r,AL Autauga County
1,AL,Baldwin County,Baldwin,r,AL Baldwin County
2,AL,Barbour County,Barbour,r,AL Barbour County
3,AL,Bibb County,Bibb,r,AL Bibb County
4,AL,Blount County,Blount,r,AL Blount County


Now I am joining the datasets.

In [8]:
### Join datasets
elect=pd.merge(demo2,votes2)
elect.shape

(3110, 59)

In [9]:
elect.head()

Unnamed: 0,fips,area_name,state_abbreviation,PST045214,PST040210,PST120214,POP010210,AGE135214,AGE295214,AGE775214,...,RTN131207,AFN120207,BPS030214,LND110210,POP060210,county_name,loc,state_abbr,county,color
0,1001,Autauga County,AL,55395,54571,1.5,54571,6.0,25.2,13.8,...,12003,88157,131,594.44,91.8,Autauga County,AL Autauga County,AL,Autauga,r
1,1003,Baldwin County,AL,200111,182265,9.8,182265,5.6,22.2,18.7,...,17166,436955,1384,1589.78,114.6,Baldwin County,AL Baldwin County,AL,Baldwin,r
2,1005,Barbour County,AL,26887,27457,-2.1,27457,5.7,21.2,16.5,...,6334,0,8,884.88,31.0,Barbour County,AL Barbour County,AL,Barbour,r
3,1007,Bibb County,AL,22506,22919,-1.8,22915,5.3,21.0,14.8,...,5804,10757,19,622.58,36.8,Bibb County,AL Bibb County,AL,Bibb,r
4,1009,Blount County,AL,57719,57322,0.7,57322,6.1,23.6,17.0,...,5622,20941,3,644.78,88.9,Blount County,AL Blount County,AL,Blount,r


Next I am making a map of PA using Plotly and Mapbox. First let's subset the PA data.

In [10]:
### Subset PA to map expected winners of counites
paelect=elect.loc[elect["state_abbreviation"]=="PA"].reset_index(drop=True)
paelect.head()

Unnamed: 0,fips,area_name,state_abbreviation,PST045214,PST040210,PST120214,POP010210,AGE135214,AGE295214,AGE775214,...,RTN131207,AFN120207,BPS030214,LND110210,POP060210,county_name,loc,state_abbr,county,color
0,42001,Adams County,PA,101714,101413,0.3,101407,5.1,20.6,18.3,...,8222,160108,272,518.67,195.5,Adams County,PA Adams County,PA,Adams,r
1,42003,Allegheny County,PA,1231255,1223348,0.6,1223348,5.3,19.0,17.4,...,16456,2540334,2343,730.08,1675.6,Allegheny County,PA Allegheny County,PA,Allegheny,b
2,42005,Armstrong County,PA,67785,68940,-1.7,68941,5.0,19.7,20.1,...,7734,39542,45,653.2,105.5,Armstrong County,PA Armstrong County,PA,Armstrong,r
3,42007,Beaver County,PA,169392,170539,-0.7,170539,5.1,19.7,19.7,...,8989,156309,141,434.71,392.3,Beaver County,PA Beaver County,PA,Beaver,r
4,42009,Bedford County,PA,48946,49768,-1.7,49762,5.0,20.7,20.5,...,11712,70289,66,1012.3,49.2,Bedford County,PA Bedford County,PA,Bedford,r


In order to do county level mapping with Plotly, I had to make a mapbox account. In addition, I had to create a plotly account to allow plotting in a Jupyter notebook. 

To color counties in a plotly plot, I had to retreive a geojson of PA from: http://catalog.civicdashboards.com/dataset/pennsylvania-counties-polygon

I will be parsing the Geojson to retrive information such as the counties and their latitute and longitudes.

In [2]:
### Open geojson
with open('Geojson_Data\PA.geojson') as f:
    data = json.load(f)

Get the county names and create pandas dataframe from it.

In [41]:
### Take the county names from the geojson
county_names = []
county_names_dict = {}
county_locs=[]
for county in data['features']:
    for m in range(len(county['properties']['name'])):
        if county['properties']['name'][m:m+6] == 'County':
            county_names.append(county['properties']['name'][0:m-1])
            county_locs.append(county['geometry']['coordinates'][0:m-1])
            county_names_dict[county['properties']['name'][0:m-1]] = county['properties']['name']
PAcounties=pd.DataFrame(county_names)
PAcounties.head()

Unnamed: 0,0
0,Blair
1,Crawford
2,Snyder
3,Indiana
4,Pike


Retrieve the average lat/lon for each county in PA.

In [42]:
county_lat=[]
county_lon=[]
for i in range(0,len(county_locs)):
    templats=[]
    templons=[]
    for j in range(0,len(county_locs[i][0][0])):
        templons.append(county_locs[i][0][0][j][0])
        templats.append(county_locs[i][0][0][j][1])
    templat=sum(templats)/len(templats)
    templon=sum(templons)/len(templons)
    county_lat.append(templat)
    county_lon.append(templon)


Partition the counties into red and blue groups depending on the winner of the county.

In [43]:
red_counties = []
blue_counties = []
county_colors=[]
for i in range(len(PAcounties)):
    for j in range(len(paelect["county"])):
        b=PAcounties.iloc[i] == paelect.iloc[j,57]
        if b[0]==True:
            if paelect.iloc[j,58]=='r':
                red_counties.append(data['features'][i])
                county_colors.append('red')
            else:
                blue_counties.append(data['features'][i])
                county_colors.append('blue')

red_data = {"type": "FeatureCollection"}
red_data['features'] = red_counties

blue_data = {"type": "FeatureCollection"}
blue_data['features'] = blue_counties

Create the plotly iplot. 

In [44]:
mapbox_access_token = "your access token"

data = graph_objs.Data([
    graph_objs.Scattermapbox(
        lat=county_lat,
        lon=county_lon,
        mode='markers',
        text=county_names,
        hoverinfo= 'text',
        marker=dict(color=county_colors)
    )
])
layout = graph_objs.Layout(
    height=600,
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        layers=[
            dict(
                sourcetype = 'geojson',
                source = red_data,
                type = 'fill',
                color = 'rgba(163,22,19,0.8)'
            ),
            dict(
                sourcetype = 'geojson',
                source = blue_data,
                type = 'fill',
                color = 'rgba(40,0,113,0.8)'
            )
        ],
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=40.8,
            lon=-76
        ),
        pitch=0,
        zoom=5.2,
        style='light'
    ),
)

from plotly import __version__
from plotly.graph_objs import *
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
fig = dict(data=data, layout=layout)
#plotly.plotly.iplot(fig)
plotly.offline.plot(fig, filename='Actual Results')

'file://C:\\Users\\shard\\Documents\\BigData\\Project\\Actual Results.html'

https://cdn.rawgit.com/sharder14/big-data-python-class/master/Project/Actual%20Results.html

Now let's go back to the joined dataset and get the columns in a more appropriate order. So we will have location, then the winner of the county, then the attributes.

In [16]:
### Go back to elect
### Want location, winner, then attributes
elect2=pd.concat([elect.iloc[:,[54,55,56,57,58]],elect.iloc[:,3:54]],axis=1)
elect2.head()

Unnamed: 0,county_name,loc,state_abbr,county,color,PST045214,PST040210,PST120214,POP010210,AGE135214,...,SBO415207,SBO015207,MAN450207,WTN220207,RTN130207,RTN131207,AFN120207,BPS030214,LND110210,POP060210
0,Autauga County,AL Autauga County,AL,Autauga,r,55395,54571,1.5,54571,6.0,...,0.7,31.7,0,0,598175,12003,88157,131,594.44,91.8
1,Baldwin County,AL Baldwin County,AL,Baldwin,r,200111,182265,9.8,182265,5.6,...,1.3,27.3,1410273,0,2966489,17166,436955,1384,1589.78,114.6
2,Barbour County,AL Barbour County,AL,Barbour,r,26887,27457,-2.1,27457,5.7,...,0.0,27.0,0,0,188337,6334,0,8,884.88,31.0
3,Bibb County,AL Bibb County,AL,Bibb,r,22506,22919,-1.8,22915,5.3,...,0.0,0.0,0,0,124707,5804,10757,19,622.58,36.8
4,Blount County,AL Blount County,AL,Blount,r,57719,57322,0.7,57322,6.1,...,0.0,23.2,341544,0,319700,5622,20941,3,644.78,88.9


# 3. Algorithms

The algorithms, I used were Support Vector Machine, K-nearest Neighbor, and Logistic Regression. In addition, I did a feature selection method, recursive feature elimination in order to reduce my predictor variables.

First, I split my data into testing and training. Training was all counties not in PA, testing was all in PA.

In [45]:
### Split training and test data
train=elect2.loc[elect2['state_abbr']!="PA"]
test=elect2.loc[elect2['state_abbr']=='PA']

### Split attributes and classes
trainatt=train.iloc[:,5:56]
trainatt.head()

Unnamed: 0,PST045214,PST040210,PST120214,POP010210,AGE135214,AGE295214,AGE775214,SEX255214,RHI125214,RHI225214,...,SBO415207,SBO015207,MAN450207,WTN220207,RTN130207,RTN131207,AFN120207,BPS030214,LND110210,POP060210
0,55395,54571,1.5,54571,6.0,25.2,13.8,51.4,77.9,18.7,...,0.7,31.7,0,0,598175,12003,88157,131,594.44,91.8
1,200111,182265,9.8,182265,5.6,22.2,18.7,51.2,87.1,9.6,...,1.3,27.3,1410273,0,2966489,17166,436955,1384,1589.78,114.6
2,26887,27457,-2.1,27457,5.7,21.2,16.5,46.6,50.2,47.6,...,0.0,27.0,0,0,188337,6334,0,8,884.88,31.0
3,22506,22919,-1.8,22915,5.3,21.0,14.8,45.9,76.3,22.1,...,0.0,0.0,0,0,124707,5804,10757,19,622.58,36.8
4,57719,57322,0.7,57322,6.1,23.6,17.0,50.5,96.0,1.8,...,0.0,23.2,341544,0,319700,5622,20941,3,644.78,88.9


In [46]:
traincls=train.iloc[:,4]
traincls.head()

0    r
1    r
2    r
3    r
4    r
Name: color, dtype: object

In [47]:
testatt=test.iloc[:,5:56]
testcls=test.iloc[:,4]

Now that my data is split into training and testing along with attributes and classes, I can impliment the algorithms. 

A support vector machine uses hyper planes to split data into the different classes. This is effectively taking data of a low dimension and projecting onto a higher dimension in order to recognize different patterns.

In [48]:
### SVM
from sklearn import svm
from sklearn import metrics
svmp=svm.SVC()
svmp.fit(trainatt,traincls)
pred1=svmp.predict(testatt)
metrics.accuracy_score(testcls,pred1)

0.83582089552238803

Here we have gotten an accuracy of about 83% on the data for PA. Now we can plot the predicted counties to investigate.

In [49]:
### Plot for SVM
### Open geojson
with open('Geojson_Data\PA.geojson') as f:
    data = json.load(f)

red_counties = []
blue_counties = []
green_counties=[]
county_colors=[]
county_text=[]

for i in range(len(PAcounties)):
    for j in range(len(test["county"])):
        b=PAcounties.iloc[i] == test['county'].iloc[j]
        if b[0]==True:
            if ((pred1[j]=='r') and (test['color'].iloc[j]=='r')):
                red_counties.append(data['features'][i])
                county_colors.append('red')
                county_text.append(county_names[i])
            else:
                if((pred1[j]=='b') and (test['color'].iloc[j]=='b')):
                    blue_counties.append(data['features'][i])
                    county_colors.append('blue')
                    county_text.append(county_names[i])
                else:
                    green_counties.append(data['features'][i])
                    county_colors.append('green')
                    pred=''
                    act=''
                    if pred1[j]=='b':
                        pred='Democrat'
                    else: pred='Republican'
                    if test['color'].iloc[j]=='b':
                        act='Democrat'
                    else: act='Republican'
                    county_text.append(county_names[i]+"<br>Predicted: "+pred+"<br>Actual: "+act)

red_data = {"type": "FeatureCollection"}
red_data['features'] = red_counties
blue_data = {"type": "FeatureCollection"}
blue_data['features'] = blue_counties
green_data = {"type": "FeatureCollection"}
green_data['features'] = green_counties


### Get the map
mapbox_access_token = "your access token"

data = graph_objs.Data([
    graph_objs.Scattermapbox(
        lat=county_lat,
        lon=county_lon,
        mode='markers',
        text=county_text,
        hoverinfo= 'text',
        marker=dict(color=county_colors)
    )
])
layout = graph_objs.Layout(
    height=600,
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        layers=[
            dict(
                sourcetype = 'geojson',
                source = red_data,
                type = 'fill',
                color = 'rgba(163,22,19,0.8)'
            ),
            dict(
                sourcetype = 'geojson',
                source = blue_data,
                type = 'fill',
                color = 'rgba(40,0,113,0.8)'
            ),
            dict(
                sourcetype = 'geojson',
                source = green_data,
                type = 'fill',
                color = 'rgba(63,191,76,0.8)'
            )
        ],
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=40.8,
            lon=-76
        ),
        pitch=0,
        zoom=5.2,
        style='light'
    ),
)

fig = dict(data=data, layout=layout)
#plotly.plotly.iplot(fig)
plotly.offline.plot(fig, filename='SVM All Atts')


Your filename `SVM All Atts` didn't end with .html. Adding .html to the end of your file.



'file://C:\\Users\\shard\\Documents\\BigData\\Project\\SVM All Atts.html'

https://cdn.rawgit.com/sharder14/big-data-python-class/master/Project/SVM%20All%20Atts.html

Through investigation of the plot, all the democrat counties were predicted as republican, but we predicted the republican counties 100%. 

Now let's try to do better with our predictions. I am running a recursive feature elimination algorithm, which ranks the attributes based on their importance in the model. I choose logistic regression as the algorithm to rank importance. 

Logistic regression uses a logit function to fit a curve to binary data. It calculates the probability of either event happening and the prediction is the one with the higher probability. 

In [50]:
### Reduce number of attributes
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, 10)
rfe = rfe.fit(trainatt, traincls)

attindex=[]
for i in range(0,len(rfe.ranking_)):
    if rfe.ranking_[i]==1:
        attindex.append(i)
trainatt=trainatt.iloc[:,attindex]
traincls=traincls
testatt=testatt.iloc[:,attindex]
testcls=testcls
trainatt.head()

Unnamed: 0,PST120214,AGE135214,AGE295214,AGE775214,SEX255214,RHI525214,RHI825214,EDU685213,HSD310213,SBO515207
0,1.5,6.0,25.2,13.8,51.4,0.1,75.6,20.9,2.71,0.0
1,9.8,5.6,22.2,18.7,51.2,0.1,83.0,27.7,2.52,0.0
2,-2.1,5.7,21.2,16.5,46.6,0.2,46.6,13.4,2.66,0.0
3,-1.8,5.3,21.0,14.8,45.9,0.1,74.5,12.1,3.03,0.0
4,0.7,6.1,23.6,17.0,50.5,0.1,87.8,12.1,2.7,0.0


The most important attributes for the model are:
    
PST120214	Population, percent change - April 1, 2010 to July 1, 2014

AGE135214	Persons under 5 years, percent, 2014

AGE295214	Persons under 18 years, percent, 2014

AGE775214	Persons 65 years and over, percent, 2014

SEX255214	Female persons, percent, 2014

RHI525214	Native Hawaiian and Other Pacific Islander alone, percent, 2014

RHI825214	White alone, not Hispanic or Latino, percent, 2014

EDU685213	Bachelor's degree or higher, percent of persons age 25+, 2009-2013

HSD310213	Persons per household, 2009-2013

SBO515207	Native Hawaiian- and Other Pacific Islander-owned firms, percent, 2007

Now we are going to run the model on the reduced dataset.

In [51]:
svmp=svm.SVC()
svmp.fit(trainatt,traincls)
pred1=svmp.predict(testatt)
pred1
metrics.accuracy_score(testcls,pred1)

0.86567164179104472

We can see that this is a decent increase from before. Let us plot the map of the results.

In [52]:
### Open geojson
with open('Geojson_Data\PA.geojson') as f:
    data = json.load(f)

red_counties = []
blue_counties = []
green_counties=[]
county_colors=[]
county_text=[]

for i in range(len(PAcounties)):
    for j in range(len(test["county"])):
        b=PAcounties.iloc[i] == test['county'].iloc[j]
        if b[0]==True:
            if ((pred1[j]=='r') and (test['color'].iloc[j]=='r')):
                red_counties.append(data['features'][i])
                county_colors.append('red')
                county_text.append(county_names[i])
            else:
                if((pred1[j]=='b') and (test['color'].iloc[j]=='b')):
                    blue_counties.append(data['features'][i])
                    county_colors.append('blue')
                    county_text.append(county_names[i])
                else:
                    green_counties.append(data['features'][i])
                    county_colors.append('green')
                    pred=''
                    act=''
                    if pred1[j]=='b':
                        pred='Democrat'
                    else: pred='Republican'
                    if test['color'].iloc[j]=='b':
                        act='Democrat'
                    else: act='Republican'
                    county_text.append(county_names[i]+"<br>Predicted: "+pred+"<br>Actual: "+act)

red_data = {"type": "FeatureCollection"}
red_data['features'] = red_counties
blue_data = {"type": "FeatureCollection"}
blue_data['features'] = blue_counties
green_data = {"type": "FeatureCollection"}
green_data['features'] = green_counties


### Get the map
mapbox_access_token = "your access token"

data = graph_objs.Data([
    graph_objs.Scattermapbox(
        lat=county_lat,
        lon=county_lon,
        mode='markers',
        text=county_text,
        hoverinfo= 'text',
        marker=dict(color=county_colors)
    )
])
layout = graph_objs.Layout(
    height=600,
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        layers=[
            dict(
                sourcetype = 'geojson',
                source = red_data,
                type = 'fill',
                color = 'rgba(163,22,19,0.8)'
            ),
            dict(
                sourcetype = 'geojson',
                source = blue_data,
                type = 'fill',
                color = 'rgba(40,0,113,0.8)'
            ),
            dict(
                sourcetype = 'geojson',
                source = green_data,
                type = 'fill',
                color = 'rgba(63,191,76,0.8)'
            )
        ],
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=40.8,
            lon=-76
        ),
        pitch=0,
        zoom=5.2,
        style='light'
    ),
)

fig = dict(data=data, layout=layout)
#plotly.plotly.iplot(fig)
plotly.offline.plot(fig, filename='SVM')


Your filename `SVM` didn't end with .html. Adding .html to the end of your file.



'file://C:\\Users\\shard\\Documents\\BigData\\Project\\SVM.html'

https://cdn.rawgit.com/sharder14/big-data-python-class/master/Project/SVM.html

Now we will try and predict using logistic regression.

In [54]:
### Logistic Regression
from sklearn import linear_model
logreg=linear_model.LogisticRegression()
logreg.fit(trainatt,traincls)
pred2=logreg.predict(testatt)
metrics.accuracy_score(testcls,pred2)

0.92537313432835822

Logistic regression has a siginificant increase in accuracy from SVM. Let's plot the map of results.

In [55]:
### Open geojson
with open('Geojson_Data\PA.geojson') as f:
    data = json.load(f)

red_counties = []
blue_counties = []
green_counties=[]
county_colors=[]
county_text=[]

for i in range(len(PAcounties)):
    for j in range(len(test["county"])):
        b=PAcounties.iloc[i] == test['county'].iloc[j]
        if b[0]==True:
            if ((pred2[j]=='r') and (test['color'].iloc[j]=='r')):
                red_counties.append(data['features'][i])
                county_colors.append('red')
                county_text.append(county_names[i])
            else:
                if((pred2[j]=='b') and (test['color'].iloc[j]=='b')):
                    blue_counties.append(data['features'][i])
                    county_colors.append('blue')
                    county_text.append(county_names[i])
                else:
                    green_counties.append(data['features'][i])
                    county_colors.append('green')
                    pred=''
                    act=''
                    if pred1[j]=='b':
                        pred='Democrat'
                    else: pred='Republican'
                    if test['color'].iloc[j]=='b':
                        act='Democrat'
                    else: act='Republican'
                    county_text.append(county_names[i]+"<br>Predicted: "+pred+"<br>Actual: "+act)

red_data = {"type": "FeatureCollection"}
red_data['features'] = red_counties
blue_data = {"type": "FeatureCollection"}
blue_data['features'] = blue_counties
green_data = {"type": "FeatureCollection"}
green_data['features'] = green_counties


### Get the map
mapbox_access_token = "your access token"

data = graph_objs.Data([
    graph_objs.Scattermapbox(
        lat=county_lat,
        lon=county_lon,
        mode='markers',
        text=county_text,
        hoverinfo= 'text',
        marker=dict(color=county_colors)
    )
])
layout = graph_objs.Layout(
    height=600,
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        layers=[
            dict(
                sourcetype = 'geojson',
                source = red_data,
                type = 'fill',
                color = 'rgba(163,22,19,0.8)'
            ),
            dict(
                sourcetype = 'geojson',
                source = blue_data,
                type = 'fill',
                color = 'rgba(40,0,113,0.8)'
            ),
            dict(
                sourcetype = 'geojson',
                source = green_data,
                type = 'fill',
                color = 'rgba(63,191,76,0.8)'
            )
        ],
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=40.8,
            lon=-76
        ),
        pitch=0,
        zoom=5.2,
        style='light'
    ),
)

fig = dict(data=data, layout=layout)
#plotly.plotly.iplot(fig)
plotly.offline.plot(fig, filename='LogReg')


Your filename `LogReg` didn't end with .html. Adding .html to the end of your file.



'file://C:\\Users\\shard\\Documents\\BigData\\Project\\LogReg.html'

https://cdn.rawgit.com/sharder14/big-data-python-class/master/Project/LogReg.html

Lastly let's use K-nearest neighbor algorithm to try and predict. The KNN algorithm, goes through each point in the test set and calculates the euclidean distance between itself and all trainning points. The k closest points determine which class the test point will be classified as.

In [56]:
from sklearn import neighbors
knn=neighbors.KNeighborsClassifier(3)
knn.fit(trainatt,traincls)
pred3=knn.predict(testatt)
metrics.accuracy_score(testcls,pred3)

0.92537313432835822

We got the same exact accuracy as we did for logistic regression. Let's look at the map and see if there are any differences.

In [57]:
### Open geojson
with open('Geojson_Data\PA.geojson') as f:
    data = json.load(f)

red_counties = []
blue_counties = []
green_counties=[]
county_colors=[]
county_text=[]

for i in range(len(PAcounties)):
    for j in range(len(test["county"])):
        b=PAcounties.iloc[i] == test['county'].iloc[j]
        if b[0]==True:
            if ((pred3[j]=='r') and (test['color'].iloc[j]=='r')):
                red_counties.append(data['features'][i])
                county_colors.append('red')
                county_text.append(county_names[i])
            else:
                if((pred3[j]=='b') and (test['color'].iloc[j]=='b')):
                    blue_counties.append(data['features'][i])
                    county_colors.append('blue')
                    county_text.append(county_names[i])
                else:
                    green_counties.append(data['features'][i])
                    county_colors.append('green')
                    pred=''
                    act=''
                    if pred1[j]=='b':
                        pred='Democrat'
                    else: pred='Republican'
                    if test['color'].iloc[j]=='b':
                        act='Democrat'
                    else: act='Republican'
                    county_text.append(county_names[i]+"<br>Predicted: "+pred+"<br>Actual: "+act)

red_data = {"type": "FeatureCollection"}
red_data['features'] = red_counties
blue_data = {"type": "FeatureCollection"}
blue_data['features'] = blue_counties
green_data = {"type": "FeatureCollection"}
green_data['features'] = green_counties


### Get the map
mapbox_access_token = "your access token"

data = graph_objs.Data([
    graph_objs.Scattermapbox(
        lat=county_lat,
        lon=county_lon,
        mode='markers',
        text=county_text,
        hoverinfo= 'text',
        marker=dict(color=county_colors)
    )
])
layout = graph_objs.Layout(
    height=600,
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        layers=[
            dict(
                sourcetype = 'geojson',
                source = red_data,
                type = 'fill',
                color = 'rgba(163,22,19,0.8)'
            ),
            dict(
                sourcetype = 'geojson',
                source = blue_data,
                type = 'fill',
                color = 'rgba(40,0,113,0.8)'
            ),
            dict(
                sourcetype = 'geojson',
                source = green_data,
                type = 'fill',
                color = 'rgba(63,191,76,0.8)'
            )
        ],
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=40.8,
            lon=-76
        ),
        pitch=0,
        zoom=5.2,
        style='light'
    ),
)

fig = dict(data=data, layout=layout)
#plotly.plotly.iplot(fig)
plotly.offline.plot(fig, filename='KNN')


Your filename `KNN` didn't end with .html. Adding .html to the end of your file.



'file://C:\\Users\\shard\\Documents\\BigData\\Project\\KNN.html'

https://cdn.rawgit.com/sharder14/big-data-python-class/master/Project/KNN.html

# 4. Results

The original dataset that was put together for analysis seemed to have too many variables. We can tell because our inital predition using SVM was not as high as we would have wanted it. 

In order to fix this, we used recursive feature elimination to pick the 10 best attributes. After we reduced our dataset, we an increase in accuracy for SVM and generally good results for other algorithms we used. 

Accuracy: SVM(86.5%), LogReg(92.5%), KNN(92.5%)

For a final analysis, I thought it would be interesting to see how well we could predict the tri-state area (PA, NJ, DE) from the rest of the country. I however, did not plot all three states because it would involve parsing three different Geojson for just one plot.

First split up the states.

In [30]:
trainTRI=elect2[elect2['state_abbr'].isin(['PA','NJ','DE'])==False]

testTRI=elect2[elect2['state_abbr'].isin(['PA','NJ','DE'])!=False]


### Split attributes and classes
trainTRIatt=trainTRI.iloc[:,5:56]
trainTRIcls=trainTRI.iloc[:,4]
testTRIatt=testTRI.iloc[:,5:56]
testTRIcls=testTRI.iloc[:,4]

In [31]:
trainatt=trainTRIatt.iloc[:,attindex]
traincls=trainTRIcls
testatt=testTRIatt.iloc[:,attindex]
testcls=testTRIcls
trainatt.head()

Unnamed: 0,PST120214,AGE135214,AGE295214,AGE775214,SEX255214,RHI525214,RHI825214,EDU685213,HSD310213,SBO515207
0,1.5,6.0,25.2,13.8,51.4,0.1,75.6,20.9,2.71,0.0
1,9.8,5.6,22.2,18.7,51.2,0.1,83.0,27.7,2.52,0.0
2,-2.1,5.7,21.2,16.5,46.6,0.2,46.6,13.4,2.66,0.0
3,-1.8,5.3,21.0,14.8,45.9,0.1,74.5,12.1,3.03,0.0
4,0.7,6.1,23.6,17.0,50.5,0.1,87.8,12.1,2.7,0.0


In [32]:
from sklearn import svm
svmtri=svm.SVC()
svmtri.fit(trainatt,traincls)
predTRI=svmtri.predict(testatt)
metrics.accuracy_score(testcls,predTRI)

0.79120879120879117

# 5. Conclusion and References



Our algorithms were able to predict the county level results of PA with up to 92% accuracy. This was due to the feature selection method used, recursive feature elimination. We were able to see that logistic regression and knn tied as the most accurate algorithms. Lastly we were able to see that we could predict the tri-state county level results with almost 80% accuracy. It is worth noting that these models are only useful for predicting the 2016 results. Such models are not reliable enough to predict the future county level results without further analysis of additional data.


References:

Stanford Project predicting senate elections:
http://cs229.stanford.edu/proj2014/Rohan%20Sampath,%20Yue%20Teng,%20Classification%20and%20Regression%20Approaches%20to%20Predicting%20US%20Senate%20Elections.pdf 

Scikit-Learn Documentation:
http://scikit-learn.org/stable/documentation.html 

Plotly Documentation:
https://plot.ly/python/ 

Help for plotting the Map in Plotly:
https://plot.ly/python/choropleth-maps/ 

Carnegie Mellon Lecture:
http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf 

Support Vector Machine Examples:
https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/ 

Scikit-Learn Examples:
https://machinelearningmastery.com/get-your-hands-dirty-with-scikit-learn-now/

Feature selection:
https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/