# Trump Vote Share - Income Analysis
## Table of Contents
1. Data Cleaning
2. Data Visualization and Analysis
3. Machine Learning
4. Conclusion


## 1. Data Cleaning

In [90]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import plotly.express as px
import plotly.graph_objects as go

In [77]:
trump = pd.read_csv('https://raw.githubusercontent.com/mwaugh0328/Data_Bootcamp_Fall_2017/master/data_bootcamp_1127/trump_data.csv', 
                    encoding='latin-1')

In [78]:
trump.head()

Unnamed: 0.1,Unnamed: 0,population,income,NAME,county,state,FIPS,StateCode,StateName,CountyFips,CountyName,CountyTotalVote,Party,Candidate,VoteCount,_merge,trump_share
0,0,55221.0,51281.0,"Autauga County, Alabama",1,1,1001.0,AL,alabama,1001,Autauga,24661,GOP,Trump,18110.0,both,0.734358
1,5,195121.0,50254.0,"Baldwin County, Alabama",3,1,1003.0,AL,alabama,1003,Baldwin,94090,GOP,Trump,72780.0,both,0.773515
2,10,26932.0,32964.0,"Barbour County, Alabama",5,1,1005.0,AL,alabama,1005,Barbour,10390,GOP,Trump,5431.0,both,0.522714
3,15,22604.0,38678.0,"Bibb County, Alabama",7,1,1007.0,AL,alabama,1007,Bibb,8748,GOP,Trump,6733.0,both,0.769662
4,20,57710.0,45813.0,"Blount County, Alabama",9,1,1009.0,AL,alabama,1009,Blount,25384,GOP,Trump,22808.0,both,0.898519


In [79]:
trump.shape

(3112, 17)

In [80]:
trump.describe()

Unnamed: 0.1,Unnamed: 0,population,income,county,state,FIPS,CountyFips,CountyTotalVote,VoteCount,trump_share
count,3112.0,3112.0,3111.0,3112.0,3112.0,3112.0,3112.0,3112.0,3112.0,3112.0
mean,7777.5,101472.2,46662.262295,103.175129,30.548522,30651.696979,30651.696979,40315.75,19139.641067,0.636029
std,4492.506724,324453.2,12124.417511,107.834596,14.965305,14984.651238,14984.651238,105949.0,38486.948628,0.156363
min,0.0,117.0,19328.0,1.0,1.0,1001.0,1001.0,64.0,57.0,0.041221
25%,3888.75,11205.5,38747.5,35.0,19.0,19038.5,19038.5,4807.0,3193.0,0.549716
50%,7777.5,26010.0,45023.0,78.5,29.0,29208.0,29208.0,10910.5,7087.5,0.66711
75%,11666.25,67936.25,52109.5,133.0,46.0,46005.5,46005.5,28433.5,17321.25,0.750448
max,15555.0,10038390.0,123453.0,840.0,56.0,56045.0,56045.0,2240323.0,566019.0,0.952727


In [81]:
trump.corr()

Unnamed: 0.1,Unnamed: 0,population,income,county,state,FIPS,CountyFips,CountyTotalVote,VoteCount,trump_share
Unnamed: 0,1.0,-0.061086,0.092661,0.20404,0.995888,0.996071,0.996071,-0.055698,-0.054335,0.050111
population,-0.061086,1.0,0.247073,-0.048813,-0.061003,-0.061276,-0.061276,0.951308,0.853867,-0.346788
income,0.092661,0.247073,1.0,-0.055744,0.095195,0.094672,0.094672,0.308981,0.343856,-0.189405
county,0.20404,-0.048813,-0.055744,1.0,0.175918,0.182888,0.182888,-0.059105,-0.061169,0.030484
state,0.995888,-0.061003,0.095195,0.175918,1.0,0.999975,0.999975,-0.052594,-0.050709,0.050352
FIPS,0.996071,-0.061276,0.094672,0.182888,0.999975,1.0,1.0,-0.052951,-0.051084,0.050506
CountyFips,0.996071,-0.061276,0.094672,0.182888,0.999975,1.0,1.0,-0.052951,-0.051084,0.050506
CountyTotalVote,-0.055698,0.951308,0.308981,-0.059105,-0.052594,-0.052951,-0.052951,1.0,0.930217,-0.392625
VoteCount,-0.054335,0.853867,0.343856,-0.061169,-0.050709,-0.051084,-0.051084,0.930217,1.0,-0.319443
trump_share,0.050111,-0.346788,-0.189405,0.030484,0.050352,0.050506,0.050506,-0.392625,-0.319443,1.0


In [82]:
trump.dtypes

Unnamed: 0           int64
population         float64
income             float64
NAME                object
county               int64
state                int64
FIPS               float64
StateCode           object
StateName           object
CountyFips           int64
CountyName          object
CountyTotalVote      int64
Party               object
Candidate           object
VoteCount          float64
_merge              object
trump_share        float64
dtype: object

**_Drop unnecessary columns such as Unnamed:0._**

In [83]:
trump.drop(columns=['Unnamed: 0', 'NAME', 'Party', 'StateName', 'FIPS', '_merge', 'Candidate',
                    'county', 'state', 'CountyFips'], inplace=True)

In [84]:
trump.rename(columns={'population': 'Population', 'income': 'Income', 'StateCode': 'State',
                      'CountyName': 'County', 'CountyTotalVote': 'County Total Vote',
                      'VoteCount': 'Vote Count', 'trump_share': 'Trump Share'}, inplace=True)

**_Some counties have all uppercase, so we will change it to same format._**

In [85]:
trump['County'] = trump['County'].str.title()

## 2. Visualization

### 1) Trump Share by County and State

In [86]:
top10_trump_support = trump.sort_values(by='Trump Share', ascending=False)[:10]

In [91]:
fig = px.sunburst(top10_trump_support, path=['State','County'], values='Trump Share', color='Trump Share',
                  color_continuous_scale='OrRd', title='Top 10 Trump Support County and State',
                  hover_data=['Trump Share'])
fig.update_layout(coloraxis_colorbar_title='Trump Share')
fig.update_traces(hovertemplate="Trump Share: %{customdata[0]}")
fig.show()

**_You can see that Texas is the number one state that support trump. And the county Roberts has the highest trump share among Texas conties._**

### 2) Relationship between Income and Trump Share

In [58]:
fig = px.density_heatmap(trump, x="Income", y="Trump Share",
                        marginal_x="histogram", marginal_y="histogram",
                        title='Income and Trump Share')
fig.show()

### 3) Population and Trump Share

In [59]:
trump['Population_ln'] = np.log(trump['Population'])

In [60]:
fig = px.scatter(trump, x='Population_ln', y='Trump Share',
                labels={'Population_ln': 'Population (ln)'},
                title='Relationship Between Population and Trump Share',
                template='none')

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

### a. Ordinary Least Sqaures

![image.png](attachment:image.png)

Ordinary Least Squares (OLS) is to fit the fuction with the data by minimizing the sum of squared errors. This method is used to estimate the unknown parameters in a linear regression model. 

In [61]:
trump['Income_ln'] = np.log(trump['Income'])
trump['Income_ln'].fillna(method='ffill', inplace=True)

In [62]:
import statsmodels.formula.api as smf

ols_reg = smf.ols('Q("Trump Share") ~ Income_ln', trump).fit()
trump['pred_ols_reg'] = ols_reg.predict()

In [63]:
print(ols_reg.summary())

                            OLS Regression Results                            
Dep. Variable:       Q("Trump Share")   R-squared:                       0.019
Model:                            OLS   Adj. R-squared:                  0.019
Method:                 Least Squares   F-statistic:                     61.72
Date:                Sat, 08 Aug 2020   Prob (F-statistic):           5.40e-15
Time:                        17:58:55   Log-Likelihood:                 1389.9
No. Observations:                3112   AIC:                            -2776.
Df Residuals:                    3110   BIC:                            -2764.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.5917      0.122     13.082      0.0

In [64]:
trump['pred_ols_reg'].head()

0    0.624882
1    0.626685
2    0.664276
3    0.650025
4    0.634933
Name: pred_ols_reg, dtype: float64

**_We could simply plot the OLS using Plotly trendline function._**

In [65]:
fig = px.scatter(trump, x='Income_ln', y='Trump Share', trendline='ols',
                labels={'Income_ln': 'Income (ln)'},
                title='Relationship Between Population and Trump Share',
                template='none', trendline_color_override='red')
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

### b. Random Forest Regressor
![image.png](attachment:image.png)

A random Forest (RF) regressor is an ensemble techqunie that performs regression and classicifcation tasks using multiple decision trees. 

In [66]:
trump['Income_ln'] = np.log(trump['Income'])
trump['Income_ln'].fillna(method='ffill', inplace=True)

In [67]:
from sklearn.ensemble import RandomForestRegressor as rf
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(trump[['Income_ln']].values, trump['Trump Share'].values)

skl_rf = rf(n_estimators=100).fit(x_train, y_train)
skl_rf.score(x_test, y_test)

-0.3293751134700227

In [68]:
trump['pred_skl_rf'] = skl_rf.predict(trump[['Income_ln']].values)

In [69]:
trump['pred_skl_rf'].head()

0    0.722969
1    0.742688
2    0.643170
3    0.694378
4    0.853808
Name: pred_skl_rf, dtype: float64

In [70]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=trump['Income_ln'], y=trump['Trump Share'],
                    mode='markers',
                    name='Trump Share'))
fig.add_trace(go.Scatter(x=trump['Income_ln'], y=trump['pred_skl_rf'],
                    mode='markers',
                    name='RF Predicted'))

fig.update_layout(template='none', title='Income and Trump Share Predicted by Random Forest',
                  xaxis_title='Income (ln)', yaxis_title='Trump Share')
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

### c. K-Nearest Neighbor (KNN)

![image.png](attachment:image.png)

The K-Nearest Neighbor (KNN) is a simple supervised machine learning technique that groups and classifies each cases based on their similarities to its neighbor.

In [71]:
from sklearn.neighbors import KNeighborsRegressor as knn
skl_knn = knn(n_neighbors=100).fit(trump[['Income_ln']].values, trump['Trump Share'].values)
trump['pred_skl_knn'] = skl_knn.predict(trump[['Income_ln']].values)

In [72]:
trump['pred_skl_knn'].head()

0    0.648736
1    0.611395
2    0.650770
3    0.694707
4    0.656258
Name: pred_skl_knn, dtype: float64

In [73]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=trump['Income_ln'], y=trump['Trump Share'],
                    mode='markers',
                    name='Trump Share'))
fig.add_trace(go.Scatter(x=trump['Income_ln'], y=trump['pred_skl_knn'],
                    mode='markers',
                    name='KNN Predicted'))

fig.update_layout(template='none', title='Income and Trump Share Predicted by K Nereast Neighbor',
                  xaxis_title='Income (ln)', yaxis_title='Trump Share')
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

## 4. Conclusion

This is a few of very simple ML model that I've learned from NYU Data Bootcamp class. Although, it is not the most accurate model (maybe very wrong), I have noticed that the random forest regression model prediction is very close to the actual Trump vote share. Also, the locations with the higher Trump vote share tend to have lower income than the locations with the lower Trump vote share. Even though I am not a big political person, I found these vote analysis quite interesting!