# Stats and Politics Modelling

### Rachel Kessler
#### 10/27/2019

The purpose of this analysis is to fit linear and logistic regression models to the republican vote shares within the 2008, 2012, and 2016 US elections.

To begin, the data is imported and joined for further use.

In [171]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
%matplotlib inline

In [199]:
df_votes = pd.read_csv('projectdata/votes_clean.csv')
df_data = pd.read_csv('projectdata/data_clean.csv')
df = pd.merge(df_votes, df_data, on='Fips', how='inner')
df.head()

Unnamed: 0,Democrats 08 (Votes),Democrats 12 (Votes),Republicans 08 (Votes),Republicans 12 (Votes),votes16_trumpd,votes16_clintonh,Fips,Democrats 08 pct,Democrats 12 pct,Democrats 16 pct,...,Children.in.single.parent.households,Adult.smoking,Adult.obesity,Diabetes,Sexually.transmitted.infections,HIV.prevalence.rate,Uninsured,Unemployment,Violent.crime,Injury.deaths
0,2598.0,2630.0,3860.0,3887.0,3967.0,2364.0,5043,40.229173,40.355992,37.340073,...,0.429,0.181,0.323,0.126,747.3,79.9,0.197,0.108,449.02,82.0
1,2144.0,2099.0,3972.0,4263.0,4917.0,1587.0,5087,35.055592,32.99277,24.400369,...,0.179,0.304,0.328,0.135,247.2,131.3,0.239,0.053,245.83,96.8
2,1935.0,1845.0,3916.0,4136.0,4353.0,1544.0,13159,33.07127,30.847684,26.182805,...,0.381,0.21,0.298,0.118,324.1,402.4,0.239,0.096,205.6,71.6
3,13191.0,12792.0,8181.0,9411.0,8153.0,12652.0,8037,61.720943,57.613836,60.812305,...,0.204,0.095,0.132,0.036,190.9,133.7,0.23,0.081,123.88,42.9
4,2595.0,2442.0,5543.0,5214.0,5021.0,1836.0,13091,31.887442,31.896552,26.775558,...,0.453,0.189,0.358,0.153,497.9,315.8,0.208,0.115,477.48,79.4


In [182]:
df.columns

Index(['Democrats 08 (Votes)', 'Democrats 12 (Votes)',
       'Republicans 08 (Votes)', 'Republicans 12 (Votes)', 'votes16_trumpd',
       'votes16_clintonh', 'Fips', 'Democrats 08 pct', 'Democrats 12 pct',
       'Democrats 16 pct', 'Republicans 08 pct', 'Republicans 12 pct',
       'Republicans 16 pct', 'State', 'ST', 'County', 'Precincts', 'Votes',
       'Less Than High School Diploma', 'At Least High School Diploma',
       'At Least Bachelors's Degree', 'Graduate Degree', 'School Enrollment',
       'Median Earnings 2010', 'White (Not Latino) Population',
       'African American Population', 'Native American Population',
       'Asian American Population', 'Other Race or Races', 'Latino Population',
       'Children Under 6 Living in Poverty',
       'Adults 65 and Older Living in Poverty', 'Total Population',
       'Preschool.Enrollment.Ratio.enrolled.ages.3.and.4',
       'Poverty.Rate.below.federal.poverty.threshold', 'Gini.Coefficient',
       'Child.Poverty.living.in.famil

# 2008 Republican Vote Share - Linear Regression
The first step taken was to develop a baseline model to compare others to using all available demographic features in the datasest. 

### Baseline Linear Regression Model

In [185]:
X_all_cols = ['Less Than High School Diploma', 'At Least High School Diploma',
       "At Least Bachelors's Degree", 'Graduate Degree', 'School Enrollment',
       'Median Earnings 2010', 'White (Not Latino) Population',
       'African American Population', 'Native American Population',
       'Asian American Population', 'Other Race or Races', 'Latino Population',
       'Children Under 6 Living in Poverty',
       'Adults 65 and Older Living in Poverty', 'Total Population',
       'Preschool.Enrollment.Ratio.enrolled.ages.3.and.4',
       'Poverty.Rate.below.federal.poverty.threshold', 'Gini.Coefficient',
       'Child.Poverty.living.in.families.below.the.poverty.line',
       'Management.professional.and.related.occupations',
       'Service.occupations', 'Sales.and.office.occupations',
       'Farming.fishing.and.forestry.occupations',
       'Construction.extraction.maintenance.and.repair.occupations',
       'Production.transportation.and.material.moving.occupations',
       'White_Asian', 'SIRE_homogeneity', 'median_age', 'Low.birthweight',
       'Teen.births', 'Children.in.single.parent.households', 'Adult.smoking',
       'Adult.obesity', 'Diabetes', 'Sexually.transmitted.infections',
       'HIV.prevalence.rate', 'Uninsured', 'Unemployment', 'Violent.crime',
       'Injury.deaths']

Outputs and inputs are initialized for the base model:

In [186]:
X = sm.add_constant(df.loc[:,X_all_cols])
y_2008 = df['Republicans 08 pct']

And the statsmodels linear regression is run.

In [187]:
lr_2008 = sm.OLS(y_2008,X)
lr_2008_results = lr_2008.fit()
lr_2008_results.summary()

0,1,2,3
Dep. Variable:,Republicans 08 pct,R-squared:,0.677
Model:,OLS,Adj. R-squared:,0.673
Method:,Least Squares,F-statistic:,164.7
Date:,"Thu, 24 Oct 2019",Prob (F-statistic):,0.0
Time:,15:36:54,Log-Likelihood:,-10869.0
No. Observations:,3109,AIC:,21820.0
Df Residuals:,3069,BIC:,22060.0
Df Model:,39,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,173.0897,410.551,0.422,0.673,-631.893,978.072
Less Than High School Diploma,-0.0004,0.091,-0.004,0.997,-0.180,0.179
At Least High School Diploma,0.0301,0.085,0.354,0.723,-0.137,0.197
At Least Bachelors's Degree,-0.2331,0.068,-3.445,0.001,-0.366,-0.100
Graduate Degree,-0.7490,0.115,-6.508,0.000,-0.975,-0.523
School Enrollment,0.0953,0.037,2.582,0.010,0.023,0.168
Median Earnings 2010,-4.164e-05,5.71e-05,-0.729,0.466,-0.000,7.03e-05
White (Not Latino) Population,0.2167,0.992,0.218,0.827,-1.729,2.163
African American Population,-0.7250,2.974,-0.244,0.807,-6.556,5.106

0,1,2,3
Omnibus:,18.56,Durbin-Watson:,1.831
Prob(Omnibus):,0.0,Jarque-Bera (JB):,19.648
Skew:,-0.156,Prob(JB):,5.41e-05
Kurtosis:,3.232,Cond. No.,1.08e+16


An initial review of the baseline model shows an adjusted R squared value of 0.673. This is the primary metric used to compare models. The baseline model also shows about 50% of variables with p-values above the accepted 0.05 level as well as many standard errors that are large with respect to the coefficient. A look at the variance inflation factors takes place next in order to identify instances of multicollinearity.

In [188]:
pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], 
              index=X.columns)

  vif = 1. / (1. - r_squared_i)


const                                                         8.120962e+06
Less Than High School Diploma                                 2.163465e+01
At Least High School Diploma                                  1.952314e+01
At Least Bachelors's Degree                                   1.652498e+01
Graduate Degree                                               9.482443e+00
School Enrollment                                             1.656250e+00
Median Earnings 2010                                          3.979842e+00
White (Not Latino) Population                                          inf
African American Population                                   8.876530e+04
Native American Population                                    1.594125e+04
Asian American Population                                              inf
Other Race or Races                                           9.529918e+02
Latino Population                                             7.213810e+04
Children Under 6 Living i

The test for multicollinearity resulted in a couple infinity values which will need to be removed from the selected model. Overall, the baseline model could definitely be improved.

## Key Model Selection Parameters (in order of significance):
* An adjusted R-Squared equivalent to or slightly lower than 0.673
* Fewer input variables (ideally 15-20)
* Statistically significant p-vales (below 0.05)
* Standard errors which are significantly lower than coefficients
* VIFs lower than the baseline model

### Alternate Model 1: Backwards (Manual/EDA Based) Selection:

Alternate Model 1 was developed by utilizing knowledge gained from the prior exploratory data analyis through knowledge of highly correlated columns. The model contains 20 input features (any less would have resulted in large drops in the adjusted R squared):

In [189]:
X_cols = [
        'Graduate Degree', 
       'White (Not Latino) Population', 'African American Population',
        'Latino Population',
        'Total Population',
       'Poverty.Rate.below.federal.poverty.threshold', 
       'Service.occupations', 'Sales.and.office.occupations',
       'Production.transportation.and.material.moving.occupations',
        'SIRE_homogeneity', 'median_age', 'Low.birthweight',
       'Teen.births', 'Children.in.single.parent.households', 'Adult.smoking',
       'Diabetes', 'Sexually.transmitted.infections',
       'Uninsured', 'Unemployment'
       ]

In [190]:
X = sm.add_constant(df.loc[:,X_cols])
lr_2008 = sm.OLS(y_2008,X)
lr_2008_results = lr_2008.fit()
lr_2008_results.summary()

0,1,2,3
Dep. Variable:,Republicans 08 pct,R-squared:,0.672
Model:,OLS,Adj. R-squared:,0.67
Method:,Least Squares,F-statistic:,332.7
Date:,"Thu, 24 Oct 2019",Prob (F-statistic):,0.0
Time:,15:48:09,Log-Likelihood:,-10892.0
No. Observations:,3109,AIC:,21820.0
Df Residuals:,3089,BIC:,21950.0
Df Model:,19,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,50.0087,3.452,14.486,0.000,43.240,56.778
Graduate Degree,-1.0470,0.056,-18.634,0.000,-1.157,-0.937
White (Not Latino) Population,0.7385,0.030,24.768,0.000,0.680,0.797
African American Population,0.1021,0.028,3.691,0.000,0.048,0.156
Latino Population,0.1917,0.028,6.897,0.000,0.137,0.246
Total Population,-2.854e-06,5.21e-07,-5.482,0.000,-3.87e-06,-1.83e-06
Poverty.Rate.below.federal.poverty.threshold,-0.2337,0.041,-5.642,0.000,-0.315,-0.152
Service.occupations,-0.6249,0.051,-12.173,0.000,-0.726,-0.524
Sales.and.office.occupations,-0.2190,0.050,-4.388,0.000,-0.317,-0.121

0,1,2,3
Omnibus:,22.229,Durbin-Watson:,1.83
Prob(Omnibus):,0.0,Jarque-Bera (JB):,23.592
Skew:,-0.175,Prob(JB):,7.53e-06
Kurtosis:,3.246,Cond. No.,27000000.0


In [191]:
pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], 
              index=X.columns)

const                                                        569.357651
Graduate Degree                                                2.241269
White (Not Latino) Population                                 15.807848
African American Population                                    7.620395
Latino Population                                              6.246043
Total Population                                               1.264644
Poverty.Rate.below.federal.poverty.threshold                   3.295278
Service.occupations                                            1.536813
Sales.and.office.occupations                                   1.429364
Production.transportation.and.material.moving.occupations      2.354563
SIRE_homogeneity                                               5.210829
median_age                                                     1.977866
Low.birthweight                                                2.757939
Teen.births                                                    3

### Alternate Model 2: Forwards (Automated) Selection - 
#### This Model was Selected as Best Performing (Explanation Below)
Alternate Model 2 was developed by looping through all the possible columns and building the features one by one based off of the highest adjusted R squared value:

In [192]:
X_all_cols = ['Less Than High School Diploma', 'At Least High School Diploma',
       "At Least Bachelors's Degree", 'Graduate Degree', 'School Enrollment',
       'Median Earnings 2010', 'White (Not Latino) Population',
       'African American Population', 'Native American Population',
       'Asian American Population', 'Other Race or Races', 'Latino Population',
       'Children Under 6 Living in Poverty',
       'Adults 65 and Older Living in Poverty', 'Total Population',
       'Preschool.Enrollment.Ratio.enrolled.ages.3.and.4',
       'Poverty.Rate.below.federal.poverty.threshold', 'Gini.Coefficient',
       'Child.Poverty.living.in.families.below.the.poverty.line',
       'Management.professional.and.related.occupations',
       'Service.occupations', 'Sales.and.office.occupations',
       'Farming.fishing.and.forestry.occupations',
       'Construction.extraction.maintenance.and.repair.occupations',
       'Production.transportation.and.material.moving.occupations',
       'White_Asian', 'SIRE_homogeneity', 'median_age', 'Low.birthweight',
       'Teen.births', 'Children.in.single.parent.households', 'Adult.smoking',
       'Adult.obesity', 'Diabetes', 'Sexually.transmitted.infections',
       'HIV.prevalence.rate', 'Uninsured', 'Unemployment', 'Violent.crime',
       'Injury.deaths']

In [193]:
y_2008 = df['Republicans 08 pct']
X_current_cols = []
r_sq = 0

The for loop starts with checking one feature at a time, and based off of the highest adjusted R squared value, the best feature is added to the list of input features. The while loop repeats this until the features reaches the selected length of 30. The while loop prints out the adjusted R squared at each step so that an optimal number of features can be selected.

In [194]:
while len(X_current_cols)<30:
    for i in range(len(X_all_cols)):
        X_cols = X_current_cols + [X_all_cols[i]]
        X = sm.add_constant(df.loc[:,X_cols])
        lr_2008 = sm.OLS(y_2008,X)
        lr_2008_results = lr_2008.fit()
        if lr_2008_results.rsquared_adj > r_sq:
            r_sq = lr_2008_results.rsquared_adj
            best_col = [X_all_cols[i]]
    X_current_cols += best_col
    print(str(len(X_current_cols))+' R-squared: '+str(r_sq))

1 R-squared: 0.1364488321948507
2 R-squared: 0.3072746189243768
3 R-squared: 0.4056782063373514
4 R-squared: 0.4932887664590744
5 R-squared: 0.5511554226465762
6 R-squared: 0.5788571625191299
7 R-squared: 0.6010600299073643
8 R-squared: 0.6163280613625386
9 R-squared: 0.6276954342456587
10 R-squared: 0.6382279277925292
11 R-squared: 0.647823459499283
12 R-squared: 0.6527056154328533
13 R-squared: 0.6568153212592872
14 R-squared: 0.6597686775051594
15 R-squared: 0.6620235492431579
16 R-squared: 0.664327419062045
17 R-squared: 0.6660168723014877
18 R-squared: 0.6671626937478206
19 R-squared: 0.668153751229476
20 R-squared: 0.6693632271064307
21 R-squared: 0.6704332307105534
22 R-squared: 0.6711142996506869
23 R-squared: 0.6716908645921109
24 R-squared: 0.6721226684244954
25 R-squared: 0.672398233533061
26 R-squared: 0.6725175871709103
27 R-squared: 0.6726511580658392
28 R-squared: 0.6728014754208331
29 R-squared: 0.6729182725947163
30 R-squared: 0.673022860723558


According to the printout above, the model reaches an adjusted R squared of approximately 0.7 at 20 features. This will be selected to compare to alternate 1. The summary and VIF were calculated below:

In [195]:
X_current_cols = X_current_cols[0:20]
X_current_cols

['Graduate Degree',
 'Children.in.single.parent.households',
 'Uninsured',
 'White (Not Latino) Population',
 'SIRE_homogeneity',
 'Unemployment',
 'Teen.births',
 'Service.occupations',
 'Diabetes',
 'Production.transportation.and.material.moving.occupations',
 'median_age',
 'Native American Population',
 'Total Population',
 'Sexually.transmitted.infections',
 'Poverty.Rate.below.federal.poverty.threshold',
 'Other Race or Races',
 'Sales.and.office.occupations',
 'School Enrollment',
 'Low.birthweight',
 'Adult.smoking']

In [196]:
X = sm.add_constant(df.loc[:,X_current_cols])
lr_2008 = sm.OLS(y_2008,X)
lr_2008_results = lr_2008.fit()
lr_2008_results.summary()

0,1,2,3
Dep. Variable:,Republicans 08 pct,R-squared:,0.671
Model:,OLS,Adj. R-squared:,0.669
Method:,Least Squares,F-statistic:,315.6
Date:,"Thu, 24 Oct 2019",Prob (F-statistic):,0.0
Time:,16:11:33,Log-Likelihood:,-10894.0
No. Observations:,3109,AIC:,21830.0
Df Residuals:,3088,BIC:,21960.0
Df Model:,20,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,57.7278,4.107,14.056,0.000,49.675,65.781
Graduate Degree,-1.0811,0.057,-18.897,0.000,-1.193,-0.969
Children.in.single.parent.households,-27.6746,2.564,-10.793,0.000,-32.702,-22.647
Uninsured,92.8037,4.212,22.031,0.000,84.544,101.063
White (Not Latino) Population,0.5797,0.019,31.247,0.000,0.543,0.616
SIRE_homogeneity,-29.7381,1.790,-16.616,0.000,-33.247,-26.229
Unemployment,-77.5380,6.680,-11.608,0.000,-90.636,-64.440
Teen.births,0.1786,0.014,12.964,0.000,0.152,0.206
Service.occupations,-0.5852,0.052,-11.285,0.000,-0.687,-0.484

0,1,2,3
Omnibus:,21.237,Durbin-Watson:,1.821
Prob(Omnibus):,0.0,Jarque-Bera (JB):,22.736
Skew:,-0.166,Prob(JB):,1.16e-05
Kurtosis:,3.255,Cond. No.,27000000.0


In [197]:
pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], 
              index=X.columns)

const                                                        804.869998
Graduate Degree                                                2.321124
Children.in.single.parent.households                           3.317721
Uninsured                                                      2.473250
White (Not Latino) Population                                  6.112236
SIRE_homogeneity                                               5.192051
Unemployment                                                   1.610587
Teen.births                                                    3.495551
Service.occupations                                            1.566401
Diabetes                                                       2.463430
Production.transportation.and.material.moving.occupations      2.377564
median_age                                                     1.987302
Native American Population                                     1.464609
Total Population                                               1

### Comparison of Linear Regression Models:
Models were compared on the basis of the key model selection parameters listed above:
* An adjusted R-Squared equivalent to or slightly lower than 0.673:
    * **Alt 1: 0.670**
    * **Alt 2: 0.669**
* Fewer input variables (ideally 15-20)
    * **Both have 20 inputs**
* VIFs lower than the baseline model
    * **Alternate 1 has double the VIF for White (Not Latino) Population than Alternate 2**
* Statistically significant p-vales (below 0.05)
    * **Both have p-values below 0.05 (almost all close to 0.000)**
* Standard errors which are significantly lower than coefficients
    * **Both have reasonable standard errors**

VIF was the deciding factor in the model selection (as the other factors were nearly the same), so **ALTERNATE MODEL 2** was selected.

### Model Interpretation

The selected model had the following inputs as independent variables:
 
Graduate Degree, Children.in.single.parent.households, Uninsured, White (Not Latino) Population, SIRE_homogeneity, Unemployment, Teen.births, Service.occupations, Diabetes, Production.transportation.and.material.moving.occupations, median_age, Native American Population, Total Population, Sexually.transmitted.infections, Poverty.Rate.below.federal.poverty.threshold, Other Race or Races, Sales.and.office.occupations, School Enrollment, Low.birthweight, Adult.smoking

Many of these variables were correlated with the republican vote share in the exploratory data analysis. 

Because the units vary, coefficients cannot be compared directly; however, the sign of the coefficient gives explanation into which way the feature influences the dependent variable (i.e. as diabetes increases, republican vote share increases, and as graduate degrees increase, the republican vote share decreases).

Units of some coefficients are inferred - such as median age (in years). This coefficient can then be interpreted directly: with every 1 year increase in median age, republican vote share is expected to decrease by 0.54.

# 2008 Republican Vote Share - Logistic Regression
The next step taken was to develop a baseline model to compare others to using all available demographic features in the datasest. 

### Baseline Logistic Regression Model

In [202]:
y_2008 = df['Republicans 08 pct']/100
y_2008_binary = np.where(y_2008>0.5,1,0)
total = len(y_2008)

In [203]:
X_all_cols = ['Less Than High School Diploma', 'At Least High School Diploma',
       "At Least Bachelors's Degree", 'Graduate Degree', 'School Enrollment',
       'Median Earnings 2010', 'White (Not Latino) Population',
       'African American Population', 'Native American Population',
       'Asian American Population', 'Other Race or Races', 'Latino Population',
       'Children Under 6 Living in Poverty',
       'Adults 65 and Older Living in Poverty', 'Total Population',
       'Preschool.Enrollment.Ratio.enrolled.ages.3.and.4',
       'Poverty.Rate.below.federal.poverty.threshold', 'Gini.Coefficient',
       'Child.Poverty.living.in.families.below.the.poverty.line',
       'Management.professional.and.related.occupations',
       'Service.occupations', 'Sales.and.office.occupations',
       'Farming.fishing.and.forestry.occupations',
       'Construction.extraction.maintenance.and.repair.occupations',
       'Production.transportation.and.material.moving.occupations',
       'White_Asian', 'SIRE_homogeneity', 'median_age', 'Low.birthweight',
       'Teen.births', 'Children.in.single.parent.households', 'Adult.smoking',
       'Adult.obesity', 'Diabetes', 'Sexually.transmitted.infections',
       'HIV.prevalence.rate', 'Uninsured', 'Unemployment', 'Violent.crime',
       'Injury.deaths']

In [204]:
X = sm.add_constant(df.loc[:,X_all_cols])
logr_2008 = sm.Logit(y_2008,X)
logr_2008_results = logr_2008.fit()
logr_2008_results.summary()

         Current function value: 0.619129
         Iterations: 35


  bse_ = np.sqrt(np.diag(self.cov_params()))
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)


0,1,2,3
Dep. Variable:,Republicans 08 pct,No. Observations:,3109.0
Model:,Logit,Df Residuals:,3069.0
Method:,MLE,Df Model:,39.0
Date:,"Thu, 24 Oct 2019",Pseudo R-squ.:,-0.1008
Time:,16:40:13,Log-Likelihood:,-1924.9
converged:,False,LL-Null:,-1748.6
Covariance Type:,nonrobust,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,4.5002,106.225,0.042,0.966,-203.697,212.697
Less Than High School Diploma,0.0005,0.027,0.019,0.985,-0.052,0.053
At Least High School Diploma,0.0016,0.025,0.062,0.950,-0.048,0.051
At Least Bachelors's Degree,-0.0108,0.018,-0.613,0.540,-0.046,0.024
Graduate Degree,-0.0315,0.030,-1.045,0.296,-0.091,0.028
School Enrollment,0.0045,0.010,0.461,0.645,-0.015,0.024
Median Earnings 2010,-1.795e-06,1.49e-05,-0.120,0.904,-3.11e-05,2.75e-05
White (Not Latino) Population,0.0133,,,,,
African American Population,-0.0220,0.769,-0.029,0.977,-1.529,1.485


In [205]:
model_predictions_prob = logr_2008_results.predict(X)
model_predictions_binary = np.where(model_predictions_prob>0.5,1,0)
correct = (model_predictions_binary == y_2008_binary).sum()
correct/total

0.8404631714377614

An initial review of the baseline model shows an accuracy of 84% This is the primary metric used to compare models. The baseline model also shows about 75% of variables with p-values above the accepted 0.05 level as well as many standard errors that are large with respect to the coefficient. 

## Key Model Selection Parameters (in order of significance):
* An accuracy equivalent to or slightly lower than 0.84
* Fewer input variables (ideally 5-20)
* Lower VIFs
* Statistically significant p-values (below 0.05)
* Standard errors which are significantly lower than coefficients

### Alternate Model 1: Manual Selection

Alternate Model 1 was developed by utilizing the variables from the previous linear regression model. The model contains 20 input features.

In [207]:
X_cols = ['Graduate Degree',
 'Children.in.single.parent.households',
 'Uninsured',
 'White (Not Latino) Population',
 'SIRE_homogeneity',
 'Unemployment',
 'Teen.births',
 'Service.occupations',
 'Diabetes',
 'Production.transportation.and.material.moving.occupations',
 'median_age',
 'Native American Population',
 'Total Population',
 'Sexually.transmitted.infections',
 'Poverty.Rate.below.federal.poverty.threshold',
 'Other Race or Races',
 'Sales.and.office.occupations',
 'School Enrollment',
 'Low.birthweight',
 'Adult.smoking']

In [208]:
X = sm.add_constant(df.loc[:,X_cols])
logr_2008 = sm.Logit(y_2008,X)
logr_2008_results = logr_2008.fit()
logr_2008_results.summary()

Optimization terminated successfully.
         Current function value: 0.619532
         Iterations 5


  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,Republicans 08 pct,No. Observations:,3109.0
Model:,Logit,Df Residuals:,3088.0
Method:,MLE,Df Model:,20.0
Date:,"Thu, 24 Oct 2019",Pseudo R-squ.:,-0.1015
Time:,16:59:32,Log-Likelihood:,-1926.1
converged:,True,LL-Null:,-1748.6
Covariance Type:,nonrobust,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.4007,1.084,0.370,0.712,-1.724,2.526
Graduate Degree,-0.0468,0.015,-3.120,0.002,-0.076,-0.017
Children.in.single.parent.households,-1.1952,0.671,-1.782,0.075,-2.510,0.119
Uninsured,4.0663,1.107,3.674,0.000,1.897,6.236
White (Not Latino) Population,0.0255,0.005,5.078,0.000,0.016,0.035
SIRE_homogeneity,-1.3520,0.484,-2.795,0.005,-2.300,-0.404
Unemployment,-3.2904,1.735,-1.897,0.058,-6.690,0.110
Teen.births,0.0078,0.004,2.164,0.030,0.001,0.015
Service.occupations,-0.0266,0.014,-1.944,0.052,-0.054,0.000


In [209]:
model_predictions_prob = logr_2008_results.predict(X)
model_predictions_binary = np.where(model_predictions_prob>0.5,1,0)
correct = (model_predictions_binary == y_2008_binary).sum()
correct/total

0.8462528144097781

### Alternate Model 2: Forwards (Automated) Selection - 
#### This Model was Selected as Best Performing (Explanation Below)
Alternate Model 2 was developed by looping through all the possible columns and building the features one by one based off of the highest accuracy:

In [232]:
X_all_cols = ['Less Than High School Diploma', 'At Least High School Diploma',
       "At Least Bachelors's Degree", 'Graduate Degree', 'School Enrollment',
       'Median Earnings 2010', 'White (Not Latino) Population',
       'African American Population', 'Native American Population',
       'Asian American Population', 'Other Race or Races', 'Latino Population',
       'Children Under 6 Living in Poverty',
       'Adults 65 and Older Living in Poverty', 'Total Population',
       'Preschool.Enrollment.Ratio.enrolled.ages.3.and.4',
       'Poverty.Rate.below.federal.poverty.threshold', 'Gini.Coefficient',
       'Child.Poverty.living.in.families.below.the.poverty.line',
       'Management.professional.and.related.occupations',
       'Service.occupations', 'Sales.and.office.occupations',
       'Farming.fishing.and.forestry.occupations',
       'Construction.extraction.maintenance.and.repair.occupations',
       'Production.transportation.and.material.moving.occupations',
       'White_Asian', 'SIRE_homogeneity', 'median_age', 'Low.birthweight',
       'Teen.births', 'Children.in.single.parent.households', 'Adult.smoking',
       'Adult.obesity', 'Diabetes', 'Sexually.transmitted.infections',
       'HIV.prevalence.rate', 'Uninsured', 'Unemployment', 'Violent.crime',
       'Injury.deaths']

The for loop starts with checking one feature at a time, and based off of the highest accuracy, the best feature is added to the list of input features. The while loop repeats this until the features reaches the selected length of 10. The while loop adds the accuracy to a dictionary called results so at each step so that an optimal number of features can be selected.

In [211]:
X_current_cols = []
accuracy = 0
results = {}

while len(X_current_cols)<10:
    
    for i in range(len(X_all_cols)):
        
        if X_all_cols[i] not in X_current_cols:
            
            X_cols = X_current_cols + [X_all_cols[i]]
            X = sm.add_constant(df.loc[:,X_cols])
            logr_2008 = sm.Logit(y_2008,X)
            logr_2008_results = logr_2008.fit()
            
            model_predictions_prob = logr_2008_results.predict(X)
            model_predictions_binary = np.where(model_predictions_prob>0.5,1,0)
            correct = (model_predictions_binary == y_2008_binary).sum()
            
            if correct/total > accuracy:
                accuracy = correct/total
                best_col = [X_all_cols[i]]
                
    if X_all_cols[i] not in X_current_cols:
        X_current_cols += best_col
        results[len(X_current_cols)] = accuracy

  return ptp(axis=axis, out=out, **kwargs)


Optimization terminated successfully.
         Current function value: 0.669547
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.669581
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.662899
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.659679
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.668531
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.669170
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.660252
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.662709
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.669136
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.661672
  

Optimization terminated successfully.
         Current function value: 0.649116
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.647356
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.650158
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.647975
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.647154
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.648482
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.650160
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.643505
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.649868
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.650423
  

Optimization terminated successfully.
         Current function value: 0.632486
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.631873
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.631944
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.628056
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.631871
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.632352
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.628298
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.628272
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.628431
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.628369
  

Optimization terminated successfully.
         Current function value: 0.626342
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.626080
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.624471
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.626343
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.626345
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.625656
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.625679
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.625826
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.625865
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.625704
  

In [212]:
results

{1: 0.7664844001286587,
 2: 0.7973624959794146,
 3: 0.8105500160823416,
 4: 0.8144097780636861,
 5: 0.8201994210357028,
 6: 0.826632357671277,
 7: 0.8324220006432936,
 8: 0.8378899967835317,
 9: 0.8411064651013187,
 10: 0.8427146992602123}

From the results we can see that the model reaches 83% accuracy by feature 7 so this was selected as optimal.

In [216]:
X_current_cols = X_current_cols[0:7]
X_current_cols

['White (Not Latino) Population',
 'Graduate Degree',
 'Uninsured',
 'SIRE_homogeneity',
 'Children.in.single.parent.households',
 'Diabetes',
 'Management.professional.and.related.occupations']

In [217]:
X = sm.add_constant(df.loc[:,X_current_cols])
logr_2008 = sm.Logit(y_2008,X)
logr_2008_results = logr_2008.fit()
logr_2008_results.summary()

Optimization terminated successfully.
         Current function value: 0.626356
         Iterations 5


  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,Republicans 08 pct,No. Observations:,3109.0
Model:,Logit,Df Residuals:,3101.0
Method:,MLE,Df Model:,7.0
Date:,"Thu, 24 Oct 2019",Pseudo R-squ.:,-0.1137
Time:,17:05:04,Log-Likelihood:,-1947.3
converged:,True,LL-Null:,-1748.6
Covariance Type:,nonrobust,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.4442,0.577,-2.503,0.012,-2.575,-0.313
White (Not Latino) Population,0.0249,0.004,5.740,0.000,0.016,0.033
Graduate Degree,-0.0573,0.015,-3.876,0.000,-0.086,-0.028
Uninsured,5.1737,0.895,5.782,0.000,3.420,6.927
SIRE_homogeneity,-1.5875,0.427,-3.719,0.000,-2.424,-0.751
Children.in.single.parent.households,-1.9298,0.546,-3.534,0.000,-3.000,-0.860
Diabetes,4.2845,2.174,1.971,0.049,0.024,8.545
Management.professional.and.related.occupations,0.0179,0.009,1.983,0.047,0.000,0.036


In [218]:
model_predictions_prob = logr_2008_results.predict(X)
model_predictions_binary = np.where(model_predictions_prob>0.5,1,0)
correct = (model_predictions_binary == y_2008_binary).sum()

In [219]:
correct/total

0.8324220006432936

In [220]:
pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], 
              index=X.columns)

const                                              232.646367
White (Not Latino) Population                        4.437361
Graduate Degree                                      2.240136
Uninsured                                            1.603675
SIRE_homogeneity                                     3.991506
Children.in.single.parent.households                 2.214237
Diabetes                                             1.688353
Management.professional.and.related.occupations      2.199935
dtype: float64

### Comparison of Logistic Regression Models:
Models were compared on the basis of the key model selection parameters listed above:
* An accuracy equivalent to or slightly lower than 0.84:
    * **Alt 1: 0.846**
    * **Alt 2: 0.832**
* Fewer input variables (ideally 15-20)
    * **Alt 1: 20 Inputs**
    * **Alt 2: 7 Inputs**
* VIFs lower than the baseline model
    * **Comparable VIFs**
* Statistically significant p-vales (below 0.05)
    * **Alt 1: 50% of p-values higher than 0.05**
    * **Alt 2: p-values below 0.05**
* Standard errors which are significantly lower than coefficients
    * **Alt 1 has some std errors larger than coefficients**

**ALTERNATE MODEL 2** was selected as it was the only model which met the key parameters

### Model Interpretation

The selected model had the following inputs as independent variables:
 
White (Not Latino) Population, Graduate Degree, Uninsured, SIRE_homogeneity, Children.in.single.parent.households, Diabetes, Management.professional.and.related.occupations

A couple of these variables were moderately correlated with the republican vote share in the exploratory data analysis. 


Because the units vary, coefficients cannot be compared directly; however, the sign of the coefficient gives explanation into which way the feature influences the dependent variable (i.e. as graduate degrees increase, the republican vote share decreases).

Units of some coefficients are inferred - such as White Population (in Percent). 

This coefficient can then be interpreted as: with every 1 year increase in percentage of white population, odds of republican a majority republican vote share increases by e^0.0249, or 1.025 times.

# 2012 and 2016 Linear Regression Fitting
Using the model for linear regression selected above, 2012 and 2016 were also fitted to a linear regression model.

In [222]:
X_cols = ['Graduate Degree',
 'Children.in.single.parent.households',
 'Uninsured',
 'White (Not Latino) Population',
 'SIRE_homogeneity',
 'Unemployment',
 'Teen.births',
 'Service.occupations',
 'Diabetes',
 'Production.transportation.and.material.moving.occupations',
 'median_age',
 'Native American Population',
 'Total Population',
 'Sexually.transmitted.infections',
 'Poverty.Rate.below.federal.poverty.threshold',
 'Other Race or Races',
 'Sales.and.office.occupations',
 'School Enrollment',
 'Low.birthweight',
 'Adult.smoking']

In [225]:
X = sm.add_constant(df.loc[:,X_cols])
y_2012 = df['Republicans 12 pct']
y_2016 = df['Republicans 16 pct']

2012 Linear Regression Model:

In [226]:
lr_2012 = sm.OLS(y_2012,X)
lr_2012_results = lr_2012.fit()
lr_2012_results.summary()

0,1,2,3
Dep. Variable:,Republicans 12 pct,R-squared:,0.712
Model:,OLS,Adj. R-squared:,0.71
Method:,Least Squares,F-statistic:,381.8
Date:,"Thu, 24 Oct 2019",Prob (F-statistic):,0.0
Time:,17:22:21,Log-Likelihood:,-10897.0
No. Observations:,3109,AIC:,21840.0
Df Residuals:,3088,BIC:,21960.0
Df Model:,20,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,54.5592,4.111,13.271,0.000,46.498,62.620
Graduate Degree,-1.1186,0.057,-19.531,0.000,-1.231,-1.006
Children.in.single.parent.households,-32.2536,2.567,-12.566,0.000,-37.286,-27.221
Uninsured,101.3141,4.217,24.027,0.000,93.046,109.582
White (Not Latino) Population,0.6521,0.019,35.117,0.000,0.616,0.689
SIRE_homogeneity,-26.8657,1.792,-14.996,0.000,-30.379,-23.353
Unemployment,-77.8181,6.687,-11.638,0.000,-90.929,-64.707
Teen.births,0.1861,0.014,13.496,0.000,0.159,0.213
Service.occupations,-0.5405,0.052,-10.413,0.000,-0.642,-0.439

0,1,2,3
Omnibus:,21.799,Durbin-Watson:,1.806
Prob(Omnibus):,0.0,Jarque-Bera (JB):,22.536
Skew:,-0.184,Prob(JB):,1.28e-05
Kurtosis:,3.196,Cond. No.,27000000.0


2016 Linear Regression Model:

In [227]:
lr_2016 = sm.OLS(y_2016,X)
lr_2016_results = lr_2016.fit()
lr_2016_results.summary()

0,1,2,3
Dep. Variable:,Republicans 16 pct,R-squared:,0.804
Model:,OLS,Adj. R-squared:,0.803
Method:,Least Squares,F-statistic:,633.0
Date:,"Thu, 24 Oct 2019",Prob (F-statistic):,0.0
Time:,17:22:27,Log-Likelihood:,-10516.0
No. Observations:,3109,AIC:,21070.0
Df Residuals:,3088,BIC:,21200.0
Df Model:,20,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,48.4522,3.637,13.323,0.000,41.321,55.583
Graduate Degree,-1.6387,0.051,-32.345,0.000,-1.738,-1.539
Children.in.single.parent.households,-27.2497,2.271,-12.002,0.000,-31.702,-22.798
Uninsured,75.1413,3.730,20.144,0.000,67.827,82.455
White (Not Latino) Population,0.6937,0.016,42.226,0.000,0.661,0.726
SIRE_homogeneity,-23.1867,1.585,-14.630,0.000,-26.294,-20.079
Unemployment,-83.8357,5.915,-14.173,0.000,-95.434,-72.238
Teen.births,0.1755,0.012,14.386,0.000,0.152,0.199
Service.occupations,-0.3641,0.046,-7.928,0.000,-0.454,-0.274

0,1,2,3
Omnibus:,118.878,Durbin-Watson:,1.804
Prob(Omnibus):,0.0,Jarque-Bera (JB):,167.258
Skew:,-0.38,Prob(JB):,4.7900000000000004e-37
Kurtosis:,3.846,Cond. No.,27000000.0


## Performance Comparison:
Based off of the adjusted R squared alone, the performance of the fitted models appear to get better from 2008 (0.69) to 2012 (0.71) to 2016 (0.80), however some features appear to become less statstically significant in 2016 such as Native American population and Adult Smoking. This seems to indicate that the selected features become more predictive for republican vote share as the years pass.

For future analysis, fitting a new model on the 2016 data may result in better performance than the 2008 model with even fewer features.

# 2012 and 2016 Logistic Regression Fitting
Using the model for logistic regression selected above, 2012 and 2016 were also fitted to a logistic regression model.

In [228]:
X_cols = ['White (Not Latino) Population',
 'Graduate Degree',
 'Uninsured',
 'SIRE_homogeneity',
 'Children.in.single.parent.households',
 'Diabetes',
 'Management.professional.and.related.occupations']

In [233]:
X = sm.add_constant(df.loc[:,X_cols])
y_2012 = df['Republicans 12 pct']/100
y_2016 = df['Republicans 16 pct']/100
y_2012_binary = np.where(y_2012>0.5,1,0)
y_2016_binary = np.where(y_2016>0.5,1,0)

total = len(y_2008)

  return ptp(axis=axis, out=out, **kwargs)


2012 Logistic Regression Model:

In [234]:
logr_2012 = sm.Logit(y_2012,X)
logr_2012_results = logr_2012.fit()
logr_2012_results.summary()

Optimization terminated successfully.
         Current function value: 0.595013
         Iterations 5


0,1,2,3
Dep. Variable:,Republicans 12 pct,No. Observations:,3109.0
Model:,Logit,Df Residuals:,3101.0
Method:,MLE,Df Model:,7.0
Date:,"Thu, 24 Oct 2019",Pseudo R-squ.:,-0.2042
Time:,17:30:57,Log-Likelihood:,-1849.9
converged:,True,LL-Null:,-1536.2
Covariance Type:,nonrobust,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.7186,0.591,-2.908,0.004,-2.877,-0.560
White (Not Latino) Population,0.0278,0.004,6.338,0.000,0.019,0.036
Graduate Degree,-0.0620,0.015,-4.106,0.000,-0.092,-0.032
Uninsured,5.9150,0.918,6.440,0.000,4.115,7.715
SIRE_homogeneity,-1.4387,0.432,-3.329,0.001,-2.286,-0.592
Children.in.single.parent.households,-2.1450,0.558,-3.846,0.000,-3.238,-1.052
Diabetes,4.0641,2.215,1.835,0.066,-0.277,8.405
Management.professional.and.related.occupations,0.0200,0.009,2.136,0.033,0.002,0.038


In [235]:
model_predictions_prob = logr_2012_results.predict(X)
model_predictions_binary = np.where(model_predictions_prob>0.5,1,0)
correct = (model_predictions_binary == y_2012_binary).sum()
correct/total

0.8774525570923126

In [236]:
logr_2016 = sm.Logit(y_2016,X)
logr_2016_results = logr_2016.fit()
logr_2016_results.summary()

Optimization terminated successfully.
         Current function value: 0.517328
         Iterations 5


0,1,2,3
Dep. Variable:,Republicans 16 pct,No. Observations:,3109.0
Model:,Logit,Df Residuals:,3101.0
Method:,MLE,Df Model:,7.0
Date:,"Thu, 24 Oct 2019",Pseudo R-squ.:,-0.3795
Time:,17:31:18,Log-Likelihood:,-1608.4
converged:,True,LL-Null:,-1165.9
Covariance Type:,nonrobust,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.7235,0.622,-2.773,0.006,-2.942,-0.505
White (Not Latino) Population,0.0306,0.004,7.007,0.000,0.022,0.039
Graduate Degree,-0.0947,0.016,-5.871,0.000,-0.126,-0.063
Uninsured,5.4123,0.963,5.622,0.000,3.525,7.299
SIRE_homogeneity,-1.0014,0.434,-2.306,0.021,-1.852,-0.150
Children.in.single.parent.households,-1.9363,0.583,-3.320,0.001,-3.079,-0.793
Diabetes,4.3947,2.319,1.895,0.058,-0.151,8.940
Management.professional.and.related.occupations,0.0185,0.010,1.848,0.065,-0.001,0.038


In [238]:
model_predictions_prob = logr_2016_results.predict(X)
model_predictions_binary = np.where(model_predictions_prob>0.5,1,0)
correct = (model_predictions_binary == y_2016_binary).sum()
correct/total

0.9266645223544548

## Performance Comparison:
Based off of the accuracy alone, the model performance appears to get better from 2008 (0.83) to 2012 (0.87) to 2016 (0.93), however some features appear to become less statstically significant in 2012 and 2016 such as Diabetes and Management occupations. This seems to indicate that the selected features become more predictive for republican vote share as the years pass.

Again, for future analysis, fitting a new model on the 2012, 2016 data may result in better performance than the 2008 model with even fewer features.