<h1><center>Stats and Politics - Deliverable 3</center></h1>

<center>by Sebastian Langer<center>

---
During this deliverable I will create models to predict U.S. election results based on county level demographic data. 


## Import packages and data

In [1]:
# Import all packages
import pandas as pd
import numpy as np
from statsmodels.regression.linear_model import OLS
import statsmodels.api as sm

# Import data
votes = pd.read_csv('clean_data/votes_clean.csv')
data = pd.read_csv('clean_data/data_clean.csv')

# Combine data and votes into one dataframe
df = data.merge(votes,on='Fips')

# Creating a linear model for the 2008 election

I will create a linear model that will predict the 2008 election results. **Each data point is county demographics and election results**, it is recorded as a row in my `df`. In this model the **independent variables is the demographic data**. The full list of independent variables are;

- 'Precincts' 
- 'Less Than High School Diploma'
- 'At Least High School Diploma'
- 'At Least Bachelors's Degree'
- 'Graduate Degree'
- 'School Enrollment'
- 'Median Earnings 2010'
- 'White (Not Latino) Population'
- 'African American Population'
- 'Native American Population'
- 'Asian American Population'
- 'Other Race or Races'
- 'Latino Population'
- 'Children Under 6 Living in Poverty'
- 'Adults 65 and Older Living in Poverty'
- 'Total Population'
- 'Preschool.Enrollment.Ratio.enrolled.ages.3.and.4'
- 'Poverty.Rate.below.federal.poverty.threshold'
- 'Gini.Coefficient'
- 'Child.Poverty.living.in.families.below.the.poverty.line'
- 'Management.professional.and.related.occupations'
- 'Service.occupations'
- 'Sales.and.office.occupations'
- 'Farming.fishing.and.forestry.occupations'
- 'Construction.extraction.maintenance.and.repair.occupations'
- 'Production.transportation.and.material.moving.occupations'
- 'SIRE_homogeneity', 'median_age', 'Low.birthweight'
- 'Teen.births'
- 'Children.in.single.parent.households'
- 'Adult.smoking'
- 'Adult.obesity'
- 'Diabetes'
- 'Sexually.transmitted.infections'
- 'HIV.prevalence.rate'
- 'Uninsured'
- 'Unemployment'
- 'Violent.crime'
- 'Homicide.rate'
- 'Injury.deaths'
- 'Infant.mortality'


Notice that I removed 'Votes' as it is the number of votes that county had for a particular election. If I wanted to include votes I would re-calculate it for every election from the Republician and Democratic vote counts.

For the 2008 model the **dependent variable is the Republician vote share** which is the proportion of republician votes in a county compared to the number of democrat votes. This means it must be between zero and one (i.e. 0 <= y <= 1)

In [2]:
# Creating independent variables
X = df.loc[:,'Precincts':'Infant.mortality']
X.drop('Votes', axis=1, inplace=True)
X = sm.add_constant(X)

# Selecting dependent variable
y_08 = df['Republicans08_Voteshare']

# Create and fit a linear model
OLS08_model = OLS(y_08,X).fit()
OLS08_model.summary()

0,1,2,3
Dep. Variable:,Republicans08_Voteshare,R-squared:,0.669
Model:,OLS,Adj. R-squared:,0.665
Method:,Least Squares,F-statistic:,149.1
Date:,"Wed, 01 May 2019",Prob (F-statistic):,0.0
Time:,16:49:44,Log-Likelihood:,3447.2
No. Observations:,3141,AIC:,-6808.0
Df Residuals:,3098,BIC:,-6548.0
Df Model:,42,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.1481,4.143,0.277,0.782,-6.976,9.272
Precincts,-1.317e-05,2.27e-05,-0.581,0.561,-5.76e-05,3.13e-05
Less Than High School Diploma,-0.0006,0.001,-0.635,0.526,-0.002,0.001
At Least High School Diploma,-0.0009,0.001,-1.059,0.289,-0.002,0.001
At Least Bachelors's Degree,-0.0027,0.001,-3.965,0.000,-0.004,-0.001
Graduate Degree,-0.0069,0.001,-6.021,0.000,-0.009,-0.005
School Enrollment,0.0005,0.000,1.303,0.193,-0.000,0.001
Median Earnings 2010,-9.677e-07,5.64e-07,-1.716,0.086,-2.07e-06,1.38e-07
White (Not Latino) Population,-0.0046,0.030,-0.154,0.877,-0.063,0.054

0,1,2,3
Omnibus:,14.758,Durbin-Watson:,1.823
Prob(Omnibus):,0.001,Jarque-Bera (JB):,16.155
Skew:,-0.121,Prob(JB):,0.00031
Kurtosis:,3.255,Cond. No.,931000000.0


This model summary contains R^2 and adjusted R^2. These values indicate the proportion of the dependent variable that can be described by the independent varibales. This model shows that 67% of the republician voteshare in the 2008 election can be explained by county demographics. The adjusted R^2 takes into account how many varibales are used in the model. We should strive to create a model with as few independent variables that will generate accurate predictions so when comparing models it is important to compare adjusted R^2 to capture the complexities of each model. 

Later I will be comparing these linear model predictions to a logistic regression. I will need a different model measurment since logistic regression only has a Pseudo R^2 which does not describe the same thing as a linear regression R^2. There are measures such as AIC and RMSE but since I am interested in if republicians win each county I will manual calculate the model's accuracy. 

In [3]:
# Getting predictions from our model
linear_predictions_08 = OLS08_model.predict(X)
linear_predictions_08_binary = np.where(linear_predictions_08 >= 0.5, 1, 0)

# Making the actual county level results into binary (1 if reps win and 0 if dems win)
y_08_binary = np.where(y_08 >= 0.5, 1, 0)

# Calculate the accuracy of the model's predictions
total_predicted_correctly = np.sum(y_08_binary == linear_predictions_08_binary)
linear_accuracy_08 = total_predicted_correctly / len(y_08_binary)


print(f'The 2008 model is {round(linear_accuracy_08*100,2)}% accurate')

The 2008 model is 83.76% accurate


The 2008 linear regression model is 84% accurate at predicting the winning party for each county. Since there is no train-test split this is likely to overfit. 

Note that the model does warn about possible multicollinearity so future iterations of the model should examine the relationship between independent variables. For example, the racial variables included in this model are percentages which all add to 100%. To avoid collinearity, one of the races should be dropped.

Also note that this linear model could produce predictions outside 0 and 1 even though these values are unrealistic (e.g. you could never get 1.1 republician vote share).

# Creating a logistic regression model for the 2008 election

When creating a logistic regression we want to input as much information as possible (e.g. if a county has a 0.51 vote share we want to input that rather than 1 since a 1 would be indistinguishable from the 1 from a county with a 0.99 vote share). This is why we chose to use a stats model logistic regression as opposed to a scikit-learn logistic regression which only allows for binary dependent variables. 

In [4]:
# Creating and fitting a logistic regression on the 2008 election
logreg_model = sm.Logit(y_08, X).fit(disp=False) #disp=False prevents the print out

# Getting predictions from our model
logistic_predictions_08 = logreg_model.predict(X)
logistic_predictions_08_binary = np.where(logistic_predictions_08 >= 0.5, 1, 0)

# Calculating prediction accuracy
total_predicted_correctly = np.sum(y_08_binary == logistic_predictions_08_binary)  
logistic_accuracy_08 = total_predicted_correctly / len(y_08_binary)


print(f'The 2008 logistic regression model is {round(logistic_accuracy_08*100,2)}% accurate')

The 2008 logistic regression model is 83.57% accurate


Notice that I waited until after obtaining my predictions to convert the values into binary results. I only converted them into binary values to calculate the accuracy of my model at predict the correct election outcome for each county.  

The logistic regression model is 83.57% accurate and will keep the predictions between 0 and 1. This is only slightly less than the 83.76% accuracy of the linear regression. Since the predictions are never being made on data that hasn't been seen before and the dependent variable input into the model is well within 0-1 I will use the more accurate linear regression model on the 2012 and 2016 elections. 


# Creating a linear model for the 2012 and 2016 elections

For the 2012 and 2016 linear models the dependent variable will be the republician vote share for 2012 and 2016 respectively.  

In [5]:
# Selecting the dependent variables
y_12 = df['Republicans12_Voteshare']
y_16 = df['Republicans16_Voteshare']

# Create and fit the linear models
OLS12_model = OLS(y_12,X).fit()
OLS16_model = OLS(y_16,X).fit()

# Getting predictions from our models
linear_predictions_12 = OLS12_model.predict(X)
linear_predictions_16 = OLS16_model.predict(X)
linear_predictions_12_binary = np.where(linear_predictions_12 >= 0.5, 1, 0)
linear_predictions_16_binary = np.where(linear_predictions_16 >= 0.5, 1, 0)

# Making the actual county level results into binary (1 if reps win and 0 if dems win)
y_12_binary = np.where(y_12 >= 0.5, 1, 0)
y_16_binary = np.where(y_16 >= 0.5, 1, 0)

# Calculate the accuracy of the 2012 models' predictions
total_predicted_correctly = np.sum(y_12_binary == linear_predictions_12_binary)
linear_accuracy_12 = total_predicted_correctly / len(y_12_binary)

# Calculate the accuracy of the 2016 models' predictions
total_predicted_correctly = np.sum(y_16_binary == linear_predictions_16_binary)
linear_accuracy_16 = total_predicted_correctly / len(y_16_binary)

print(f'The 2008 model is {round(linear_accuracy_08*100,2)}% accurate')
print(f'The 2012 model is {round(linear_accuracy_12*100,2)}% accurate')
print(f'The 2016 model is {round(linear_accuracy_16*100,2)}% accurate')

The 2008 model is 83.76% accurate
The 2012 model is 87.71% accurate
The 2016 model is 93.28% accurate


The accuracy of the more recent elections are higher than the older ones (93% > 88% > 84%). The models are likely to be overfit since we did not split our data into train and test. However, this increase in accuracy could indicate an increase in polarization in the US improving the ability of demographic data to make accurate predictions.

---

# Bonus section
The above section is all that is required for a good assignment - this section is extra work that will improve our model.

Lets start by defining a function that will calculate the accuracies for a given election.


In [6]:
def calc_accuracy_table(accuracies_table,X,name='Accuracy'):
    '''A function that calculates the accuracy of an OLS model for each election year.
    
    Input: 
    accuracies_table - a pandas table of accuracies
    X - the independent variables to create and fit a model with
    name - the name of the new column in the accuracy table
    
    Output:
    The accuracies_table with an addition column of calculated accuracies
    
    '''
    my_accuracies = pd.DataFrame(columns=[name])
    election_year = 2008

    for year in [y_08, y_12, y_16]:
        binary = np.where(year >= 0.5,1,0)
        OLS_model = OLS(year,X).fit()
        predictions = np.where(OLS_model.predict(X) >= 0.5,1,0)
        accuracy = np.sum(predictions == binary) / len(year)
        my_accuracies.loc[election_year] = [accuracy]
        election_year += 4

    my_accuracies.reset_index(inplace=True)
    my_accuracies.columns = ['Year',name]
    
    accuracies_table[name] = my_accuracies[name]
    
    return accuracies_table


# Calculating the accuracies for original models we created
accuracies_table = pd.DataFrame({'Year':[2008,2012,2016]})
accuracies_table = calc_accuracy_table(accuracies_table,X,'Original_Models')
accuracies_table

Unnamed: 0,Year,Original_Models
0,2008,0.837631
1,2012,0.877109
2,2016,0.932824


In [7]:
def calc_rsquared_table(rsquared_table,X,name='RSquared'):
    '''A function that calculates the R squared of an OLS model for each election year.
    
    Input: 
    rsquared_table - a pandas table of R squares
    X - the independent variables to create and fit a model with
    name - the name of the new column in the accuracy table
    
    Output:
    The rsquared_table with an addition column of calculated R squares
    
    '''
    my_rsquared = pd.DataFrame(columns=[name])
    election_year = 2008

    for year in [y_08, y_12, y_16]:
        OLS_model = OLS(year,X).fit()
        my_rsquared.loc[election_year] = [OLS_model.rsquared]
        election_year += 4

    my_rsquared.reset_index(inplace=True)
    my_rsquared.columns = ['Year',name]
    
    rsquared_table[name] = my_rsquared[name]
    
    return rsquared_table


# Calculating the accuracies for original models we created
rsquared_table = pd.DataFrame({'Year':[2008,2012,2016]})
rsquared_table = calc_rsquared_table(rsquared_table,X,'Original_Models')
rsquared_table

Unnamed: 0,Year,Original_Models
0,2008,0.66903
1,2012,0.711609
2,2016,0.807239


## Removing columns that logically are collinear
Since all the race columns add up to 100% I will remove the column 'Other Race or Races'. The columns 'Less Than High School Diploma' and 'At Least High School Diploma' are also collinear so I will remove 'At Least High School Diploma'.

In [8]:
# Sum of race columns
print(f'The sum of the percentages of the race columns are {round(X.iloc[:,8:14].sum(axis=1).mean(),4)} %')                                                      
                                                      

The sum of the percentages of the race columns are 99.9999 %


Although this isn't 100% the values are only recorded to 2 decimal places so I assume this is due to a rounding error.

In [9]:
# Sum of 'Less Than High School Diploma' and 'At Least High School Diploma' 
X.iloc[:,2:4].sum(axis=1).describe()

count    3141.000000
mean       99.920408
std         1.993629
min        50.000000
25%       100.000000
50%       100.000000
75%       100.000000
max       100.000000
dtype: float64

In [10]:
# Displaying those columns that do not add up to 100
table = X.iloc[:,2:4][X.iloc[:,2:4].sum(axis=1)<100]
table["Sum"] = table.sum(axis=1)
table

Unnamed: 0,Less Than High School Diploma,At Least High School Diploma,Sum
338,3.85,46.15,50.0
1108,4.6,45.4,50.0
1350,6.75,43.25,50.0
2251,20.1,29.9,50.0
2961,10.15,39.85,50.0


As shown in the table above, all but 5 rows of 'Less Than High School Diploma' and 'At Least High School Diploma' add up to 100%. Ideally I'd multiply these values by 2 to get the percentage but since its only a few I will ignore it.

In [11]:
# Dropping the logically collinear variables
X.drop(['Other Race or Races','At Least High School Diploma'], axis=1, inplace=True)

# Calculating the accuracies for models without these collinear variables
accuracies_table = calc_accuracy_table(accuracies_table,X,'Manually_Reduced_Model')
print('Accuracies')
display(accuracies_table)

# Calculating the R squared for models without these collinear variables
rsquared_table = calc_rsquared_table(rsquared_table,X,'Manually_Reduced_Models')
print('\n R Squared')
display(rsquared_table)

Accuracies


Unnamed: 0,Year,Original_Models,Manually_Reduced_Model
0,2008,0.837631,0.837313
1,2012,0.877109,0.877428
2,2016,0.932824,0.932824



 R Squared


Unnamed: 0,Year,Original_Models,Manually_Reduced_Models
0,2008,0.66903,0.66889
1,2012,0.711609,0.711285
2,2016,0.807239,0.807238


These models have very similar accuracies and R squared - lets look at the 2008 election to see if the multicollinearity warning is gone/reduced.

In [12]:
# Create and fit a linear model
OLS08_model = OLS(y_08,X).fit()

# Predictions
predictions_08 = np.where(OLS08_model.predict(X) >= 0.5,1,0)

# Accuracy
accuracy_08 = np.sum(predictions_08 == y_08_binary) / len(y_08)

OLS08_model.summary()

0,1,2,3
Dep. Variable:,Republicans08_Voteshare,R-squared:,0.669
Model:,OLS,Adj. R-squared:,0.665
Method:,Least Squares,F-statistic:,156.6
Date:,"Wed, 01 May 2019",Prob (F-statistic):,0.0
Time:,16:49:45,Log-Likelihood:,3446.5
No. Observations:,3141,AIC:,-6811.0
Df Residuals:,3100,BIC:,-6563.0
Df Model:,40,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.1826,2.876,-0.063,0.949,-5.822,5.457
Precincts,-1.352e-05,2.27e-05,-0.597,0.551,-5.79e-05,3.09e-05
Less Than High School Diploma,0.0002,0.000,0.549,0.583,-0.001,0.001
At Least Bachelors's Degree,-0.0028,0.001,-4.198,0.000,-0.004,-0.001
Graduate Degree,-0.0068,0.001,-5.942,0.000,-0.009,-0.005
School Enrollment,0.0004,0.000,1.157,0.247,-0.000,0.001
Median Earnings 2010,-9.419e-07,5.63e-07,-1.672,0.095,-2.05e-06,1.63e-07
White (Not Latino) Population,0.0077,0.001,7.777,0.000,0.006,0.010
African American Population,0.0021,0.001,2.116,0.034,0.000,0.004

0,1,2,3
Omnibus:,14.868,Durbin-Watson:,1.823
Prob(Omnibus):,0.001,Jarque-Bera (JB):,16.374
Skew:,-0.12,Prob(JB):,0.000278
Kurtosis:,3.261,Cond. No.,647000000.0


The models remain similarly accurate to their previous versions but the condition number indicating multicollinearity has gone down slightly (from 9.31e+08 to 6.47e+08). Lets try a more rigorous approach. 

## Reducing multicollinearity with variance inflation factor 
Variance Inflation Factor (VIF) is a measure of multicollinearity and can be used to identify which variables to drop. To learn more about VIF, visit this [website](https://www.statisticshowto.datasciencecentral.com/variance-inflation-factor/).

According to the website, a VIF above 5 indicates high correlation. Therefore, I will set the threshold to 5.

Keep in mind with the way VIF is calculated that VIF values can change once a variable is dropped. Therefore, new VIF values need to be calculated in each iteration before determining which variable to drop.

In [13]:
# Code Source: 
# https://stats.stackexchange.com/questions/155028/how-to-systematically-remove-collinear-variables-in-python   

#import package
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Reassign the independent variable 
X = df.loc[:,'Precincts':'Infant.mortality']
X.drop('Votes', axis=1, inplace=True)

def calculate_vif_(X, thresh=5.0):
    variables = list(range(X.shape[1]))
    dropped = True
    while dropped:
        dropped = False
        vif = [variance_inflation_factor(X.iloc[:, variables].values, ix)
               for ix in range(X.iloc[:, variables].shape[1])]
        
        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            print('dropping \'' + X.iloc[:, variables].columns[maxloc] +
                  '\' at index: ' + str(maxloc))
            del variables[maxloc]
            dropped = True

    print('Remaining variables:')
    print(X.columns[variables])
    return X.iloc[:, variables]


X_reduced = calculate_vif_(X, thresh=5.0)

dropping 'White (Not Latino) Population' at index: 7
dropping 'At Least High School Diploma' at index: 2
dropping 'School Enrollment' at index: 4
dropping 'Gini.Coefficient' at index: 15
dropping 'Adult.obesity' at index: 28
dropping 'median_age' at index: 23
dropping 'Management.professional.and.related.occupations' at index: 16
dropping 'Poverty.Rate.below.federal.poverty.threshold' at index: 14
dropping 'At Least Bachelors's Degree' at index: 2
dropping 'SIRE_homogeneity' at index: 19
dropping 'Diabetes' at index: 23
dropping 'Sales.and.office.occupations' at index: 15
dropping 'Low.birthweight' at index: 18
dropping 'Child.Poverty.living.in.families.below.the.poverty.line' at index: 13
dropping 'Children.in.single.parent.households' at index: 18
dropping 'Infant.mortality' at index: 26
dropping 'Median Earnings 2010' at index: 3
dropping 'Uninsured' at index: 20
dropping 'Less Than High School Diploma' at index: 1
dropping 'Service.occupations' at index: 11
dropping 'Injury.deaths'

Its interesting to note that the constant was dropped during this collinearity test.

In [14]:
# Create and fit the model with reduced independent variables
reduced_model = OLS(y_08, X_reduced).fit()
reduced_model.summary()

0,1,2,3
Dep. Variable:,Republicans08_Voteshare,R-squared:,0.919
Model:,OLS,Adj. R-squared:,0.919
Method:,Least Squares,F-statistic:,3224.0
Date:,"Wed, 01 May 2019",Prob (F-statistic):,0.0
Time:,16:49:54,Log-Likelihood:,1128.5
No. Observations:,3141,AIC:,-2235.0
Df Residuals:,3130,BIC:,-2168.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Precincts,-5.271e-05,2.03e-05,-2.599,0.009,-9.25e-05,-1.3e-05
Graduate Degree,0.0223,0.001,31.078,0.000,0.021,0.024
African American Population,-0.0035,0.000,-13.023,0.000,-0.004,-0.003
Native American Population,-0.0016,0.000,-3.667,0.000,-0.002,-0.001
Asian American Population,-0.0244,0.002,-14.979,0.000,-0.028,-0.021
Other Race or Races,0.0214,0.002,11.685,0.000,0.018,0.025
Latino Population,0.0007,0.000,2.551,0.011,0.000,0.001
Farming.fishing.and.forestry.occupations,0.0304,0.001,25.477,0.000,0.028,0.033
Production.transportation.and.material.moving.occupations,0.0200,0.000,64.548,0.000,0.019,0.021

0,1,2,3
Omnibus:,24.555,Durbin-Watson:,1.648
Prob(Omnibus):,0.0,Jarque-Bera (JB):,25.065
Skew:,-0.205,Prob(JB):,3.61e-06
Kurtosis:,3.151,Cond. No.,273.0


The R^2 has increased to 0.929 which is a large improvement on the 0.669 from the original model and the warning about collinearity is no longer there.

In [15]:
# Calculating the accuracies for models with VIF drops
accuracies_table = calc_accuracy_table(accuracies_table,X_reduced,'VIF_Reduced_Model')
print('Accuracies')
display(accuracies_table)

# Calculating the R squared for models with VIF drops
rsquared_table = calc_rsquared_table(rsquared_table,X_reduced,'VIF_Reduced_Models')
print('\n R Squared')
display(rsquared_table)

Accuracies


Unnamed: 0,Year,Original_Models,Manually_Reduced_Model,VIF_Reduced_Model
0,2008,0.837631,0.837313,0.603629
1,2012,0.877109,0.877428,0.690226
2,2016,0.932824,0.932824,0.798153



 R Squared


Unnamed: 0,Year,Original_Models,Manually_Reduced_Models,VIF_Reduced_Models
0,2008,0.66903,0.66889,0.918895
1,2012,0.711609,0.711285,0.919673
2,2016,0.807239,0.807238,0.931755


Interestingly that although the accuracy goes down slightly the R squared dramatically increases.


## States as variables
Since states are often more inclined to a particular political party I will add states as an independent variable. To do this I will create dummy variables but in order to avoid collinearity I need to remove at least one state variable (which is taken care of in the VIF calculation).

In [16]:
# Recreating the independent variable to include State
X = df.loc[:,:'Infant.mortality']
X.drop(['ST', 'Fips', 'County', 'Votes'], axis=1, inplace=True)
X = sm.add_constant(X)

# Adding state dummies
X = pd.get_dummies(X)

# Reducing multicollinearity
X_reduced = calculate_vif_(X, thresh=5.0)

  return 1 - self.ssr/self.centered_tss
  vif = 1. / (1. - r_squared_i)


dropping 'State_Alabama' at index: 43
dropping 'const' at index: 0
dropping 'White (Not Latino) Population' at index: 7
dropping 'At Least High School Diploma' at index: 2
dropping 'School Enrollment' at index: 4
dropping 'Gini.Coefficient' at index: 15
dropping 'Adult.obesity' at index: 28
dropping 'median_age' at index: 23
dropping 'Management.professional.and.related.occupations' at index: 16
dropping 'Poverty.Rate.below.federal.poverty.threshold' at index: 14
dropping 'Diabetes' at index: 25
dropping 'At Least Bachelors's Degree' at index: 2
dropping 'SIRE_homogeneity' at index: 19
dropping 'Uninsured' at index: 25
dropping 'Low.birthweight' at index: 19
dropping 'Sales.and.office.occupations' at index: 15
dropping 'Child.Poverty.living.in.families.below.the.poverty.line' at index: 13
dropping 'Median Earnings 2010' at index: 3
dropping 'Service.occupations' at index: 12
dropping 'Children.in.single.parent.households' at index: 16
dropping 'Infant.mortality' at index: 23
dropping '

In [17]:
# Create and fit the model with reduced independent variables
reduced_model = OLS(y_08, X_reduced).fit()
reduced_model.summary()

0,1,2,3
Dep. Variable:,Republicans08_Voteshare,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.949
Method:,Least Squares,F-statistic:,1016.0
Date:,"Wed, 01 May 2019",Prob (F-statistic):,0.0
Time:,16:51:23,Log-Likelihood:,1896.9
No. Observations:,3141,AIC:,-3678.0
Df Residuals:,3083,BIC:,-3327.0
Df Model:,58,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
African American Population,-0.0026,0.000,-8.964,0.000,-0.003,-0.002
Native American Population,-0.0032,0.000,-8.478,0.000,-0.004,-0.002
Asian American Population,-0.0057,0.001,-4.115,0.000,-0.008,-0.003
Latino Population,-0.0039,0.000,-13.687,0.000,-0.004,-0.003
Total Population,-1.265e-08,9.35e-09,-1.353,0.176,-3.1e-08,5.68e-09
Farming.fishing.and.forestry.occupations,0.0153,0.001,13.875,0.000,0.013,0.017
HIV.prevalence.rate,-4.819e-05,1.6e-05,-3.008,0.003,-7.96e-05,-1.68e-05
Violent.crime,7.989e-05,1.63e-05,4.903,0.000,4.79e-05,0.000
State_Alaska,0.5486,0.029,18.752,0.000,0.491,0.606

0,1,2,3
Omnibus:,1887.446,Durbin-Watson:,1.602
Prob(Omnibus):,0.0,Jarque-Bera (JB):,27422.236
Skew:,2.593,Prob(JB):,0.0
Kurtosis:,16.515,Cond. No.,19000000.0


In [18]:
# Calculating the accuracies for models with VIF drops
accuracies_table = calc_accuracy_table(accuracies_table,X_reduced,'VIF_Reduced_State_Model')
print('Accuracies')
display(accuracies_table)

# Calculating the R squared for models with VIF drops
rsquared_table = calc_rsquared_table(rsquared_table,X_reduced,'VIF_Reduced_State_Models')
print('\n R Squared')
display(rsquared_table)

Accuracies


Unnamed: 0,Year,Original_Models,Manually_Reduced_Model,VIF_Reduced_Model,VIF_Reduced_State_Model
0,2008,0.837631,0.837313,0.603629,0.814709
1,2012,0.877109,0.877428,0.690226,0.849729
2,2016,0.932824,0.932824,0.798153,0.888252



 R Squared


Unnamed: 0,Year,Original_Models,Manually_Reduced_Models,VIF_Reduced_Models,VIF_Reduced_State_Models
0,2008,0.66903,0.66889,0.918895,0.950275
1,2012,0.711609,0.711285,0.919673,0.953093
2,2016,0.807239,0.807238,0.931755,0.956151


Although accuracy is lower the R squared is higher. Choosing which metric is more important depends on what you want your model to do. It could be argued that accuracy is better because we want to know which party won each county.