In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn

import statsmodels.api as sm
import statsmodels.formula.api as smf

# Various tests of normality
from scipy.stats import shapiro
from statsmodels.stats.stattools import jarque_bera
from statsmodels.stats.diagnostic import lilliefors

# Prepare the Data

In [None]:
# Import the county election data, and the county health rankings data for 2016.
# Store in a variabled called "df"

url = 'https://raw.githubusercontent.com/thomaspingel/geodata/master/election/county_election_data_2000-2016.csv'
election_df = pd.read_csv(url,dtype={'FIPS':str})

url = 'https://raw.githubusercontent.com/thomaspingel/geodata/master/county_health_rankings/chr_2016.csv'
health_df = pd.read_csv(url,dtype={'FIPS':str})

df = health_df.merge(election_df,how='left',on='FIPS')

# Column names often have spaces.  Use this command to replace spaces with underscores
# df.columns = df.columns.str.replace(' ', '_')

df.head()

In [None]:
print(df.columns.values)

# Part 1 - Exploring Transformations of Data

**Transformations are a useful way to take data that is skewed and de-skew it so that it performs better in a regression model (i.e., is more predictive).**

_Log transforms_ are a useful way to transform data when
* data values are positive AND
* data values are skewed right (i.e., clustered heavily to the left)

If data seem clustered left BUT they are not positive, you can add a value to recenter the distribution

_Arithmetic and power transforms are also useful._
One can also raise the values to a power, either greater than one (e.g., the square or second power) or between zero and 1 (e.g., square root or .5 power)

---

<u>Instructions</u>: Using the codeblock below, explore ten different likely contributors to the prediction model to see if they likely require transformation before inclusion.  Use the histograms to visually diagnose how normal they appear.  While nothing will be completely normal, does it have a general "bell" shape, or is it bunched up at one end or the other.  If it's bunched up toward the left, that's a "right skew" as it has a longer right "tail".  If you need to review concepts of skew, please see the [statisticshowto.com page and video on skew](https://www.statisticshowto.com/probability-and-statistics/skewed-distribution/).  Indicators of skew should use weak/moderate/strong left/right skew.

Paste your best transform code (everything to the right of the equals sign) into the markdown table.  Please order variables alphabetically (manually).  Two of these have been done for you.

| Variable | Untransformed skew | Transform | 
| ----------- | ---- | ----------- |
| Demographics - Population   | Strong Right | np.log(x) |
| gop_minus_dem_prc_2012 | Weak Left | (x + 100) ** 1.4 | 

In [None]:
# Edit this line to choose a variable from the list above
x = df['Demographics - Population'].dropna()

# And modify this line to the right of the equal sign to make a transformation.
# Options include a log transform, and adding a value and/or raising to a positive exponent.
# Example: xt = np.log(x)
# Example: xt = (x + 100) ** 2.5
xt =  np.log(x)

plt.figure(figsize=(15,5))
plt.subplot(121)
_ = plt.hist(x,bins=25)
plt.subplot(122)
_ = plt.hist(xt,bins=25)

# Statistical tests of normality
# If the P-Value is less than .05, the data are not normal
# Graduate students should pay attention to these values and understand their import
print('Untransformed: ',shapiro(x))
print('Transformed:',shapiro(xt))

# Part 2 - Test variables one at a time for R<sup>2</sup> value

<u>Instructions</u> - Using the variables and transformation methods you've aggregated above, run each variable twice - first untransformed, then transformed in the way you specified above - and put the results in this table, sorted (manually) by R2 or Transformed R2 score, whichever is higher (use the highest score as that line's score).  Include a VERY short statement indicating the logic of inclusion in the model.  Add a markdown paragraph below this table, and explain what variables are worth including in the model.

| Variable | R<sup>2</sup> | R<sup>2</sup><sub>trans</sub> | Rationale |
| ----------- | :---- | :-----| ----------:| 
| gop_minus_dem_prc_2012 | 0.891 | 0.870 | States are likely to vote like they did the last time |
| Demographics - Population   | 0.123 | 0.255 | Larger states are more likely to vote Democrat |

In [None]:
_df = pd.DataFrame()

_df['y'] = df['gop_minus_dem_prc_2016']

# Edit this value, first as just the column, and again using the transformation you picked above
_df['x'] = np.log(df['Demographics - Population'])

_df = _df.dropna(how='any')

# Then run the regression
results = smf.ols(formula='y ~ x',data=_df).fit()

# Print the report
print(results.summary())

# Part 3 - Multiple Regression Round 1 - Investigate interaction effects

<u>Instructions</u> - Choose 3 or 4 of the highest scoring variables (some discretion is OK here), and include them in a model with with full interaction effects.  Express the models as either untransformed values or transformed values if you think the transformation significantly improves the explanatory power.  Review the output, and use the by variable/interaction report to fill in the table for all interactions tested.  If the value is under p < .05, the interaction is statistically significant:

| Variables | Result
| --------- | -------
| x0 | Example: Statistically Significant
| x1 | Example: Not Statistically Significant
| x2 | ???
| x0 and x1 | ???
| x0 and x2 | ???
| x1 and x2 | ???
| x0 and x1 and x2 | ???

In [None]:
_df = pd.DataFrame()

_df['y'] = df['gop_minus_dem_prc_2016']

_df['x0'] = df['gop_minus_dem_prc_2012']
_df['x1'] = df['Demographics - Population']
_df['x2'] = df['Sexually transmitted infections - Chlamydia Rate']

_df = _df.dropna(how='any')

# Then run the regression
results = smf.ols(formula='y ~ x0 * x1 * x2',data=_df).fit()

# Print the report
print(results.summary())

# Part 4 - Multiple Regression

<u>Instructions</u> - Use all of the above information to construct a final model with individual and interaction terms, as needed and dictated by an analysis of the resulting p-values.  You'll need to edit your model and formula, rerun several times adding and removing terms as dictated by the p-values, and then step though the crosstabulation code below.  No editing of any codebock other than the one directly beneath these instructions should be required.

In [None]:
_df = pd.DataFrame()

_df['y'] = df['gop_minus_dem_prc_2016']

_df['x0'] = df['variable1']
_df['x1'] = df['variable2']
_df['x2'] = df['variable3']

_df = _df.dropna(how='any')

# Then run the regression
results = smf.ols(formula='y ~ x0 + x1 + x2 + x0:x1 + x1:x2 + x0:x1:x2',data=_df).fit()

# Print the report
print(results.summary())

In [None]:
# Constuct a dataframe of the confusion matrix, calculated with pandas.crosstab
# Here, only binary options are considered (R wins or D wins, not percent victory)

output_df = pd.DataFrame({'true':df['gop_minus_dem_prc_2016']>0,'predicted':results.fittedvalues>0})
output_df['true'] = output_df['true'].map({True:'R',False:'D'})
output_df['predicted'] = output_df['predicted'].map({True:'R',False:'D'})
cross = pd.crosstab(output_df.true,output_df.predicted)
print(cross)

In [None]:
# Show the normalized values (percents)

cross = pd.crosstab(output_df.true,output_df.predicted,normalize=True)
cross

In [None]:
# How accurate was the total model at predicting R/D victory

100 * (cross.values[0,0] + cross.values[1,1])

# Part 5 - Write up

<u>Instructions</u> - Write a short paragraph in a markdown cell explaining what you learned about regression, and about 
predicting 2016 election results.  Name 5 more variables not included in these datasets that you think might be helpful in predicting the results.  These variables should be operationally valid - meaning, one could easily imagine that such data exist.  When complete, render this notebook as a PDF via notebook-as-pdf, by exporting (File -> Download As -> PDF via HTML)