# Beta distributions on election data 

In this lab you will be looking at [election data taken from Andrew Gelman's book on Bayesian statistics](http://www.stat.columbia.edu/~gelman/arm/examples/election88/) (highly recommended!).

---

## Dataset

The initial parsing code for the data has been completed to save you time so you can get to the Bayesian stuff. You are still required to perform some feature engineering though.

The data contains polling information for George H.W. Bush as well as election information. 

The polling information indicates samples by state of who intends to vote for Bush or not. The voting information is a sample after the election asking who actually voted for bush or not.

---

## 1. Import and parse the data

This portion is done for you. There are two datasets:

    election88  :  this contains the election voting poll information
    pre_poll    :  this contains the pre-election voting poll information

In [1]:
import pandas as pd
import numpy as np

election88 = pd.read_csv('./datasets/election88.csv')

election88.rename(columns={'stnum':'state_id','samplesize':'vote_total'}, inplace=True)

election88 = election88[~election88.st.isin(['DC','AK','HI'])]

print election88.head(3)

   state_id  st  electionresult  vote_total    raking  _merge
0         1  AL            0.59         203  0.673067       3
2         3  AZ            0.60         194  0.568980       3
3         4  AR            0.56         121  0.563672       3


In [2]:
# Reading in the poll csv file
pre_poll = pd.read_csv('./datasets/polls.csv')

# remove unneccessary columns:
del pre_poll['org']

pre_poll.rename(columns={'state':'state_id'}, inplace=True)

pre_poll = pre_poll.merge(election88[['state_id','st']], on='state_id')

print pre_poll.head(3)
print pre_poll.shape

   year survey  bush  state_id  edu  age  female  black  weight  st
0     1   9152   1.0         7    2    2       1      0    1403  CT
1     1   9152   1.0         7    4    3       1      0     701  CT
2     1   9152   0.0         7    2    1       0      0    4341  CT
(13525, 10)


In [3]:
# print state category counts
print pre_poll['st'].value_counts()

CA    1493
NY     894
TX     788
FL     750
PA     616
OH     605
IL     567
MI     530
NJ     428
WA     393
WI     389
MA     373
VA     354
NC     346
TN     329
GA     316
MO     309
IN     291
MN     289
MD     284
SC     223
MS     220
KY     210
AL     203
LA     196
AZ     194
CO     181
CT     171
OR     149
IA     143
KS     141
OK     130
NE     125
AR     121
WV     117
NM     109
RI      91
UT      79
SD      60
ND      60
ME      51
ID      42
MT      40
DE      39
NV      32
NH      27
WY      15
VT      12
Name: st, dtype: int64


--- 

## 2. In the poll data, compute the number of people who did and didn't intend to vote for Bush by state.

In [6]:
pre_poll.groupby('st').de

Unnamed: 0_level_0,year,survey,bush,state_id,edu,age,female,black,weight
st,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AL,203,203,159,203,203,203,203,203,203
AR,121,121,101,121,121,121,121,121,121
AZ,194,194,168,194,194,194,194,194,194
CA,1493,1493,1280,1493,1493,1493,1493,1493,1493
CO,181,181,145,181,181,181,181,181,181
CT,171,171,134,171,171,171,171,171,171
DE,39,39,37,39,39,39,39,39,39
FL,750,750,641,750,750,750,750,750,750
GA,316,316,264,316,316,316,316,316,316
IA,143,143,113,143,143,143,143,143,143


--- 

## 3. In the vote data, compute the number of people who did and didn't vote for Bush by state.

---

## 4. Merge the poll and vote data together by state

---

## 5. Construct a function to plot beta probability distributions based on poll and vote counts

The distributions should be on the same chart.

In [4]:
from scipy.stats import beta
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('white')
%matplotlib inline

---

## 6. Select 4 states of your choice and plot the beta distributions


---

## 7. [BONUS] Use bootstrapping to estimate the percent of the voting distribution greater than the polling distribution

Selecting random samples from a beta distribution can be done with:

```python
from numpy.random import beta as random_beta
```

HINT: You will want to calculate the percentage of random draws from the voting distribution that are greater than all of the draws from the poll distribution.

In [5]:
from numpy.random import beta as random_beta