### Imitate election polling under different scenarios
1. People surveyed didn't change their minds, didn't hide who they voted for, and were representative of those who voted on election day.
2. People with a higher education were more likely to respond, which led to a bias for Clinton.

**The goat is to understand the frequency that a poll incorrectly calls the election for Hillary Chinton when a sample is collected with absolutely no bias and when there is a small amount of non-response bias.**

#### Setting up the urn model for the first scenario
1. The urn has 6,165,478 marbles in it, one for each voter.
2. We write on each marble the candidate that they voted for, draw 1,500 marbles from the urn(1,500 is a typical size for these polls), and tally the votes for Trump, Clinton, and any other candidate.

*Since we care only about Trump's lead over Clinton, we can lump together all votes for other candidates.*

**This way, each marble has one of three possible votes: Trump, Clinton, or Other.**

In [1]:
import numpy as np

In [2]:
proportions = np.array([0.4818, 0.4746, 1 - (0.4818 + 0.4746)])
n = 1_500
N = 6_165_478
votes = np.trunc(N * proportions).astype(int)
votes

array([2970527, 2926135,  268814])

The urn model has three types of marbles in it: __the multivariate hypergeometic__.

In Python, the urn model with more than two types of marbles is implemented by the ```scipy.stats.multivariate_hypergeometric```.

In [3]:
from scipy.stats import multivariate_hypergeom

In [4]:
multivariate_hypergeom.rvs(votes, n)

array([721, 705,  74])

(nT - nC) / n calculates Trump's lead for each sample. If the lead is positive, then the sample shows a win for Trump. 

We know the actual lead was 0.4818 - 0.4746 = 0.0072. To get a sense of the variation in the poll, we can simulate the chance process of drawing from the urn over and over and examine the values that we get in return. 

In [5]:
# Simulate 100,000 polls of 1,500 voters from the votes:
def trump_advantage(votes, n):
    sample_votes = multivariate_hypergeom.rvs(votes, n)
    return (sample_votes[0] - sample_votes[1]) / n

In [6]:
simulations = [trump_advantage(votes, n) for _ in range(100_000)]

In [7]:
np.mean(simulations)

np.float64(0.007351666666666669)

In [9]:
# In the 100,000 simulated polls, we find Trump a victor about 60% of the time:
np.mean(np.array(simulations) > 0)

np.float64(0.61064)

#### An Urn Model with Bias
Specifically, we examine the impacts of 0.5% bias in favor of Clinton. Instead of 47.46% votes for Clinton, we have 47.96%, and we have 48.18 - 0.5 = 47.68% for Trump.  

In [10]:
# We adjust the proportions of marbles in the urn to reflect this bias:
bias = 0.005
proportions_bias = np.array([0.4818 - bias, 0.4746 + bias, 1 - (0.4818 + 0.4746)])
proportions_bias

array([0.4768, 0.4796, 0.0436])

In [11]:
votes_bias = np.trunc(N * proportions_bias).astype(int)
votes_bias

array([2939699, 2956963,  268814])

In [12]:
simulations_bias = [trump_advantage(votes_bias, n) for _ in range(100_000)]

In [14]:
np.mean(np.array(simulations_bias) > 0)

np.float64(0.44956)

#### Conducting Larger Polls 
Would increasing the sample size have helped?

In [16]:
# Let's try a sample size of 12,000:
simulations_big = [trump_advantage(votes, 12_000) for _ in range(100_000)]
simulations_bias_big = [trump_advantage(votes_bias, 12_000) for _ in range(100_000)]

In [17]:
np.mean(np.array(simulations_big) > 0)

np.float64(0.79137)

In [18]:
np.mean(np.array(simulations_bias_big) > 0)

np.float64(0.37765)

We haven't overcome the bias; we just have a more accurate picture of the biased situation. 
__A larger sample size reduces the sampling error, but unfortunately, if there is bias, then the predictions are close to the biased estimate.__