# CS146 Session 7.1 Pre-Class Work

In the reading on <a href="https://ropercenter.cornell.edu/support/polling-fundamentals-total-survey-error/">Total Survey Error</a> by the Roper Center, there is a table of 95% confidence intervals for sampling error in percentage values in survey results. This margin of error depends on both the number of people surveyed (the sampling size) and the observed outcome for a particular candidate (as a percentage). It turns out there is an error in this table.

1. Using the normal approximation to the binomial distribution, confirm that the 95% confidence interval for the sampling error for sample size 1000 and percentage outcome 10% is 2% (rounded to the nearest integer). Motivate why it is appropriate to use the binomial distribution here.

For this question, we will need to refer to the formula for the central limit theorem:

<center>$\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}$</center>
<br>
As well as the formula for the margin of error:
<br>
<center>$Z\times\sqrt{\frac{p(1-p)}{n}}$</center>
<br>
Where:

- $p$ = percentage outcome.
- $n$ = sample size
- $z$ = z-score

In [41]:
import numpy as np
import pandas as pd

In [42]:
# Defining the parameters the binomial distribution for the sampling error 
# for sample size 1000 and percentage outcome 10%

n = 1000 # sample size  - the amount of people sampled
p = 0.1  # percentage outcome - number of people who gave a positive response


# Using normal distribution to approximate the binomial distribution
# First, we will have to derive the parameters for the normal distribution from the params for the binom dist.
mu = n*p # mean/expected outcome of the normal distribution is the sample size multiplied by the % outcome 
stdev = np.sqrt(n*p*(1-p)) # std. deviation is just the square root of the variance

# Now, we will draw 1,000,000 random samples from the normal distribution.
norm_dist = np.random.normal(mu, stdev, 1000000)

# Here, to calculate the 95% confidence intervals, we will need to first obtain the 95% c.i. for the normal 
# distribution, and then use the margin of error formula as well as the Central Limit Theorem
# to convert it to the 95% c.i. for the binom. dist.
ci_95 = [(np.percentile(norm_dist, 2.5)-mu)/n*100, (np.percentile(norm_dist, 97.5)-mu)/n*100]

print(ci_95)

[-1.8598530771521182, 1.8617609969851259]


2. Write a Python function for calculating the 95% confidence interval given any sample size and any percentage outcome. Use your function to calculate all the values in the Total Survey Error table rounded to the nearest integer. For which entries does your margin of error differ from the value in the table?

In [43]:
percentage_list = []

def binom_to_norm(n, p):
    mu = n*p 
    stdev = np.sqrt(n*p*(1-p))
    norm_dist = np.random.normal(mu, stdev, 1000000)
    ci_95 = [(np.percentile(norm_dist, 2.5)-mu)/n*100, (np.percentile(norm_dist, 97.5)-mu)/n*100]
    percentage_list.append(ci_95[1])
    return(n, p, ['%.0f' % elem for elem in ci_95])

sample_size = [100,250,500,750,1000]
percentage_outcome = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

for n in sample_size:
    for p in percentage_outcome:
        print(binom_to_norm(n,p))

(100, 0.1, ['-6', '6'])
(100, 0.2, ['-8', '8'])
(100, 0.3, ['-9', '9'])
(100, 0.4, ['-10', '10'])
(100, 0.5, ['-10', '10'])
(100, 0.6, ['-10', '10'])
(100, 0.7, ['-9', '9'])
(100, 0.8, ['-8', '8'])
(100, 0.9, ['-6', '6'])
(250, 0.1, ['-4', '4'])
(250, 0.2, ['-5', '5'])
(250, 0.3, ['-6', '6'])
(250, 0.4, ['-6', '6'])
(250, 0.5, ['-6', '6'])
(250, 0.6, ['-6', '6'])
(250, 0.7, ['-6', '6'])
(250, 0.8, ['-5', '5'])
(250, 0.9, ['-4', '4'])
(500, 0.1, ['-3', '3'])
(500, 0.2, ['-3', '4'])
(500, 0.3, ['-4', '4'])
(500, 0.4, ['-4', '4'])
(500, 0.5, ['-4', '4'])
(500, 0.6, ['-4', '4'])
(500, 0.7, ['-4', '4'])
(500, 0.8, ['-4', '4'])
(500, 0.9, ['-3', '3'])
(750, 0.1, ['-2', '2'])
(750, 0.2, ['-3', '3'])
(750, 0.3, ['-3', '3'])
(750, 0.4, ['-4', '4'])
(750, 0.5, ['-4', '4'])
(750, 0.6, ['-3', '4'])
(750, 0.7, ['-3', '3'])
(750, 0.8, ['-3', '3'])
(750, 0.9, ['-2', '2'])
(1000, 0.1, ['-2', '2'])
(1000, 0.2, ['-2', '2'])
(1000, 0.3, ['-3', '3'])
(1000, 0.4, ['-3', '3'])
(1000, 0.5, ['-3', '3'])
(1000

In [49]:
df = pd.DataFrame({ 'Percentage Outcome' : np.array([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]),
                    'n = 1000' : np.array([round(i) for i in percentage_list[-9:]]),
                    'n = 750' : np.array([round(i) for i in percentage_list[27:36]]),
                    'n = 500' : np.array([round(i) for i in percentage_list[18:27]]),
                    'n = 250' : np.array([round(i) for i in percentage_list[9:18]]),
                    'n = 100' : np.array([round(i) for i in percentage_list[0:9]]), })
df

Unnamed: 0,Percentage Outcome,n = 1000,n = 750,n = 500,n = 250,n = 100
0,0.1,2.0,2.0,3.0,4.0,6.0
1,0.2,2.0,3.0,4.0,5.0,8.0
2,0.3,3.0,3.0,4.0,6.0,9.0
3,0.4,3.0,4.0,4.0,6.0,10.0
4,0.5,3.0,4.0,4.0,6.0,10.0
5,0.6,3.0,4.0,4.0,6.0,10.0
6,0.7,3.0,3.0,4.0,6.0,9.0
7,0.8,2.0,3.0,4.0,5.0,8.0
8,0.9,2.0,2.0,3.0,4.0,6.0


3. Can you identify where these errors come from?

Not exactly sure...