In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

**Bernoulli Sampling**

Draws **r** replications of sample size **n** from a $Bernoulli(p)$
distribution. Then calculates the sample means, sample standard deviations, and sample 95% confidence intervals. Also determines whether or not the true mean of the sample lies within the calculated confidence interval. 

The equation for the 95% confidence interval is given by: 

$CI = (\bar{Y_{r}} - \dfrac{1.96s_{r}}{n^{1/2}},\bar{Y_{r}} + \dfrac{1.96s_{r}}{n^{1/2}}) $

where $\bar{Y_{r}}$ is the sample mean, $s_{r}$ is the sample standard deviation, and $n$ is the size of the sample.

In [2]:
# define initial parameters
n = 5       # sample size
p = 0.1     # probability 
r = 1000    # replications 

def bernoulli_samples(n,p,r): 
    sample_means = [] 
    sample_stds = [] 
    confidence_intervals = [] 
    inside_interval = [] 
    
    
    # run each replication 
    for trial in  range(r): 
        # draw the sample
        sample = np.random.binomial(1,p,n)
        
        # calculate sample mean
        mu = np.mean(sample)
        sample_means.append(mu)
        
        # calculate sample std. deviation
        sigma = np.sqrt(mu*(1-mu))
        sample_stds.append(sigma)
        
        # calculate the sample 95% confidence interval 
        ci = [mu - 1.96*sigma/np.sqrt(n), mu + 1.96*sigma/np.sqrt(n)]
        confidence_intervals.append(ci) 
        
        # determine if true mean is inside the calculated interval 
        inside = 1 if (p >= ci[0] and p <= ci[1]) else 0 
        inside_interval.append(inside) 
        

    # determine proportion of confidence intervals with true mean inside interval 
    true_prop = sum(inside_interval)/len(inside_interval)
    
    return np.mean(sample_means), np.mean(sample_stds), true_prop

We can use the above function to evaluate the claim that a sample size of 30 is enough to ensure that asymptotic confidence intervals work well. In order to do this, we must find the mean estimate of $p$ and the mean estimate of the true standard deviation and compare across sample sizes.

In [4]:
param_n = [10, 30, 60, 10, 30, 60]
param_p = [0.05, 0.05, 0.05, 0.25, 0.25, 0.25]


# calculate estiamtes
p_hat1, sigma_hat1, prop_inside1 = bernoulli_samples(param_n[0], param_p[0], 1000)
p_hat2, sigma_hat2, prop_inside2 = bernoulli_samples(param_n[1], param_p[1], 1000)
p_hat3, sigma_hat3, prop_inside3 = bernoulli_samples(param_n[2], param_p[2], 1000)

p_hat4, sigma_hat4, prop_inside4 = bernoulli_samples(param_n[3], param_p[3], 1000)
p_hat5, sigma_hat5, prop_inside5 = bernoulli_samples(param_n[4], param_p[4], 1000)
p_hat6, sigma_hat6, prop_inside6 = bernoulli_samples(param_n[5], param_p[5], 1000)

# compile results
p_hats = [p_hat1, p_hat2, p_hat3, 
          p_hat4, p_hat5, p_hat6]

sigma_hats = [sigma_hat1, sigma_hat2, sigma_hat3, 
              sigma_hat4, sigma_hat5, sigma_hat6]

inside_interval = [prop_inside1, prop_inside2, prop_inside3, 
                   prop_inside4, prop_inside5, prop_inside6]

In [5]:
# display results 
results_df = pd.DataFrame({'n': param_n, 'p': param_p, 'Estimate of p': p_hats, 
                           'Estimate of std. dev.': sigma_hats, 'Fraction inside CI': inside_interval})

results_df

Unnamed: 0,n,p,Estimate of p,Estimate of std. dev.,Fraction inside CI
0,10,0.05,0.051,0.132699,0.411
1,30,0.05,0.048867,0.183274,0.791
2,60,0.05,0.050183,0.207136,0.808
3,10,0.25,0.2481,0.394296,0.933
4,30,0.25,0.2501,0.422194,0.929
5,60,0.25,0.25175,0.429318,0.949


Looking at the results above, we can see that there is not much a difference for the fraction of confidence intervals that contain the true mean between values of n = 30 and n = 60. Furthermore, the estimates of the p and sigma are very similar for n = 30 and n = 60. Comparing these to the estimates when n = 10, we can see that the estimated standard deviation varies more between n = 10 and n = 60. Overall, this suggests that there is not a significant benefit to using values of n that are much larger than 30 when trying to estimate the true mean of samples. Therefore, I would agree with the claim that an n of 30 or larger is enough to ensure that asymptotic confidence intervals work well. 