# Margin of Error

In sampling experiments (e.g. election polling, census data, etc.), oftentimes the data are presented as percentages.  We can think of these percentages as measured probabilites of a certain outcome in the experiment.

The question is:  What is the uncertainty in the measured probability?  That is, what is the uncertainty in the mean value of the probability?

The answer to this question depends on three things:

1.  What is the sample size, N?
2.  What is the probability of the outcome, p?
3.  How confident do we need to be in reporting our result (indicated by $\alpha$)?

The margin of error (i.e. the uncertainty in the mean) is defined by:

$MOE = z_\gamma \times \sqrt{\frac{\sigma^2}{N}}$

where $z_\gamma$ is the z value associated with the confidence level (1-$\alpha$) that we have chosen, and $\sigma^2$ is the variance of the measured probability distribution.

For a binomial or Bernouilli distribution, $\sigma^2 = p(1-p)$.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

In [2]:
# Example:  On August 3rd, 2020, it was reported that of the 982 new COVID-19 cases in Virginia for that day, 43%
#           were in the Hampton Roads region.

p = 0.43
N = 982
alpha = 0.05

confidence_level = 1 - alpha

z_gamma = stats.norm.ppf(confidence_level+alpha/2) #We expect alpha/2 both above and below the confidence interval.

sigma2 = p*(1-p)

MOE = z_gamma * np.sqrt(sigma2/N)

print ("Measured probability = %0.3f +/- %0.3f" % (p,MOE*p))

Measured probability = 0.430 +/- 0.013


What value do we expect?  To estimate this, we would need to know the population of Virginia, and the population of Hampton Roads.  The former is 8.536 million, as of 2019, and the latter is 1.78 million, as of this year.

Thus, the expected probability, based only on population, is $p_{theory} = 0.208$.

# Sensitivity and Specificity

Sensitivity = the probability of a positive test, given that a positive test is expected.

Specificity = the probability of a negative test, given that a negative test is expected.

In an ideal world, both of these probabilities would be 1.  But, the world is not ideal.

Think of medical testing.  If one has the coronavirus, one would would expect that a test would come back positive.  If one does not have the coronavirus, would would expect that the test would come back negative.

False positive = the probability of a positive test, given that a negative test is expected.

False negative = the probability of a negative test, given that a positive test is expected.

False positives are related to $\alpha$ - there is not a problem, but we say that there is.
False negatives are related to $\beta$ - there is a problem, but we say that there is not.

This is why Type I errors ($\alpha$) are generally more tolerable than Type II errors ($\beta$).

In [31]:
N_tests = 100000
alpha = 0.01
beta = 0.10

# Expected probabilities
p_positive = 0.30
p_negative = 0.70

N_true_positive = N_tests*p_positive*(1-alpha)
N_false_negative = N_tests*p_positive*alpha

N_true_negative = N_tests*p_negative*(1-beta)
N_false_positive = N_tests*p_negative*beta

print (N_true_positive,N_false_positive,N_true_negative,N_false_negative)

# probability of the person having the disease, given that the test is positive
pBA = N_true_positive/(N_true_positive+N_false_positive)

print (pBA)

29700.0 7000.0 63000.0 300.0
0.8092643051771117


These ideas are well described theoretically by Bayes' Theorem, which is related to conditional probabilities:

$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$

where $P(A|B)$ is the probability of event A, given condition B, $P(B|A)$ is the probability of event B, given condition A, and $P(A)$ and $P(B)$ are the probabilities of events A and B, respectively.

In [33]:
# event A = a positive test
# event B = the person has the disease

# probability of person having the disease
pB = p_positive

# probability of a positive test, given that the person has the disease
pAB = 1 - alpha

# probability of a positive test
pA = p_positive*(1 - alpha) + p_negative*beta

# probability of the person having the disease, given that the test is positive
pBA = pAB*pB/pA

print (pBA)

0.8092643051771117


In [15]:
n1 = 45
N = 1200
n2 = int(0.029*N)

alpha = 0.05
confidence_level = 1 - alpha
z_gamma = stats.norm.ppf(confidence_level+alpha/2) #We expect alpha/2 both above and below the confidence interval.
print (z_gamma)

p1 = float(n1/N)
sigma1 = p1*(1-p1)
MOE = z_gamma * np.sqrt(sigma1/N)

print ("Measured probability = (%0.2f +/- %0.2f) percent" % (p1*100,MOE*p1*100))

p2 = float(n2/N)
sigma2 = p2*(1-p2)
MOE = z_gamma * np.sqrt(sigma2/N)

print ("Measured probability = (%0.2f +/- %0.2f) percent" % (p2*100,MOE*p2*100))

1.959963984540054
Measured probability = (3.75 +/- 0.04) percent
Measured probability = (2.83 +/- 0.03) percent
