# Module 14: Inference for Two Population Proportions

In this module, we look at how to handle two populations, each of which has a different proportion of members with some property. In particular, we want to study the difference between these two population proportions; usually with a null hypothesis that the difference is zero

## Confidence Intervals

We can construct a confidence interval for the difference between these proportions. Our confidence interval takes the usual form: estimate $\pm$ critical value $*$ standard error. We estimate the difference between the population proportions using the difference between our sample proportions. Our standard error is very close to the sum of the standard errors that we would use for each population separately. Our critical value is from a standard normal.

For example, suppose we ask 50 Canadians and 50 Americans whether they prefer coffee or tea. The results of this survey are given in the following code cell. We create a $95\%$ confidence interval for the difference between the proportion of Canadians and Americans that prefer coffee over tea.

In [None]:
n = 50 #Sample size (same for both countries)
coffee.Can = 26 #Number of Canadians that prefer coffee
coffee.US = 34 #Number of Americans that prefer coffee

#Estimate the difference in population proportions
p.hat.Can = coffee.Can/n #Proportion of Canadians that prefer coffee
p.hat.US = coffee.US/n #Proportion of Americans that prefer coffee
diff = p.hat.Can - p.hat.US

#Get the standard error of the difference in sample proportions
var.Can = p.hat.Can*(1-p.hat.Can)/n #Variance of p.hat.Can
var.US = p.hat.US*(1-p.hat.US)/n #Variance of p.hat.US
SE = sqrt(var.Can + var.US)

#Construct our confidence interval
Z.star = qnorm(p=0.975)
MOE = Z.star * SE
lcl = diff - MOE
ucl = diff + MOE

#Print our results
print("Confidence interval is from")
print(lcl)
print("to")
print(ucl)

*R Tip: Notice that we split up our program into segments. This is a helpful way to stay organized when you're writing longer programs. Remember to add a comment at the start of each segment to remind you what that segment does. It is also often helpful to print some of the intermediate results to make sure they are being calculated the way you want.*

## Hypothesis Tests

We might also want to test hypotheses about the difference between two population proportions. The usual null hypothesis is that both populations have the same proportion. An equivalent way to state this null hypothesis is that the difference between the population proportions is zero. Our alternative hypothesis can be that the difference between the population proportions is greater than zero, less than zero or not equal to zero.

Our test statistic is calculated in following way: estimated difference / standard error. We don't need to include a null hypothesis value because our null hypothesis says that the difference is zero. 

We calculate our standard error using a 'pooled proportion', $\hat{p}$. This $\hat{p}$ is the total number of individuals with our property of interest divided by the total number of individuals surveyed across both populations. 

Our standard error is: $\sqrt{\hat{p} (1-\hat{p})((1/n_1) + (1/n_2))}$, where $n_1$ is the size of the sample from population 1 and $n_2$ is the size of the sample from population 2.

Suppose that we repeat our study comparing tea drinkers and coffee drinkers, but this time we survey 50 people from China and 60 people from Brazil. Let's test the hypothesis that there is no difference in the proportion of coffee drinkers across these two countries.

In [None]:
n.Ch = 50 #Number of Chinese people surveyed
n.Br = 60 #Number of Brazilian people surveyed
coffee.Ch = 3 #Number of Chinese people that prefer coffee
coffee.Br = 56 #Number of Brazilian people that prefer coffee

#Estimate the difference in population proportions
p.hat.Ch = coffee.Ch/n.Ch #Proportion of Chinese people that prefer coffee
p.hat.Br = coffee.Br/n.Br #Proportion of Brazilian people that prefer coffee
diff = p.hat.Ch - p.hat.Br
print("Difference is:")
print(diff)

#Get the standard error of the difference in sample proportions
p.pooled = (coffee.Ch + coffee.Br)/(n.Ch + n.Br) #Pooled proportion of coffee drinkers
SE = sqrt(p.pooled*(1-p.pooled)*((1/n.Ch) + (1/n.Br)))
print("Standard error is:")
print(SE)

#Calculate our test statistic and p-value
stat = diff/SE
p.value = 2*pnorm(abs(stat), lower.tail=F)
print("P-value is:")
print(p.value)

This p-value is very small, so we can conclude that there is a difference in the proportion of people who prefer coffee over tea between China and Brazil. In fact, according to a 2012 survey, Brazil is one of the most coffee preferring countries, and China is one of the least.