# Chapter 22 - Comparing Two Proportions

## Another Ruler

* The variance of the sum or difference of two independent random variables is the sum or their variances.
  * applies only when the two random variables are _independent_  

## The Standard Deviation of the Difference Between Two Proportions

* given two sample proportions, $p_1$, and $p_2$,

\begin{equation}
SD(\hat{p}_1) = \sqrt{\frac{p_1 q_1}{n_1}}
\end{equation}

\begin{equation}
SD(\hat{p}_2) = \sqrt{\frac{p_2 q_2}{n_2}}
\end{equation}

\begin{equation}
Var(\hat{p}_1 - \hat{p}_2) = \Big(\sqrt{\frac{p_1 q_1}{n_1}}\Big)^2 + \Big(\sqrt{\frac{p_2 q_2}{n_2}}\Big)^2 \
 = \frac{p_1 q_1}{n_1} + \frac{p_2 q_2}{n_2}
\end{equation}


\begin{equation}
SD(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{p_1 q_1}{n_1} + \frac{p_2 q_2}{n_2}}
\end{equation}

* since we usually don't know the true values of $p_1$ and $p_2$, we use the sample proportions as estimates, to generate a standard error:

\begin{equation}
SE(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}_1 \hat{q}_1}{n_1} + \frac{\hat{p}_2 \hat{q}_2}{n_2}}
\end{equation}

## Assumptions and Conditions

### Independence Assumption

* **independence assumption**: within each group, the data should be based on results for independent individuals
  * **randomization condition**: data in each group should be drawn independently and at random from a homogeneous population; or generated by a randomized comparative experiment
  * **10% condition**: if data is sampled without replacement, the sample should not exceed 10% of the population
* **independent groups assumption** : the two groups we're comparing must be independent _of each other_
  * this is needed to support our ability to add variances
  
### Sample Size Assumption

* **success/failure condition**: need larger groups to estimate proportions that are nearer to 0% or 100%; ensuring that at least 10 successes and 10 failures are observed helps to support this

## The Sampling Distribution

### The Sampling Distribution Model for a Difference Between Two Independent Proportions

Provided that the sampled values are independent, the samples are independent, and the sample sizes are large enough, the sampling distribution of $\hat{p}_1 - \hat{p}_2$ is modeled by a Normal model with mean $\mu = p_1 - p_2$ and standard deviation

\begin{equation}
SD(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{p_1 q_1}{n_1} + \frac{p_2 q_2}{n_2}}
\end{equation}


### A Two-Proportion z-Interval

When the conditions are met, we are ready to find the confidence interval for the difference of two proportions, $\p_1 - p_2$.  The confidence interval is

\begin{equation}
(\hat{p}_1 - \hat{p}_2) \pm z^* \times SE(\hat{p}_1 - \hat{p}_2)
\end{equation}

where we find the standard error of the difference,

\begin{equation}
SE(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}_1 \hat{q}_1}{n_1} + \frac{\hat{p}_2 \hat{q}_2}{n_2}}
\end{equation}

from the observed proportions.

The critical value $z^*$ depends on the particular confidence level, $C$, that we specify.

## Step-by-Step Example : A Two-Proportion z-Interval

* Plan: State what you want to know.  
  - Discuss the variables and the W's.
  - Identify the parameter you wish to estimate.
  - Choose and state a confidence level.
* Model:
  - Think about the assumptions and check the conditions.
  - State the sampling distribution model for the statistic.
  - Choose your method.
* Mechanics
  - Construct the confidence interval.
  - Often, key step in finding the confidence interval is estimated the standard deviation of the sampling distribution model of the statistic.
* Conclusion
  - Interpret your confidence interval in the propoer context.

## Will I Snore When I'm 64?

* comparing proportion across two groups
* hypothesis test, where parameter of interest is the true _difference_ in proportion between the two groups
* typically express this as

\begin{equation}
\begin{split}
H_0: p_1 - p_2 = 0 \\
H_A: p_1 - p_2 \ne 0 
\end{split}
\end{equation}

## Everyone into the Pool

* we have a standard error for the difference in proportions estimated using the two sample proportions
* because the null hypothesis assumes no difference between the proportions, we can combine the counts to generate a single proportion
* combining the counts like this to get an overall proportion is called **pooling**

\begin{equation}
\hat{p}_{pooled} = \frac{Success_1 + Success_2}{n_1 + n_2}
\end{equation}

* we can then use this pooled proportion to generate a standard error:

\begin{equation}
SE_{pooled}(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}_{pooled}\hat{q}_{pooled}}{n_1} + \frac{\hat{p}_{pooled}\hat{q}_{pooled}}{n_2}}
\end{equation}

## Improving the Success / Failure Condition

* when evaluating the success / failure condition, it is the _expected_ frequencies (i.e. under the conditions of the null hypothesis) of success and failures that should be evaluated, not the _observed_ frequencies, i.e.

\begin{equation}
\begin{split}
n_1\hat{p}_{pooled} \\
n_2\hat{p}_{pooled}
\end{split}
\end{equation}

## Compared to What?

### Two-Proportion z-Test

The conditions for the two-proportion z-test are the same as for the two-proportion z-interval.  We are testing the hypothesis

\begin{equation}
H_0: p_1 - p_2 = 0.
\end{equation}

Because we hypothesize that the proportions are equal, we pool the groups to find

\begin{equation}
\hat{p}_{pooled} = \frac{Success_1 + Success_2}{n_1 + n_2}
\end{equation}

and use that pooled value to estimate the standard error:

\begin{equation}
SE_{pooled}(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}_{pooled}\hat{q}_{pooled}}{n_1} + \frac{\hat{p}_{pooled}\hat{q}_{pooled}}{n_2}}.
\end{equation}

Now we find the test statistic,

\begin{equation}
z = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{SE_{pooled}(\hat{p}_1 - \hat{p}_2)}.
\end{equation}

When the conditions are met and the null hypothesis is true, this statistic follows the standard Normal model, so we can use the model to obtain a P-value.

## Step-by-Step Example: A Two-Proportion z-Test

* Plan: State what you want to know.  Discuss the variables and the W's
* Hypotheses: state the hypotheses
* Model: 
  - Think about the assumptions and check the conditions.
  - State the null model.
  - Choose your method.
* Mechanics
  - Use the pooled SE to estimate $SD(p_1 - p_2)$
  - Make a picture
  - Find the z-score for the observed difference in proportions
  - Find the P-value
* Conclusion
  - Link the P-value to your decision about the null hypothesis, and state your conclusion in context.

## What Can Go Wrong?

* Don't use two-sample proportion methods when the samples aren't independent.
* Don't apply inference methods where there was no randomization.
* Don't interpret a significant difference in proportions causally.

## What Have We Learned?

* [p.537]