# Chapter 24 - Comparing Means

## Plot the Data

## Comparing Two Means

* we know that, for independent random variables, the variance of their _difference_ is the _sum_ of their individual variances
* $Var(Y-X) = Var(Y) + Var(X)$
* Standard deviation of the difference of two sample means:
    
\begin{equation}
SD(\bar{y}_1 - \bar{y}_2) = \sqrt{\frac{\sigma^2_1}{n_1} +\frac{\sigma^2_2}{n_2}}
\end{equation}

* Because we don't know the true standard deviations of the two groups, we'll use the estimates $s_1$ and $s_2$, and generate a standard error:

\begin{equation}
SE(\bar{y}_1 - \bar{y}_2) = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}
\end{equation}

* the confidence interval we build is called a **two-sample $t$-interval** (for the difference in means).  
* the corresponding hypothesis test is called a **two-sample $t$-test**
* the interval looks just like all the other's we've seen -- the statistic plus or minus an estimated margin of error:

\begin{equation}
(\bar{y}_1 - \bar{y}_2) \pm ME
\end{equation}

where

\begin{equation}
ME = t^* \times SE(\bar{y}_1 - \bar{y}_2)
\end{equation}

* in reality, the distribution isn't a Student's $t$ distribution, but we can use a special formula for degrees of freedom to bring it close enough to Student's $t$ to use that as our sampling distribution model:

\begin{equation}
df = \frac{(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2})^2}{\frac{1}{n_1 - 1} (\frac{s_1^2}{n_1})^2 + \frac{1}{n_2 - 1} (\frac{s_2^2}{n_2})^2}
\end{equation}

* (typically use computer to calculate)

### A Sampling Distribution for the Difference Between Two Means

When the conditions are met, the sampling distribution of the standardized sample difference between the means of two independent groups,

\begin{equation}
t = \frac{(\bar{y}_1 - \bar{y}_2) - (\mu_1 - \mu_2)}{SE(\bar{y}_1 - \bar{y}_2)}
\end{equation}

can be modeled by a Student's $t$-model with a number of degrees of freedom found with a special formula.  We estimate the standard error with

\begin{equation}
SE(\bar{y}_1 - \bar{y}_2) = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}
\end{equation}

## Assumptions and Conditions

### Independence Assumption

* the data in each group must be drawn independently and at random from a homogeneous population, or generated by a randomized comparative experiment
* **randomization condition**
* **10% condition**: we usually don't check this; only if a very small population or an extremely large sample

### Normal Population Assumption

* **nearly normal condition**: 
  * must check this for both groups; 
  * for samples where $n < 15$ in either group, don't use this method if historgram or Normal probability plot shows severe skewness
  * for $n$'s closer to 40, mild skewness ok, but remark on any outliers
  * for $n$ > 40, CLT causes the nearly normal condition to matter less; but still be cognizant of outliers, extreme outliers, and multiple modes
  
### Independent Groups Assumption

* the two groups we are comparing must be independent of each other
* must be verified by thinking about the datasets -- there's no statistical test to verify this

### Two-Sample $t$0-Interval for the Difference Between Means

When the conditions are met, we are ready to find the confidence interval for the difference between means of two independent groups, $\mu_1 - \mu_2$.  The confidence interval is

\begin{equation}
(\bar{y}_1 - \bar{y}_2) \pm t^*_{df} \times SE(\bar{y}_1 - \bar{y}_2)
\end{equation}

where the standard error of the difference of the means

\begin{equation}
SE(\bar{y}_1 - \bar{y}_2) = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}
\end{equation}

The critical value $t^*_{df}$ depends on the particular confidence level, $C$, that you specify and on the number of degerees of freedom, which we get from the sample sizes and a special formula.

## Step-by-Step Example: A Two-Sample $t$-Interval

* Plan:
  * state what you want to know
  * indentify the parameter you wish to estimate
  * identify the population(s) about which you wish to make statements
  * identify the variables and review the W's
  * make a picture: boxplots are the display of choice for comparing groups
* Model:
  * think about the appropriate assumptions and check the conditions to b sure that a Student's $t$-model for the sampling distribution is appropriate
  * make a picture: use histograms or Normal probability plots to check the shape of the distribution of each group
  * state the sampling distribution model for the statistic
  * specify your method
* Mechanics
  * construct the confidence interval
  * be sure to include units along with the statistics
  * use meaningful subscripts to identify groups
  * use sample standard deviation to find the standard error of the sampling distribution
  * use a computer to calculate the degress of freedom
* Conclusion
  * interpret the confidence interval in the proper context

## Another One Just Like the Other Ones?

## Testing the Difference Between Two Means

* **two-sample $t$-test for the difference between means**

## A Test for the Difference Between Two Means

### Two-Sample $t$-Test for the Difference Between Means

The conditions for the two-sample $t$-test for the difference between the means of two independent groups are the same as for the two-sample $t$-interval.  We test the hypothesis

\begin{equation}
H_0: \mu_1 - \mu_2 = \Delta_0
\end{equation}

where teh hypothesized difference is almost always 0, using the statistic

\begin{equation}
t = \frac{(\bar{y}_1 - \bar{y}_2) - \Delta_0}{SE(\bar{y}_1 - \bar{y}_2)}
\end{equation}

The standard error of $\bar{y}_1 - \bar{y}_2$ is

\begin{equation}
SE(\bar{y}_1 - \bar{y}_2) = \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}
\end{equation}

When the conditions are met and the null hypothesis is treu, this statistic can be closely modeled by a Student's $t$-model with a number of degrees of freedom given by a special formula.  We use that model to obtain a P-value.

## Step-by-Step Example: A Two-Sample $t$-Test for the Difference Between Two Means

* Plan
  * state what we want to know
  * identify the parameter you wish to estimate
  * identify the variables and check the W's
* Hypotheses
  * state the null and alternative hypotheses
  * make a picture
    * boxplots for comparing groups
    * histograms or Normal probability plots to check the distribution of each group
* Model
  * think about the assumptions and check the conditions
  * State the sampling distribution model
  * specify your method
* Mechanics
  * list the primary statistics
  * use the null model to find the P-value
  * make a picture
  * find the $t$-value
  * find the P-value
* Conclusion
  * link the P-value to your decision about the null hypothesis, and state the conclusion in context
  * be cautious about generalizing outside the range of the values in the study

## Back into the Pool

* if we are willing to assume that the variances of the two groups are equal, we could pool the data from two groups to estimate the common variance
* we would estimate this pooled variance from the data, so we'd still use a Student's $t$-model
* this test is called a **pooled $t$-test (for the difference between means)**

## The Pooled $t$-Test

* **Equal Variance Assumption**: we must make an assumption that the variances of the two populations from with the sample have been drawn are equal
* **Similar Spreads Condition**: consists of looking at the boxplots to check that the spreads are not wildly different

* substitute the pooled-$t$ estimate of the standard error and its degrees of freedom into the steps of the confidence interval or hypothesis test, and you'll be using the pooled-$t$ method
* if you choose to use the pooled-$t$ method, you must defend the assumption that the variances of the two groups are equal

### Pooled $t$-Test and Confidence Interval for Means

The conditions for the pooled $t$-test for the difference between the means of two independent groups (commonly called a "pooled $t$-test") are the same as for the two-sample $t$-test with the additional assumption that the variances of the two groups are the same.  We test the hypothesis

\begin{equation}
H_0: \mu_1 - \mu_2 = \Delta_0
\end{equation}

where the hypothesized difference, $\Delta_0$, is almost always 0, using the statistic

\begin{equation}
t = \frac{(\bar{y}_1 - \bar{y}_2) - \Delta_0}{SE_{pooled}(\bar{y}_1 - \bar{y}_2)}
\end{equation}

The standard error $\bar{y}_1 - \bar{y}_2$ is

\begin{equation}
SE_{pooled}(\bar{y}_1 - \bar{y}_2) = \sqrt{\frac{s^2_{pooled}}{n_1} +\frac{s^2_{pooled}}{n_2}} = s_{pooled}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}
\end{equation}

where the pooled variance is

\begin{equation}
s^2_{pooled} = \frac{(n_1 - 1)s^2_1 + (n_2 - 1)s^2_2}{(n_1 - 1) + (n_2 - 1)}
\end{equation}

When the conditions are met and the null hypothesis is true, we can model this statistic's sampling distribution with a Student's $t$-model with $(n_1 - 1) + (n_2 - 1)$ degrees of freedom.  We use that model to obtain a P-value for a test or a margin of error for a confidence interval.

The corresponding confidence interval is

\begin{equation}
(\bar{y}_1 - \bar{y}_2) \pm t^*_{df} \times SE_{pooled}(\bar{y}_1 - \bar{y}_2)
\end{equation}

where the critical value $t^*$ depends on the confidence level and is found with $(n_1 - 1) + (n_2 - 1)$ degrees of freedom.

## Is the Pool All Wet?

* when should you use the pooled-$t$ method rather than the two-sample $t$ methods?  hardly ever
    * when the variances _are_ equal, the two methods give results that are pretty much the same
    * when they're not equal, the pooled method is invalid

## Why Not Test the Assumption That the Variances Are Equal?

* the hypothesis test that could do this is very sensitive to failures of the assumptions and works poorly for small sample sizes

## Is There Ever a Time When Assuming Equal Variances Makes Sense?

* may make sense in a randomized comparative experiment

## *Tukey's Quick Test

## *A Rank Sum Test

## What Can Go Wrong?

* Watch out for paried data
* Look at the plots

## What Have We Learned?

* [p. 599]