# CS-6570 Lecture 18 - Stastical Tests
**Dylan Zwick**

*Weber State University*

Thus far in our class we've mostly focused on *prediction*. That is, based upon a given set of input data, we want to predict an outcome of interest. For the next couple lectures well switch gears a little bit, and instead discuss *inference*. Namely, how you can attempt to infer conclusions based on what your data is telling you. In particular, we'll focus on how we can attempt to determine the truth or falsity of a particular hypothesis.

In classical statistics, much of the emphasis is on testing a single hypothesis, typically called the *null hypothesis*, such as $H_{0}$*: people in Utah and people in New Jersey are equally as likely to prefer Reeces Peanut Butter Cups to Snickers*. Typically, what we're interested in discovering is that there *is* a difference, and so we want to determine if we can *reject* the null hypothesis.

However, in contemporary settings we are often faced with very large amounts of data, and we want to see what we can infer. In other words, we don't have any specific null hypothesis we're interested in rejecting, and instead want to see what we can infer from the data regarding any number of potential null hypotheses. So, instead of simply testing $H_{0}$, we might want to test $m$ null hypotheses $H_{0,1}, H_{0,2}, \ldots, H_{0,m}$, where $H_{0,j}$ represents the $j$th null hypothesis. For example, you could take any number of non-Snickers candies and and set $H_{0,j}$*: people in Utah and people in New Jersey are equally as likely to prefer candy $j$ to Snickers*.

In class today we'll go through a quick review of hypothesis testing, and then discuss the challenge of multiple testing. We'll discuss multiple testing in more detail in our next lecture.

**Hypothesis Testing**

Hypothesis tests provide a rigorous statistical framework for answering simple yes/no questions about data. Conducting a hypothesis test typically proceeds in four steps:

* Step 1 - Define the Null and Alternative Hypothesis

The null hypothesis, denoted $H_{0}$, is the default state of belief about the world. It is boring by construction: it may well be true, but we probably hope our data tells us otherwise. The alternative hypothesis, denoted $H_{a}$, represents something different and unexpected. Typically, the alternative hypothesis simply posits the null hypothesis is false. So, in our example above, the alternative hypothesis would be that people in Utah and people in New Jersey are **not** equally as likely to prefer Reeces to Snickers. The treatment of $H_{0}$ and $H_{a}$ is asymmetric. $H_{0}$ is treated as the default state of the world, and we focus on the question of rejecting it. In other words, we're not concerned with which hypothesis is more likely, we're concerned with how likely our data is assuming $H_{0}$ is true. **Note that a failure to reject the null hypothesis doesn't necessarily mean it's true - it could mean we just need more data.**

* Step 2 - Construct the Test Statistic

Typically we set the test statistic to zero. Meaning if the null hypothesis and the alternative are the same, the test statistic would be zero. A test statistic, denoted $T$, summarizes the extent to which our data are consistent with $H_{0}$. The way we construct $T$ depends on the nature of the null hypothesis we are testing. We won't dive into those details today, but to make things concrete, let $x_{1}^{t},x_{2}^{t},\ldots,x_{n_{t}}^{t}$ denote the blood pressure measurements for the $n_{t}$ mice in a treatment group, and let $x_{1}^{c},x_{2}^{c},\ldots,x_{n_{c}}^{c}$ denote the blood pressure measurements for the $n_{c}$ mice in the control group, and $\mu_{t} = E(X^{t}), \mu_{c} = E(X^{c})$. To test $H_{0}: \mu_{t} = \mu_{c}$, we make use of a *two-sample t-statistic*, defined as

$\displaystyle T = \frac{\hat{\mu_{t}} - \hat{\mu_{c}}}{s\sqrt{\frac{1}{n_{t}} + \frac{1}{n_{c}}}}$

where $\hat{\mu_{t}}$ and $\hat{\mu_{c}}$ are the sample means for the test and control groups respectively, and

$\displaystyle s = \sqrt{\frac{(n_{t}-1)s_{t}^{2} + (n_{c}-1)s_{c}^{2}}{n_{t} + n_{c} - 2}}$

is **an estimator of the pooled standard deviation of the two samples**. Here, $s_{t}^{2}$ and $s_{c}^{2}$ are unbiased estimators of the variance of the blood pressure in the treatment and control groups, respectively. A large (absolute) value of $T$ provides evidence against $H_{0}: \mu_{t} = \mu_{c}$, and hence evidence in support of $H_{a}: \mu_{t} \neq \mu_{c}$.

**The important idea is that T is derived from the . If T is large is provides evidence that the null hypothesis is true. **

* Step 3 - Compute the $p$-value

A large value of the statistic $T$ is evidence against $H_{0}$. But, how large in large? The notion of a $p$-value provides a way to formalize as well as answer this question. The $p$-value is defined as the probability of observing a test statistic equal to or more extreme than the observed statistic, *under the assumption that $H_{0}$ is true*. So, a small $p$-value is evidence against $H_{0}$. The distribution of the test statistic under $H_{0}$ (a.k.a. the test statistic's *null distribution*) depends on the details of what type of null hypothesis is being tested, and what type of test statistic is used. Most commonly-used test statistics follow a well-known statistical distribution under the null hypothesis - like a normal distribution, a $\chi^{2}$-distribution, or an $F$-distribution. **It turns out that the averages between the two sample groups will be the same, so we need to measure the random chance that we'd see that value or something larger. p-value of 0.18 means you would see a T-Statistic that large or larger. Usually based on the absolute value from 0 so you do a two-sided test**

The $p$-value is perhaps one of the most used and abused notions in all of statistics. In particular, it is sometimes said that the $p$-value is the probability that $H_{0}$ holds (that the null hypothesis is true). This is wrong! The one and only correct interpretation of the $p$-value is as the fraction of the time we would expect to see such an extreme value of the test statistic provided $H_{0}$ holds.

* Step 4 - Decide Whether to Reject the Null Hypothesis

Once we have computed a $p$-value corresponding to $H_{0}$, we need to decide whether to reject $H_{0}$. The answer to this question is very much up to the data analyst. In some fields it's typical to reject $H_{0}$ if $p < .05$. In some areas of physics, it's typical to reject $H_{0}$ only if the $p$-value is below $10^{-9}$.

The choice of threshold for the $p$-value depends on our perceived costs of Type I vs. Type II error, which we can summarize in the table below (colums are "truth", rows are "decision"):

| Truth / Decision | $H_{0}$ | $H_{a}$ |
| --- | --- | --- |
| Reject $H_{0}$ | Type I Error | Correct |
| Do Not Reject $H_{0}$ | Correct | Type II Error |

The *power* of a hypothesis test is defined as the probability of not making a Type II error given $H_{a}$ holds. Ideally we would like both the Type I and Type II error rates to be small. But in practice, this is hard! There typically is a trade-off: we can make the Type I error small by only rejecting $H_{0}$ if we are quite sure that it doesn't hold; however, this will result in an increase in the Type II error. Alternatively, we can make the Type II error small by rejecting $H_{0}$ in the presence of even modest evidence that it does not hold, but this will cause the Type I error to be large. In practice, we typically view Type I errors as more "serious" than Type II errors, because the former involve declaring a scientific finding that is not correct. 


**The Challenge of Multiple Testing**

Suppose you have a stockbroker who wishes to drum up new clients by convincing them of his predictive power. He tells 1,024 potential new clients that he can correctly predict whether Google's stock will increase or decrease for 10 days running. There are $2^{10} = 1,024$ for how Google's stock might change over the course of 10 days. So, he emails each client one of the 10 possibilities. Then, 10 days later, he emails the one to whom he sent the perfect prediction and says "See! I'm amazing."

This is an example of something called [data dredging](https://en.wikipedia.org/wiki/Data_dredging). The basic idea behind data dredging as that suppose you look deviations that only have a 1 in a million chance. Well, if you look 1 billion observations, you'd expect to see 1 thousand of these by pure chance. Then, you just demonstrate the 1,000 and make some claim about causality. Some humorous data dredging examples can be found on the website [spurious correlations](https://www.tylervigen.com/spurious-correlations). For example:

<center>
    <div>
        <img src="Spurious Cage.png" width="1000"/>
    </div>
</center>

**The Family-Wise Error Rate**

The family-wise error rate is the probability of making *at least one* Type I error for multiple hypothesis tests. Stated a bit more formally, if $V$ represents the number of Type I errors (a.k.a. false positives), then the family-wise error rate is given by

$FWER = Pr(V \geq 1)$

If all our hypotheses are independent (which is a big assumption) and there are $m$ hypotheses, then if $\alpha$ is our $p$-value threshold, the $FWER$ is:

$FWER(\alpha) = 1 - PR(V = 0) = 1 - Pr(\cap_{j = 1}^{m}V_{j} = 0) = 1 - \prod_{j = 1}^{m}(1-\alpha) = 1 - (1-\alpha)^{m}$

If $m$ is large, then even for a small threshold like $\alpha = .05$ we're likely to get at least 1 Type I error. In fact, if $m = 100$ and $/alpha = .05$, then the $FWER$ is $.994$, or 99.4\%. So, we're basically guaranteed at least one Type I error!

The Bonferroni method is a method for determining the threshold for avoiding Type I error that takes our desired threshold for any Type I error, say $\alpha = .05$, and divides it by the number of observations $m$. So, if $/alpha = .05$ and $m = 100$, then we'd need a $p$-value below $.05/100 = .0005$ to reject any null hypothesis. This is typically more restrictive than you need, as it makes the naive assumption that each hypothesis is independent, which is rarely the case.