# Hypothesis Testing

Notation: $\mathbf{X}$ denotes a random variable or random vector.
$\mathbf{x}$ is its realization.



A *hypothesis* is a statement about the parameter space $\Theta$.
Hypothesis testing checks whether the data support a *null hypothesis*
$\Theta_{0}$, which is a subset of $\Theta$ of interest. Ideally the
null hypothesis should be suggested by scientific theory. The
*alternative hypothesis* $\Theta_{1}=\Theta\backslash\Theta_{0}$ is the
complement of $\Theta_{0}$. Based on the observed evidence, hypothesis
testing decides to accept or reject the null hypothesis. If the null
hypothesis is rejected by the data, it implies that from the statistical
perspective the data is incompatible with the proposed scientific
theory.



In this chapter, we will first introduce the idea and practice of
hypothesis testing and the related confidence interval. While we mainly
focus on the frequentist interpretation of hypothesis, we briefly
discuss the Bayesian approach to statistical decision. As an application
of the testing procedures to the linear regression model, we elaborate
how to test a linear or nonlinear hypothesis of the slope coefficients
based on the unrestricted or restricted OLS estimators.



## Testing

### Decision Rule and Errors

If $\Theta_{0}$ is a singleton, we call it a *simple hypothesis*;
otherwise we call it a *composite hypothesis*. For example, if the
parameter space $\Theta=\mathbb{R}$, then $\Theta_{0}=\left\{ 0\right\}$
(or equivalently $\theta_{0}=0$) is a simple hypothesis, whereas
$\Theta_{0}=(-\infty,0]$ (or equivalently $\theta_{0}\leq0$) is a
composite hypothesis.

A *test function* is a mapping
$$\phi:\mathcal{X}^{n}\mapsto\left\{ 0,1\right\} ,$$ where $\mathcal{X}$
is the sample space. The null hypothesis is accepted if
$\phi\left(\mathbf{x}\right)=0$, or rejected if
$\phi\left(\mathbf{x}\right)=1$. We call the set
$A_{\phi}=\left\{ \mathbf{x}\in\mathcal{X}^{n}:\phi_{\theta}\left(\mathbf{x}\right)=0\right\}$
the *acceptance region*, and its complement
$R_{\phi}=\left\{ \mathbf{x}\in\mathcal{X}^{n}:\phi\left(\mathbf{x}\right)=1\right\}$
the *rejection region.*



The *power function* of a test $\phi$ is
$$\beta\left(\theta\right)=P_{\theta}\left\{ \phi\left(\mathbf{X}\right)=1\right\} =E_{\theta}\left[\phi\left(\mathbf{X}\right)\right].$$
The power function measures the probability that the test function
rejects the null when the data is generated under the true parameter
$\theta$, reflected in $P_{\theta}$ and $E_{\theta}$.

The *power* of a test for some $\theta\in\Theta_{1}$ is the value of
$\beta\left(\theta\right)$. The *size* of the test is
$\sup_{\theta\in\Theta_{0}}\beta\left(\theta\right).$ Notice that the
definition of power depends on a $\theta$ in the alternative hypothesis
$\Theta_{1}$, whereas that of size is independent of $\theta$ due to the
supremum over the set of null $\Theta_{0}$. The *level* of a test is any
value $\alpha\in\left(0,1\right)$ such that
$\alpha\geq\sup_{\theta\in\Theta_{0}}\beta\left(\theta\right)$, which is
often used when it is difficult to attain the exact supremum. A test of
size $\alpha$ is also of level $\alpha$ or bigger; while a test of level
$\alpha$ must have size smaller or equal to $\alpha$.



The concept of *level* is useful if we do not have sufficient
information to derive the exact size of a test. If
$\left(X_{1i},X_{2i}\right)_{i=1}^{n}$ are randomly drawn from some
unknown joint distribution, but we know the marginal distribution is
$X_{ji}\sim N\left(\theta_{j},1\right)$, for $j=1,2$. In order to test
the joint hypothesis $\theta_{1}=\theta_{2}=0$, we can construct a test
function
$$\phi_{\theta_{1}=\theta_{2}=0}\left(\mathbf{X}_{1},\mathbf{X}_{2}\right)=1\left\{ \left\{ \sqrt{n}\left|\overline{X}_{1}\right|\geq z_{1-\alpha/4}\right\} \cup\left\{ \sqrt{n}\left|\overline{X}_{2}\right|\geq z_{1-\alpha/4}\right\} \right\} ,$$
where $z_{1-\alpha/4}$ is the $\left(1-\alpha/4\right)$-th quantile of
the standard normal distribution. The level of this test is
$$\begin{aligned}P\left(\phi_{\theta_{1}=\theta_{2}=0}\left(\mathbf{X}_{1},\mathbf{X}_{2}\right)\right) & \leq P\left(\sqrt{n}\left|\overline{X}_{1}\right|\geq z_{1-\alpha/4}\right)+P\left(\sqrt{n}\left|\overline{X}_{2}\right|\geq z_{1-\alpha/4}\right)\\
 & =\alpha/2+\alpha/2=\alpha.
\end{aligned}$$ where the inequality follows by the *Bonferroni
inequality*
$$P\left(A\cup B\right)\leq P\left(A\right)+P\left(B\right).$$ (The
seemingly trivial Bonferroni inequality is useful in many proofs of
probability results.) Therefore, the level of
$\phi\left(\mathbf{X}_{1},\mathbf{X}_{2}\right)$ is $\alpha$, but the
exact size is unknown without the knowledge of the joint distribution.
(Even if we know the correlation of $X_{1i}$ and $X_{2i}$, putting two
marginally normal distributions together does not make a jointly normal
vector in general.)



------------------------------------------------------------------------

                       accept $H_{0}$     reject $H_{0}$
      $H_{0}$ true    correct decision     Type I error
      $H_{0}$ false    Type II error     correct decision

------------------------------------------------------------------------

Actions, States and Consequences

-   The *probability of committing Type I error* is
    $\beta\left(\theta\right)$ for some $\theta\in\Theta_{0}$.

-   The *probability of committing Type II error* is
    $1-\beta\left(\theta\right)$ for some $\theta\in\Theta_{1}$.



The philosophy on hypothesis testing has been debated for centuries. At
present the prevailing framework in statistics textbooks is the
*frequentist perspective*. A frequentist views the parameter as a fixed
constant. They keep a conservative attitude about the Type I error: Only
if overwhelming evidence is demonstrated shall a researcher reject the
null. Under the principle of protecting the null hypothesis, a desirable
test should have a small level. Conventionally we take $\alpha=0.01,$
0.05 or 0.1. We say a test is *unbiased* if
$\beta\left(\theta\right)>\sup_{\theta\in\Theta_{0}}\beta\left(\theta\right)$
for all $\theta\in\Theta_{1}$. There can be many tests of correct size.

A trivial test function
$\phi(\mathbf{x})=1\left\{ 0\leq U\leq\alpha\right\}$ for all
$\theta\in\Theta$, where $U$ is a random variable from a uniform
distribution on $\left[0,1\right]$, has correct size $\alpha$ but no
non-trivial power at the alternative. On the other extreme, the trivial
test function $\phi\left(\mathbf{x}\right)=1$ for all $\mathbf{x}$
enjoys the biggest power but suffers incorrect size.

Usually, we design a test by proposing a test statistic
$T_{n}:\mathcal{X}^{n}\mapsto\mathbb{R}^{+}$ and a critical value
$c_{1-\alpha}$. Given $T_{n}$ and $c_{1-\alpha}$, we write the test
function as
$$\phi\left(\mathbf{X}\right)=1\left\{ T_{n}\left(\mathbf{X}\right)>c_{1-\alpha}\right\} .$$
To ensure such a $\phi\left(\mathbf{x}\right)$ has correct size, we need
to figure out the distribution of $T_{n}$ under the null hypothesis
(called the *null distribution*), and choose a critical value
$c_{1-\alpha}$ according to the null distribution and the desirable size
or level $\alpha$.

Another commonly used indicator in hypothesis testing is $p$-value:
$$\sup_{\theta\in\Theta_{0}}P_{\theta}\left\{ T_{n}\left(\mathbf{x}\right)\leq T_{n}\left(\mathbf{X}\right)\right\} .$$
In the above expression, $T_{n}\left(\mathbf{x}\right)$ is the realized
value of the test statistic $T_{n}$, while
$T_{n}\left(\mathbf{X}\right)$ is the random variable generated by
$\mathbf{X}$ under the null $\theta\in\Theta_{0}$. The interpretation of
the $p$-value is tricky. $p$-value is the probability that we observe
$T_{n}(\mathbf{X})$ being greater than the realized $T_{n}(\mathbf{x})$
if the null hypothesis is true.

$p$-value is *not* the probability that the null hypothesis is true.
Under the frequentist perspective, the null hypothesis is either true or
false, with certainty. The randomness of a test comes only from
sampling, not from the hypothesis. $p$-value measures whether the
dataset is compatible with the null hypothesis. $p$-value is closely
related to the corresponding test. When $p$-value is smaller than the
specified test size $\alpha$, the test rejects the null.




## Summary

Applied econometrics is a field obsessed of hypothesis testing, in the
hope to establish at least statistical association and ideally
causality. Hypothesis testing is a fundamentally important topic in
statistics. The states and the decisions in Table
<a href="#tab:Decisions-and-States" data-reference-type="ref" data-reference="tab:Decisions-and-States">[tab:Decisions-and-States]</a>
remind us the intrinsic connections with game theory in economics. I, a
game player, plays a sequential game against the “nature”.

Step0:  
The parameter space $\Theta$ is partitioned into the null hypothesis
$\Theta_{0}$ and the alternative hypothesis $\Theta_{1}$ according to a
scientific theory.

Step1:  
Before I observe the data, I design a test function $\phi$ according to
$\Theta_{0}$ and $\Theta_{1}$. In game theory terminology, the
contingency plan $\phi$ is my *strategy*.

Step2:  
Once I observe the fixed data $\mathbf{x}$, I act according to the
instruction of $\phi\left(\mathbf{x}\right)$ — either accept
$\Theta_{0}$ or reject $\Theta_{0}$.

Step3:  
Nature reveals the true parameter $\theta^{*}$ behind $\mathbf{x}$. Then
I can evaluate the gain/loss of my decision
$\phi\left(\mathbf{x}\right)$.

When the loss function (negative payoff) is specified as
$$\mathscr{L}\left(\theta,\phi\left(\mathbf{x}\right)\right)=\phi\left(\mathbf{x}\right)\cdot1\left\{ \theta\in\Theta_{0}\right\} +\left(1-\phi\left(\mathbf{x}\right)\right)\cdot1\left\{ \theta\in\Theta_{1}\right\} ,$$
the randomness of the data will incur the risk (expected loss)
$$\mathscr{R}\left(\theta,\phi\right)=E\left[\mathscr{L}\left(\theta,\phi\left(\mathbf{x}\right)\right)\right]=\beta_{\phi}\left(\theta\right)\cdot1\left\{ \theta\in\Theta_{0}\right\} +\left(1-\beta_{\phi}\left(\theta\right)\right)\cdot1\left\{ \theta\in\Theta_{1}\right\} .$$
I am a rational person. I understand the structure of the game and I
want to do a good job in Step 1 in designing my strategy. I want to
minimize my risk.

If I am a frequentist, one and only one of
$1\left\{ \theta\in\Theta_{0}\right\}$ and
$1\left\{ \theta\in\Theta_{1}\right\}$ can happen. An unbiased test
makes sure
$\sup_{\theta\in\Theta_{0}}\beta_{\phi}\left(\theta\right)\leq\alpha$.
When many tests are unbiased, ideally I would like to pick the best one.
If it exists, in a class $\Psi_{\alpha}$ of unbiased tests of size
$\alpha$ the uniformly most power test $\phi^{*}$ satisfies
$\mathscr{R}\left(\theta,\phi^{*}\right)\geq\sup_{\phi\in\Psi_{\alpha}}\mathscr{R}\left(\theta,\phi\right)$
for every $\theta\in\Theta_{1}$. For simple versus simple tests, LRT is
the uniformly most powerful test according to Neyman-Pearson Lemma.

If I am a Bayesian, I do not mind imposing probability (weight) on the
parameter space, which is my prior belief $\pi\left(\theta\right)$. My
Bayesian risk becomes $$\begin{aligned}
\mathscr{BR}\left(\pi,\phi\right) & =E_{\pi\left(\theta\right)}\left[\mathscr{R}\left(\theta,\phi\right)\right]=\int\left[\beta_{\phi}\left(\theta\right)\cdot1\left\{ \theta\in\Theta_{0}\right\} +\left(1-\beta_{\phi}\left(\theta\right)\right)\cdot1\left\{ \theta\in\Theta_{1}\right\} \right]\pi\left(\theta\right)d\theta\\
 & =\int_{\left\{ \theta\in\Theta_{0}\right\} }\beta_{\phi}\left(\theta\right)\pi\left(\theta\right)d\theta+\int_{\left\{ \theta\in\Theta_{1}\right\} }(1-\beta_{\phi}\left(\theta\right))\pi\left(\theta\right)d\theta.\end{aligned}$$
This is the average (with respect to $\pi\left(\theta\right)$) risk over
the null and the alternative.

**Historical notes**: Hypothesis testing started to take the modern
shape at the beginning of the 20th century. Karl Pearson (1957–1936)
laid the foundation of hypothesis testing and introduced the $\chi^{2}$
test, the $p$-value, among many other concepts that we keep using today.
Neyman-Pearson Lemma was named after Jerzy Neyman (1894–1981) and Egon
Pearson (1895–1980), Karl’s son.

