Notation: $\mathbf{X}$ denotes a random variable or random vector.
$\mathbf{x}$ is its realization.

Hypothesis Testing
==================

* A *hypothesis* is a statement about the parameter space $\Theta$. 
* The *null hypothesis* $\Theta_{0}$ is a subset of $\Theta$ of interest,
typically suggested by scientific theory. 
* The *alternative
hypothesis* $\Theta_{1}=\Theta\backslash\Theta_{0}$ is the complement
of $\Theta_{0}$. * *Hypothesis testing* is a decision whether to accept
the null hypothesis or to reject it according to the observed evidence.
* If $\Theta_0$ is a singleton, we call it a *simple hypothesis*; otherwise we call it a *composite hypothesis*.

* A *test function* is a mapping
$$\phi:\mathcal{X}^{n}\mapsto\left\{ 0,1\right\},$$ where $\mathcal{X}$
is the sample space. We accept the null hypothesis if
$\phi\left(\mathbf{x}\right)=0$, or reject it if
$\phi\left(\mathbf{x}\right)=1$. 
* The *acceptance region* is defined as
$A_{\phi}=\left\{ \mathbf{x}\in\mathcal{X}^{n}:\phi\left(\mathbf{x}\right)=0\right\} ,$
and the *rejection region* is
$R_{\phi}=\left\{ \mathbf{x}\in\mathcal{X}^{n}:\phi\left(\mathbf{x}\right)=1\right\} .$
* The *power function* of the test $\phi$ is
$$\beta_{\phi}\left(\theta\right)=P_{\theta}\left(\phi\left(\mathbf{X}\right)=1\right)=E_{\theta}\left(\phi\left(\mathbf{X}\right)\right).$$
The power function measures, at a given point $\theta$, the probability that the
test function rejects the null.

* The *power* of $\phi$ at $\theta$ for some $\theta\in\Theta_{1}$ is
defined as the value of $\beta_{\phi}\left(\theta\right)$. 
* The *size* of
the test $\phi$ is define as
$\alpha=\sup_{\theta\in\Theta_{0}}\beta_{\phi}\left(\theta\right).$
Notice that the definition of power depends on a $\theta$ in the
alternative, whereas that of size is independent of $\theta$ as it takes
the supremum over the set of null $\Theta_0$. 
* The *level* of the test $\phi$ is a value
$\alpha\in\left(0,1\right)$ such that
$\alpha\geq\sup_{\theta\in\Theta_{0}}\beta_{\phi}\left(\theta\right)$,
which is often used when it is difficult to attain the exact supremum.



         | decision     |  reject $H_{1}$  | reject $H_{0}$
         |--------------|------------------| ---------------
         | $H_{0}$ true |     correct      | Type I error
         | $H_{0}$ false| Type II error    |   correct

* size = *P*(reject $H_{0}$|$H_{0}$ true)
* power = *P*(reject $H_{0}$|$H_{0}$ false)
* The *probability of committing Type I error* is
$\beta_{\phi}\left(\theta\right)$ for some $\theta\in\Theta_{0}$.
* The *probability of committing Type II error* is
$1-\beta_{\phi}\left(\theta\right)$ for $\theta\in\Theta_{1}$; in other
words, it is one minus the power at $\theta$.

The philosophy on the hypothesis testing has been debated for centuries. 
At present the prevailing framework in statistics textbooks is the frequentist perspective. 
A frequentist views the
parameter as a fixed constant, and they keep a conservative attitude about the Type
I error. Only if overwhelming evidence is demonstrated should a
researcher reject the null. Under the philosophy of protecting the null hypothesis, a desirable test
should have a small level. Conventionally we take $\alpha=0.01,$ 0.05 or
0.1. There can be many tests of the correct size.

**Example** A trivial test function,
$\phi(\mathbf{X})=1\left\{ 0\leq U\leq\alpha\right\}$, where
$U$ is a random variable from a uniform distribution on
$\left[0,1\right]$, has correct size but no power. Another trivial test
function $\phi\left(\mathbf{X}\right)=1$ has the biggest power but
useless size.

Usually, we design a test by proposing a test statistic
$T_{n}:\mathcal{X}^{n}\mapsto\mathbb{R}^{+}$ and a critical value
$c_{1-\alpha}$. 
Given $T_n$ and $c_{1-\alpha}$, we write the test function as
$$\phi\left(\mathbf{X}\right)=1\left\{ T_{n}\left(\mathbf{X}\right)>c_{1-\alpha}\right\}.$$
To ensure such a $\phi\left(\mathbf{x}\right)$ has correct size, we
figure out the distribution of $T_{n}$ under the null hypothesis (called
the *null distribution*), and choose a critical value $c_{1-\alpha}$ according to the null
distribution and the desirable size or level $\alpha$.

The concept of *level* is useful if we do not have information to derive
the exact size of a test.

**Example** If $\left(X_{1i},X_{2i}\right)_{i=1}^{n}$ are
randomly drawn from some unknown joint distribution, but we know the marginal distribution is
$X_{ji}\sim N\left(\theta_{j},1\right)$, for $j=1,2$. In order to test
the joint hypothesis $\theta_{1}=\theta_{2}=0$, we can construct a test
function
$$\phi\left(\mathbf{X}_{1},\mathbf{X}_{2}\right)=1\left\{ \left\{ \sqrt{n}\left|\overline{X}_{1}\right|\geq c_{1-\alpha/4}\right\} \cup\left\{ \sqrt{n}\left|\overline{X}_{2}\right|\geq c_{1-\alpha/4}\right\} \right\} ,$$
where $c_{1-\alpha/4}$ is the $\left(1-\alpha/4\right)$-th quantile of
the standard normal distribution. The level of this test is
$$\begin{aligned}
P_{\theta_{1}=\theta_{2}=0}\left(\phi\left(\mathbf{X}_{1},\mathbf{X}_{2}\right)\right) & \leq P_{\theta_{1}=0}\left(\sqrt{n}\left|\overline{X}_{1}\right|\geq c_{1-\alpha/4}\right)+P_{\theta_{2}=0}\left(\sqrt{n}\left|\overline{X}_{2}\right|\geq c_{1-\alpha/4}\right)\\
 & =\alpha/2+\alpha/2=\alpha.\end{aligned}$$ where the inequality
follows by the *Bonferroni inequality*
$P\left(A\cup B\right)\leq P\left(A\right)+P\left(B\right)$. Therefore,
the level of $\phi\left(\mathbf{X}_{1},\mathbf{X}_{2}\right)$ is
$\alpha$, but the exact size is unknown without the knowledge of the
joint distribution. (Even if we know the correlation of $X_{1i}$ and
$X_{2i}$, putting two marginally normal distributions together does not
make a jointly normal vector in general.)

There can be many tests of a correct level. Denote the class of test
functions of level smaller than $\alpha$ as
$\Psi_{\alpha}=\left\{ \phi:\sup_{\theta\in\Theta_{0}}\beta_{\phi}\left(\theta\right)\leq\alpha\right\}$.
A *uniformly most powerful test* $\phi^{*}\in\Psi_{\alpha}$ is a test
function such that, for every $\phi\in\Psi_{\alpha},$
$$\beta_{\phi^{*}}\left(\theta\right)\geq\beta_{\phi}\left(\theta\right)$$ uniformly over $\theta\in\Theta_{1}$.

**Example** Suppose a random sample of size 6 is generated from
$$\left(X_{1},\ldots,X_{6}\right)\sim\text{i.i.d.}N\left(\theta,1\right),$$
where $\theta$ is unknown. We want to infer the population mean of the
normal distribution. The null hypothesis is $H_{0}$: $\theta\leq0$ and
the alternative is $H_{1}$: $\theta>0$. All tests in
$$\Psi=\left\{ 1\left\{ \bar{X}\geq c/\sqrt{6}\right\} :c\geq1.64\right\}$$
has the correct level. Since $\bar{X}=N\left(\theta,1/\sqrt{6}\right)$,
the power function for those in $\Psi$ is
$$\beta_{\phi}\left(\theta\right)=P\left(\bar{X}\geq\frac{c}{\sqrt{6}}\right)=P\left(\sqrt{6}\left(\bar{X}-\theta\right)\geq c-\sqrt{6}\theta\right)=1-\Phi\left(c-\sqrt{6}\theta\right)$$
where $\Phi$ is the cdf of standard normal.
It is clear that $\beta_{\phi}\left(\theta\right)$ is monotonically decreasing in $c$. 
Thus the test function
$$\phi\left(\mathbf{X}\right)=1\left\{ \bar{X}\geq 1.64/\sqrt{6}\right\}$$
is the most powerful test in $\Psi$, as $c=1.64$ is the lower bound that $\Psi$ allows.

Another commonly used indicator in hypothesis testing is $p$-value:
$$\sup_{\theta\in\Theta_{0}}P_{\theta}\left(T_{n}\left(\mathbf{x}\right)\leq T_{n}\left(\mathbf{X}\right)\right).$$
In the above expression, $T_{n}\left(\mathbf{x}\right)$ is the realized
value of the test statistic $T_{n}$, while
$T_{n}\left(\mathbf{X}\right)$ is the random variable generated by
$\mathbf{X}$ under the null $\theta\in\Theta_{0}$. 
The interpretation of the $p$-value is tricky. 
$p$-value is the probability that we observe $T_n (\mathbf{X})$ being greater than the 
realized $T_n (\mathbf{x} )$ if the null hypothesis is true. 
$p$-value is *not* the probability that the null
hypothesis is true. Under the frequentist perspective, the null
hypothesis is either true or false, with certainty. The randomness of a
test comes only from sampling, not from the hypothesis.

It measures whether the data is consistent with the null
hypothesis, or whether the evidence from the data is compatible with the
null hypothesis. 
$p$-value is closely
related to the corresponding test. 
When $p$-value is smaller than the
specified test size $\alpha$, the test rejects the null hypothesis. 

Confidence Interval
===================

An *interval estimate* is a function
$C:\mathcal{X}^{n}\mapsto\left\{ \Theta':\Theta'\subseteq\Theta\right\}$
that maps a point in the sample space to a subset of the parameter
space. The *coverage probability* of an *interval estimator*
$C\left(\mathbf{X}\right)$ is defined as
$P_{\theta}\left(\theta\in C\left(\mathbf{X}\right)\right)$. The
coverage probability is the frequency that the interval estimator
captures the true parameter that generates the sample (From the
frequentist perspective, the parameter is fixed while the region is
random). It is *not* the probability that $\theta$ is inside the given
region (From the Bayesian perspective, the parameter is random while the
region is fixed conditional on $\mathbf{X}$.)

Suppose a random sample of size 6 is generated from
$$\left(X_{1},\ldots,X_{6}\right)\sim\text{i.i.d. }N\left(\theta,1\right).$$
Find the coverage probability of the random interval
$$\left[\bar{X}-1.96/\sqrt{6},\bar{X}+1.96/\sqrt{6}\right].$$

Hypothesis testing and confidence interval are closely related.
Sometimes it is difficult to directly construct the confidence interval,
but easier to test a hypothesis. One way to construct confidence
interval is by *inverting a corresponding test*. Suppose $\phi$ is a
test of size $\alpha$. If $C\left(\mathbf{X}\right)$ is constructed as
$$C\left(\mathbf{x}\right)=\left\{ \theta\in\Theta:\phi_{\theta}\left(\mathbf{x}\right)=0\right\},$$
then its coverage probability
$$P_{\theta}\left(\theta\in C\left(\mathbf{X}\right)\right)=1-P_{\theta}\left(\phi_{\theta}\left(\mathbf{X}\right)=1\right)=1-\alpha.$$

Application in OLS
==================

Wald Test
---------

Suppose the OLS estimator $\widehat{\beta}$ is asymptotic normal,
i.e.
$$\sqrt{n}\left(\widehat{\beta}-\beta\right)\stackrel{d}{\to}N\left(0,\Omega\right)$$
where $\Omega$ is a $K\times K$ positive definite covariance matrix and
$R$ is a $q\times K$ constant matrix, then
$R\sqrt{n}\left(\widehat{\beta}-\beta\right)\stackrel{d}{\to}N\left(0,R\Omega R'\right)$.
Moreover, if $\mbox{rank}\left(R\right)=q$, then
$$n\left(\widehat{\beta}-\beta\right)'R'\left(R\Omega R'\right)^{-1}R\left(\widehat{\beta}-\beta\right)\stackrel{d}{\to}\chi_{q}^{2}.$$
Now we intend to test the null hypothesis $R\beta=r$. Under the null,
the Wald statistic
$$W_{n}=n\left(R\widehat{\beta}-r\right)'\left(R\widehat{\Omega}R'\right)^{-1}\left(R\widehat{\beta}-r\right)\stackrel{d}{\to}\chi_{q}^{2}$$
where $\widehat{\Omega}$ is a consistent estimator of $\Omega$.

**Example** (Single test) In a linear regression 
$$\begin{aligned}
y & =  x_{i}'\beta+e_{i}=\sum_{k=1}^{5}\beta_{k}x_{ik}+e_{i}.\nonumber \\
E\left[e_{i}x_{i}\right] & =  \mathbf{0}_{5},\label{eq:example}\end{aligned}
$$
where $y$ is wage and
$$x=\left(\mbox{edu},\mbox{age},\mbox{experience},\mbox{experience}^{2},1\right)'.$$
To test whether *education* affects *wage*, we specify the null
hypothesis $\beta_{1}=0$. Let $R=\left(1,0,0,0,0\right)$.
$$\sqrt{n}\widehat{\beta}_{1}=\sqrt{n}\left(\widehat{\beta}_{1}-\beta_{1}\right)=\sqrt{n}R\left(\widehat{\beta}-\beta\right)\stackrel{d}{\to}N\left(0,R\Omega R'\right)\sim N\left(0,\Omega_{11}\right),\label{eq:R11}$$
where $\Omega{}_{11}$ is the $\left(1,1\right)$ (scalar) element of
$\Omega$. Therefore,
$$\sqrt{n}\frac{\widehat{\beta}_{1}}{\widehat{\Omega}_{11}^{1/2}}=\sqrt{\frac{\Omega_{11}}{\widehat{\Omega}_{11}}}\sqrt{n}\frac{\widehat{\beta}_{1}}{\Omega_{11}^{1/2}}$$
If $\widehat{\Omega}\stackrel{p}{\to}\Omega$, then
$\left(\Omega_{11}/\widehat{\Omega}_{11}\right)^{1/2}\stackrel{p}{\to}1$
by the continuous mapping theorem. As
$\sqrt{n}\widehat{\beta}_{1}/\Omega_{11}^{1/2}\stackrel{d}{\to}N\left(0,1\right)$,
we conclude
$\sqrt{n}\widehat{\beta}_{1}/\widehat{\Omega}_{11}^{1/2}\stackrel{d}{\to}N\left(0,1\right).$

The above example is a test about a single coefficient, and the
test statistic is essentially a *t*-statistic. The following example
gives a test about a joint hypothesis.

**Example** (Joint test) We want to simultaneously test $\beta_{1}=1$ and
$\beta_{3}+\beta_{4}=2$ in the above example. The null hypothesis can be
expressed in the general form $R\beta=r$, where the restriction matrix
$R$ is $$R=\begin{pmatrix}1 & 0 & 0 & 0 & 0\\
0 & 0 & 1 & 1 & 0
\end{pmatrix}$$ and $r=\left(1,2\right)'$. Once we figure out $R$, it is routine 
to construct the test.

These two examples are linear restrictions. In
order to test a nonlinear regression, we need the so-called *delta
method*.

**Delta method** If
$\sqrt{n}\left(\widehat{\theta}-\theta_{0}\right)\stackrel{d}{\to}N\left(0,\Omega_{K\times K}\right)$,
and $f:\mathbb{R}^{K}\mapsto\mathbb{R}^{q}$ is a continuously
differentiable function for some $q\leq K$, then
$$\sqrt{n}\left(f\left(\widehat{\theta}\right)-f\left(\theta_{0}\right)\right)\stackrel{d}{\to}N\left(0,\frac{\partial f}{\partial\theta}\left(\theta_{0}\right)\Omega\frac{\partial f}{\partial\theta}\left(\theta_{0}\right)'\right).$$

In the example of linear regression, the optimal experience level can be
found by setting the first order condition with respective to experience
to set, $\beta_{3}+2\beta_{4}\mbox{experience}^{*}=0$. We test the
hypothesis that the optimal experience level is 20 years; in other
words, $$\mbox{experience}^{*}=-\frac{\beta_{3}}{2\beta_{4}}=20.$$ This
is a nonlinear hypothesis. If $q\leq K$ where $q$ is the number of
restrictions, we have
$$n\left(f\left(\widehat{\theta}\right)-f\left(\theta_{0}\right)\right)'\left(\frac{\partial f}{\partial\theta}\left(\theta_{0}\right)\Omega\frac{\partial f}{\partial\theta}\left(\theta_{0}\right)'\right)^{-1}\left(f\left(\widehat{\theta}\right)-f\left(\theta_{0}\right)\right)\stackrel{d}{\to}\chi_{q}^{2},$$
where in this example, $\theta=\beta$,
$f\left(\beta\right)=-\beta_{3}/\left(2\beta_{4}\right)$. The gradient
$$\frac{\partial f}{\partial\beta}\left(\beta\right)=\left(0,0,-\frac{1}{2\beta_{4}},\frac{\beta_{3}}{2\beta_{4}^{2}}\right)$$
Since $\widehat{\beta}\stackrel{p}{\to}\beta_{0}$, by the continuous
mapping theorem theorem, if $\beta_{0,4}\neq0$, we have
$\frac{\partial}{\partial\beta}f\left(\widehat{\beta}\right)\stackrel{p}{\to}\frac{\partial}{\partial\beta}f\left(\beta_{0}\right)$.
Therefore, the (nonlinear) Wald test is
$$W_{n}=n\left(f\left(\widehat{\beta}\right)-20\right)'\left(\frac{\partial f}{\partial\beta}\left(\widehat{\beta}\right)\widehat{\Omega}\frac{\partial f}{\partial\beta}\left(\widehat{\beta}\right)'\right)^{-1}\left(f\left(\widehat{\beta}\right)-20\right)\stackrel{d}{\to}\chi_{1}^{2}.$$
This is a valid test with correct asymptotic size.

However, we can equivalently state the null hypothesis as
$\beta_{3}+40\beta_{4}=0$ and we can construct a Wald statistic
accordingly. In general, a linear hypothesis is preferred to a nonlinear
one, due to the approximation error in the delta method under the null
and more importantly the invalidity of the Taylor expansion under the
alternative. It also highlights the problem of Wald test being *variant* for re-parametrization.

Lagrangian Multiplier Test 
-----------------------------------------

Restricted least square
$$\min_{\beta}\left(y-X\beta\right)'\left(y-X\beta\right)\mbox{ s.t. }R\beta=r.$$
Turn it into an unrestricted problem
$$L\left(\beta,\lambda\right)=\frac{1}{2n}\left(y-X\beta\right)'\left(y-X\beta\right)+\lambda'\left(R\beta-r\right).$$
The first-order condition 
$$\begin{aligned}
\frac{\partial}{\partial\beta}L & = & -\frac{1}{n}X'\left(y-X\tilde{\beta}\right)+\tilde{\lambda}R=-\frac{1}{n}X'e+\frac{1}{n}X'X\left(\tilde{\beta}-\beta^{*}\right)+R'\tilde{\lambda}=0.\\
\frac{\partial}{\partial\beta}L & = & R\tilde{\beta}-r=R\left(\tilde{\beta}-\beta^{*}\right)=0
\end{aligned}$$
Combine these two equations into a linear system,
$$\begin{pmatrix}\widehat{Q} & R'\\
R & 0
\end{pmatrix}\begin{pmatrix}\tilde{\beta}-\beta^{*}\\
\tilde{\lambda}
\end{pmatrix}=\begin{pmatrix}\frac{1}{n}X'e\\
0
\end{pmatrix}.$$

$$\begin{aligned}
 &  & \begin{pmatrix}\tilde{\beta}-\beta^{*}\\
\tilde{\lambda}
\end{pmatrix}=\begin{pmatrix}\widehat{Q} & R'\\
R & 0
\end{pmatrix}^{-1}\begin{pmatrix}\frac{1}{n}X'e\\
0
\end{pmatrix}\\
 & = & \begin{pmatrix}\widehat{Q}^{-1}-\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1} & \widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}\\
\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1} & -\left(R\widehat{Q}^{-1}R'\right)^{-1}
\end{pmatrix}\begin{pmatrix}\frac{1}{n}X'e\\
0
\end{pmatrix}.\end{aligned}$$

We conclude that
$$\sqrt{n}\tilde{\lambda}=\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'e$$
$$\sqrt{n}\tilde{\lambda}\Rightarrow N\left(0,\left(RQ^{-1}R'\right)^{-1}RQ^{-1}\Omega Q^{-1}R'\left(RQ^{-1}R'\right)^{-1}\right).$$
Let
$W=\left(RQ^{-1}R'\right)^{-1}RQ^{-1}\Omega Q^{-1}R'\left(RQ^{-1}R'\right)^{-1}$,
we have
$$n\tilde{\lambda}'W^{-1}\tilde{\lambda}\Rightarrow\chi_{q}^{2}.$$ If
homoskedastic, then
$W=\sigma^{2}\left(RQ^{-1}R'\right)^{-1}RQ^{-1}QQ^{-1}R'\left(RQ^{-1}R'\right)^{-1}=\sigma^{2}\left(RQ^{-1}R'\right)^{-1}.$
$$\begin{aligned}
\frac{n\tilde{\lambda}'RQ^{-1}R'\tilde{\lambda}}{\sigma^{2}} & =\frac{1}{n\sigma^{2}}\left(y-X\tilde{\beta}\right)'XQ^{-1}X'\left(y-X\tilde{\beta}\right)\\
 & =\frac{1}{n\sigma^{2}}\left(y-X\tilde{\beta}\right)'P_{X}\left(y-X\tilde{\beta}\right).\end{aligned}$$

Likelihood-Ratio test 
------------------------------------

For likelihood ratio test, the starting point can be a criterion
function
$L\left(\beta\right)=\left(y-X\beta\right)'\left(y-X\beta\right)$. It
does not have to be the likelihood function. $$\begin{aligned}
L\left(\tilde{\beta}\right)-L\left(\widehat{\beta}\right) & =\frac{\partial L}{\partial\beta}\left(\widehat{\beta}\right)+\frac{1}{2}\left(\tilde{\beta}-\widehat{\beta}\right)'\frac{\partial L}{\partial\beta\partial\beta}\left(\dot{\beta}\right)\left(\tilde{\beta}-\widehat{\beta}\right)\\
 & =0+\frac{1}{2}\left(\tilde{\beta}-\widehat{\beta}\right)'\widehat{Q}\left(\tilde{\beta}-\widehat{\beta}\right).\end{aligned}$$
From the derivation of LM test, we have 
$$\begin{aligned}
\sqrt{n}\left(\tilde{\beta}-\beta^{*}\right) 
 & =  \left(\widehat{Q}^{-1}-\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\right)\frac{1}{\sqrt{n}}X'e\\
 & =  \frac{1}{\sqrt{n}}\left(X'X\right)X'e-\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'e\\
 & =  \sqrt{n}\left(\widehat{\beta}-\beta^{*}\right)-\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'e
 \end{aligned}$$
Therefore
$$\sqrt{n}\left(\tilde{\beta}-\widehat{\beta}\right)=-\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'e$$
and 
$$\begin{aligned}
 &   n\left(\tilde{\beta}-\beta\right)'\widehat{Q}\left(\tilde{\beta}-\widehat{\beta}\right)\\
 & =  \frac{1}{\sqrt{n}}e'X\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\widehat{Q}\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'e\\
 & =  \frac{1}{\sqrt{n}}e'X\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'e
\end{aligned}$$
In general, it is a quadratic form of normal distributions. If
homoskedastic, then
$$\left(R\widehat{Q}^{-1}R'\right)^{-1/2}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'e$$
has variance
$$\sigma^{2}\left(RQ^{-1}R'\right)^{-1/2}RQ^{-1}QQ^{-1}R'\left(RQ^{-1}R'\right)^{-1/2}=\sigma^{2}I_{q}.$$

We can view the optimization of the log-likelihood as a two-step
optimization with the inner step $\sigma=\sigma\left(\beta\right)$. By
the envelop theorem, when we take derivative with respect to $\beta$, we
can ignore the indirect effect of
$\partial\sigma\left(\beta\right)/\partial\beta$.