Asymptotics
===========

Asymptotic theory is concerned about the behavior of statistics when the
sample size is arbitrarily large. It is a useful approximation technique
to simplify complicated finite-sample analysis.

Modes of Convergence
--------------------

Convergence of a deterministic sequence means that for any
$\varepsilon>0$, there exists an $N\left(\varepsilon\right)$ such that
for all $n>N\left(\varepsilon\right)$, we have
$\left|z_{n}-z\right|<\varepsilon$. We say $z$ is the limit of $z_{n}$,
and write as $z_{n}\to z$.

In contrast to the convergence of a deterministic sequence, we are
interested in the convergence of random variables. Since a random
variable is “random”, we must define clearly what “convergence” means.
Several modes of convergence are often encountered.

-   Convergence almost surely<span>\*</span>

-   Convergence in probability:
    $\lim_{n\to\infty}P\left(\omega:\left|Z_{n}\left(\omega\right)-z\right|<\varepsilon\right)=1$
    for any $\varepsilon>0$.

-   Squared-mean convergence:
    $\lim_{n\to\infty}E\left[\left(z_{n}-z\right)^{2}\right]=0.$

$z_{n}$ is a binary random variable: $z_{n}=\sqrt{n}$ with probability
$1/n$, and $z_{n}=0$ with probability $1-1/n$. Then
$z_{n}\stackrel{p}{\to}0$ but $z_{n}\stackrel{m.s.}{\nrightarrow}0$.

Another example (to be formalized): Gambling. Each person contribute one
dollar, and only one person will win all the sum. As the number of
people goes larger, the average gain is still one dollar, but the limit
in probability becomes zero.

Convergence in probability does not count what happens on a subset in
the sample space of small probability. Squared-mean convergence deals
with the average over the entire probability space. If a random variable
can take a wild value, even with small probability, it may blow away the
squared-mean convergence. On the contrary, such irregularity does not
undermine convergence in probability.

-   Convergence in distribution: $x_{n}\stackrel{d}{\to}x$ if
    $F\left(x_{n}\right)\to F\left(x\right)$ for each $x$ on which
    $F\left(x\right)$ is continuous.

Convergence in distribution is about *pointwise* convergence of CDF, not
the random variables themselves.

Let $x\sim N\left(0,1\right)$. If $z_{n}=x+1/n$, then
$z_{n}\stackrel{p}{\to}x$ and of course $z_{n}\stackrel{d}{\to}x$.
However, if $z_{n}=-x+1/n$, or $z_{n}=y+1/n$ where
$y\sim N\left(0,1\right)$ is independent of $x$, then
$z_{n}\stackrel{d}{\to}x$ but $z_{n}\stackrel{p}{\nrightarrow}x$.

*Cramér-Wold device* handles convergence in distribution for random
vectors? We say a sequence of $K$-dimensional random vectors
$\left(X_{n}\right)$ converge in distribution to $X$ if we have
$\lambda'X_{n}\stackrel{d}{\to}\lambda'X$ for any
$\lambda\in\mathbb{R}^{K}$ with $\lambda'\lambda=1$.

Law of Large Numbers[^1]
------------------------

(Weak) law of large numbers (LLN) is a collection of statements about
convergence in probability of the sample average to its population
counterpart. The basic form of LLN is:
$$\frac{1}{n}\sum_{i=1}^{n}z_{i}-E\left[\frac{1}{n}\sum_{i=1}^{n}z_{i}\right]\stackrel{p}{\to}0$$
as $n\to\infty$. Various versions of LLN work under different
assumptions about the distributions and dependence of the random
variables.

-   Chebyshev LLN: if $\left(z_{1},\ldots,z_{n}\right)$ is a sample of
    i.i.d. observations, $E\left[z_{1}\right]=\mu$ , and
    $\sigma^{2}=\mathrm{var}\left[x_{1}\right]<\infty$ exists, then
    $\frac{1}{n}\sum_{i=1}^{n}z_{i}-\mu\stackrel{p}{\to}0.$

Chebyshev LLN utilizes

-   *Chebyshev inequality*: for any random variable $x$ , we have
    $P\left(\left|x\right|>\varepsilon\right)\leq E\left[x^{2}\right]/\varepsilon^{2}$
    for any $\varepsilon>0$, if $E\left[x^{2}\right]$ exists.

Chebyshev inequality is a special case of

-   *Markov inequality*:
    $P\left(\left|x\right|>\varepsilon\right)\leq E\left[\left|x\right|^{r}\right]/\varepsilon^{r}$
    for $r\geq1$ and any $\varepsilon>0$, if
    $E\left[\left|x\right|^{r}\right]$ exists.

It is easy to verify Markov inequality. $$\begin{aligned}
E\left[\left|x\right|^{r}\right] & =\int_{\left|x\right|>\varepsilon}\left|x\right|^{r}dF_{X}+\int_{\left|x\right|\leq\varepsilon}\left|x\right|^{r}dF_{X}\\
 & \geq\int_{\left|x\right|>\varepsilon}\left|x\right|^{r}dF_{X}\geq\varepsilon^{r}\int_{\left|x\right|>\varepsilon}dF_{X}=\varepsilon^{r}P\left(\left|x\right|>\varepsilon\right).\end{aligned}$$

Consider a partial sum $S_{n}=\sum_{i=1}^{n}x_{i}$, where
$\mu_{i}=E\left[x_{i}\right]$ and
$\sigma_{i}^{2}=\mathrm{var}\left[x_{i}\right]$. We apply the Chebyshev
inequality to the sample mean
$\overline{x}-\bar{\mu}=n^{-1}\left(S_{n}-E\left[S_{n}\right]\right)$.
$$\begin{aligned}
P\left(\left|\bar{x}-\bar{\mu}\right|\geq\varepsilon\right) & =P\left(\left|S_{n}-E\left[S_{n}\right]\right|\geq n\varepsilon\right)\\
 & \leq\left(n\varepsilon\right)^{-2}E\left[\sum_{i=1}^{n}\left(x_{i}-\mu_{i}\right)^{2}\right]\\
 & =\left(n\varepsilon\right)^{-2}\mathrm{var}\left(\sum_{i=1}^{n}x_{i}\right)\\
 & =\left(n\varepsilon\right)^{-2}\left[\sum_{i=1}^{n}\mathrm{var}\left(x_{i}\right)+\sum_{i=1}^{n}\sum_{j\neq i}\mathrm{cov}\left(x_{i},x_{j}\right)\right].\end{aligned}$$

From the above derivation, convergence in probability holds as long as
the right-hand side shrinks to 0 as $n\to\infty$. Actually, the
convergence can be maintained under much more general conditions than
just under the i.i.d. assumption. The random variables in the sample do
not have to be identically distributed, and they do not have to be
independent either.

Another useful LLN is *Kolmogorov LLN*. Since its derivation requires
advanced knowledge of mathematics, we state the result without proof.

-   Kolmogorov LLN: if $\left(z_{1},\ldots,z_{n}\right)$ is a sample of
    i.i.d. observations and $E\left[z_{1}\right]=\mu$ exists, then
    $\frac{1}{n}\sum_{i=1}^{n}z_{i}-\mu\stackrel{p}{\to}0.$

Compared to Chebyshev LLN, Kolmogorov LLN only requires the existence of
the population mean, but not any higher moment. On the other hand,
i.i.d. is essential for Kolmogorov LLN.

Central Limit Theorem
---------------------

The central limit theorem (CLT) is a collect of probability results
about the convergence in distribution to a normally distributed random
variable. The basic form of the CLT is: for a sample
$\left(z_{1},\ldots,z_{n}\right)$ of *zero-mean* random variables,
$$\frac{1}{\sqrt{n}}\sum_{i=1}^{n}z_{i}\stackrel{d}{\to}N\left(0,\sigma^{2}\right).\label{eq:clt}$$
Various versions of CLT work under different assumptions about the
random variables.

*Lindeberg-Levy CLT* is the simplest CLT.

-   If the sample is i.i.d., $E\left[x_{1}\right]=0$ and
    $\mathrm{var}\left[x_{1}^{2}\right]=\sigma^{2}<\infty$,
    then (\[eq:clt\]) holds.

Lindeberg-Levy CLT is easy to verify by the characteristic function. For
any random variable $x$, the function
$\varphi_{x}\left(t\right)=E\left[\exp\left(ixt\right)\right]$ is called
its *characteristic function*. The characteristic function fully
describes a distribution, just like PDF or CDF. For example, the
characteristic function of $N\left(\mu,\sigma^{2}\right)$ is
$\exp\left(it\mu-\frac{1}{2}\sigma^{2}t^{2}\right)$.

If $E\left[\left|x\right|^{k}\right]<\infty$ for a positive integer $k$,
then
$$\varphi_{X}\left(t\right)=1+itE\left[X\right]+\frac{\left(it\right)^{2}}{2}E\left[X^{2}\right]+\ldots\frac{\left(it\right)^{k}}{k!}E\left[X^{k}\right]+o\left(t^{k}\right).$$
Under the assumption of Lindeberg-Levy CLT,
$$\varphi_{X_{i}/\sqrt{n}}\left(t\right)=1-\frac{t^{2}}{2n}\sigma^{2}+o\left(\frac{t^{2}}{n}\right)$$
for all $i$, and by independence we have $$\begin{aligned}
\varphi_{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}x_{i}}\left(t\right) & =\prod_{i=1}^{n}\varphi_{x_{i}/\sqrt{n}}\left(t\right)=\left(1+i\cdot0-\frac{t^{2}}{2n}\sigma^{2}+o\left(\frac{t^{2}}{n}\right)\right)^{n}\\
 & \to\exp\left(-\frac{\sigma^{2}}{2}t^{2}\right),\end{aligned}$$ where
the limit is exactly the characteristic function of
$N\left(0,\sigma^{2}\right)$.

-   Lindeberg-Feller CLT: i.n.i.d., and *Lindeberg condition*: for any
    fixed $\varepsilon>0$,
    $$\frac{1}{s_{n}^{2}}\sum_{i=1}^{n}\int_{\left|x_{i}\right|>\varepsilon s_{n}}x_{i}^{2}dPx_{i}\to0$$
    where $s_{n}=\left(\sum_{i=1}^{n}\sigma_{i}^{2}\right)^{1/2}$.

-   Lyapunov CLT: i.n.i.d, finite $E\left[\left|x\right|^{3}\right]$.

Tools for Transformations
-------------------------

The original forms of LLN or CLT only deal with sample means. However,
most of the econometric estimators of interest are functions of sample
means. Therefore, we need tools to handle transformations.

-   Small op: $x_{n}=o_{p}\left(r_{n}\right)$ if
    $x_{n}/r_{n}\stackrel{p}{\to}0$.

-   Big Op: $x_{n}=O_{p}\left(r_{n}\right)$ if for any $\varepsilon>0$,
    there exists a $c>0$ such that
    $P\left(\left|x_{n}\right|/r_{n}>c\right)<\varepsilon$.

-   Continuous mapping theorem 1: If $x_{n}\stackrel{p}{\to}a$ and
    $f\left(\cdot\right)$ is continuous at $a$, then
    $f\left(x_{n}\right)\stackrel{p}{\to}f\left(a\right)$.

-   Continuous mapping theorem 2: If $x_{n}\stackrel{d}{\to}x$ and
    $f\left(\cdot\right)$ is continuous almost surely on the support of
    $x$, then $f\left(x_{n}\right)\stackrel{d}{\to}f\left(x\right)$.

-   Slutsky’s Theorem: If $x_{n}\stackrel{d}{\to}x$ and
    $y_{n}\stackrel{p}{\to}a$, then

    -   $x_{n}+y_{n}\stackrel{d}{\to}x+a$

    -   $x_{n}y_{n}\stackrel{d}{\to}ax$

    -   $x_{n}/y_{n}\stackrel{d}{\to}x/a$ if $a\neq0$.

-   Delta method: if
    $\sqrt{n}\left(\widehat{\theta}-\theta_{0}\right)\stackrel{d}{\to}N\left(0,\Omega\right)$,
    and $f\left(\cdot\right)$ is continuously differentiable at
    $\theta_{0}$, then
    $$\sqrt{n}\left(f\left(\widehat{\theta}\right)-f\left(\theta_{0}\right)\right)\stackrel{d}{\to}N\left(0,\frac{\partial f}{\partial\theta'}\left(\theta_{0}\right)\Omega\left(\frac{\partial f}{\partial\theta}\left(\theta_{0}\right)\right)'\right).$$

Asymptotic Properties of OLS
============================

We apply large sample theory to study the OLS estimator
$\widehat{\beta}=\left(X'X\right)^{-1}X'Y.$

Consistency
-----------

We say $\widehat{\beta}$ is *consistent* if
$\widehat{\beta}\stackrel{p}{\to}\beta$ as $n\to\infty$. To verify
consistency, we write
$$\widehat{\beta}-\beta=\left(X'X\right)^{-1}X'e=\left(\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}'\right)^{-1}\frac{1}{n}\sum_{i=1}^{n}x_{i}e_{i}.\label{eq:ols_d}$$
The first term
$$\widehat{Q}=\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}'\stackrel{p}{\to}Q=E\left[x_{i}x_{i}'\right].$$
and the second term
$$\frac{1}{n}\sum_{i=1}^{n}x_{i}e_{i}\stackrel{p}{\to}0.$$ No matter
whether $\left(y_{i},x_{i}\right)_{i=1}^{n}$ is an i.i.d., i.n.i.d., or
dependent sample, as long as the convergence in probability holds for
the above two expressions, we have
$\widehat{\beta}-\beta\stackrel{p}{\to}Q^{-1}0=0$ by the continuous
mapping theorem. In other words, $\widehat{\beta}$ is a consistent
estimator of $\beta$.

Asymptotic Normality
--------------------

In finite sample, $\widehat{\beta}$ is a random variable. We have shown
the distribution of $\widehat{\beta}$ under normality in the previous
lecture. Without the restrictive normality assumption, how can we
characterize the randomness of the OLS estimator? If we multiply
$\sqrt{n}$ on both sides of (\[eq:ols\_d\]), we have
$$\sqrt{n}\left(\widehat{\beta}-\beta\right)=\left(\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}'\right)^{-1}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}x_{i}e_{i}.$$
Since $E\left[x_{i}e_{i}\right]=0$, we apply a CLT to obtain
$$n^{-1/2}\sum_{i=1}^{n}x_{i}e_{i}\stackrel{d}{\to}N\left(0,\Sigma\right)$$
where $\Sigma=E\left[x_{i}x_{i}'e_{i}^{2}\right]$. By the continuous
mapping theorem,
$$\sqrt{n}\left(\widehat{\beta}-\beta\right)\stackrel{d}{\to}Q^{-1}\times N\left(0,\Sigma\right)\sim N\left(0,\Omega\right)$$
where $\Omega=Q^{-1}\Sigma Q^{-1}$ is called the *asymptotic variance*.
This is the *asymptotic normality* of the OLS estimator.

Up to now we have derived the asymptotic distribution of
$\widehat{\beta}$. However, to make it feasible, we still have to
estimator the asymptotic variance $\Omega$. If $\widehat{\Sigma}$ is a
consistent estimator of $\Sigma$, then
$\widehat{\Omega}=\widehat{Q}^{-1}\widehat{\Sigma}\widehat{Q}^{-1}$ is a
consistent estimator of $\Omega$. (Of course, there are other ways to
estimate the asymptotic variance.) Then a feasible version about the
distribution of $\widehat{\beta}$ is
$$\widehat{\Omega}^{-1/2}\sqrt{n}\left(\widehat{\beta}-\beta\right)\stackrel{d}{\to}N\left(0,I_{K}\right)$$

Estimation of the Variance<span>\*</span>
-----------------------------------------

To show the finiteness of the variance,
$\Sigma=E\left[x_{i}x_{i}'e_{i}^{2}\right].$ Let $z_{i}=x_{i}e_{i}$, so
$\Sigma=E\left[z_{i}z_{i}'\right]$. Because of the Cachy-Schwarz
inequality,
$$\left\Vert \Sigma\right\Vert _{\infty}=\max_{k=1,\ldots,K}E\left[z_{ik}^{2}\right].$$
For each $k$,
$E\left[z_{ik}^{2}\right]=E\left[x_{ik}^{2}e_{i}^{2}\right]\leq\left(E\left[x_{ik}^{4}\right]E\left[e_{i}^{4}\right]\right)^{1/2}$.

For the estimation of variance, if the error is homoskedastic,
$$\begin{aligned}
\frac{1}{n}\sum_{i=1}^{n}\widehat{e}_{i}^{2} & =\frac{1}{n}\sum_{i=1}^{n}\left(e_{i}+x_{i}'\left(\widehat{\beta}-\beta\right)\right)^{2}\\
 & =\frac{1}{n}\sum_{i=1}^{n}e_{i}^{2}+\left(\frac{2}{n}\sum_{i=1}^{n}e_{i}x_{i}\right)'\left(\widehat{\beta}-\beta\right)+\frac{1}{n}\sum_{i=1}^{n}e_{i}^{2}\left(\widehat{\beta}-\beta\right)'x_{i}x_{i}'\left(\widehat{\beta}-\beta\right).\end{aligned}$$
The second term
$$\left(\frac{2}{n}\sum_{i=1}^{n}e_{i}x_{i}\right)'\left(\widehat{\beta}-\beta\right)=o_{p}\left(1\right)o_{p}\left(1\right)=o_{p}\left(1\right).$$
The third term
$$\left(\widehat{\beta}-\beta\right)\left(\frac{1}{n}\sum_{i=1}^{n}e_{i}^{2}x_{i}x'_{i}\right)\left(\widehat{\beta}-\beta\right)=o_{p}\left(1\right)O_{p}\left(1\right)o_{p}\left(1\right)=o_{p}\left(1\right).$$
As
$\frac{1}{n}\sum_{i=1}^{n}\widehat{e}_{i}^{2}=\frac{1}{n}\sum_{i=1}^{n}\widehat{e}_{i}^{2}+o_{p}\left(1\right)$
and
$\frac{1}{n}\sum_{i=1}^{n}e_{i}^{2}=\sigma_{e}^{2}+o_{p}\left(1\right)$,
we have
$\frac{1}{n}\sum_{i=1}^{n}\widehat{e}_{i}^{2}=\sigma_{e}^{2}+o_{p}\left(1\right)$.
In other words,
$\frac{1}{n}\sum_{i=1}^{n}\widehat{e}_{i}^{2}\stackrel{p}{\to}\sigma_{e}^{2}$.

For general heteroskedasticity, $$\begin{aligned}
\frac{1}{n}\sum_{i=1}^{n}\widehat{e}_{i}^{2} & =\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}'\left(e_{i}+x_{i}'\left(\widehat{\beta}-\beta\right)\right)^{2}\\
 & =\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}'e_{i}^{2}+\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}e_{i}x_{i}'\left(\widehat{\beta}-\beta\right)+\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}'\left(\left(\widehat{\beta}-\beta\right)'x_{i}\right)^{2}.\end{aligned}$$
The third term is bounded by $$\begin{aligned}
 &  & \mbox{trace}\left(\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}'\left(\left(\widehat{\beta}-\beta\right)'x_{i}\right)^{2}\right)\\
 & \leq & K\max_{k}\frac{1}{n}\sum_{i=1}^{n}x_{ik}^{2}\left[\left(\widehat{\beta}-\beta\right)'x_{i}\right]^{2}\\
 & \leq & K\left\Vert \widehat{\beta}-\beta\right\Vert _{2}^{2}\max_{k}\frac{1}{n}\sum_{i=1}^{n}x_{ik}^{2}\left\Vert x_{i}\right\Vert _{2}^{2}\\
 & \leq & K\left\Vert \widehat{\beta}-\beta\right\Vert _{2}^{2}\frac{1}{n}\sum_{i=1}^{n}\left\Vert x_{i}\right\Vert _{2}^{2}\left\Vert x_{i}\right\Vert _{2}^{2}\\
 & = & K\left\Vert \widehat{\beta}-\beta\right\Vert _{2}^{2}\frac{1}{n}\sum_{i=1}^{n}\left(\sum_{k=1}^{K}x_{ik}^{2}\right)^{2}\\
 & \leq & K\left\Vert \widehat{\beta}-\beta\right\Vert _{2}^{2}K\sum_{k=1}^{K}\frac{1}{n}\sum_{i=1}^{n}x_{ik}^{4}=o_{p}\left(1\right)O_{p}\left(1\right)=o_{p}\left(1\right).\end{aligned}$$
where the third inequality follows by
$\left(a_{1}+\cdots+a_{K}\right)^{2}\leq K\left(a_{1}^{2}+\cdots+a_{K}^{2}\right)$.
The second term is bounded by $$\begin{aligned}
 &  & \left|\frac{1}{n}\sum_{i=1}^{n}x_{ik}x_{ik'}e_{i}x_{i}'\left(\widehat{\beta}-\beta\right)\right|\\
 & \leq & \max_{k}\left|\widehat{\beta}_{k}-\beta_{k}\right|K\max_{k,k',k''}\left|\frac{1}{n}\sum_{i=1}^{n}e_{i}x_{ik}x_{ik'}x_{ik''}\right|\\
 & \leq & \left\Vert \widehat{\beta}-\beta\right\Vert _{2}\left(\frac{1}{n}\sum_{i=1}^{n}e_{i}^{4}\right)^{1/4}K\max_{k,k',k''}\left(\frac{1}{n}\sum_{i=1}^{n}\left(x_{ik}x_{ik'}x_{ik''}\right)^{4/3}\right)^{3/4}\\
 & \leq & \left\Vert \widehat{\beta}-\beta\right\Vert _{2}K\max_{k}\left(\frac{1}{n}\sum_{i=1}^{n}x_{ik}^{4}\right)^{3/4}=o_{p}\left(1\right)O_{p}\left(1\right)\end{aligned}$$
where the second and the third inequality hold by the Holder’s
inequality.

[^1]: Though the results in this section hold for convergence almost
    surely, for simplicity we state them in terms of convergence in
    probability.