
Suppose the data is generated from a parametric model. Statistical
estimation looks for the unknown parameter from the observed data. A
*principle* is an ideology about a proper way of estimation. Over the
history of statistics, only a few principles are widely accepted. Among
them Maximum Likelihood is the most important and fundamental. The
maximum likelihood principle entails that the unknown parameter being
found as the maximizer of the log-likelihood function.



## Maximum Likelihood

In this chapter, we first give an introduction of the maximum likelihood
estimation. Consider a random sample of
$Z=\left(z_{1},z_{2},\ldots,z_{n}\right)$ drawn from a parametric
distribution with density $f_{z}\left(z_{i};\theta\right)$, where
$z_{i}$ is either a scalar random variable or a random vector. A
parametric distribution is completely characterized by a
finite-dimensional parameter $\theta$. We know that $\theta$ belongs to
a parameter space $\Theta$. We use the data to estimate $\theta$.

The log-likelihood of observing the entire sample $Z$ is
$$L_{n}\left(\theta;Z\right):=\log\left(\prod_{i=1}^{n}f_{z}\left(z_{i};\theta\right)\right)=\sum_{i=1}^{n}\log f_{z}\left(z_{i};\theta\right).$$
In reality the sample $Z$ is given and for each $\theta\in\Theta$ we can
evaluate $L_{n}\left(\theta;Z\right)$. The maximum likelihood estimator
is
$$\widehat{\theta}_{MLE}:=\arg\max_{\theta\in\Theta}L_{n}\left(\theta;Z\right).$$
Why maximizing the log-likelihood function is desirable? An intuitive
explanation is that $\widehat{\theta}_{MLE}$ makes observing $Z$ the
“most likely” in the entire parametric space.



Consider the Gaussian location model $z_{i}\sim N\left(\mu,1\right)$,
where $\mu$ is the unknown parameter to be estimated. The likelihood of
observing $z_{i}$ is
$f_{z}\left(z_{i};\mu\right)=\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(z_{i}-\mu\right)^{2}\right)$.
The likelihood of observing the sample $Z$ is
$$f_{Z}\left(Z;\mu\right)=\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(z_{i}-\mu\right)^{2}\right)$$
and the log-likelihood is
$$L_{n}\left(\mu;Z\right)=-\frac{n}{2}\log\left(2\pi\right)-\frac{1}{2}\sum_{i=1}^{n}\left(z_{i}-\mu\right)^{2}.$$
The (averaged) log-likelihood function for the $n$ observations is
$$\begin{aligned}
\ell_{n}\left(\mu\right) & =-\frac{1}{2}\log\left(2\pi\right)-\frac{1}{2n}\sum_{i=1}^{n}\left(z_{i}-\mu\right)^{2}.\end{aligned}$$
We work with the averaged log-likelihood $\ell_{n}$, instead of the
(raw) log-likelihood $L_{n}$, to make it directly comparable with the
expected log density $$\begin{aligned}
E_{\mu_{0}}\left[\log f_{z}\left(z;\mu\right)\right] & =E_{\mu_{0}}\left[\ell_{n}\left(\mu\right)\right]\\
 & =-\frac{1}{2}\log\left(2\pi\right)-\frac{1}{2}E_{\mu_{0}}\left[\left(z_{i}-\mu\right)^{2}\right]\\
 & =-\frac{1}{2}\log\left(2\pi\right)-\frac{1}{2}E_{\mu_{0}}\left[\left(\left(z_{i}-\mu_{0}\right)+\left(\mu_{0}-\mu\right)\right)^{2}\right]\\
 & =-\frac{1}{2}\log\left(2\pi\right)-\frac{1}{2}E_{\mu_{0}}\left[\left(z_{i}-\mu_{0}\right)^{2}\right]-E_{\mu_{0}}\left[z_{i}-\mu_{0}\right]\left(\mu_{0}-\mu\right)-\frac{1}{2}\left(\mu_{0}-\mu\right)^{2}\\
 & =-\frac{1}{2}\log\left(2\pi\right)-\frac{1}{2}-\frac{1}{2}\left(\mu-\mu_{0}\right)^{2}.\end{aligned}$$
where the first equality holds because of random sampling. Obviously,
$\ell_{n}\left(\mu\right)$ is maximized at
$\bar{z}=\frac{1}{n}\sum_{i=1}^{n}z_{i}$ while
$E_{\mu_{0}}\left[\ell_{n}\left(\mu\right)\right]$ is maximized at
$\mu=\mu_{0}$. 



We use the following code to demonstrate the population log-likelihood
$E\left[\ell_{n}\left(\mu\right)\right]$ when $\mu_{0}=2$ (solid line)
and the 3 sample realizations when $n=4$ (dashed lines).

\*\*there is a knitr\*\* part



## Summary

The exact distribution under the normality assumption of the error term
is the classical statistical results. The Gauss Markov theorem holds
under two crucial assumptions: linear CEF and homoskedasticity.

**Historical notes**: MLE was promulgated and popularized by Ronald
Fisher (1890–1962). He was a major contributor of the frequentist
approach which dominates mathematical statistics today, and he sharply
criticized the Bayesian approach. Fisher collected the iris flower
dataset of 150 observations in his biological study in 1936, which can
be displayed in R by typing `iris`. Fisher invented the many concepts in
classical mathematical statistics, such as sufficient statistic,
ancillary statistic, completeness, and exponential family, etc.

**Further reading**: @phillips1983exact offered a comprehensive
treatment of exact small sample theory in econometrics. After that,
theoretical studies in econometrics swiftly shifted to large sample
theory, which we will introduce in the next chapter.



## Appendix

### Joint Normal Distribution

It is arguable that normal distribution is the most frequently
encountered distribution in statistical inference, as it is the
asymptotic distribution of many popular estimators. Moreover, it boasts
some unique features that facilitates the calculation of objects of
interest. This note summaries a few of them.

An $n\times1$ random vector $Y$ follows a joint normal distribution
$N\left(\mu,\Sigma\right)$, where $\mu$ is an $n\times1$ vector and
$\Sigma$ is an $n\times n$ symmetric positive definite matrix. The
probability density function is
$$f_{y}\left(y\right)=\left(2\pi\right)^{-n/2}\left(\mathrm{det}\left(\Sigma\right)\right)^{-1/2}\exp\left(-\frac{1}{2}\left(y-\mu\right)'\Sigma^{-1}\left(y-\mu\right)\right)$$
where $\mathrm{det}\left(\cdot\right)$ is the determinant of a matrix.




We will discuss the relationship between two components of a random
vector. To fix notation, $$Y=\left(\begin{array}{c}
Y_{1}\\
Y_{2}
\end{array}\right)\sim N\left(\left(\begin{array}{c}
\mu_{1}\\
\mu_{2}
\end{array}\right),\left(\begin{array}{cc}
\Sigma_{11} & \Sigma_{12}\\
\Sigma_{21} & \Sigma_{22}
\end{array}\right)\right)$$
where $Y_{1}$ is an $m\times1$ vector, and
$Y_{2}$ is an $\left(n-m\right)\times1$ vector. $\mu_{1}$ and $\mu_{2}$
are the corresponding mean vectors, and $\Sigma_{ij}$, $j=1,2$ are the
corresponding variance and covariance matrices. From now on, we always
maintain the assumption that $Y=\left(Y_{1}',Y_{2}'\right)'$ is jointly
normal.

Fact immediately implies a convenient feature of the normal distribution.
Generally speaking, if we are given a joint pdf of two random variables
and intend to find the marginal distribution of one random variables, we
need to integrate out the other variable from the joint pdf. However, if
the variables are jointly normal, the information of the other random
variable is irrelevant to the marginal distribution of the random
variable of interest. We only need to know the partial information of
the part of interest, say the mean $\mu_{1}$ and the variance
$\Sigma_{11}$ to decide the marginal distribution of $Y_{1}$.

<span id="fact:marginal"
label="fact:marginal">\[fact:marginal\]</span>The marginal distribution
$Y_{1}\sim N\left(\mu_{1},\Sigma_{11}\right)$.