**Notation**: in this note, $y$ is a scale random variable, and $x$ is a
$K\times1$ random vector.

Conditional Expectation
===========================================================

A regression model can be written as $$y=m\left(x\right)+\epsilon,$$
where $m(x)=E[y|x]$ is called the *conditional mean function*, and
$\epsilon=y-m\left(x\right)$ is called the *regression error*. Such an
equation holds for $\left(y,x\right)$ that follows any joint
distribution, as long as $E\left[y|x\right]$ exists. The error term
$\epsilon$ satisfies these properties:

-   $E\left[\epsilon|x\right]=0$,

-   $E\left[\epsilon\right]=0$,

-   $E\left[h\left(x\right)\epsilon\right]=0$, where $h$ is a function
    of $x$.

The last property implies that $\epsilon$ is uncorrelated with any
function of $x$.

If we are interested in predicting $y$ given $x$, then the conditional
mean function $E\left[y|x\right]$ is “optimal” in terms of the *mean
squared error* (MSE).

As $y$ is not a deterministic function of $x$, we cannot predict it with
certainty. In order to evaluate different methods of prediction, we must
have a criterion for comparison. For an arbitrary
prediction method $g\left(x\right)$, we employ a *loss function*
$L\left(y,g\left(x\right)\right)$ to measure how wrong is the
prediction, and the expected value of the loss function is called the
*risk*, and is denoted as $R\left(y,g\left(x\right)\right)$. 

There are many choices of loss functions. A particularly convenient one is
the *quadratic loss function*,
defined as
$$L\left(y,g\left(x\right)\right)=\left(y-g\left(x\right)\right)^{2}.$$
The risk corresponding to the quadratic loss is
$$R\left(y,g\left(x\right)\right)=E\left[\left(y-g\left(x\right)\right)^{2}\right],$$
and it is called the MSE.

Due to its operational ease, MSE is one of the most widely used
criterion. Under MSE, the conditional expectation function happens to be
the best prediction method for $y$ given $x$. In other words, the
conditional mean function $m\left(x\right)$ minimizes the MSE.

The claimed optimality can be confirmed by "guess-and-verify." 
For an
arbitrary $g\left(x\right)$, the risk is decomposed into three terms
$$E\left[\left(y-g\left(x\right)\right)^{2}\right] = 
 E\left[\left(y-m\left(x\right)\right)^{2}\right]+2E\left[\left(y-m\left(x\right)\right)\left(m\left(x\right)-g\left(x\right)\right)\right]+E\left[\left(m\left(x\right)-g\left(x\right)\right)^{2}\right].$$
The first term is irrelevant to $g\left(x\right)$. The second term
$2E\left[\epsilon\left(m\left(x\right)-g\left(x\right)\right)\right]=0$
is again irrelevant of $g\left(x\right)$. The third term, obviously, is
minimized at $g\left(x\right)=m\left(x\right)$.

Linear Projection
============================================

As discussed in the previous section, we are interested in the
conditional mean function $m(x)$. However, remind that
$$m\left(x\right)=E\left[y|x\right]=\int y f\left(y|x\right)\mathrm{d}y$$
is a complex function of $x$, as it depends on the joint distribution of
$\left(y,x\right)$.

A particular form of the conditional mean function is a linear function
$$m\left(x\right)=x'\beta.$$

The linear function is not as restrictive as one might thought. It can
be used to generate some nonlinear (in random variables) effect if we
re-define $x$. For example, if
$$y=x_{1}\beta_{2}+x_{2}\beta_{2}+x_{1}x_{2}\beta_{3}+e,$$ then
$\frac{\partial}{\partial x_{1}}m\left(x_{1},x_{2}\right)=\beta_{1}+x_{2}\beta_{3}$,
which is nonlinear in $x_{1}$, while it is still linear in the parameter
$\beta$ if we define a set of new regressors as
$\left(\tilde{x}_{1},\tilde{x}_{2},\tilde{x}_{3}\right)=\left(x_{1},x_{2},x_{1}x_{2}\right)$.

**Example**
If $\begin{pmatrix}y\\
x
\end{pmatrix}\sim\mathrm{N}\left(\begin{pmatrix}\mu_{y}\\
\mu_{x}
\end{pmatrix},\begin{pmatrix}\sigma_{y}^{2} & \rho\sigma_{y}\sigma_{x}\\
\rho\sigma_{y}\sigma_{x} & \sigma_{x}^{2}
\end{pmatrix}\right)$, then
$$E\left[y|x\right]=\mu_{y}+\rho\frac{\sigma_{y}}{\sigma_{x}}\left(x-\mu_{x}\right)=\left(\mu_{y}-\rho\frac{\sigma_{y}}{\sigma_{x}}\mu_{x}\right)+\rho\frac{\sigma_{y}}{\sigma_{x}}x.$$

Even though in general $m\left(x\right)\neq x'\beta$, the linear form
$x'\beta$ is still useful as an approximation, as will be clear soon.
Therefore, we may write the linear regression model, or the *linear
projection model*, as 
$$
\begin{aligned}
y & =  x'\beta+e\\
E[x e] & =  0,
\end{aligned}
$$
where $e$ is called the *projection
error*, to be distinguished from $\varepsilon=y-m\left(x\right)$.

If a constant is included in $x$ as a regressor, we have
$E\left[e\right]=0$.

The coefficient $\beta$ in the linear projection model has a
straightforward closed-form. Multiplying $x$ on both sides and taking
expectation, we have $E[xy]=E[xx']\beta$. If $E[xx']$ is invertible, we
can explicitly solve
$$\beta=\left(E\left[xx'\right]\right)^{-1}E\left[xy\right].$$

Now we justify $x'\beta$ as an approximation to $m\left(x\right)$.
Indeed, $x'\beta$ is the optimal *linear* predictor in terms of MSE; in
other words,
$$\beta=\arg\min_{b\in\mathbb{R}^{K}}E\left[\left(y-x'b\right)^{2}\right].\label{eq:min_MSE}$$
This fact can be verified by taking the first-order condition of the
above minimization problem
$$\frac{\partial}{\partial\beta}E\left[\left(y-x'\beta\right)^{2}\right]=2E\left[x\left(y-x'\beta\right)\right]=0.$$

In the meantime, $x'\beta$ is also the best *linear* approximation to
$m(x)$. If we replace $y$ in the optimization problem by $m\left(x\right)$, we
solve the minimizer as
$$\left(E\left[xx'\right]\right)^{-1}E\left[xm\left(x\right)\right]=\left(E\left[xx'\right]\right)^{-1}E\left[E\left[xy|x\right]\right]=\left(E\left[xx'\right]\right)^{-1}E\left[xy\right]=\beta.$$
Thus $\beta$ is also the best linear approximation to
$m\left(x\right)$ in terms of MSE.

Omitted Variable Bias
----------------------------------------------

We write the *long regression* as
$$y=x_{1}'\beta_{1}+x_{2}'\beta_{2}+\beta_{3}+e,$$ and the *short
regression* as $$y=x_{1}'\gamma_{1}+\gamma_{2}+u.$$ If $\beta_{1}$ in
the long regression is the parameter of interest, omitting $x_{2}$ as in
the short regression will render *omitted variable bias* (meaning
$\gamma_{1}\neq\beta_{1}$) unless $x_{1}$ and $x_{2}$ are uncorrelated.

We first demean all the variables in the two regressions, which is
equivalent as if we project out the effect of the constant. The long
regression becomes
$$\tilde{y} = \tilde{x}_{1}'\beta_{1}+\tilde{x}_{2}'\beta_{2}+\tilde{e},$$ and the short regression becomes
$$\tilde{y}=\tilde{x}_{1}'\gamma_{1}+\tilde{u},$$ where *tilde* denotes
the demeaned variable.

After demeaning, the cross-moment equals to the covariance. The short
regression coefficient 
$$
\begin{aligned}
\gamma_{1} & =  \left(E\left[\tilde{x}_{1}\tilde{x}_{1}'\right]\right)^{-1}E\left[\tilde{x}_{1}\tilde{y}\right]\\
 & =  \left(E\left[\tilde{x}_{1}\tilde{x}_{1}'\right]\right)^{-1}E\left[\tilde{x}_{1}\left(\tilde{x}_{1}'\beta_{1}+\tilde{x}_{2}'\beta_{2}+e\right)\right]\\
 & =  \beta_{1}+\left(E\left[\tilde{x}_{1}\tilde{x}_{1}'\right]\right)^{-1}E\left[\tilde{x}_{1}\tilde{x}_{2}'\right]\beta_{2}.\end{aligned}
$$
Therefore, $\gamma_{1}=\beta_{1}$ if and only if
$E\left[\tilde{x}_{1}\tilde{x}_{2}'\right]\beta_{2}=0$, which demands
either $E\left[\tilde{x}_{1}\tilde{x}_{2}'\right]=0$ or $\beta_{2}=0$.

Obviously we prefer to run the long regression to attain $\beta_{1}$ if
possible, as it is a model general model than the short regression. 
However, sometimes $x_{2}$ is simply unobservable so the long
regression is infeasible. When only the short regression is available,
in some cases we are able to sign the bias, meaning that we know whether
$\gamma_{1}$ is bigger or smaller than $\beta_{1}$.