In [1]:
%%javascript
MathJax.Hub.Config({
    TeX: { equationNumbers: { autoNumber: "AMS" } }
});

<IPython.core.display.Javascript object>

In previous section we derived cost function for linear regression using Maximum Likelihood Estimation (MLE) and in the process we made few assumptions. These assumptions are also the assumptions of Ordinary Least Squared (OLS) Linear Regression. Let us list them down once again here:

- **Linearity** : The relationship between target variable and dependent variable(s) is linear in parameters
- **Normality** : The target variable is normally distributed which extends from the assumption that error is white noise
- **Data is random sample from the population** : The data points are independent and identically distributed (IID)
- **Spherical errors** : The error is homoscedasticity and no serial correlation. This means that there error have same finite variance and are not correlation
- **No perfect multicollinearity** : There is no linear dependence in the independent variables. i.e the design matrix has full rank
- **Strict exogeneity** : There is no correlation between errors and independent variables or the expectation of errors conditioned on the design matix is zero

The Gauss–Markov theorem holds when the four assumptions of OLS: linearity, no multicollinearity, strict exogeneity, and spherical errors are adhered. If we make these four assumptions, then estimate of the coefficients of the OLS ($\hat\beta$) is **BLUE**, the best (minimum-variance) linear unbiased estimator.

Let us verify that the $\hat\beta$ we got from OLS is unbiased and have minimum variance.

$$ \hat\beta = (\text{X}^T\text{X})^{-1}\text{X}^Ty $$

### Calculating bias of $\hat\beta$

For $\hat\beta$ to be unbiased, $\mathbb{E}[\hat\beta | \text{X}] = \beta$

\begin{align}
\mathbb{E}[\hat\beta | \text{X}] &= \mathbb{E}[(\text{X}^T\text{X})^{-1}\text{X}^Ty | \text{X}] \\
                                 &= \mathbb{E}[(\text{X}^T\text{X})^{-1}\text{X}^T(\text{X}\beta+\epsilon) | \text{X}] \\
                                 &= \mathbb{E}[(\text{X}^T\text{X})^{-1}\text{X}^T\text{X}\beta+(\text{X}^T\text{X})^{-1}\text{X}^T\epsilon) | \text{X}] \\
                                 &= \mathbb{E}[(\text{X}^T\text{X})^{-1}(\text{X}^T\text{X})\beta+(\text{X}^T\text{X})^{-1}\text{X}^T\epsilon) | \text{X}] \\
                                 &= \mathbb{E}[\mathit{I}\beta+(\text{X}^T\text{X})^{-1}\text{X}^T\epsilon) | \text{X}] \\
                                 &= \mathbb{E}[\mathit{I}\beta | \text{X}]+\mathbb{E}[(\text{X}^T\text{X})^{-1}\text{X}^T\epsilon) | \text{X}] \\
                                 &= \beta+\mathbb{E}[(\text{X}^T\text{X})^{-1}\text{X}^T\epsilon) | \text{X}] \\
\end{align}

Since we had already made assumption at expectation of error conditioned on design matrix (X) is zero. We get
$$\mathbb{E}[(\text{X}^T\text{X})^{-1}\text{X}^T\epsilon) | \text{X}]=0$$
Therefore,
$$\mathbb{E}[\hat\beta]=\beta$$

### Calculating variance of $\hat\beta$

From previous bias calculation we know that $\hat\beta = \beta+(\text{X}^T\text{X})^{-1}\text{X}^T\epsilon)$

\begin{align}
\mathbb{V}[\hat\beta | \text{X}] &= \mathbb{V}[\beta+(\text{X}^T\text{X})^{-1}\text{X}^T\epsilon) | \text{X}] \\
                      &= \mathbb{V}[\beta | \text{X}]+\mathbb{V}[(\text{X}^T\text{X})^{-1}\text{X}^T\epsilon) | \text{X}] \\
\end{align}

Since, $\beta$ is a non-random, from the properties of variance, $\mathbb{V}[\beta | \text{X}]=0$

$$
\mathbb{V}[\hat\beta | \text{X}] = \mathbb{V}[(\text{X}^T\text{X})^{-1}\text{X}^T\epsilon) | \text{X}]
$$

Let $\mathit{A} = (\text{X}^T\text{X})^{-1}\text{X}^T$ and since A is also non-random, from the properties of variance, $\mathbb{V}[\mathit{A}\epsilon | \text{X}] = \mathit{A}\mathbb{V}[\epsilon | \text{X}]\mathit{A}^T$. Further, from the assumption of spherical error, $\mathbb{V}[\epsilon | \text{X}] = \sigma^2\mathit{I}$

Therefore,

\begin{align}
\mathbb{V}[\hat\beta | \text{X}] &= \mathit{A}\sigma^2\mathit{I}\mathit{A}^T \\
                                 &= \sigma^2\mathit{I}\mathit{A}\mathit{A}^T \\
                                 &= \sigma^2\mathit{I}((\text{X}^T\text{X})^{-1}\text{X}^T)((\text{X}^T\text{X})^{-1}\text{X}^T)^T \\
                                 &= \sigma^2\mathit{I}((\text{X}^T\text{X})^{-1}\text{X}^T\text{X}(\text{X}^T\text{X})^{-1}) \\
                                 &= \sigma^2\mathit{I}(\text{X}^T\text{X})^{-1} \\
                                 &= \sigma^2(\text{X}^T\text{X})^{-1}
\end{align}

### Proof that $\hat\beta$ is the minimum variance linear unbiased estimator

Now that we have found the variance of our estimator, in order to prove that is has the minimum variance, we have to find the variance for all other linear unbiased estimators. For this, we should create a new matrix **C** such that any other linear unbiased estimator of $\beta$, say $\tilde\beta = \text{C}y$ has greater or equal variance than $\hat\beta$.

To have $\tilde\beta$ unbiased

\begin{align}
& \mathbb{E}[\tilde\beta | \text{X}] = \beta & \\
& \mathbb{E}[\text{C}y | \text{X}] = \beta & \\
& \mathbb{E}[\text{CX}\beta + \epsilon | \text{X}] = \beta & \\
& \mathbb{E}[\text{CX}\beta | \text{X}] + {\mathbb{E}[\epsilon | \text{X}]} = \beta & \\
& \mathbb{E}[\text{CX}\beta | \text{X}] + 0 = \beta & \because \text{ strict exogeneity} \\
& \mathbb{E}[\text{CX}\beta | \text{X}] = \beta & \\
& \text{CX}\beta = \beta \\
& \text{CX}\beta\beta^{-1} = \beta\beta^{-1} \\
& \text{CX} = \mathit{I}
\end{align}

Therefore, for $\tilde\beta$ to be unbiased, $\text{CX} = \mathit{I}$

We know that $\hat\beta = (\text{X}^T\text{X})^{-1}\text{X}^Ty$ so $\hat\beta = \text{C}_{OLS}y$. We can express **C** for $\tilde\beta$ as a linear perturbation to $\text{C}_{OLS}$.

$$
C = D + C_{OLS} \\
C = D + (\text{X}^T\text{X})^{-1}\text{X}^T
$$

Since, for unbiasedness of $\tilde\beta$, **C** has to satisfy $\text{CX} = \mathit{I}$ and **D** must lie in the null space of ***X*** i.e $\text{DX} = 0$. This can be shown as below

\begin{align}
& \text{CX} = \mathit{I} & \\
& (\text{C}_{OLS} + \text{D})\text{X} = \mathit{I} & \\
& \text{C}_{OLS}\text{X} + \text{D}\text{X} = \mathit{I} & \\
& \mathit{I} + \text{D}\text{X} = \mathit{I} & \because \hat\beta \text{ is unbiased so } \text{C}_{OLS}\text{X} = \mathit{I}\\
& \mathit{I} + \text{D}\text{X} - \mathit{I}= \mathit{I} - \mathit{I} \\
& \text{D}\text{X} = 0
\end{align}

We should also represent $\tilde\beta$ in terms of $\hat\beta$

\begin{align}
\tilde\beta &= Cy \\
            &= (D + (\text{X}^T\text{X})^{-1}\text{X}^T)y \\
            &= Dy + (\text{X}^T\text{X})^{-1}\text{X}^Ty \\
            &= Dy + \hat\beta
\end{align}

Therefore, re-arranging it we get $\tilde\beta = \hat\beta + Dy$

#### Now let us calculate variance of $\tilde\beta$

\begin{align}
\mathbb{V}[\tilde\beta | \text{X}] &= \mathbb{V}[\hat\beta + Dy | \text{X}] \\
                                   &= \mathbb{V}[\hat\beta | \text{X}]] + \mathbb{V}[Dy | \text{X}] \\
                                   &= \sigma^2(\text{X}^T\text{X})^{-1} + \mathbb{V}[Dy | \text{X}] \\
\end{align}

Let us solve for second term right hand side

\begin{align}
\mathbb{V}[Dy | \text{X}] &= \mathbb{V}[D(X\beta + \epsilon) | \text{X}] & \\
                          &= \mathbb{V}[DX\beta | \text{X}] + \mathbb{V}[D\epsilon | \text{X}] & \\
                          &= \mathbb{V}[0 | \text{X}] + \mathbb{V}[D\epsilon | \text{X}] & \because DX = 0 \implies DX\beta = 0\\
                          &= 0 + \mathbb{V}[D\epsilon | \text{X}] & \\
                          &= D\mathbb{V}[\epsilon | \text{X}]D^T & \\
                          &= D\sigma^2\mathit{I}D^T & \because \text{ spherical error} \\
                          &= \sigma^2DD^T & \\
\end{align}

Replacing this in variance of $\tilde\beta$

\begin{align}
\mathbb{V}[\tilde\beta | \text{X}] &= \mathbb{V}[\hat\beta | \text{X}]] + \sigma^2DD^T \\
\end{align}

Since, $DD^T$ is positive semi-definite, value for $\sigma^2DD^T$ will always be zero or positive. This means that the variance of $\tilde\beta$ will always be equal or greater that variance of $\hat\beta$. With this, we can conclude that there are no linear unbiased estimators that are smaller in variance than the OLS estimator. While they may be the same, our OLS will remain **BLUE**.