<a href="https://colab.research.google.com/github/zhong338/MFM-FM5222/blob/main/Week7_HatMatrixandLev.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FM5222
# The Hat Matrix and Leverage


The context here is multi-variable linear regression which in general can be written in the form:

$$Y = X \mathbf{b}  + \mathbf{\epsilon}$$

Where 

$X$ is a $N \times p$ (or  $N \times p+1$ if there is an intercept) data matrix representing $N$ obervations of $p$ (prediction) variables also called features.  

$b$ is a vector of regression coefficients.

$Y$ is the observed response variable (or target).

$\mathbf{\epsilon}$ is a vector of residual noise assumed to be $\mathcal{N}(0,\sigma^2)$.

Note that previosly we used $A$ for the matrix $X$, but we use $X$ now just to be more consistent with the text and most other presentations.

Also, we previusly wrote this $ X \mathbf{b} = Y  + \mathbf{\epsilon}$.  We also change this for greater notational consistency.


Recall that the least-square fit (which is also the MLE) for the coeffiecients is given

$$\hat{b} = (X^TX)^{-1} X^T Y$$

For each observation $i$, we can then calculate the fit values of $Y_i$ via 

$$\hat{Y}_i = \sum_k X_{i,k}\hat{b}_k$$

And the regresssed residual

$$\hat{\epsilon}_i = Y_i - \hat{Y}_i $$


We observe then that in matrix notation, we have

$$\hat{Y} = X  \hat{b} =  X(X^TX)^{-1} X^T Y$$


This means that each estimate $\hat{Y}$ is a linear combination of the observed $Y$ values.  And this linear com bination is given by the matrix

$$H = X(X^TX)^{-1} X^T $$


This matrix is calles the "hat matrix".




## Hat Matrix

Let us a take a look at this matrix for a minute. 

* It is $N \times N$ but will be rank $p$ (or $p +1$ - we will just use $p$ for the duration).

   We can see this because the matatrix $X^TX$ is a $p\times p$ invertible matrix.
   
* The trace of $H$ is $p$.   Why is this?

    Recall that the trace of a square matrix is the sum along the diagonals. Two key facts about the trace of a matrix are
    
    1. $\mathrm{tr}(BAB^{-1}) = \mathrm{tr}(A)$ .  That is, the trace is unaffected by similarity transformations.
    2. If $AB$ is square, then $\mathrm{tr}(AB) = \mathrm{tr}(BA)$ 
    
Using these

$$\mathrm{tr}(H) = \mathrm{tr}\left(X(X^TX)^{-1} X^T \right)\\
= \mathrm{tr}\left(X^T X(X^TX)^{-1} \right)\\
=\mathrm{tr}(I_p) \\
= p$$

where $I_p$ is the $p\times p$ identity matrix.



Why do we care?  Because we want to understand how much each observation $Y_i$ impacts the estimate $\hat{Y}_i$.


In particular,

$\hat{Y}_i = \sum_k H_{i,j}Y_k = H_{i,i} Y_i + \sum_{k\neq i} H_{i,j}Y_k$

so we see that the influence that $Y_i$ itself has on its own estimate $\hat{Y}_i$ is determined by the diagonal element of $H$, $H_{i,i}$.

We call the valaue $H_{i,i}$ the *leverage* of the $i$th observation.

## Leverage

Consider the vector $\hat{Y} = HY$.  Itss covariance matrix will be given by



$$\mathrm{Cov}(\hat{Y}) = H\mathrm{Cov}(Y)H^T\\
= H\sigma_{\epsilon}^2I_N H^T\\
\sigma_{\epsilon}^2HH^T$$

However, we note that $H^T = H$ and therefore

$$HH^T = X(X^TX)^{-1} X^T X(X^TX)^{-1} X^T\\
 =X(X^TX)^{-1}I_p X^T\\
 = H$$
 
Hence,

$$\mathrm{Cov}(\hat{Y}) = \sigma_{\epsilon}^2 H$$.

In particular, 

$$\mathrm{Var}(\hat{Y}_i) = \sigma_{\epsilon}^2 H_{i,i}$$

This means that if the leverage is high, the standard error of the estimate is high.  For this reason, high leverage points can be points of concern.


### How to interpret leverage


For the $i$th observartion vector, $H_{i,i} = X_i^T (X^TX)^{-1} X_i$.  But  the matrix $(X^TX)^{-1}$ is a positive definite matrix which forces  $0 < H_{i,i}$

Becaue $HH^T = H$, we can say that

$$H_{i,i} = \sum_j H_{i,j}^2 \geq H_{i,i}^2$$

and therefore $H_{i,i} \leq 1$

So we know that $0 < H_{i,i} \leq 1$


Because $\mathrm{tr}(H) = p$, we know that $\frac{1}{N}\sum_{i=1}^N H_{i,i} =\frac{p}{N}$ so the average leverage will be $\frac{p}{N}$.  A high-leverage point will be above averge and a rule-of-thumb for concern is twice the average.  That is, we consider leverage to be high if

$$H_{i,i} > 2\frac{p}{N}$$

















    

### Example

We construct a fake data example where one of the $X$ observation will be high leverage.  We can then generate several different versions of $Y$ data consistent with the the same $X$ observations and we show that, for the high-leverage observation, $\hat{Y}_i$ is highly variable.


We will take $p= 3$ and $N = 12$ and omit an intercept.  The "true model" will have $\mathrm{b}^T = (1,2,3)$ and $\sigma_{\epsilon} = 2$.


In [None]:
import numpy as np
import matplotlib.pyplot as plt



In [None]:
sigma_e = 2
b = np.array([1,2,3])


X = np.array([[ 0. , -2. , -2. ],[ 1. , -3. , -3. ],[ 1.5, -0.5,  0. ],[-1. ,  1. , -0.5],[-2. ,  0.5,  2. ],\
    [ 0.5, -1.5, -2.5],[-2.5, -2.5,  1. ],[ 2. ,  1.5,  0.5],[-3. , -1. ,  1.5],\
    [-0.5,  0. , -1. ],[-1.5,  2. , -1.5], [ 5.5,  5.5,  5.5]])

X

array([[ 0. , -2. , -2. ],
       [ 1. , -3. , -3. ],
       [ 1.5, -0.5,  0. ],
       [-1. ,  1. , -0.5],
       [-2. ,  0.5,  2. ],
       [ 0.5, -1.5, -2.5],
       [-2.5, -2.5,  1. ],
       [ 2. ,  1.5,  0.5],
       [-3. , -1. ,  1.5],
       [-0.5,  0. , -1. ],
       [-1.5,  2. , -1.5],
       [ 5.5,  5.5,  5.5]])

Let's calculate the hat matrix.


In [None]:
H = X@np.linalg.pinv(X)

And not the leveragess of the observations.  Recall, that high leverage will be leverage higher than

$$2 \frac{p}{N} = .5$$

In [None]:
np.diag(H)

array([0.10363453, 0.32994599, 0.08514144, 0.12533187, 0.19433764,
       0.13705026, 0.34697427, 0.07587201, 0.28206537, 0.03943514,
       0.51586103, 0.76435045])

We can see that the last observation is a high leverage point (by design of course!).  The second to last observation is also on the high side (not by design).

Now generate $100$ different samplings of $Y$ under the notion that $Y = Xb + \epsilon$ 

In [None]:
epsilons = np.random.normal(scale = sigma_e, size = [12,100])

Ys = np.reshape(X@b, [12,1])@np.ones([1,100]) + epsilons

We can now creates 100 different estimates of $\hat{Y}$.

In [None]:
Yhats = H@Ys

We are interested in the standard deviation  of each element of $\hat{Y}$

In [None]:
np.std(Yhats, axis = 1)

array([0.62379145, 1.13661485, 0.59029915, 0.71708115, 0.89647068,
       0.71346189, 1.24664983, 0.57135274, 1.11164715, 0.38566286,
       1.45538315, 1.63999593])

We compare this with the predicted (theoretical) values

In [None]:
sigma_e* np.sqrt(np.diag(H))

array([0.64384636, 1.1488185 , 0.58358014, 0.70804482, 0.88167485,
       0.740406  , 1.17809045, 0.5508975 , 1.06219653, 0.39716564,
       1.43646932, 1.74854277])

### Summaary

* Each estimate $\hat{Y}_i$ is a linear combination of the the observed values of $Y$. 

* The linear combination is given by the hat matrix $H = X(X^TX)^{-1} X$

* The diagonal elements of $H$ measure how much the estimate $\hat{Y}_i$ depends directly on $Y_i$ and is called the leverage.

* Leverage more than twice the average leverage of $2 \frac{p}{N}$ is considered high leverge and is potentially a problem.

* The standard error of $\hat{Y}_i$ is $\sigma_{\epsilon} \sqrt{H_{i,i}}$

