# Multicollinearity

In section 3, we derive the normal equation for OLS linear regression and the estimate for $\beta$ was:

$$ \hat\beta = (\text{X}^T\text{X})^{-1}\text{X}^Ty $$

where, y is a n-dimensional vector of target variables and **X** is the $n \times p$ dimensional design matrix.

Note that estimating $\hat\beta$ depends on the Gram matrix $\text{X}^T\text{X}$ being invertible. Furthermore, the variance of this estimator

$$ \mathbb{V}[\hat\beta | \text{X}] = \sigma^2(\text{X}^T\text{X})^{-1} $$

also depends on the Gram matrix being invertible.

The Gram matrix is invertible if and only if **X** is full rank, and **X** is full rank when none of its *p* columns can be represented as linear combinations of any other columns. That means, all *p* independent variables are linearly independent and $n \ge p$.

When independent variables are linearly dependent, we call it perfect multicollinearity.

$$ x_i = \alpha_0 + \sum_{j \neq i} \alpha_j x_j \quad j \in [1 \dots p] $$

Although in practice, we rarely get to see perfect multicollinearity and when we use the word multicollinearity, we usually mean severe imperfect multicollinearity.

$$ x_i = \alpha_0 + \sum_{j \neq i} \alpha_j x_j + u \quad j \in [1 \dots p] $$

## Consequence of multicollinearity

**Perfect multicollinearity**

If there were perfect multicollinearity, the OLS estimator $\hat\beta$ will no longer remain **BLUE** and Gram matrix will become non-invertible causing calculation of $\hat\beta$ difficult.

**Imperfect multicollinearity**

However, perfect multicollinearity is rarely evident on real situation and imperfect multicollinearity does not break the assumption of OLS. Therefore, Gauss Markov Theorem tells us that the OLS estimator is still **BLUE**.

Although, imperfect multicollinearity does not prediction by large, it does have consequences.

### Difficult to interpret

Since multicollinear predictors can be predictive of the same response, the model becomes difficult to interpret. It makes it difficult to disambiguate between the effect of two or more multicollinear predictors.

Take an example a simple linear model with two independent variables

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon $$

In this, if $x_1 = kx_2$

$$
y = \beta_0 + \beta_1 (kx_2) + \beta_2 x_2 + \epsilon \\
y = \beta_0 + (\beta_1 k + \beta_2)x_2 + \epsilon
$$

Here, for any choice of $\beta_1$ there is an infinite number of choices for $\beta_2$ from the equation $\beta' = (\beta_1 k + \beta_2)$. Because of this, it becomes difficult to properly interpret $\beta_1$ and $\beta_2$ and they can take any value. This can cause increase in the variance of $\hat\beta$ leading to lower t-statistic which will make it harder to reject the null hypothesis.

Let us understand this with a numeric example. Consider we have a data with 4 independent variables ($x_1, x_2, x_3, x_4$) out of which $x_1$ and $x_3$ are collinear. For this, we make three models: with all independent variables, with just $x_1$ and with just $x_3$ and we get these results:
- **with all independent variables**
  - $R^2$ = 0.992
  - coefficients for predictors:
    - $\beta_1$ = 1.64
    - $\beta_2$ = 0.79
    - $\beta_3$ = 3.83
    - $\beta_4$ = 0.21

- **with just $x_1$**
  - $R^2$ = 0.926
  - coefficients for predictors:
    - $\beta_1$ = 2.05

- **with just $x_3$**
  - $R^2$ = 0.761
  - coefficients for predictors:
    - $\beta_3$ = 11.96

Now if look at the results for the model with all independent variables, we might say that $x_3$ is important. However, when we look at models with just $x_1$ and just $x_3$, the situation changes. We find that $x_1$ has more predictive power as it gives significantly higher $R^2$ when compared to $x_3$. So when both of these independent variables are included in the model, multicollinearity complicates the analysis.

### Unstable parameter estimates

The second problem of multicollinearity is that estimation of $\hat\beta$ becomes unstable. But first, let us quantify multicollinearity. For this we will use **condition index** and **condition number**. The condition index number of a matrix **A** is the ratio of its maximum and each singular value (or eigenvalue if **A** is normal) and the largest condition index is called the condition number ($\mathcal{K}(A))$ i.e the ratio of largest singular value to the smallest.

$$ \mathcal{K}(A) = \frac{\sigma_{max}(A)}{\sigma_{min}(A)} $$

As the smallest singular value $\sigma_{min}(A)$ becomes close to zero, the condition number of A grows to be very big.

The condition number measure how ***well-conditioned*** is the problem. A **well-conditioned** problem one where a small change in the input *x* results in a small change in the output *f(x)*. An **ill-conditioned** problem is a problem in which a small change in the input *x* results in a large change in the output *f(x)*. In terms of regression, an ill-conditioned problem is one where, for a small change in independent variables there is a large change in the answer or dependent variable. For a well-conditioned problem, the condition number is low and a high condition number is indicative of an ill-conditioned problem.

But what is the intuition behind this?

Let us consider a single value decomposition of matrix **A**

$$ A = USV^T $$

where **U** and **V** are orthogonal matrices and **S** is a diagonal matrix of singular values

$$
S = \begin{bmatrix}
    \sigma_1(A) & 0 & \cdots & 0 \\
    0 & \sigma_2(A) & \cdots & 0 \\
    \vdots & \vdots & \ddots & \vdots \\
    0 & 0 & \cdots & \sigma_p(A) \\
    \end{bmatrix}
$$

Then the A in terms of SVD can be written as

$$ A^{-1} = VS^{-1}U^{T} $$

and since **S** is a diagonal matrix, its inverse is the inverse of its diagonal elements

$$
S = \begin{bmatrix}
    \frac{1}{\sigma_1(A)} & 0 & \cdots & 0 \\
    0 & \frac{1}{\sigma_2(A)} & \cdots & 0 \\
    \vdots & \vdots & \ddots & \vdots \\
    0 & 0 & \cdots & \frac{1}{\sigma_p(A)} \\
    \end{bmatrix}
$$

In this, if any of the singular value for the matrix **A** is zero, the inverse does not exist. Even when, the singular value is close, because of $\frac{1}{\sigma_p(A)}$, a small change in singular value, can cause a big change in the inverse. This make the algorithm sensitive to $\sigma_p(A)$. Also, at such small values, inverse may also becomes numerically unstable because if floating point arithmetic errors.

For more details on condition number and stability, refer to Part III Lectures 12-14 of *Numerical linear algebra* (vol 50) by Trefethen, L. N., & Bau III, D. How this relates to OLS, check Lecture 18 and 19. If you are interested, take a look at its exercise problem 18.2.

But how does this relate to OLS?

In OLS, the estimate of $\beta$ is calculated as

$$ \hat\beta = (\text{X}^T\text{X})^{-1}\text{X}^Ty $$

In this, the Gram matrix ($\text{X}^T\text{X}$) is equivalent to matrix **A**. So if the condition number for $\text{X}^T\text{X}$ is high, the problem becomes ill-conditioned and the estimate of the coefficients ($\hat\beta$) becomes unstable.

This is why Python libraries such statsmodels will warn us about eigenvalues (square of the singular values) if we fit OLS to data with multicollinearity.

## Detection of multicollinearity

*Multicollinearity and misleading statistical results* by Jong Hae Kim
https://ekja.org/upload/pdf/kja-19087.pdf