# Multicollinearity

In section 3, we derive the normal equation for OLS linear regression and the estimate for $\beta$ was:

$$ \hat\beta = (\text{X}^T\text{X})^{-1}\text{X}^Ty $$

where, y is a n-dimensional vector of target variables and **X** is the $n \times p$ dimensional design matrix.

Note that estimating $\hat\beta$ depends on the Gram matrix $\text{X}^T\text{X}$ being invertible. Furthermore, the variance of this estimator

$$ \mathbb{V}[\hat\beta | \text{X}] = \sigma^2(\text{X}^T\text{X})^{-1} $$

also depends on the Gram matrix being invertible.

The Gram matrix is invertible if and only if **X** is full rank, and **X** is full rank when none of its *p* columns can be represented as linear combinations of any other columns. That means, all *p* independent variables are linearly independent and $n \ge p$.

When independent variables are linearly dependent, we call it perfect multicollinearity.

$$ x_i = \alpha_0 + \sum_{j \neq i} \alpha_j x_j \quad j \in [1 \dots p] $$

Although in practice, we rarely get to see perfect multicollinearity and when we use the word multicollinearity, we usually mean severe imperfect multicollinearity.

$$ x_i = \alpha_0 + \sum_{j \neq i} \alpha_j x_j + u \quad j \in [1 \dots p] $$

## Consequence of multicollinearity

**Perfect multicollinearity**

If there were perfect multicollinearity, the OLS estimator $\hat\beta$ will no longer remain **BLUE** and Gram matrix will become non-invertible causing calculation of $\hat\beta$ difficult.

**Imperfect multicollinearity**

However, perfect multicollinearity is rarely evident on real situation and imperfect multicollinearity does not break the assumption of OLS. Therefore, Gauss Markov Theorem tells us that the OLS estimator is still **BLUE**.

Although, imperfect multicollinearity does not prediction by large, it does have consequences.

### Difficult to interpret

Since multicollinear predictors can be predictive of the same response, the model becomes difficult to interpret. It makes it difficult to disambiguate between the effect of two or more multicollinear predictors.

Take an example a simple linear model with two independent variables

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon $$

In this, if $x_1 = kx_2$

$$
y = \beta_0 + \beta_1 (kx_2) + \beta_2 x_2 + \epsilon \\
y = \beta_0 + (\beta_1 k + \beta_2)x_2 + \epsilon
$$

Here, for any choice of $\beta_1$ there is an infinite number of choices for $\beta_2$ from the equation $\beta' = (\beta_1 k + \beta_2)$. Because of this, it becomes difficult to properly interpret $\beta_1$ and $\beta_2$ and they can take any value. This can cause increase in the variance of $\hat\beta$ leading to lower t-statistic which will make it harder to reject the null hypothesis.

Let us understand this with a numeric example. Consider we have a data with 4 independent variables ($x_1, x_2, x_3, x_4$) out of which $x_1$ and $x_3$ are collinear. For this, we make three models: with all independent variables, with just $x_1$ and with just $x_3$ and we get these results:
- **with all independent variables**
  - $R^2$ = 0.992
  - coefficients for predictors:
    - $\beta_1$ = 1.64
    - $\beta_2$ = 0.79
    - $\beta_3$ = 3.83
    - $\beta_4$ = 0.21

- **with just $x_1$**
  - $R^2$ = 0.926
  - coefficients for predictors:
    - $\beta_1$ = 2.05

- **with just $x_3$**
  - $R^2$ = 0.761
  - coefficients for predictors:
    - $\beta_3$ = 11.96

Now if look at the results for the model with all independent variables, we might say that $x_3$ is important. However, when we look at models with just $x_1$ and just $x_3$, the situation changes. We find that $x_1$ has more predictive power as it gives significantly higher $R^2$ when compared to $x_3$. So when both of these independent variables are included in the model, multicollinearity complicates the analysis.

### Unstable parameter estimates

The second problem of multicollinearity is that estimation of $\hat\beta$ becomes unstable. But first, let us quantify multicollinearity. For this we will use **condition index** and **condition number**. The condition index number of a matrix **A** is the ratio of its maximum and each singular value (or eigenvalue if **A** is normal) and the largest condition index is called the condition number ($\mathcal{K}(A))$ i.e the ratio of largest singular value to the smallest.

$$ \mathcal{K}(A) = \frac{\sigma_{max}(A)}{\sigma_{min}(A)} $$

As the smallest singular value $\sigma_{min}(A)$ becomes close to zero, the condition number of A grows to be very big.

The condition number measure how ***well-conditioned*** is the problem. A **well-conditioned** problem one where a small change in the input *x* results in a small change in the output *f(x)*. An **ill-conditioned** problem is a problem in which a small change in the input *x* results in a large change in the output *f(x)*. In terms of regression, an ill-conditioned problem is one where, for a small change in independent variables there is a large change in the answer or dependent variable. For a well-conditioned problem, the condition number is low and a high condition number is indicative of an ill-conditioned problem.

But what is the intuition behind this?

Let us consider a single value decomposition of matrix **A**

$$ A = USV^T $$

where **U** and **V** are orthogonal matrices and **S** is a diagonal matrix of singular values

$$
S = \begin{bmatrix}
    \sigma_1(A) & 0 & \cdots & 0 \\
    0 & \sigma_2(A) & \cdots & 0 \\
    \vdots & \vdots & \ddots & \vdots \\
    0 & 0 & \cdots & \sigma_p(A) \\
    \end{bmatrix}
$$

Then the A in terms of SVD can be written as

$$ A^{-1} = VS^{-1}U^{T} $$

and since **S** is a diagonal matrix, its inverse is the inverse of its diagonal elements

$$
S = \begin{bmatrix}
    \frac{1}{\sigma_1(A)} & 0 & \cdots & 0 \\
    0 & \frac{1}{\sigma_2(A)} & \cdots & 0 \\
    \vdots & \vdots & \ddots & \vdots \\
    0 & 0 & \cdots & \frac{1}{\sigma_p(A)} \\
    \end{bmatrix}
$$

In this, if any of the singular value for the matrix **A** is zero, the inverse does not exist. Even when, the singular value is close, because of $\frac{1}{\sigma_p(A)}$, a small change in singular value, can cause a big change in the inverse. This make the algorithm sensitive to $\sigma_p(A)$. Also, at such small values, inverse may also becomes numerically unstable because if floating point arithmetic errors.

For more details on condition number and stability, refer to Part III Lectures 12-14 of *Numerical linear algebra* (vol 50) by Trefethen, L. N., & Bau III, D. How this relates to OLS, check Lecture 18 and 19. If you are interested, take a look at its exercise problem 18.2.

**But how does this relate to OLS?**

In OLS, the estimate of $\beta$ is calculated as

$$ \hat\beta = (\text{X}^T\text{X})^{-1}\text{X}^Ty $$

The singular value decomposition for our design matrix **X** will be

$$ \text{X} = USV^T $$

and the normal equation for $\hat\beta$ depends on $(\text{X}^T\text{X})^{-1}$ which using the SVD for **X** will be

\begin{align*}
\text{X}^T\text{X} &= (USV^T)^T(USV^T) & \\
                   &= (VS^TU^T)(USV^T) & \\
                   &= (VSU^T)(USV^T) & \because S \text{ is diagonal so } S^T = S \\
                   &= VSU^TUSV^T & \\
                   &= VSISV^T & \because U \text{ is orthogonal so } U^T = U^{-1} \\
                   &= VS^2V^T & \because S \text{ is diagonal so } SS = S^2 \\
\end{align*}

On substituting this in the normal equation, we get


\begin{align*}
\hat\beta &= (\text{X}^T\text{X})^{-1}\text{X}^Ty & \\
          &= (VS^2V^T)^{-1}(USV^T)^Ty & \\
          &= ((V^T)^{-1}S^{-2}V^{-1})(USV^T)^Ty & \\
          &= (VS^{-2}V^T)(VSU^T)y & \because V \text{ is orthogonal and } V^TV = I \text{ so } V^T = V^{-1} \text{ and } (V^T)^{-1} = V \\
          &= (VS^{-2}V^TVSU^Ty & \\
          &= VS^{-2}SU^Ty & \\
          &= VS^{-1}U^Ty & \\
\end{align*}

Also, as $\text{X}^T\text{X}$ is a real symmetric matrix, its eigen decomposition will be

$$ \text{X}^T\text{X} = Q\Lambda Q^T $$

where **Q** is a matrix of eigenvectors and $\Lambda$ is a diagonal matrix of eigenvalues. On comparing this with what we got from SVD above, we get $V = Q$ and $S^2 = \Lambda$.

Based on this, we can say that the normal equation depends on the square root of inverse of eigenvalues of $\text{X}^T\text{X}$. And for and ill-conditioned problem/matrix, the smaller eigenvalues will be very small (almost close to zero).

This is why Python libraries such statsmodels will warn us about eigenvalues (square of the singular values) if we fit OLS to data with multicollinearity.

## Detection of multicollinearity

Now that we know the effects of multicollinearity on our regression analyses, how can we tell if multicollinearity exists in our data?

Some of the common sign during analysis that are indicative of presence of multicollinearity:
- The estimates of the coefficients vary excessively from model to model.
- High $R^2$ with low t-statistic.
- The t-tests for each of the individual slopes are non-significant (P > 0.05), but the overall F-test for testing all of the slopes are simultaneously 0 is significant (P < 0.05).

We can also look at pairwise correlation of our independent variables. A large correlations among pairs may indicate multicollinearity. Although, correlation can be helpful, it is limiting. If an independent variable is dependent on multiple other independent variables ($x_3 = \alpha_1 x_1 + \alpha_2 x_2 + error$), correlation may not high. Therefore, it is better to use other methods to detect multicollinearity.

### Variance Inflation Factor (VIF)

If an independent variable is linearly dependent on multiple other independent variables, it will look like

$$ x_i = \alpha_0 + \sum_{j \neq i} \alpha_j x_j + \epsilon \quad j \in [1 \dots p] $$

This is same as having a linear model of $x_i$ using $x_j$.

What we can do is, we can build a linear models for each of the independent variable in our data using remaining independent variable and calculate $R^2$ for each of these models. As we know that, variances of the estimated coefficients are inflated when multicollinearity exists. We can quantifies how much the variance is inflated by using VIF. The VIF for $\beta_i$, denoted $\text{VIF}_i$, is a factor by which the variance of $\beta_j$ is "inflated" due to the existence of correlation among the predictor variables in the model.

$$ \text{VIF}_i = \frac{1}{1 - R^2_i} $$

Considering the range of $R^2$ ($0 \le R^2 \le 1$), $R^2_i$ = 0 (complete absence of multicollinearity) minimizes the variance of $\hat\beta_i$, while $R^2_i$ = 1 (exact multicollinearity) makes this variance infinite.

Although the VIF helps in determining the presence of multicollinearity, it cannot detect the explanatory variables causing the multicollinearity. To put it simply, if for an independent variable $x_1$, VIF is high, this tells you $x_1$ is suffering from multicollinearity. But it does not tell you which other independent variables $x_1$ is related to.

### Condition Number

We have already seen above how we can use condition number to identify if multicollinearity is present or not by checking whether the coefficient estimation is well-conditioned or ill-conditioned.

### Variance Decomposition Proportion (VDP)

The multicollinearity detection methods we have seen tell us that multicollinearity exist and which independent variable is suffering from it. However, it would be nice if we can find which independent variables are collinear. We know that multicollinearity inflates the variance of regression coefficients and Variance Decomposition Proportion (VDP) tells us how much each eigenvalue contributes to the variance of each regression coefficient. Each explanatory variable has variance decomposition proportions corresponding to each condition index. If two or more variance decomposition proportions corresponding to high condition index exceed 80% to 90%, it is determined that multicollinearity is present between the independent variables corresponding to the exceeding variance decomposition proportions.

Let us understand this with an step-by-step example.

Suppose we have three explanatory variables: $x_1$, $x_2$ and $x_3$ and they have a correlation matrix **R** (standardized variables).

**Correlation Matrix (R):**
$$
R = \begin{bmatrix}
1.0 & 0.9 & 0.7 \\
0.9 & 1.0 & 0.6 \\
0.7 & 0.6 & 1.0
\end{bmatrix}
$$

**Eigenvalues:**
- $\lambda_1 = 2.4741$
- $\lambda_2 = 0.0888$
- $\lambda_3 = 0.4372$

**Eigenvectors (columns correspond to eigenvalues):**
$$
V = \begin{bmatrix}
0.6108 & 0.7523 & -0.2468 \\
0.5885 & -0.6400 & -0.4941 \\
0.5296 & -0.1565 & 0.8337
\end{bmatrix}
$$

**Condition Indices:**
- Condition Index 1: $1.0000$
- Condition Index 2: $5.2796$
- Condition Index 3: $2.3789$

**Step-by-Step Variance Decomposition Proportions (VDPs):**

**Variable X1:**
- Numerator for $\lambda_1 = \frac{(0.6108)^2}{2.4741} = 0.1508$
- Numerator for $\lambda_2 = \frac{(0.7523)^2}{0.0888} = 6.3765$
- Numerator for $\lambda_3 = \frac{(-0.2468)^2}{0.4372} = 0.1393$
- Denominator: $6.6667$
- VDP at Condition Index 1: $\frac{0.1508}{6.6667} = 0.0226$
- VDP at Condition Index 2: $\frac{6.3765}{6.6667} = 0.9565$
- VDP at Condition Index 3: $\frac{0.1393}{6.6667} = 0.0209$
- Sum of VDPs: $1.0000$

**Variable X2:**
- Numerator for $\lambda_1 = \frac{(0.5885)^2}{2.4741} = 0.1400$
- Numerator for $\lambda_2 = \frac{(-0.6400)^2}{0.0888} = 4.6142$
- Numerator for $\lambda_3 = \frac{(-0.4941)^2}{0.4372} = 0.5583$
- Denominator: $5.3125$
- VDP at Condition Index 1: $\frac{0.1400}{5.3125} = 0.0264$
- VDP at Condition Index 2: $\frac{4.6142}{5.3125} = 0.8685$
- VDP at Condition Index 3: $\frac{0.5583}{5.3125} = 0.1051$
- Sum of VDPs: $1.0000$

**Variable X3:**
- Numerator for $\lambda_1 = \frac{(0.5296)^2}{2.4741} = 0.1134$
- Numerator for $\lambda_2 = \frac{(-0.1565)^2}{0.0888} = 0.2761$
- Numerator for $\lambda_3 = \frac{(0.8337)^2}{0.4372} = 1.5897$
- Denominator: $1.9792$
- VDP at Condition Index 1: $\frac{0.1134}{1.9792} = 0.0573$
- VDP at Condition Index 2: $\frac{0.2761}{1.9792} = 0.1395$
- VDP at Condition Index 3: $\frac{1.5897}{1.9792} = 0.8032$
- Sum of VDPs: $1.0000$

**Final Variance Decomposition Proportions Table:**
$$
\begin{array}{c|ccc}
\text{Condition Index} & x_1 & x_2 & x_3 \\
\hline
1 & 0.0226 & 0.0264 & 0.0573 \\
2 & 0.9565 & 0.8685 & 0.1395 \\
3 & 0.0209 & 0.1051 & 0.8032 \\
\end{array}
$$

Condition Index 2 has high VDPs for $x_1$ and $x_2$ (95.6% and 86.8%). This suggests $x_1$ and $x_2$ are collinear and responsible for most of the multicollinearity at this structure.

For more details on multicollinearity check out *[Multicollinearity and misleading statistical results](https://ekja.org/upload/pdf/kja-19087.pdf)* by Jong Hae Kim

## Treatment of Multicollinearity

Now that we understand what multicollinearity is and how to detect it, the next question is: how can we address it? There are few way we can address this problem:

### Dropping Predictors

A straightforward idea might be to simply drop one of the predictors if it is highly correlated with another. However, this is not always the best solution. As we have discussed, multicollinearity can exist even when no single pair of predictors shows a very high correlation. In addition, we generally prefer to retain as much data as possible rather than discard valuable information.

### Dimension Reduction Techniques

A more effective strategy is to use dimension reduction methods that transform the predictor space into a lower-dimensional representation:
- **Principal Component Regression (PCR):** This method uses principal components derived from the predictors as new inputs to the regression model. These components are uncorrelated by design, which directly addresses multicollinearity.
- **Partial Least Squares (PLS) Regression:** Similar to PCR, PLS finds components that capture variance in both the predictors and the response, potentially leading to better predictive performance.

### Ridge Regression (L2 Regularization)

Another powerful approach is ridge regression, which mitigates multicollinearity by introducing an $\mathcal{l}_2$-norm penalty (also called Tikhonov regularization) to the regression coefficients. The ridge regression solution is given by:

$$ \hat\beta_{ridge} = (\alpha \mathit{I} + \text{X}^T\text{X})^{-1}\text{X}^Ty $$

where $\alpha \gt 0$ is a regularization hyperparameter.

Adding $\alpha$ to the diagonal elements of the covariance matrix ensures that the matrix remains full rank and invertible, even in cases where the number of predictors exceeds the number of observations (p > n). This makes ridge regression particularly useful when dealing with multicollinearity or in low-sample-size scenarios.