# Singular Value Decomposition in Statistics

## Dependency

In [1]:
import numpy as np

## Least-Squares

In least-squares, $\beta = (X^T X)^{-1} X^T y$

$$
X \beta = X (X^T X)^{-1} X^T y
$$
$$
= U D V^T ((U D V^T)^T U D V^T)^{-1} (U D V)^T y
$$
$$
= U D V^T (V D U^T U D V^T)^{-1} V D U^T y
$$
$$
= U D V^T (V D I D V^T)^{-1} V D U^T y
$$
$$
= U D V^T (V D^2 V^T)^{-1} V D U^T y
$$
$$
= U D V^T (V^{-T} D^{-2} V^{-1}) V D U^T y
$$
$$
= U D I D^{-2} I D U^T y
$$
$$
= U D D^{-2} D U^T y
$$
$$
= U I U^T y
$$
$$
= U U^T y
$$

## Ridge Regression

Let $X$ be the centered input matrix without intercept. **centered input** means each $x_{ij}$ gets replaced by $x_{ij} - \bar{x}$. $N$ is the number of data. $p$ is the number of features. So $X$ is $N \times p$. So it doesn't include intercept column which appear in **design matrix**. 

By applying **singular value decomposition** to $X$,

$$
X = U D V^T
$$

In ridge regression, $\hat {\beta}^{ridge} = (X^T X + \lambda I)^{-1} X^T y$

$$
X \hat {\beta}^{ridge} = X (X^T X + \lambda I)^{-1} X^T y
$$
$$
= U D V^T ((U D V^T)^T U D V^T + \lambda I)^{-1} (U D V^T)^T y
$$
$$
= U D V^T (V D U^T U D V^T + \lambda I)^{-1} V D U^T y
$$
$$
= U D V^T (V D^2 V^T + \lambda I)^{-1} V D U^T y
$$
$$
= U D (D^2 V^T + \lambda I)^{-1} V D U^T y
$$
$$
= U D (D^2 + \lambda I)^{-1} D U^T y
$$
$$
= U \frac {D^2} {D^2 + \lambda I} U^T y
$$

By using the sum of the out products,

$$
\sum_{j = 1}^{p} u_j \frac {d_{j}^2} {d_{j}^2 + \lambda} u_{j}^T y
$$

Because $\lambda \ge 0$ in ridge regression,

$$
\frac {d_{j}^2} {d_{j}^2 + \lambda} \le 1
$$

So the effect of $\lambda$ in ridge regression is to shrink **singular values**. Singular values indictate the importance of each **singular vector**.


**Ridge regression shrinks the small singular values the most. The less important the principal components are, the more ridge regression penalizes.**

For example, suppose ridge regression $\lambda$ is 100, the first singular value $d_1$ is 100, and the second singular value $d_2$ is 10. By substituting $\lambda$ and singular value to the ratio formula $\frac {d_{j}^2} {d_{j}^2 + \lambda}$,

$$
\lambda = 100
$$
$$
d_1 = 100 \quad \frac {d_{1}^2} {d_{1}^2 + \lambda} = 0.99
$$
$$
d_2 = 10 \quad \frac {d_{2}^2} {d_{2}^2 + \lambda} = 0.5
$$

In [8]:
l = 100
d1 = 100
d2 = 10

def compute_ratio(singular_value, lambda_):
    return singular_value**2 / (singular_value**2 + lambda_)

r1 = compute_ratio(d1, l)
r2 = compute_ratio(d2, l)

print(f'{r1:.3f}')
print(f'{r2:.3f}')

0.990
0.500
