Sascha Spors,
Professorship Signal Theory and Digital Signal Processing,
Institute of Communications Engineering (INT),
Faculty of Computer Science and Electrical Engineering (IEF),
University of Rostock,
Germany

# Data Driven Audio Signal Processing - A Tutorial with Computational Examples

Winter Semester 2022/23 (Master Course #24512)

- lecture: https://github.com/spatialaudio/data-driven-audio-signal-processing-lecture
- tutorial: https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise

Feel free to contact lecturer frank.schultz@uni-rostock.de

# Exercise 7: Least Squares Solution / Left Inverse in SVD and QR Domain

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import svd, diagsvd, qr, inv, pinv, norm
from numpy.linalg import matrix_rank
np.set_printoptions(precision=2, floatmode='fixed', suppress=True)

matplotlib_widget_flag = False

For the given full-column rank feature matrix $\mathbf{X}$ (tall/thin shape with independent columns with $M \times N$ dimension, $M$ samples, $N$ features) and outcome vector ${\beta}$ the linear set of equations

$$\mathbf{X} {\beta} = \mathbf{y}$$

is to be solved for unknowns ${\beta}$.
We obviously cannot invert $\mathbf{X}$, so we must find an optimum row space estimate $\hat{\beta}$.

### Least Squares Solution

great material, strong recommendation:
- Gilbert Strang (2020): "Linear Algebra for Everyone", Wellesley-Cambridge Press, Ch. 4.3
- Gilbert Strang (2019): "Linear Algebra and Learning from Data", Wellesley-Cambridge Press, Ch. II.2

We know for sure, that **pure row space** $\hat{\beta}$ maps to **pure column space** $\hat{\mathbf{y}}$ for the linear combination

$$\mathbf{X} \hat{\beta} = \hat{\mathbf{y}}$$

The given (measured) outcome vector $\mathbf{y}$ might not necessarily (in practical problems probably never!) live in pure column space of $\mathbf{X}$.

We therefore need an offset (error) vector to get there

$$\mathbf{X} \hat{\beta} + \mathbf{e} = \mathbf{y}$$

We want to find the $\hat{\beta}$ which yields the **smallest** $||\mathbf{e}||_2^2$ or equivalently $||\mathbf{e}||_2$.
This is our optimization criterion, known as **least squares**, because the sum of squares $||\mathbf{e}||_2^2 = e_1^2 + e_2^2 + ... + e_M^2$ should be minimized.

So, thinking in vector addition

$$\mathbf{y} = \hat{\mathbf{y}} + \mathbf{e} \rightarrow \mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$$

we can geometrically figure (we should imagine and/or draw this is in 2D / 3D) that the smallest $||\mathbf{e}||_2^2$ is achieved when we span a **right-angled triangle** using $\mathbf{y}$ as hypotenuse.

Therefore, $\hat{\mathbf{y}} \perp \mathbf{e}$.
Recall, that $\hat{\mathbf{y}}$ lives in column space of $\mathbf{X}$, that's where we started from.
Column space is orthogonal to left null space, so $\hat{\mathbf{y}} \perp \mathbf{e}$ tells us, that $\mathbf{e}$ must live in left null space of $\mathbf{X}$.
This requirement can be formally written as $\mathbf{X}^\mathrm{H} \mathbf{e} = \mathbf{0}$ and then can be utilized as

$$\mathbf{X} \hat{\beta} + \mathbf{e} = \mathbf{y} \qquad\rightarrow\qquad \mathbf{X}^\mathrm{H}\mathbf{X} \hat{\beta} + \mathbf{X}^\mathrm{H}\mathbf{e} = \mathbf{X}^\mathrm{H}\mathbf{y} \qquad\rightarrow\qquad \mathbf{X}^\mathrm{H}\mathbf{X} \hat{\beta} = \mathbf{X}^\mathrm{H}\mathbf{y}$$

The last equation in the above line is known as normal equation(s).

This can be solved using the left inverse of
$$\mathbf{X}^\mathrm{H} \mathbf{X}$$
This matrix is full rank and therefore invertible.
Hence, multiplying both sides of normal equation(s) from left

$$(\mathbf{X}^\mathrm{H} \mathbf{X})^{-1} (\mathbf{X}^\mathrm{H} \mathbf{X}) \hat{\beta} = (\mathbf{X}^\mathrm{H} \mathbf{X})^{-1} \mathbf{X}^\mathrm{H} \mathbf{y}$$

Since for left inverse $(\mathbf{X}^\mathrm{H} \mathbf{X})^{-1} (\mathbf{X}^\mathrm{H} \mathbf{X}) = \mathbf{I}$ holds, we get the least-squares sense solution for $\beta$ in the row space of $\mathbf{X}$

$$\hat{\beta} = (\mathbf{X}^\mathrm{H} \mathbf{X})^{-1} \mathbf{X}^\mathrm{H} \mathbf{y}$$

using the **left inverse** of $\mathbf{X}$ as

$$\mathbf{X}^{+L} = (\mathbf{X}^\mathrm{H} \mathbf{X})^{-1} \mathbf{X}^\mathrm{H}$$

So, optimum estimator for our unknowns (i.e. the model parameters)

$$\hat{\beta} = \mathbf{X}^{+L} \mathbf{y}$$

If we let matrix $\mathbf{X}$ act on $\hat{\beta}$,

$$\mathbf{X} \hat{\beta} = \mathbf{X}\mathbf{X}^{+L} \mathbf{y} = \hat{\mathbf{y}}$$

we land in pure column space as initially started.
Thus,

$$\mathbf{P}_\mathrm{col space} = \mathbf{X} \mathbf{X}^{+L}$$

is a **projection matrix**, that projects any $\mathbf{y}$ into the column space of $\mathbf{X}$ creating $\hat{\mathbf{y}}$, thus often referred to as the **hat matrix**.

We can define another projection matrix

$$\mathbf{P}_\mathrm{row space} = \mathbf{X}^{+L} \cdot \mathbf{X},$$
which maps any $\beta$ to row space of $\mathbf{X}$.
From above we already know

$$\mathbf{P}_\mathrm{row space} = (\mathbf{X}^\mathrm{H} \mathbf{X})^{-1} \mathbf{X}^\mathrm{H} \cdot \mathbf{X} = \mathbf{I}$$

which makes sense, as the tall/thin, full column rank matrix does not span a null space, other than the $\mathbf{0}$ vector.
So, all possible $\beta$ already live in the row space and must be projected exactly to itself. 

### Least Squares Solution in SVD Domain

great material, strong recommendation:
- Gilbert Strang (2019): "Linear Algebra and Learning from Data", Wellesley-Cambridge Press, p.125ff

The left inverse 

$$\mathbf{X}^{+L} = (\mathbf{X}^\mathrm{H} \mathbf{X})^{-1} \mathbf{X}^\mathrm{H}$$

in terms of SVD

$$\mathbf{X}^{+L} = ((\mathbf{U}\mathbf{S}\mathbf{V}^\mathrm{H})^\mathrm{H} \mathbf{U}\mathbf{S}\mathbf{V}^\mathrm{H})^{-1} (\mathbf{U}\mathbf{S}\mathbf{V}^\mathrm{H})^\mathrm{H}$$

$$\mathbf{X}^{+L} = (\mathbf{V}\mathbf{S}^\mathrm{H}\mathbf{S}\mathbf{V}^\mathrm{H})^{-1} (\mathbf{V}\mathbf{S}^\mathrm{H}\mathbf{U}^\mathrm{H})$$

$$\mathbf{X}^{+L} = \mathbf{V} (\mathbf{S}^\mathrm{H}\mathbf{S})^{-1} \mathbf{V}^\mathrm{H} \mathbf{V}\mathbf{S}^\mathrm{H}\mathbf{U}^\mathrm{H}$$

$$\mathbf{X}^{+L} = \mathbf{V} (\mathbf{S}^\mathrm{H}\mathbf{S})^{-1} \mathbf{S}^\mathrm{H}\mathbf{U}^\mathrm{H}$$

$$\mathbf{X}^{+L} = \mathbf{V} \mathbf{S}^\mathrm{+L} \mathbf{U}^\mathrm{H}$$

allows for a convenient discussion, how singular values act when **mapping column space back to row space**.

Considering only one singular value $\sigma_i$ and its corresponding left/right singular vectors, the left inverse $\mathbf{S}^\mathrm{+L} = (\mathbf{S}^\mathrm{H}\mathbf{S})^{-1} \mathbf{S}^\mathrm{H}$ reduces to

$$\frac{\sigma_i}{\sigma_i^2} = \frac{1}{\sigma_i}$$

For **very, very small** $\sigma_i$, the **inversion** thus leads to **huge values**, which might be not meaningful as this (these) weighted $\mathbf{v}_i$ vector(s) then dominate(s) the row space solution. Small changes in $\sigma_i$ 
(i.e. small changes in a badly conditioned feature matrix), then lead to comparably large changes in the row space solution $\hat{\beta}$. So-called ridge regression (aka Tikhonov regularization) is a straightforward workaround for ill-conditioned matrices. See stuff below.

### Least Squares Solution in QR Domain

great material, strong recommendation:
- Gilbert Strang (2020): "Linear Algebra for Everyone", Wellesley-Cambridge Press, p.170ff
- Gilbert Strang (2019): "Linear Algebra and Learning from Data", Wellesley-Cambridge Press, p.128ff

The normal equation(s)

$$\mathbf{X}^\mathrm{H}\mathbf{X} \hat{\beta} = \mathbf{X}^\mathrm{H}\mathbf{y}$$

can be conveniently given as QR decomposition (recall $\mathbf{Q}^\mathrm{H} \mathbf{Q}=\mathbf{I}$ due to Gram-Schmidt orthonormalization)

$$(\mathbf{Q R})^\mathrm{H}\mathbf{Q R} \hat{\beta} = (\mathbf{Q R})^\mathrm{H}\mathbf{y}$$

$$\mathbf{R}^\mathrm{H} \mathbf{Q}^\mathrm{H} \mathbf{Q R} \hat{\mathbf{x}} = (\mathbf{Q R})^\mathrm{H}\mathbf{y}$$

$$\mathbf{R}^\mathrm{H} \mathbf{R} \hat{\beta} = \mathbf{R}^\mathrm{H} \mathbf{Q}^\mathrm{H} \mathbf{y}$$

$$\mathbf{R} \hat{\beta} = \mathbf{Q}^\mathrm{H} \mathbf{y}$$

We should not expect that algorithms solve

$$\hat{\mathbf{x}} = \mathbf{R}^{+L} \mathbf{Q}^\mathrm{H} \mathbf{b}$$

with the left inverse $\mathbf{R}^{+L}$ of upper triangle $\mathbf{R}$, we should not do this for non-toy-examples as well.

In [None]:
rng = np.random.default_rng(1)
mean, stdev = 0, 0.01
M = 100
N = 3
X = rng.normal(mean, stdev, [M, N])
print('rank =', matrix_rank(X), '== number of cols =', N)

In [None]:
if matplotlib_widget_flag:
    %matplotlib widget

In [None]:
[Q, R] = qr(X)
[U, s, Vh] = svd(X)
print('sing vals', s)
V = Vh.conj().T

# scipy function
Xli_pinv = pinv(X)

# manual normal equation solver
Xli_man = inv(X.conj().T @ X) @ X.conj().T

# SVD
Si = diagsvd(1/s, N, M)  # works if array s has only non-zero entries
Xli_svd = V @ Si @ U.conj().T

# QR
Xli_qr = pinv(R) @ Q.conj().T

print('pinv == inverse via normal eq?', np.allclose(Xli_pinv, Xli_man))
print('pinv == inverse via SVD?', np.allclose(Xli_pinv, Xli_svd))
print('pinv == inverse via QR?', np.allclose(Xli_pinv, Xli_qr))

In [None]:
# create y from one column space entry and one left null space entry
# note that we use unit length vectors for convenience: ||e||_2^2 = 1
yh = U[:, 0]  # choose one of col space
e = U[:, N]  # assuming rank N -> we choose last vector of left null space
y = yh + e

# find betah in the row space
betah = Xli_pinv @ y  # only yh gets mapped back to row space
# this is our LS solution betah

print(Xli_pinv @ e)  # e is mapped to zero vec
print(Xli_pinv @ y)
print(Xli_pinv @ yh)
print(betah)

In [None]:
print('norm(X @ betah - yh, 2) == 0 -> ', norm(X @ betah - yh, 2))  # == 0

print('||e||_2^2:')
print(norm(X @ betah - y, 2)**2)
print(norm(e, 2)**2)

### Ridge Regression / Regularization in SVD Domain

The minimization problem

$$\mathrm{min}_\beta \left(||\mathbf{X} \beta - \mathbf{y}||_2^2 + \alpha^2 ||\beta||_2^2 \right)$$

is known as regression, aka Tikhonov regularization, aka regression with penalty on $||\beta||_2^2$ using the ridge coefficient coefficient $\alpha^2$ (this is a hyper parameter in model training, often variable $\lambda = \alpha^2$ is used).

For limit $\alpha^2=0$ this minimization problem is identical to above standard least squares solution.

The analytical solution is well known as

$$\hat{\beta}^\mathrm{Ridge} = \mathbf{X}^{+\mathrm{L,Ridge}}  \mathbf{y}$$

with the ridge left inverse matrix given in SVD fashion 

$$\mathbf{X}^{+\mathrm{L,Ridge}} = \mathbf{V} \left((\mathbf{S}^\mathrm{H}\mathbf{S} + \alpha^2 \mathbf{I})^{-1} \mathbf{S}^\mathrm{H}\right) \mathbf{U}^\mathrm{H}$$

Considering only one singular value $\sigma_i$ and its corresponding left/right singular vectors, the left inverse $\mathbf{S}^\mathrm{+L,Ridge} = \left((\mathbf{S}^\mathrm{H}\mathbf{S} + \alpha^2 \mathbf{I})^{-1} \mathbf{S}^\mathrm{H}\right)$ reduces to

$$\frac{\sigma_i}{\sigma_i^2 + \alpha^2},$$

which can be discussed conveniently with below plot.

In [None]:
alpha = 1/10
lmb = alpha**2

singval = np.logspace(-4, 4, 2**6)
# ridge regression
inv_singval = singval / (singval**2 + alpha**2)

plt.plot(singval, 1 / singval, label='no penalty')
plt.plot(singval, inv_singval, label='penalty')
plt.xscale('log')
plt.yscale('log')
plt.xticks(10.**np.arange(-4, 5))
plt.yticks(10.**np.arange(-4, 5))
plt.axis('equal')
plt.xlabel(r'$\sigma_i$')
plt.ylabel(r'$\sigma_i \,\,\,/\,\,\, (\sigma_i^2 + \alpha^2)$')
plt.title(r'ridge penalty $\alpha =$'+str(alpha))
plt.legend()
plt.grid()
print('alpha =', alpha, 'alpha^2 = lambda =', lmb)

In [None]:
rng = np.random.default_rng(1)
mean, stdev = 0, 10
M, N = 3, 3
A_tmp = rng.normal(mean, stdev, [M, N])
[U_tmp, s_tmp, Vh_tmp] = svd(A_tmp)
V_tmp = Vh_tmp.conj().T
s_tmp = [10, 8, 0.5]  # create sing vals
S_tmp = diagsvd(s_tmp, M, N)

# create full rank square matrix to work with (no nullspaces except 0-vectors!)
A = U_tmp @ S_tmp @ Vh_tmp
[U, s, Vh] = svd(A)
print('A\n', A)
print('rank of A: ', matrix_rank(A))
print('sigma', s)
S = diagsvd(s, M, N)
V = Vh.conj().T

# b as column space linear combination
b = 1*U[:, 0] + 1*U[:, 1] + 1*U[:, 2]

xh = inv(A) @ b
print('xh =', xh, '\nA xh =', A @ xh, '\nb =', b)
# == (because in b all U weighted with unity gain)
print('inverted sigma no penalty: ', 1 / s)
print('||xh||_2^2 =', norm(xh, 2))
print('norm of vec: inverted sigma no penalty: ', norm(1 / s, 2))

lmb = 2
Sli_ridge = inv(S.conj().T @ S + lmb*np.eye(3)) @ S.conj().T
Ali_ridge = V @ Sli_ridge @ U.conj().T
xh_ridge = Ali_ridge @ b
print('xh_ridge =', xh_ridge, '\nA xh_ridge =', A @ xh_ridge, '\nb = ', b)
# == (because in b all U weighted with unity gain)
print('inverted sigma with penalty: ', s / (s**2 + lmb))
print('||xh_ridge||_2^2 =', norm(xh_ridge, 2))
print('norm of vec: inverted sigma with penalty: ', norm(s / (s**2 + lmb), 2))

fig1 = plt.figure(figsize=(5, 5))
ax = plt.axes(projection='3d')
w = Vh @ xh
wr = Vh @ xh_ridge
for n in range(3):
    ax.plot([0, w[n]*V[0, n]], [0, w[n]*V[1, n]], [0, w[n]*V[2, n]],
            color='C'+str(n), lw=1, ls=':', label=r'$\hat{x}$@$v_i$, no penalty')
    ax.plot([0, wr[n]*V[0, n]], [0, wr[n]*V[1, n]], [0, wr[n]*V[2, n]],
            color='C'+str(n), lw=3, ls='-', label=r'$\hat{x}$@$v_i$, penalty')

ax.plot([0, xh[0]], [0, xh[1]], [0, xh[2]],
        'black', label=r'$\hat{x}$, no penalty')
ax.plot([0, xh_ridge[0]], [0, xh_ridge[1]], [
        0, xh_ridge[2]], 'C7', label='$\hat{x}$, penalty')
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')
ax.set_zlabel(r'$z$')
lim = 1
ax.set_xlim(-lim, lim)
ax.set_ylim(-lim, lim)
ax.set_zlim(-lim, lim)
ax.set_title('V / row space')
plt.legend()


fig2 = plt.figure(figsize=(5, 5))
ax = plt.axes(projection='3d')
w = Vh @ xh
wr = Vh @ xh_ridge
for n in range(3):
    ax.plot([0, U[0, n]], [0, U[1, n]], [0, U[2, n]],
            color='C'+str(n), lw=2, ls='-', label=r'$u_i$')
    ax.plot([0, s[n]*U[0, n]], [0, s[n]*U[1, n]], [0, s[n]*U[2, n]],
            color='C'+str(n), lw=1, ls=':', label=r'$\sigma_i \cdot u_i$')

ax.plot([0, b[0]], [0, b[1]], [0, b[2]], 'black', lw=1, label=r'$b$')
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')
ax.set_zlabel(r'$z$')
lim = 5
ax.set_xlim(-lim, lim)
ax.set_ylim(-lim, lim)
ax.set_zlim(-lim, lim)
ax.set_title('U / row space')
plt.legend()

In [None]:
if matplotlib_widget_flag:
    plt.close(fig1)
    plt.close(fig2)

## Copyright

- the notebooks are provided as [Open Educational Resources](https://en.wikipedia.org/wiki/Open_educational_resources)
- the text is licensed under [Creative Commons Attribution 4.0](https://creativecommons.org/licenses/by/4.0/)
- the code of the IPython examples is licensed under the [MIT license](https://opensource.org/licenses/MIT)
- feel free to use the notebooks for your own purposes
- please attribute the work as follows: *Frank Schultz, Data Driven Audio Signal Processing - A Tutorial Featuring Computational Examples, University of Rostock* ideally with relevant file(s), github URL https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise, commit number and/or version tag, year.