Sascha Spors,
Professorship Signal Theory and Digital Signal Processing,
Institute of Communications Engineering (INT),
Faculty of Computer Science and Electrical Engineering (IEF),
University of Rostock,
Germany

# Data Driven Audio Signal Processing - A Tutorial with Computational Examples

Winter Semester 2022/23 (Master Course #24512)

- lecture: https://github.com/spatialaudio/data-driven-audio-signal-processing-lecture
- tutorial: https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise

Feel free to contact lecturer frank.schultz@uni-rostock.de

# Gradient Descent with Momentum

## Analytical Loss Function in 2D

Suppose, that the (made up) loss function of a model with two parameters $\beta_1$ and $\beta_2$ is analytically given as

$$\mathcal{L}(\beta_1, \beta_2) = (\beta_1 - 2)^2 + (\beta_2 - 1)^4 - (\beta_2 -1)^2$$

In this toy example there is no data dependency involved, which is not how things work in practice, but it is  good to understand the essence of finding a minimum numerically.

In order to find **potential minima**, and thereby the **optimum model parameters** $\hat{\beta_1}$ and $\hat{\beta_2}$, we need to solve gradient for zero

$$\nabla \mathcal{L} = 
\begin{bmatrix}
\frac{\partial \mathcal{L}}{\partial \beta_1}\\
\frac{\partial \mathcal{L}}{\partial \beta_2}
\end{bmatrix}=
\mathbf{0}$$

The required partial derivatives of first order are 

$$\frac{\partial \mathcal{L}}{\partial \beta_1} = 2 (\beta_1 - 2)^1$$

$$\frac{\partial \mathcal{L}}{\partial \beta_2} = 4 (\beta_2 - 1)^3 - 2(\beta_2 -1)^1$$

A check with the Hessian of $\mathcal{L}(\beta_1, \beta_2)$ yields whether we deal with a minimum, maximum, saddle point or neither of them for each of the zero gradient conditions.

We get **first minimum** at
$$\beta_{1,min1} = 2\qquad \beta_{2,min1} = 1+\frac{1}{\sqrt{2}}$$

We get **second minimum** at
$$\beta_{1,min2} = 2\qquad \beta_{2,min2} = 1-\frac{1}{\sqrt{2}}$$

Both minima yield the same function value
$$\mathcal{L}(\beta_{1,min}, \beta_{2,min}) = -\frac{1}{4},$$
so we deal actually with **two optimum models**, as there is **no global minimum** with only one lowest function value.

We have **one saddle point**, that actually separates the two minima at
$$\beta_{1,saddle} = 2\qquad \beta_{2,saddle} = 1$$

with function value
$$\mathcal{L}(\beta_{1,saddle}, \beta_{2,saddle}) = 0$$

In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
from scipy.signal import freqz, tf2zpk

matplotlib_widget_flag = False

In [None]:
# analytical solutions from above
minimum1, fminimum1 = np.array([2, 1+1/np.sqrt(2)]), -1/4
minimum2, fminimum2 = np.array([2, 1-1/np.sqrt(2)]), -1/4
saddle, fsaddle = np.array([2, 1]), 0


def get_gradient(beta):
    beta_gradient = np.array([2*(beta[0]-2),
                              4*(beta[1]-1)**3 - 2*(beta[1]-1)])
    return beta_gradient

## Plot the Loss Function

In [None]:
def f(x, y):
    # our loss function from above
    # only with nicer to read variables x,y
    # instead of beta1, beta2
    return (x-2)**2 + (y-1)**4 - (y-1)**2


x, y = np.linspace(0, 4, 2**7), np.linspace(-1/2, 5/2, 2**7)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

In [None]:
col_min, col_max, no_col = -1/2, 5, 12
col_tick = np.linspace(col_min, col_max, no_col, endpoint=True)
cmap = plt.cm.magma_r
norm = mpl.colors.BoundaryNorm(col_tick, cmap.N)

if matplotlib_widget_flag:
    %matplotlib widget
fig = plt.figure()
ax = plt.axes(projection='3d')
c = ax.plot_surface(X, Y, Z,
                    rstride=1, cstride=1,
                    cmap=cmap, norm=norm,
                    edgecolor='none')
ax.plot(minimum1[0], minimum1[1], fminimum1, 'C0o')
ax.plot(minimum2[0], minimum2[1], fminimum2, 'C0o')
ax.plot(saddle[0], saddle[1], fsaddle, 'C0o')
cbar = fig.colorbar(c, ax=ax,
                    ticks=col_tick[::no_col//10],
                    label=r'$\mathcal{L}$')
ax.set_xlim(x[0], x[-1])
ax.set_ylim(y[0], y[-1])
ax.set_zlim(-1/4, 5)
ax.set_xlabel(r'$\beta_1$')
ax.set_ylabel(r'$\beta_2$')
ax.set_zlabel(r'$\mathcal{L}$')
ax.view_init(elev=60, azim=-40)

## Gradient Descent Update Rule 
We could (and in ML problems we often need) to find the minima numerically.
The most straightforward and simple numerical solver is the so called **gradient descent** (GD), a first order method.

It uses the (analytically known) gradient $\nabla\mathcal{L}$, evaluates it for an actual $\beta_{actual}$ and updates subsequently into direction of negative gradient, i.e. the (or rather a?!?) minimum that we want to find.

This iterative procedure can be written as
$$(1):\quad \beta_{new} = \beta_{actual} - \mathrm{step size} \cdot \nabla\mathcal{L}\bigg|_{\beta_{actual}}\quad(2):\quad\beta_{new} \rightarrow \beta_{actual}\quad(3): \mathrm{go to}\,(1)$$
repeated until we hopefully converged to the $\beta$ that represents the minimum.

In practice GD is not often used, as it is not very robust and many things can go wrong. Let us check this with some illustrative examples below.

In [None]:
def my_gradient_descent(beta, step_size):
    beta_gradient = get_gradient(beta)
    beta -= beta_gradient * step_size
    return beta

## Essence of Gradient Descent with Momentum

The actual implementation might be a fancy invention, but the idea behind *momentum* is that we lowpass filter the calculated gradient such that fast direction changes does not occur so fast. A simple one-pole lowpass filter as known from basic DSP course does the job for the upcoming toy examples.

In [None]:
def my_gradient_descent_momentum(beta, z, step_size, momentum_coeff):
    beta = beta - step_size * z
    beta_gradient = get_gradient(beta)
    # this is a one pole filter in DSP
    z = (1-momentum_coeff) * beta_gradient + momentum_coeff * z
    return beta, z

The code line 
`z = (1-momentum_coeff) * beta_gradient + momentum_coeff * z`
is a recursive equation, i.e. a difference equation. From DSP we know how to analyze such an LTI system.
With 
$$b_0 = (1-\mathrm{momentum\_coeff})$$ 
and
$$a_1 = -\mathrm{momentum\_coeff}$$
we rewrite this difference equation in typical DSP fashion

$$y[k] = b_0 x[k] - a_1 y[k-1]$$

The z-transform

$$Y(z) = b_0 X(z) - a_1 z^{-1} Y(z)$$

$$Y(z) + a_1 z^{-1} Y(z) = b_0 X(z)$$

$$(1 + a_1 z^{-1} ) Y(z) = b_0 X(z)$$

brings us to the transfer function

$$H(z) = \frac{Y(z)}{X(z)} = \frac{b_0}{1 + a_1 z^{-1}} = \frac{b_0 z}{z + a_1}$$

with a zero in origin $z_0 = 0$ and one real pole (on the $\Re(z)$ axis) at $z_\infty = -a_1=+\mathrm{momentum\_coeff}$ 


In [None]:
momentum_coeff = 1/3  # arbitrary number
# the higher the lower the cutoff frequency
# what does this mean in terms of the momentum for the GD?!

b0 = 1-momentum_coeff
a0, a1 = 1, -momentum_coeff

b, a = np.array([b0, 0]), np.array([a0, a1])
w, h = freqz(b, a)
plt.plot(w / np.pi, 20*np.log10(np.abs(h)))
plt.xlim(0, 1)
plt.xlabel(r'$\Omega / \pi$')
plt.ylabel('level in dB')
plt.title('lowpass filter acting on gradient to go GD with more mass')
plt.grid(True)

z, p, k = tf2zpk(b, a)
print('zero and pole as analytically derived:', z, p)

In [None]:
def plot_contour_of_my_gradient_descent():
    fig, ax = plt.subplots()
    ax.contour(X, Y, Z, cmap='magma_r')
    ax.plot(beta_path_gd[0], beta_path_gd[1], 'C0d-', lw=0.5, ms=4, label='GD')
    ax.plot(beta_path_gd_momentum[0], beta_path_gd_momentum[1], 'C8o-', lw=0.5, ms=4, label='GD with momentum')
    ax.plot(minimum1[0], minimum1[1], 'kx', ms=10)
    ax.plot(minimum2[0], minimum2[1], 'kx', ms=10)
    ax.plot(saddle[0], saddle[1], 'kx', ms=10)
    ax.axis('equal')
    ax.set_xlim(0, 4)
    ax.set_ylim(-0.25, 2.5)
    ax.set_xlabel(r'$\beta_1$')
    ax.set_ylabel(r'$\beta_2$')
    ax.legend()
    ax.grid('True')

### Gradient Descent Momentum

In [None]:
steps = 2**5
beta_gd = np.array([4, 1.075])
step_size = 1/5

z = np.array([0, 0])
momentum_coeff = 0.33

beta_gd_momentum = np.copy(beta_gd)
beta_path_gd = np.zeros((2, steps+1))
beta_path_gd[:, 0] = beta_gd
beta_path_gd_momentum = np.zeros((2, steps+1))
beta_path_gd_momentum[:, 0] = beta_gd_momentum

for step in range(steps):
    beta_gd = my_gradient_descent(beta_gd, step_size)
    beta_path_gd[:, step+1] = beta_gd
    
    beta_gd_momentum, z = my_gradient_descent_momentum(beta_gd_momentum,
                                                       z,
                                                       step_size,
                                                       momentum_coeff)
    beta_path_gd_momentum[:, step+1] = beta_gd_momentum
    
# and plot
plot_contour_of_my_gradient_descent()

### Gradient Descent Momentum

In [None]:
steps = 2**10
beta_gd = np.array([0.0, -0.5])
step_size = 1 / 50

z = np.array([0, 0])
momentum_coeff = 0.99

beta_gd_momentum = np.copy(beta_gd)
beta_path_gd = np.zeros((2, steps+1))
beta_path_gd[:, 0] = beta_gd
beta_path_gd_momentum = np.zeros((2, steps+1))
beta_path_gd_momentum[:, 0] = beta_gd_momentum

for step in range(steps):
    beta_gd = my_gradient_descent(beta_gd, step_size)
    beta_path_gd[:, step+1] = beta_gd
    
    beta_gd_momentum, z = my_gradient_descent_momentum(beta_gd_momentum,
                                                       z,
                                                       step_size,
                                                       momentum_coeff)
    beta_path_gd_momentum[:, step+1] = beta_gd_momentum
    
# and plot
plot_contour_of_my_gradient_descent()

### Gradient Descent Momentum


In [None]:
steps = 2**6
beta_gd = np.array([4., 1.])
step_size = 1 / 10

z = np.array([0, 0])
momentum_coeff = 0.95

beta_gd_momentum = np.copy(beta_gd)
beta_path_gd = np.zeros((2, steps+1))
beta_path_gd[:, 0] = beta_gd
beta_path_gd_momentum = np.zeros((2, steps+1))
beta_path_gd_momentum[:, 0] = beta_gd_momentum

for step in range(steps):
    beta_gd = my_gradient_descent(beta_gd, step_size)
    beta_path_gd[:, step+1] = beta_gd
    
    beta_gd_momentum, z = my_gradient_descent_momentum(beta_gd_momentum,
                                                       z,
                                                       step_size,
                                                       momentum_coeff)
    beta_path_gd_momentum[:, step+1] = beta_gd_momentum
    
# and plot
plot_contour_of_my_gradient_descent()

### Gradient Descent Momentum



In [None]:
steps = 2**4
beta_gd = np.array([4., 2.5])
step_size = 0.29
z = np.array([0, 0])
momentum_coeff = 0.08 # 0.03 vs. 0.09

beta_gd_momentum = np.copy(beta_gd)
beta_path_gd = np.zeros((2, steps+1))
beta_path_gd[:, 0] = beta_gd
beta_path_gd_momentum = np.zeros((2, steps+1))
beta_path_gd_momentum[:, 0] = beta_gd_momentum

for step in range(steps):
    beta_gd = my_gradient_descent(beta_gd, step_size)
    beta_path_gd[:, step+1] = beta_gd
    
    beta_gd_momentum, z = my_gradient_descent_momentum(beta_gd_momentum,
                                                       z,
                                                       step_size,
                                                       momentum_coeff)
    beta_path_gd_momentum[:, step+1] = beta_gd_momentum
    
# and plot
plot_contour_of_my_gradient_descent()

## Copyright

- the notebooks are provided as [Open Educational Resources](https://en.wikipedia.org/wiki/Open_educational_resources)
- feel free to use the notebooks for your own purposes
- the text is licensed under [Creative Commons Attribution 4.0](https://creativecommons.org/licenses/by/4.0/)
- the code of the IPython examples is licensed under the [MIT license](https://opensource.org/licenses/MIT)
- please attribute the work as follows: *Frank Schultz, Data Driven Audio Signal Processing - A Tutorial Featuring Computational Examples, University of Rostock* ideally with relevant file(s), github URL https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise, commit number and/or version tag, year.