Sascha Spors,
Professorship Signal Theory and Digital Signal Processing,
Institute of Communications Engineering (INT),
Faculty of Computer Science and Electrical Engineering (IEF),
University of Rostock,
Germany

# Data Driven Audio Signal Processing - A Tutorial with Computational Examples

Winter Semester 2022/23 (Master Course #24512)

- lecture: https://github.com/spatialaudio/data-driven-audio-signal-processing-lecture
- tutorial: https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise

Feel free to contact lecturer frank.schultz@uni-rostock.de

# Trade-Off Between Bias^2 / Variance; Regularization by Ridge Regression

- we use plain **regularized least squares** based **ridge regression** to discuss a very fundamental aspect when we learn from data, i.e. we create prediction models
- this aspect is known as bias-variance trade-off
- in general we can split the squared sum of (true model data - predicted model data) into three components
$$\text{model bias}^2 + \text{model variance} + \text{noise variance}$$
- a model will never explain all variance (which is actually not wanted for a useful robust model), so certain noise variance remains
- we can influence the model bias and model variance obviously by the choice of the model, see [bias_variance_linear_regression.ipynb](bias_variance_linear_regression.ipynb) for usage of different design/feature matrices that set up models with different complexity
- however, we cannot at the same time have lowest model bias *and* lowest model variance to reduce the overall error for predictions
- we therefore need to find a good compromise between bias and variance and especially we need to avoid two extremes
    - underfit case, with typically too low model complexity yielding high bias and low variance
    - overfit case, with typically too high model complexity yielding low bias and high variance

In this notebook we use a model that is in principle capable of overfitting, i.e. it can fit some amount of the noise due to its comparably too high model complexity. One way to avoid overfitting to a certain degree is the regularization of the inverse problem. Here, we use linear regression and the most simple form of regularization

$$\min_{\text{wrt }\mathbf{b}} (||\mathbf{y} - \mathbf{X} \mathbf{b}||_2^2 + \alpha ||\mathbf{b}||^2_2),$$

which is known as **ridge regression** or **Tikhonov regularization** with the **hyper parameter** (regularization value) $\alpha$.

The solution can be analytically given and is well known as

$$\hat{\mathbf{b}} = (\mathbf{X}^\mathrm{H} \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^\mathrm{H} \mathbf{y}$$

and results in the left inverse solution for $\alpha=0$, i.e. linear regression with ordinary least squares (OLS).


We will see that
- very **small** $\alpha$ produces **high var**iance, but **low** squared **bias**; hence, we potentially **over**fit the model
- very **large** $\alpha$ produces **low** **var**iance, but **high** squared **bias**; hence, we potentially **under**fit the model

We could consider the specific $\alpha$, where the sum $\text{model bias}^2 + \text{model variance}$ is **minimum**, an **optimum** choice for the regularization amount.

We should realize that regularization does not solve the general problem of choosing an appropriate design/matrix, i.e. an appropriate model. For example, if the true model has $f(x^2)$ and the prediction model is set up for $f(x^3)$, it will be hard to train/predict for negative $x$-values, simply because the true and the prediction model have not too much in common. So, regularization can only help a little here. We could try this as a toy example on our own.

Useful chapters in textbooks on bias-variance-tradeoff and ridge regression:
- [Bishop 2006] Christopher M. Bishop, *Pattern Recognition and Machine Learning*, Springer, 2006, Chapter 3.2, 3.1.4
- Sergios Theodoridis, *Machine Learning*, Academic Press, 2020, 2nd ed., Chapter 3.9, 3.8
- Kevin P. Murphy, *Machine Learning-A Probabilistic Perspective*, MIT Press, 2012, 1st ed., Chapter 6.4.4, 7.5
- Kevin P. Murphy, *Probabilistic Machine Learning-An Introduction*, MIT Press, 2022, Chapter 4.7.6.3, 11.3
- Trevor Hastie, Robert Tibshirani, Jerome Friedman, *The Elements of  Statistical Learning: Data Mining, Inference, and Prediction*, Springer, 2009, 2nd ed., Chapter 2.9, 3.4
- Gareth James, Daniela Witten, Trevor Hastie, Rob Tibshirani, *An Introduction to Statistical Learning with Applications in R*, Springer, 2021, 2nd ed., Chapter 2.2.2, 6.2.1
- Richard O. Duda, Peter E. Hart, David G. Stork, *Pattern Classification*, Wiley, 2000, 2nd ed., Chapter 9.3

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.api import OLS

In [None]:
# for reproducible outputs
rng = np.random.default_rng(12345)  # used for data creation and shuffle

In [None]:
def create_dataset(M, split=0.8, noise_scale=2, shuffled=True):
    x = np.linspace(0, 2*np.pi, M)  # lin increase
    # shuffle data for simple train/test split handling using [:Ns], [Ns:]
    if shuffled:
        rng.shuffle(x)

    # design/feature matrix of the true model
    X = np.column_stack((np.cos(1*x),
                         np.sin(2*x),
                         np.cos(5*x),
                         np.cos(6*x)))
    # add bias/intercept column to the design/feature matrix
    X = np.hstack((np.ones((M, 1)), X))
    # some nice numbers for the true model parameters beta
    beta = np.array([3, 2, 1, 1/2, 1/4])
    # outcome of true model
    y_true = (X @ beta)[:, None]
    # add measurement noise
    noise = rng.normal(loc=0, scale=noise_scale, size=(M, 1))
    y = y_true + noise

    # design/feature matrix of the prediction model
    # we create a model that can overfit the noisy data
    # as the feature/design matrix contains also non-matching Fourier series
    # components, thus:
    # true model Fourier components, same as above
    X = np.column_stack((np.cos(1*x),
                         np.sin(2*x),
                         np.cos(5*x),
                         np.cos(6*x)))
    # additional Fourier components, that do not explain our y_true
    # but will be sensible to the measurement noise contained in y
    if True:
        X = np.column_stack((X,
                             np.cos(2*x),
                             np.cos(3*x),
                             np.cos(4*x),
                             np.cos(7*x),
                             np.cos(8*x),
                             np.cos(9*x),
                             np.cos(10*x),
                             np.sin(1*x),
                             np.sin(3*x),
                             np.sin(4*x),
                             np.sin(5*x),
                             np.sin(6*x),
                             np.sin(7*x),
                             np.sin(8*x),
                             np.sin(9*x),
                             np.sin(10*x)))
    X = np.hstack((np.ones((M, 1)), X))

    # split data set into training set and test set
    Ns = int(split*M)
    X_trn, X_tst = X[:Ns, :], X[Ns:, :]
    y_true_trn, y_true_tst = y_true[:Ns], y_true[Ns:]  # without noise
    y_trn, y_tst = y[:Ns], y[Ns:]  # with measurement noise

    return x, X_trn, X_tst, y_true_trn, y_true_tst, y_trn, y_tst

In [None]:
noise_scale = 5

M = 2**10  # no of rows in X = no of samples

Nmodels = 2**6  # number of models to be trained

alpha_min, alpha_max = -2, +2
Nalpha = 2**6 + 1

alpha_vec = np.logspace(alpha_min, alpha_max, Nalpha-1)
alpha_vec = np.insert(alpha_vec, 0, 0)  # add 0 for no regularization

x, X_trn, X_tst, y_true_trn, y_true_tst, y_trn, y_tst = create_dataset(M)
Mtest = y_tst.shape[0]

# we use capital Y to refer to output data of many models stored into matrices
Yh_tst = np.zeros((Mtest, Nmodels, Nalpha))
Y_true_tst = np.zeros((Mtest, Nmodels, Nalpha))

In [None]:
for model_idx in range(Nmodels):
    x, X_trn, X_tst, y_true_trn, y_true_tst, y_trn, y_tst = create_dataset(
        M, noise_scale=noise_scale)
    model = OLS(y_trn, X_trn, hasconst=True)

    for alpha_idx, alpha in enumerate(alpha_vec):
        results = model.fit_regularized(
            alpha=alpha, L1_wt=0, profile_scale=False)
        Yh_tst[:, model_idx, alpha_idx] = results.predict(X_tst)
        # either: true model data without noise:
        Y_true_tst[:, model_idx, alpha_idx] = np.squeeze(y_true_tst)
        # or: take the noisy data
        # Y_true_tst[:, model_idx, alpha_idx] = np.squeeze(y_tst)

In [None]:
# get average prediction, i.e. mean over the L models
# which is a numerical eval of the expectation:
ym = np.mean(Yh_tst, axis=1)  # (3.45) in [Bishop 2006]
ym_true = np.mean(Y_true_tst, axis=1)

# get integrated squared bias (numerical eval of the expectation):
# (3.42), (3.46) in [Bishop 2006]
bias_squared = np.mean((ym - ym_true)**2, axis=0)

# get integrated variance (numerical eval of the expectation):
# (3.43), (3.47) in [Bishop 2006]
variance = np.mean(
    np.mean((Yh_tst - np.expand_dims(ym, axis=1))**2, axis=1), axis=0)

# find min for bias_squared+variance
idx = np.argmin(bias_squared+variance)
# get specific alpha for this min
alpha_opt = alpha_vec[idx]

In [None]:
fig, axs = plt.subplots(1, 1, figsize=(8, 4))
axs.plot(alpha_vec, bias_squared, 'C0', label=r'bias$^2$', lw=2)
axs.plot(alpha_vec, variance, 'C1', label=r'var')
axs.plot(alpha_vec, bias_squared+variance, 'C2', label=r'bias$^2$+var')

axs.plot(alpha_opt, bias_squared[idx], 'C0o')
axs.plot(alpha_opt, variance[idx], 'C1o')
axs.plot(alpha_opt, bias_squared[idx] + variance[idx], 'C2o')

axs.set_xscale('log')
axs.set_yscale('log')
axs.set_xlabel(r'regularization value $\alpha$')
axs.set_title(r'$\alpha_\mathrm{opt}$='+'{:4.3f}'.format(alpha_vec[idx]))
axs.legend()
axs.set_xlim(10**alpha_min, 10**alpha_max)
axs.set_ylim(1e-2, 1e1)
axs.grid(True)

In [None]:
# split=1 here, i.e. we realize handling: train data == test data
# we do this to conviently show y(x) for shuffled=False data
x, X_trn, X_tst, y_true_trn, y_true_tst, y_trn, y_tst = create_dataset(
    M, split=1, noise_scale=noise_scale, shuffled=False)
model = OLS(y_trn, X_trn, hasconst=True)

fig, axs = plt.subplots(2, 2, figsize=(10, 5))
fig2, axs2 = plt.subplots(1, 1, figsize=(10, 3))

# model with alpha=0
results = model.fit_regularized(alpha=0, L1_wt=0, profile_scale=False)
yh = results.predict(X_trn)
axs2.plot(np.arange(X_trn.shape[1]), results.params, 'C0o-',
          label=r'$\alpha$='+'{:4.3f}'.format(alpha_vec[0]))
axs[0, 0].plot(x, y_trn, 'C0')
axs[0, 0].plot(x, y_true_trn, 'C1')
axs[0, 0].plot(x, yh, 'C3')
axs[0, 0].set_title(r'$\alpha$='+'{:4.3f}'.format(alpha_vec[0]))

# model with alpha=0.01 (stored in alpha_vec[0])
results = model.fit_regularized(
    alpha=alpha_vec[0], L1_wt=0, profile_scale=False)
yh = results.predict(X_trn)
axs2.plot(np.arange(X_trn.shape[1]), results.params, 'C1o:',
          label=r'$\alpha$='+'{:4.3f}'.format(alpha_vec[1]))
axs[0, 1].plot(x, y_trn, 'C0')
axs[0, 1].plot(x, y_true_trn, 'C1')
axs[0, 1].plot(x, yh, 'C3')
axs[0, 1].set_title(r'$\alpha$='+'{:4.3f}'.format(alpha_vec[1]))

# model with optimum alpha (stored in alpha_opt)
results = model.fit_regularized(alpha=alpha_opt, L1_wt=0, profile_scale=False)
yh = results.predict(X_trn)
axs2.plot(np.arange(X_trn.shape[1]), results.params, 'C2o-',
          label=r'$\alpha_\mathrm{opt}$='+'{:4.3f}'.format(alpha_opt))
axs[1, 0].plot(x, y_trn, 'C0')
axs[1, 0].plot(x, y_true_trn, 'C1')
axs[1, 0].plot(x, yh, 'C3')
axs[1, 0].set_title(r'$\alpha_\mathrm{opt}$='+'{:4.3f}'.format(alpha_opt))

# model with alpha=100 (stored in alpha_vec[-1])
results = model.fit_regularized(
    alpha=alpha_vec[-1], L1_wt=0, profile_scale=False)
yh = results.predict(X_trn)
axs2.plot(np.arange(X_trn.shape[1]), results.params, 'C3o-',
          label=r'$\alpha$='+'{:4.3f}'.format(alpha_vec[-1]))
axs[1, 1].plot(x, y_trn, 'C0', label='y_train (with noise)')
axs[1, 1].plot(x, y_true_trn, 'C1', label='y_true_train (w/o noise)')
axs[1, 1].plot(x, yh, 'C3', label='predicted y from X_trn')
axs[1, 1].set_title(r'$\alpha$='+'{:4.3f}'.format(alpha_vec[-1]))


for i in range(2):
    for j in range(2):
        axs[i, j].grid(True)
        axs[i, j].set_xlabel('x')
        axs[i, j].set_ylabel('y')
        axs[i, j].set_xlim(x[0], x[-1])
        axs[i, j].set_ylim(-7, 13)
axs[1, 1].legend()
fig.tight_layout()

axs2.legend()
axs2.set_xlabel(
    r'$\beta$ coefficient index, true model features 0...4, features >4 contribute to overfit')
axs2.set_ylabel(r'$\beta$ value')
axs2.set_title('prediction model parameters')
axs2.set_xticks(np.arange(X_trn.shape[1]))
axs2.grid(True)
fig2.tight_layout()

## Check best regularization value for one specific data set

In [None]:
# we use very noisy y data
# we have split=1 and shuffled=False here, as we split and shuffle manually
# below
# we do this because we want to concatenate and re-sort the data after
# training / testing in order to plot the data nicely
x, X_trn, _, y_true_trn, _, y_trn, _ = create_dataset(
    M, split=1, noise_scale=10, shuffled=False)

if False:  # we could use just the features that correspond to the true model
    X_trn = X_trn[:, 0:5]

# for shuffling data
idx = np.arange(M)
rng.shuffle(idx)
# shuffle data
x = x[idx]
X = X_trn[idx, :]
y_true = y_true_trn[idx]
y = y_trn[idx]
# split data
split = 0.8  # 80 % go into training, 20% into test data
Ns = int(split*M)
X_trn, X_tst = X[:Ns, :], X[Ns:, :]
y_true_trn, y_true_tst = y_true[:Ns], y_true[Ns:]  # without noise
y_trn, y_tst = y[:Ns], y[Ns:]  # with measurement noise
# set up OLS model
model = OLS(y_trn, X_trn, hasconst=True)
# we use capital Y to refer to output data of many models stored into matrices
Yh_trn = np.zeros((Ns, Nalpha))
Yh_tst = np.zeros((M-Ns, Nalpha))
# train/predict for different reguarization
for alpha_idx, alpha in enumerate(alpha_vec):
    results = model.fit_regularized(alpha=alpha, L1_wt=0, profile_scale=False)
    Yh_trn[:, alpha_idx] = results.predict(X_trn)
    Yh_tst[:, alpha_idx] = results.predict(X_tst)
# residual check for train / test data compared with true! data (which we dont have in pratice)
SSE_trn = np.sum((Yh_trn - y_true_trn)**2, axis=0)
SSE_tst = np.sum((Yh_tst - y_true_tst)**2, axis=0)
# get optimum alpha where smallest SSE_trn+SSE_tst is obtained
alpha_opt_idx = np.argmin(SSE_trn+SSE_tst)
alpha_opt = alpha_vec[alpha_opt_idx]
print(alpha_opt)

plt.figure(figsize=(10,3))
plt.plot(alpha_vec, SSE_trn, 'C0', label='sum squared errors train data')
plt.plot(alpha_vec, SSE_tst, 'C2', label='sum squared errors test data')
plt.plot(alpha_vec, SSE_trn+SSE_tst, 'C3', label='SSE train + SSE test')
plt.plot(alpha_vec[alpha_opt_idx], SSE_trn[alpha_opt_idx]+SSE_tst[alpha_opt_idx], 'C3o')
plt.xscale('log')
plt.yscale('log')
plt.xlim(10**alpha_min, 10**alpha_max)
plt.xlabel(r'regularization value $\alpha$')
plt.title(r'$\alpha_\mathrm{opt}$='+'{:4.3f}'.format(alpha_opt))
plt.legend()
plt.grid(True)
plt.tight_layout()

In [None]:
# concatenate train and test data
# re-shuffle them, so bring it into original order, such that x is increasing
y_true = np.concatenate((y_true_trn, y_true_tst))
y_true = y_true[np.argsort(idx)]
y = np.concatenate((y_trn, y_tst))
y = y[np.argsort(idx)]
yh = np.concatenate((Yh_trn[:, alpha_opt_idx], Yh_tst[:, alpha_opt_idx]))
yh = yh[np.argsort(idx)]
# plot y(x)
plt.figure(figsize=(10,3))
plt.plot(x[np.argsort(idx)], y, 'C0', label='measured y (with noise)')
plt.plot(x[np.argsort(idx)], y_true, 'C1', label='true y (w/o noise)')
plt.plot(x[np.argsort(idx)], yh, 'C3', label='predicted y')
plt.xlabel('x')
plt.ylabel('y')
plt.xlim(x[np.argsort(idx)][0], x[np.argsort(idx)][-1])
plt.legend()
plt.grid(True)
plt.tight_layout()

## Copyright

- the notebooks are provided as [Open Educational Resources](https://en.wikipedia.org/wiki/Open_educational_resources)
- the text is licensed under [Creative Commons Attribution 4.0](https://creativecommons.org/licenses/by/4.0/)
- the code of the IPython examples is licensed under the [MIT license](https://opensource.org/licenses/MIT)
- feel free to use the notebooks for your own purposes
- please attribute the work as follows: *Frank Schultz, Data Driven Audio Signal Processing - A Tutorial Featuring Computational Examples, University of Rostock* ideally with relevant file(s), github URL https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise, commit number and/or version tag, year.