Sascha Spors,
Professorship Signal Theory and Digital Signal Processing,
Institute of Communications Engineering (INT),
Faculty of Computer Science and Electrical Engineering (IEF),
University of Rostock,
Germany

# Data Driven Audio Signal Processing - A Tutorial with Computational Examples

Winter Semester 2023/24 (Master Course #24512)

- lecture: https://github.com/spatialaudio/data-driven-audio-signal-processing-lecture
- tutorial: https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise

Feel free to contact lecturer frank.schultz@uni-rostock.de

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.api import OLS

# Trade-Off Between Bias^2 / Variance; Regularization by Ridge Regression

- we use plain **regularized least squares** based **ridge regression** to discuss a very fundamental aspect when we learn from data, i.e. we create prediction models
- this aspect is known as bias-variance trade-off
- in general we can split the **sum of (true model data - predicted model data)^2** into three components

$$\text{model bias}^2 + \text{model variance} + \text{noise variance}$$

- a model will never explain all variance (which is actually not wanted for a useful robust model), so certain noise variance remains
- we can influence the model bias and model variance obviously by the choice of the model, see [bias_variance_linear_regression.ipynb](bias_variance_linear_regression.ipynb) for usage of different design/feature matrices that set up models with different complexity
- however, we cannot at the same time have lowest model bias *and* lowest model variance to reduce the overall error for predictions
- we therefore need to find a good compromise between bias and variance and especially we need to avoid two extremes
    - underfit case, with typically too low model complexity yielding high bias and low variance
    - overfit case, with typically too high model complexity yielding low bias and high variance

In this notebook we use a model that is in principle capable of overfitting, i.e. it can fit some amount of the noise due to its comparably too high model complexity.

One way to avoid overfitting to a certain degree is the regularization of the inverse problem. A most simple form of regularization can be used for our well known linear model (i.e. the ordinary least squares (OLS))

$$\min_{\text{wrt }\beta} (||\mathbf{y} - \mathbf{X} \beta||_2^2)$$

by adding a penalty onto $||\beta||^2_2$, which leads to the optimization problem

$$\min_{\text{wrt }\beta} (||\mathbf{y} - \mathbf{X} \beta||_2^2 + \alpha^2 ||\beta||^2_2).$$

This is known as **ridge regression** or **Tikhonov regularization**.
The approach has the model parameters $\beta$, but compared to OLS it has **one additional hyper parameter**, which is the regularization value / penalty weight $\alpha^2>0$ (it is a real-valued scalar).
When $\alpha^2=0$ we obtain ordinary least squares case.

Hyper parameters are parameters that are linked to the actual chosen optimization problem, but not to the model architecture, rather to the optimization 'algorithm'/approach.
Optimum hyper parameters should be learned within the model's training/fitting stage.
We do this later on in the course, cf. [exercise12_HyperParameterTuning.ipynb](exercise12_HyperParameterTuning.ipynb).

Note that in textbooks $\alpha^2$ is very often denoted as variable $\lambda$.
We here rather use a squared variable $\alpha^2$ to have consistent quantities in the denominator of the 
singular value inversion
$$\frac{\sigma_i}{\sigma_i^2 + \alpha^2}$$
that holds for the ridge regression, cf. [exercise07_left_inverse_SVD_QR.ipynb](exercise07_left_inverse_SVD_QR.ipynb).
Just for recap, see the plot we already had below.

In [None]:
alpha = 1/10
lmb = alpha**2

singval = np.logspace(-4, 4, 2**6)
# ridge regression
inv_singval = singval / (singval**2 + alpha**2)

plt.plot(singval, 1 / singval, label='no penalty')
plt.plot(singval, inv_singval, label='penalty')
plt.xscale('log')
plt.yscale('log')
plt.xticks(10.**np.arange(-4, 5))
plt.yticks(10.**np.arange(-4, 5))
plt.axis('equal')
plt.xlabel(r'singular value $\sigma_i$')
plt.ylabel(
    r'ridge inverted singular value $\sigma_i \,\,\,/\,\,\, (\sigma_i^2 + \alpha^2)$')
plt.title(r'ridge penalty $\alpha =$'+str(alpha) +
          r', $\alpha^2 =$'+str(alpha**2))
plt.legend()
plt.grid()
print('alpha =', alpha, 'alpha^2 = lambda =', lmb)

The solution for the ridge regression optimization problem (in machine learning wording: **we trained the model**) can be analytically given and is well known as

$$\hat{\beta} = (\mathbf{X}^\mathrm{H} \mathbf{X} + \alpha^2 \mathbf{I})^{-1} \mathbf{X}^\mathrm{H} \mathbf{y}$$

and consistently results in the left inverse solution for $\alpha^2=0$, i.e. again linear regression with ordinary least squares (OLS).

We see, that the estimated model parameters $\hat{\beta}$ are influenced by the hyper parameter $\alpha^2$.
We precisely know how exactly, as we've discussed the regularized inverted singular values $\frac{\sigma_i}{\sigma_i^2 + \alpha^2}$ in [exercise07_left_inverse_SVD_QR.ipynb](exercise07_left_inverse_SVD_QR.ipynb).


In the examples below we will see that
- very **small** $\alpha^2$ produces **high var**iance, but **low** squared **bias**; hence, we potentially **over**fit the model
- very **large** $\alpha^2$ produces **low** **var**iance, but **high** squared **bias**; hence, we potentially **under**fit the model

We could consider an **optimum** regularization amount $\alpha^2$, where the sum $\text{model bias}^2 + \text{model variance}$ is **minimum**. This mean that we found an optimum hyper parameter set, here only the one value $\alpha^2$.

We should realize that regularization does not solve the general problem of choosing an appropriate design/matrix, i.e. an appropriate model and features. For example, if the true model has $f(x^2)$ and the prediction model is set up for $f(x^3)$, it will be hard to train/predict for negative $x$-values, simply because the true model and the prediction model have not too much in common for these independent variables. So, regularization can only help a little in this example. We could try this as a toy example on our own...

Useful chapters in textbooks on bias-variance-tradeoff and ridge regression:
- [Bishop 2006] Christopher M. Bishop, *Pattern Recognition and Machine Learning*, Springer, 2006, Chapter 3.2, 3.1.4
- Sergios Theodoridis, *Machine Learning*, Academic Press, 2020, 2nd ed., Chapter 3.9, 3.8
- Kevin P. Murphy, *Machine Learning-A Probabilistic Perspective*, MIT Press, 2012, 1st ed., Chapter 6.4.4, 7.5
- Kevin P. Murphy, *Probabilistic Machine Learning-An Introduction*, MIT Press, 2022, Chapter 4.7.6.3, 11.3
- Trevor Hastie, Robert Tibshirani, Jerome Friedman, *The Elements of  Statistical Learning: Data Mining, Inference, and Prediction*, Springer, 2009, 2nd ed., Chapter 2.9, 3.4
- Gareth James, Daniela Witten, Trevor Hastie, Rob Tibshirani, *An Introduction to Statistical Learning with Applications in R*, Springer, 2021, 2nd ed., Chapter 2.2.2, 6.2.1
- Richard O. Duda, Peter E. Hart, David G. Stork, *Pattern Classification*, Wiley, 2000, 2nd ed., Chapter 9.3

In [None]:
# for reproducible outputs
rng = np.random.default_rng(12345)  # used for data creation and shuffling

In [None]:
def create_dataset(M, split=0.8, noise_scale=2, shuffled=True):
    x = np.linspace(0, 2*np.pi, M)  # lin increase
    # shuffle data for simple train/test split handling using [:Ns], [Ns:]
    if shuffled:
        rng.shuffle(x)

    # design/feature matrix of the true model
    X = np.column_stack((np.cos(1*x),
                         np.sin(2*x),
                         np.cos(5*x),
                         np.cos(6*x)))
    # add bias/intercept column to the design/feature matrix
    X = np.hstack((np.ones((M, 1)), X))
    # some nice numbers for the true model parameters beta
    beta = np.array([3, 2, 1, 1/2, 1/4])
    # outcome of true model, i.e. linear combination
    y_true = (X @ beta)[:, None]
    # add measurement noise
    noise = rng.normal(loc=0, scale=noise_scale, size=(M, 1))
    y = y_true + noise

    # design/feature matrix of the prediction model
    # we create a model that can overfit the noisy data
    # as the feature/design matrix contains also non-matching Fourier series
    # components, thus we firstly go for:
    # true model Fourier components, same as above
    X = np.column_stack((np.cos(1*x),
                         np.sin(2*x),
                         np.cos(5*x),
                         np.cos(6*x)))
    # and then add additional Fourier components, that do not explain our
    # y_true but will be sensible to the measurement noise contained in y
    if True:  # if False can be used for debug
        X = np.column_stack((X,
                             np.cos(2*x),
                             np.cos(3*x),
                             np.cos(4*x),
                             np.cos(7*x),
                             np.cos(8*x),
                             np.cos(9*x),
                             np.cos(10*x),
                             np.sin(1*x),
                             np.sin(3*x),
                             np.sin(4*x),
                             np.sin(5*x),
                             np.sin(6*x),
                             np.sin(7*x),
                             np.sin(8*x),
                             np.sin(9*x),
                             np.sin(10*x)))
    X = np.hstack((np.ones((M, 1)), X))  # add bias

    # split data set into a training data set and a test data set
    Ns = int(split*M)
    X_trn, X_tst = X[:Ns, :], X[Ns:, :]
    y_true_trn, y_true_tst = y_true[:Ns], y_true[Ns:]  # without noise
    y_trn, y_tst = y[:Ns], y[Ns:]  # with measurement noise

    return x, X_trn, X_tst, y_true_trn, y_true_tst, y_trn, y_tst

## Empirical Estimators for Bias and Variance of Ridge Regression Models

In [None]:
noise_scale = 5  # standard deviation for added measurement noise

M = 2**10  # no of rows in X = no of samples in one full data set

Ndatasampling = 2**6  # number of data sets

alpha2_min, alpha2_max = -2, +2
Nalpha = 2**6 + 1

alpha2_vec = np.logspace(alpha2_min, alpha2_max, Nalpha-1)
# insert 0 at end for the case 'no regularization'
alpha2_vec = np.insert(alpha2_vec, 0, 0)

# create only one data set to get value for Mtest
x, X_trn, X_tst, y_true_trn, y_true_tst, y_trn, y_tst = create_dataset(M)
Mtest = y_tst.shape[0]

# we use capital Y variable to refer to output data
# of many models stored into columns of matrices Yxxxx
Yh_tst = np.zeros((Mtest, Ndatasampling, Nalpha))
Y_true_tst = np.zeros((Mtest, Ndatasampling, Nalpha))

# in total we train (Ndatasampling x Nalpha) models in the next for loop

In [None]:
for data_idx in range(Ndatasampling):
    # get new data set, split to train/test
    x, X_trn, X_tst, y_true_trn, y_true_tst, y_trn, y_tst = create_dataset(
        M, noise_scale=noise_scale)
    # set up model with training data
    model = OLS(y_trn, X_trn, hasconst=True)
    # vary the ridge regression hyper parameter alpha^2
    for alpha2_idx, alpha2 in enumerate(alpha2_vec):
        # train a model
        results = model.fit_regularized(
            alpha=alpha2, L1_wt=0, profile_scale=False)
        # predict data using the actual model
        Yh_tst[:, data_idx, alpha2_idx] = results.predict(X_tst)
        # either: true model data without noise:
        Y_true_tst[:, data_idx, alpha2_idx] = np.squeeze(y_true_tst)
        # or: take the noisy data
        # Y_true_tst[:, data_idx, alpha2_idx] = np.squeeze(y_tst)

Next, we calculate the bias and variance using the data we obtained from many training procedures / many models.

Bias and variance is calculated for models that were trained on different train/test data, but for the same hyper parameter $\alpha^2$.
So, for each $\alpha^2$ we obtain a bias and a variance quantity.
We then can estimate that one **optimum hyper parameter** $\alpha^2_\mathrm{opt}$, where the sum bias$^2$ + variance is **minimum**.

See [bias_variance_linear_regression.ipynb](bias_variance_linear_regression.ipynb) for details on the equations for bias and variance.

In [None]:
# get average prediction, i.e. mean over the L models
# which is a numerical eval of the expectation:
ym = np.mean(Yh_tst, axis=1)  # (3.45) in [Bishop 2006]
ym_true = np.mean(Y_true_tst, axis=1)

# get integrated squared bias (numerical eval of the expectation):
# (3.42), (3.46) in [Bishop 2006]
bias_squared = np.mean((ym - ym_true)**2, axis=0)

# get integrated variance (numerical eval of the expectation):
# (3.43), (3.47) in [Bishop 2006]
variance = np.mean(
    np.mean((Yh_tst - np.expand_dims(ym, axis=1))**2, axis=1), axis=0)

# find min for bias_squared+variance
idx = np.argmin(bias_squared+variance)
# get specific alpha^2 for this min
alpha2_opt = alpha2_vec[idx]

In [None]:
fig, axs = plt.subplots(1, 1, figsize=(8, 4))
axs.plot(alpha2_vec, bias_squared, 'C0', label=r'bias$^2$', lw=2)
axs.plot(alpha2_vec, variance, 'C1', label=r'var')
axs.plot(alpha2_vec, bias_squared+variance, 'C2', label=r'bias$^2$+var')

axs.plot(alpha2_opt, bias_squared[idx], 'C0o')
axs.plot(alpha2_opt, variance[idx], 'C1o')
axs.plot(alpha2_opt, bias_squared[idx] + variance[idx], 'C2o')

axs.set_xscale('log')
axs.set_yscale('log')
axs.set_xlabel(r'regularization value $\alpha^2$')
axs.set_title(r'$\alpha^2_\mathrm{opt}$='+'{:4.3f}'.format(alpha2_vec[idx]))
axs.legend()
axs.set_xlim(10**alpha2_min, 10**alpha2_max)
axs.set_ylim(1e-2, 1e1)
axs.grid(True)

## Compare prediction data to true model data and to noisy data for different regularization

Let us next check and visualize what different hyper parameters $\alpha^2$ do in terms of regularization, i.e. in terms of model complexity, when we predict data and compare this to the true model data and the noisy data.
As we draw data from the same sampling distribution as above, we can assume that the estimated $\alpha^2_\mathrm{opt}$
still holds.
So we present four cases for $\alpha^2$:
- $\alpha^2 = 0$
- $\alpha^2 = 0.01$
- $\alpha^2 = \alpha^2_\mathrm{opt}$
- $\alpha^2 = 100$

Recall that typically
- very **small** $\alpha^2$ produces **high var**iance, but **low** squared **bias**; hence, we potentially **over**fit the model
- very **large** $\alpha^2$ produces **low** **var**iance, but **high** squared **bias**; hence, we potentially **under**fit the model

For convenience (i.e. easier data handling), we use the same data for training/fitting and testing/predicting.
We should never do this in real applications, but here we can go for it, as we are interested in the essence of what's going on with simple regularization.

In [None]:
# split=1 here, i.e. we realize handling: train data == test data
# we do this to conviently show y(x) for shuffled=False data
x, X_trn, X_tst, y_true_trn, y_true_tst, y_trn, y_tst = create_dataset(
    M, split=1, noise_scale=noise_scale, shuffled=False)
model = OLS(y_trn, X_trn, hasconst=True)

fig, axs = plt.subplots(2, 2, figsize=(10, 5))
fig2, axs2 = plt.subplots(1, 1, figsize=(10, 3))

# model with alpha=0
results = model.fit_regularized(alpha=0, L1_wt=0, profile_scale=False)
yh = results.predict(X_trn)
axs2.plot(np.arange(X_trn.shape[1]), results.params, 'C0o-',
          label=r'$\alpha^2$='+'{:4.3f}'.format(alpha2_vec[0]))
axs[0, 0].plot(x, y_trn, 'C0')
axs[0, 0].plot(x, y_true_trn, 'C1')
axs[0, 0].plot(x, yh, 'C3')
axs[0, 0].set_title(r'$\alpha^2$='+'{:4.3f}'.format(alpha2_vec[0]))

# model with alpha^2=0.01 (stored in alpha2_vec[0])
results = model.fit_regularized(
    alpha=alpha2_vec[0], L1_wt=0, profile_scale=False)
yh = results.predict(X_trn)
axs2.plot(np.arange(X_trn.shape[1]), results.params, 'C1o:',
          label=r'$\alpha^2$='+'{:4.3f}'.format(alpha2_vec[1]))
axs[0, 1].plot(x, y_trn, 'C0')
axs[0, 1].plot(x, y_true_trn, 'C1')
axs[0, 1].plot(x, yh, 'C3')
axs[0, 1].set_title(r'$\alpha^2$='+'{:4.3f}'.format(alpha2_vec[1]))

# model with optimum alpha (stored in alpha_opt)
results = model.fit_regularized(alpha=alpha2_opt, L1_wt=0, profile_scale=False)
yh = results.predict(X_trn)
axs2.plot(np.arange(X_trn.shape[1]), results.params, 'C2o-',
          label=r'$\alpha^2_\mathrm{opt}$='+'{:4.3f}'.format(alpha2_opt))
axs[1, 0].plot(x, y_trn, 'C0')
axs[1, 0].plot(x, y_true_trn, 'C1')
axs[1, 0].plot(x, yh, 'C3')
axs[1, 0].set_title(r'$\alpha^2_\mathrm{opt}$='+'{:4.3f}'.format(alpha2_opt))

# model with alpha^2=100 (stored in alpha2_vec[-1])
results = model.fit_regularized(
    alpha=alpha2_vec[-1], L1_wt=0, profile_scale=False)
yh = results.predict(X_trn)
axs2.plot(np.arange(X_trn.shape[1]), results.params, 'C3o-',
          label=r'$\alpha^2$='+'{:4.3f}'.format(alpha2_vec[-1]))
axs[1, 1].plot(x, y_trn, 'C0', label='y_train (with noise)')
axs[1, 1].plot(x, y_true_trn, 'C1', label='y_true_train (w/o noise)')
axs[1, 1].plot(x, yh, 'C3', label='predicted y from X_trn')
axs[1, 1].set_title(r'$\alpha^2$='+'{:4.3f}'.format(alpha2_vec[-1]))


for i in range(2):
    for j in range(2):
        axs[i, j].grid(True)
        axs[i, j].set_xlabel('x')
        axs[i, j].set_ylabel('y')
        axs[i, j].set_xlim(x[0], x[-1])
        axs[i, j].set_ylim(-7, 13)
axs[1, 1].legend()
fig.tight_layout()

axs2.legend()
axs2.set_xlabel(
    r'$\beta$ coefficient index, true model features 0...4, features >4 contribute to overfit')
axs2.set_ylabel(r'$\beta$ value')
axs2.set_title('prediction model parameters')
axs2.set_xticks(np.arange(X_trn.shape[1]))
axs2.grid(True)
fig2.tight_layout()

## Estimate optimum regularization value for one specific data set

The approach discussed below is known as **hyper parameter tuning**.
The essence is as follows:

We have certain finite amount of data and want to train a regularized model.
We here go for ridge regression, so the model parameters are $\beta$ and there is one hyper parameter $\alpha^2$.
We split data to training set and test set.
We train and test many models where we vary $\alpha^2$.
The model with best performance exhibits $\alpha^2 = \alpha^2_\mathrm{opt}$.
Finding this value is known as hyper parameter tuning.
The model is afterwards trained for the optimum $\beta$ weights.

**Note that for real applications**, the final performance check should use a test data set, which was never used for the hyper parameter training.
This implies that the full data set is actually split into at least

- **train** (train model in hyper parameter tuning stage)
- **dev** (predict data in hyper parameter tuning stage)
- **test** (use only once to predict data for final performance estimation)

sets.
For final training (i.e. using the optimum hyper parameter $\alpha^2$ to estimate optimum model weights $\beta$) often either the train data or the train+dev data is used for training.

Below, for convenience, we only use a split into two data sets. Splitting into train, dev, test might be a nice homework...

In [None]:
# we use very noisy y data
# we have split=1 and shuffled=False here, as we split and shuffle manually
# below
# we do this because we want to concatenate and re-sort the data after
# training / testing in order to plot the data nicely
x, X_trn, _, y_true_trn, _, y_trn, _ = create_dataset(
    M, split=1, noise_scale=10, shuffled=False)

if False:  # we could use just the features that correspond to the true model
    X_trn = X_trn[:, 0:5]

# for shuffling data
idx = np.arange(M)
rng.shuffle(idx)
# shuffle data
x = x[idx]
X = X_trn[idx, :]
y_true = y_true_trn[idx]
y = y_trn[idx]
# split data
split = 0.8  # 80 % go into training data, 20% into test data
Ns = int(split*M)
X_trn, X_tst = X[:Ns, :], X[Ns:, :]
y_true_trn, y_true_tst = y_true[:Ns], y_true[Ns:]  # without noise
y_trn, y_tst = y[:Ns], y[Ns:]  # with measurement noise
# set up OLS model
model = OLS(y_trn, X_trn, hasconst=True)
# we use capital Y to refer to output data of many models stored into matrices
Yh_trn = np.zeros((Ns, Nalpha))
Yh_tst = np.zeros((M-Ns, Nalpha))
# train/predict for different regularization
for alpha2_idx, alpha2 in enumerate(alpha2_vec):
    results = model.fit_regularized(alpha=alpha2, L1_wt=0, profile_scale=False)
    Yh_trn[:, alpha2_idx] = results.predict(X_trn)
    Yh_tst[:, alpha2_idx] = results.predict(X_tst)
# residual check for train / test data compared with
# true! data (which we don't have in pratice)
SSE_trn = np.sum((Yh_trn - y_true_trn)**2, axis=0)
SSE_tst = np.sum((Yh_tst - y_true_tst)**2, axis=0)
# get optimum alpha2 where smallest SSE_trn+SSE_tst is obtained
alpha2_opt_idx = np.argmin(SSE_trn+SSE_tst)
alpha2_opt = alpha2_vec[alpha2_opt_idx]
print(alpha2_opt)

plt.figure(figsize=(10, 3))
plt.plot(alpha2_vec, SSE_trn, 'C0', label='sum squared errors train data')
plt.plot(alpha2_vec, SSE_tst, 'C2', label='sum squared errors test data')
plt.plot(alpha2_vec, SSE_trn+SSE_tst, 'C3', label='SSE train + SSE test')
plt.plot(alpha2_vec[alpha2_opt_idx],
         SSE_trn[alpha2_opt_idx]+SSE_tst[alpha2_opt_idx], 'C3o')
plt.xscale('log')
plt.yscale('log')
plt.xlim(10**alpha2_min, 10**alpha2_max)
plt.xlabel(r'regularization value $\alpha^2$')
plt.title(r'$\alpha^2_\mathrm{opt}$='+'{:4.3f}'.format(alpha2_opt))
plt.legend()
plt.grid(True)
plt.tight_layout()

In [None]:
# concatenate train and test data
# re-shuffle them, so bring it into original order, such that x is increasing
y_true = np.concatenate((y_true_trn, y_true_tst))
y_true = y_true[np.argsort(idx)]
y = np.concatenate((y_trn, y_tst))
y = y[np.argsort(idx)]
yh = np.concatenate((Yh_trn[:, alpha2_opt_idx], Yh_tst[:, alpha2_opt_idx]))
yh = yh[np.argsort(idx)]
# plot y(x)
plt.figure(figsize=(10,3))
plt.plot(x[np.argsort(idx)], y, 'C0', label='measured y (with noise)')
plt.plot(x[np.argsort(idx)], y_true, 'C1', label='true y (w/o noise)')
plt.plot(x[np.argsort(idx)], yh, 'C3', label='predicted y')
plt.xlabel('x')
plt.ylabel('y')
plt.xlim(x[np.argsort(idx)][0], x[np.argsort(idx)][-1])
plt.ylim(-5,10)
plt.legend()
plt.grid(True)
plt.tight_layout()

## Copyright

- the notebooks are provided as [Open Educational Resources](https://en.wikipedia.org/wiki/Open_educational_resources)
- the text is licensed under [Creative Commons Attribution 4.0](https://creativecommons.org/licenses/by/4.0/)
- the code of the IPython examples is licensed under the [MIT license](https://opensource.org/licenses/MIT)
- feel free to use the notebooks for your own purposes
- please attribute the work as follows: *Frank Schultz, Data Driven Audio Signal Processing - A Tutorial Featuring Computational Examples, University of Rostock* ideally with relevant file(s), github URL https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise, commit number and/or version tag, year.