Sascha Spors,
Professorship Signal Theory and Digital Signal Processing,
Institute of Communications Engineering (INT),
Faculty of Computer Science and Electrical Engineering (IEF),
University of Rostock,
Germany

# Data Driven Audio Signal Processing - A Tutorial with Computational Examples

Winter Semester 2022/23 (Master Course #24512)

- lecture: https://github.com/spatialaudio/data-driven-audio-signal-processing-lecture
- tutorial: https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise

Feel free to contact lecturer frank.schultz@uni-rostock.de

# Trade-Off Between Bias^2 / Variance and Check for Model Complexity

- we use plain ordinary **least squares** (OLS) based **linear regression** to discuss a very fundamental aspect when we learn from data, i.e. we create prediction models
- this aspect is known as bias-variance trade-off
- in general we can split the squared sum of (true model data - predicted model data) into three components
$$\text{model bias}^2 + \text{model variance} + \text{noise variance}$$
- a model will never explain all variance (which is actually not wanted for a useful robust model), so certain noise variance remains
- we can influence the model bias and model variance obviously by the choice of the model  
- however, we cannot at the same time have lowest model bias *and* lowest model variance to reduce the overall error for predictions
- we therefore need to find a good compromise between bias and variance and especially we need to avoid two extremes
    - underfit case, with typically too low model complexity yielding high bias and low variance
    - overfit case, with typically too high model complexity yielding low bias and high variance

In this notebook we therefore check **over**-/**underfitting** via bias$^2$/variance quantities and $R_{\text{adjusted}}^2$ on models that were trained and predicted on noisy data (note here: **train data=test data**).

For this toy example we know the true world (unnoisy) data, because we know the linear model equation that creates these data; hence, we are pretty sure about our interpretations on the performances of the different models.
In real practice we deal with an unknown model equation, so we should properly check for over-/underfitting on our model estimates.

A robust prediction model should have a **reasonable trade-off between bias^2/variance** and reasonable **high** $R_{\text{adjusted}}^2$ **mean** but **low** $R_{\text{adjusted}}^2$ **variance** (see this notebook).

A robust prediction model should predict **reasonable outcomes to unknown input data**, such that it **generalizes well** on **unseen data**. This is part of another notebook, see [bias_variance_ridge_regression.ipynb](bias_variance_ridge_regression.ipynb)

Useful chapters in textbooks on this fundamental aspect:
- [Bishop 2006] Christopher M. Bishop, *Pattern Recognition and Machine Learning*, Springer, 2006, Chapter 3.2
- Sergios Theodoridis, *Machine Learning*, Academic Press, 2020, 2nd ed., Chapter 3.9
- Kevin P. Murphy, *Machine Learning-A Probabilistic Perspective*, MIT Press, 2012, 1st ed., Chapter 6.4.4
- Kevin P. Murphy, *Probabilistic Machine Learning-An Introduction*, MIT Press, 2022, Chapter 4.7.6.3
- Trevor Hastie, Robert Tibshirani, Jerome Friedman, *The Elements of  Statistical Learning: Data Mining, Inference, and Prediction*, Springer, 2009, 2nd ed., Chapter 2.9
- Gareth James, Daniela Witten, Trevor Hastie, Rob Tibshirani, *An Introduction to Statistical Learning with Applications in R*, Springer, 2021, 2nd ed., Chapter 2.2.2
- Richard O. Duda, Peter E. Hart, David G. Stork, *Pattern Classification*, Wiley, 2000, 2nd ed., Chapter 9.3

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.api import OLS

## True Model and Its Data

In [None]:
# number of observations / samples:
N = 2**8
# true model with x as input variable to create 4 features:
x = np.linspace(0, 2*np.pi, N)
X = np.column_stack((np.cos(x),
                     np.sin(2*x),
                     np.cos(5*x),
                     np.cos(6*x)))
# add a bias/intercept column to the design/feature matrix:
X = np.hstack((np.ones((X.shape[0], 1)), X))
hasconst = True
# some nice numbers for the true model parameters beta:
beta = np.array([3, 2, 1, 1/2, 1/4])

In [None]:
# generate 'true' data with the design matrix of 'true' model
y = np.dot(X, beta)
plt.figure(figsize=(5, 3))
plt.plot(y, 'k-')
plt.xlabel("independent features' input variable x")
plt.ylabel(('dependent variable yn'))
plt.title('true model data as linear model (x -> 4 features + intercept)')
plt.xlim(0, N)
plt.ylim(-2, 8)
plt.grid(True)
print(X.shape, y.shape)

## Function for Train / Predict and Calc Bias^2 / Variance 

In [None]:
def bias_variance_of_model(X, noise_scale=1/3):
    # add bias column to the design matrix
    X = np.hstack((np.ones((X.shape[0], 1)), X))
    hasconst = True
    print('\nshape of model/feature matrix X:',
          X.shape,
          '\nrank of matrix X / # of model parameters:',
          np.linalg.matrix_rank(X))
    # init random number generator to reproduce results
    rng = np.random.default_rng(12345)
    # generate L data sets with added noise
    L = 2**7
    noise = rng.normal(loc=0, scale=noise_scale, size=(N, L))
    Yn = y[:, None] + noise
    # alloc memory for all predictions
    Yhat = np.zeros((N, L))
    rsquared_adj = np.zeros(L)
    # train and predict L models on these L data sets
    for i in range(L):
        model = OLS(Yn[:, i], X, hasconst=hasconst)  # OLS model
        results = model.fit()  # train the model
        Yhat[:, i] = results.predict(X)  # predict outcome
        rsquared_adj[i] = results.rsquared_adj

    # get average prediction, i.e. mean over the L models
    # which is a numerical eval of the expectation:
    ym = np.mean(Yhat, axis=1)  # (3.45) in [Bishop 2006]

    # get integrated squared bias (numerical eval of the expectation):
    # note: y is the true model data
    bias_squared = np.mean((ym - y)**2)  # (3.42), (3.46) in [Bishop 2006]

    # get integrated variance (numerical eval of the expectation):
    variance = np.mean(
        np.mean((Yhat - ym[:, None])**2, axis=1),
        axis=0)  # (3.43), (3.47) in [Bishop 2006]

    for i in range(L):
        axs[0, 0].plot(Yn[:, i])
        axs[0, 1].plot(Yhat[:, i])

    axs[0, 1].plot(y, 'k-', label='true model')

    axs[0, 1].plot(np.mean(Yhat, axis=1), ':',
                   color='gold', label='$\mu(\hat{Y})$')

    axs[0, 1].plot(np.mean(Yhat, axis=1) + np.std(Yhat, axis=1), '--', lw=0.75,
                   color='gold', label='$\mu(\hat{Y}) \pm \sigma(\hat{Y})$')
    axs[0, 1].plot(np.mean(Yhat, axis=1) - np.std(Yhat, axis=1), '-.', lw=0.75,
                   color='gold')

    axs[0, 1].set_title(r'bias$^2$='+'{:4.3f}'.format(
        bias_squared)+', var='+'{:4.3f}'.format(
        variance)+r', bias$^2$+var='+'{:4.3f}'.format(
        bias_squared+variance))
    for i in range(2):
        axs[0, i].set_xlim(0, N)
        axs[0, i].set_ylim(-2, 8)
        axs[0, i].grid(True)
        axs[0, i].set_xlabel("independent features' input variable x")
    axs[0, 0].set_ylabel('dependent variable yn')
    axs[0, 1].set_ylabel('predicted variable yhat')
    axs[0, 1].legend()

    axs[1, 0].plot(rsquared_adj)
    axs[1, 0].set_title(r'$\hat{\mu}(R_{adj}^2)='+'{:4.3f}$'.format(
        np.mean(rsquared_adj))+r', $\hat{\sigma}(R_{adj}^2)='+'{:4.3f}$'.format(
        np.std(rsquared_adj)))
    axs[1, 0].set_xlim(0, L)
    axs[1, 0].set_ylim(0, 1)
    axs[1, 0].set_xlabel('model index')
    axs[1, 0].set_ylabel(r'$R_{adj}^2$')
    axs[1, 0].grid(True)

    axs[1, 1].set_xlabel('intentionally empty')

    plt.tight_layout()

    print('bias^2 + variance  = ', bias_squared+variance)

## Check Models

In [None]:
# we take just a simple line equation model y = beta1 x + beta0 here
# note  that intercept is only added in function bias_variance_of_model(X)
X = np.copy(x)[:, None]
fig, axs = plt.subplots(2, 2, figsize=(10, 5))
bias_variance_of_model(X)
axs[0, 0].set_title('underfit, too low model complexity, high bias, low var');

In [None]:
# we take a Fourier series expansion model here
X = np.column_stack((np.cos(x), np.sin(x)))  # init with first two features
# add more features according to a Fourier series expansion
# <=N//2 makes sure we do not use more model parameters than signal samples
# in order to solve this as a least-squares problem, i.e. using left-inverse
for m in range(2, N//2):
    X = np.column_stack((X, np.sin(m*x), np.cos(m*x)))
fig, axs = plt.subplots(2, 2, figsize=(10, 5))
bias_variance_of_model(X)
axs[0, 0].set_title('overfit, too high model complexity, low bias, high var');

In [None]:
# we take all features of the true model here
# (we generally not know this exactly in practice)
X = np.column_stack((np.cos(x),
                     np.sin(2*x),
                     np.cos(5*x),
                     np.cos(6*x)))
fig, axs = plt.subplots(2, 2, figsize=(10, 5))
bias_variance_of_model(X)  # lowest possible bias^2+variance, because we
# know the true model (again: which in practice likely never will occur)
# the remaining variance is from the added noise
axs[0, 0].set_title('true model features, lowest bias, lowest var');

In [None]:
# we take only the first two features of the true model
# as these oscillations explain much of the dependent variable y
X = np.column_stack((np.cos(x),
                     np.sin(2*x)))
fig, axs = plt.subplots(2, 2, figsize=(10, 5))
bias_variance_of_model(X)
axs[0, 0].set_title('reasonable bias/var trade-off if true model is unknown');

## Copyright

- the notebooks are provided as [Open Educational Resources](https://en.wikipedia.org/wiki/Open_educational_resources)
- the text is licensed under [Creative Commons Attribution 4.0](https://creativecommons.org/licenses/by/4.0/)
- the code of the IPython examples is licensed under the [MIT license](https://opensource.org/licenses/MIT)
- feel free to use the notebooks for your own purposes
- please attribute the work as follows: *Frank Schultz, Data Driven Audio Signal Processing - A Tutorial Featuring Computational Examples, University of Rostock* ideally with relevant file(s), github URL https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise, commit number and/or version tag, year.