Sascha Spors,
Professorship Signal Theory and Digital Signal Processing,
Institute of Communications Engineering (INT),
Faculty of Computer Science and Electrical Engineering (IEF),
University of Rostock,
Germany

# Data Driven Audio Signal Processing - A Tutorial with Computational Examples

Winter Semester 2021/22 (Master Course #24512)

- lecture: https://github.com/spatialaudio/data-driven-audio-signal-processing-lecture
- tutorial: https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise

Feel free to contact lecturer frank.schultz@uni-rostock.de

# Exercise 1: Introduction to DDASP

## Planned Schedule for DDASP

01. 21.10.: Introduction
02. 28.10.: STFT
03. 04.11.: SVD 1
04. 11.11.: SVD 2
05. 18.11.: Regression vs. Clustering I
06. 25.11.: Regression vs. Clustering II
07. 02.12.: Opimization in Machine Learning / Gradient Descent
08. 09.12.: Model Parameters
09. 16.12.: DNN 1
10. 06.01.: DNN 2
11. 13.01.: CNN 1
12. 20.01.: CNN 2
13. 27.01.: Autoencoder 

## General Objective

- for engineers **understanding the essence** of a concept is more important than strict math proof
    - as engineers we can leave proofs to mathematicians
    - *example*: understanding the 4 matrix subspaces and the matrix (pseudo)-inverse based on the SVD is essential, proofs on this fundamental topic is nice to have
- understand building blocks of machine learning for audio data processing
- create simple tool chains from these building blocks
- create simple applications from these tool chains
- get an impression about real industrial applications
- get in touch with scientific literature
    - where to find, how to read
    - here we will find latest tool chain inventions (if published at all, a lot of stuff is either unavailable due to company secrets, or only patent specifications exist, which usually omit heavy math)
    - interpretation of results
    - reproducibility
    - re-invent tool chain
- get in touch with major software libraries (in Python), see below

## Best Engineering Practice

- engineering is about creating tools (using existing tools)
- models are tools and thus perfectly fit to the engineering community, so nothing new building models 
- better know your tools in very detail
- responsibility, ethics, moral
- check our task
    - critical reflection (higher good vs. earning money)
    - do we really need machine support here
    - if so, how can machines support us here, how do humans solve this task
    - what do machines better here than humans and vice versa
    - what is our expectation of the model perfomance
    - handcrafted model vs. machine learned model (problem: model transparency)
- ...

## Established Procedure
for structured development of data-driven methods (cf. the lecture)

1. Definition of problem and performance measures
2. Data preparation and feature extraction
3. Spot check potential model architectures
4. Model selection
5. Evaluation and reporting
6. Application


## Useful Python Packages

- `numpy` for matrix/tensor algebra
- `scipy` for important science math stuff
- `pandas` for data handling
- `scikit-learn` for predictive data analysis, machine learning
- `matplotlib` for plotting
- `tensorflow` deep learning with DNNs, CNNs...
- `pytorch` deep learning with DNNs, CNNs...audio handling
- `librosa`+`ffmpeg` music/audio analysis + en-/decoding/stream support

## Applications for Machine Learning in Audio

Some examples for applications are given below. Nowadays industrial applications use a combination of different ML techniques to provide an intended consumer service. 

- supervised learning (mostly prediction by clustering / regression)
    - query by humming
    - music/genre recognition & recommendation
    - speech recognition
    - disease prediction by breathing sound analysis
    - acoustic surveillance of machines (keypad noise to text?!)
    - gun shot / alert sound detection
    - beam forming / direction of arrival (DOA)
    - composing (cf. Beethoven Symphony Nr. 10)
    - deep audio fakes (human-made vs. machine-made replica)
    - Auto EQ (mix should sound as reference mix?!)
- unsupervised learning (mostly clustering, dimensionality reduction)
    - noise reduction
    - echo cancellation
    - feedback cancellation
    - speech / language recognition
    - compression
    - feature creation (typical spectrum of pop music, classical...)
    - feature calculation (perceived loudness, cf. replay gain adaption) 
    - key recognition
- reinforcement learning
    - human tasks: how to compose a hit single, how to mix a hit single

## Ideas for Student Projects

- song recognition
- key recognition
- chord recognition
- de-noising
- genre classification and recommendation service

## Digital Audio Signal Formats / Parameters

- for computer processing **digital** signals are required
- analog -> analog-to-digital converter -> digital
- typically represented as streams/files
    - uncompressed as e.g. PCM
    - lossless compressed as e.g. FLAC
    - lossy compressed as e.g. AC3 / MP3 / AAC / Vorbis / G711 / G722
- encoder
    - typically large processing load, since mostly off-line rendering or some latency allowed
    - if lossy compression very often a psycho-acoustical model is employed to reduce data
- decoder
    - typically very low processing load due to low computational capabilities and real time demand
- most DSP algorithms work on uncompressed audio data, thus inherent parameters are:
    - bit resolution / quantization (typically in the range of 8-24 Bit integer, 32/64 Bit floating)
    - sampling frequency (typical: 1/2/4 x 32, 44.1, 48 kHz, sometimes also 8/12/16 kHz)
    - number of audio channels (mono, stereo, 5.1, 7.1, 4-128 in microphone arrays, several hundreds in audio productions)
- vector / matrix representation

### Quantization

- uncompressed audio typically uses [Pulse Code Modulation](https://en.wikipedia.org/wiki/Pulse-code_modulation) (PCM) to represent the data digitally
- quantization of amplitude values is required
- typically done with linear quantizer
- number of bits $B$
- then there are $2^B$ possible quantization steps, e.g. for $B=8$ this leads to $256$ quantization steps
- when assigning **integer** numbers to sample values the convention holds (for the **midtread quantizer**)
    - minimum integer is $-(2^{B-1})$, e.g. for $B=8$ this leads to integer -128
    - maximum integer is $+(2^{B-1})-1$, e.g. for $B=8$ this leads to integer +127
    - zero can be exactly represented
    - sample values smaller than $-(2^{B-1})$ / larger than $+(2^{B-1})-1$ will be clipped to the min/max integer
- convention for analog-to-digital converters (ADC), digital-to-analog converters (DAC) and audio files is to interpret the samples for the range -1...+1
- unless using explicitly a fixed point DSP (still often used for embedded hardware, less power consumption, smaller chip size), nowadays (PC based) processing is performed with floating/double precision
- thus scaling the (integer) data might be required, cf. `scale_wav()` below

### Read and Plot PCM Wav File

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from scipy.io import wavfile

def scale_wav(x):
    print('dtype for wav file content:', x.dtype)
    if x.dtype == 'float32' or x.dtype == 'float64':
        return(x)  # already in +-1.0 double range
    else:  # we assume integer coded wav files:
        tmp = str(x.dtype)
        print('quantization bit resolution might be lower than storage bit resolution!')
        # normalize to bring to +-1.0 double range
        a = 1/(2**(int(tmp[3:])-1))
        return(a*x)

In [None]:
folder = 'audio_ex01/'

# fs, x = wavfile.read(folder+'sine1k_16Bit.wav')  # integer PCM
fs, x = wavfile.read(folder+'sine1k_24Bit.wav')  # integer PCM
# fs, x = wavfile.read(folder+'sine1k_32Bit.wav')  # float PCM
# fs, x = wavfile.read(folder+'sine1k_64Bit.wav')  # double PCM

x = scale_wav(x)
# to work with x should have double precision unless
# special applications require for another format
print('dtype for read in signal x:', x.dtype)

In [None]:
plt.figure(figsize=(7, 6))
plt.subplot(2, 1, 1)
plt.stem(x[:48], basefmt='C0:', linefmt='C0:', markerfmt='C0o')
plt.xlabel('k')
plt.ylabel('x[k]')

# if required the samples can additionally be normalized
# for example to represent analog units, such as sound
# pressure if the audio signal was recorded by a
# calibrated microphone
# furthermore we can plot over the time rather than sample
# index
# example:
k = np.arange(x.size)
pressure_norm = 20  # full scale represents 20 Pascal peak
plt.subplot(2, 1, 2)
plt.stem(k[:48]/fs*1000, x[:48] * pressure_norm,
         basefmt='C0:', linefmt='C0:', markerfmt='C0o')
plt.xlabel('t in ms')
plt.ylabel('x in Pascal');

## Copyright

- the notebooks are provided as [Open Educational Resources](https://en.wikipedia.org/wiki/Open_educational_resources)
- feel free to use the notebooks for your own purposes
- the text is licensed under [Creative Commons Attribution 4.0](https://creativecommons.org/licenses/by/4.0/)
- the code of the IPython examples is licensed under under the [MIT license](https://opensource.org/licenses/MIT)
- please attribute the work as follows: *Frank Schultz, Data Driven Audio Signal Processing - A Tutorial Featuring Computational Examples, University of Rostock* ideally with relevant file(s), github URL https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise, commit number and/or version tag, year.
