# Deep Learning for audio Lecture 1 (Introduction)


Usefull links:

1. Great web based plotting tool: [Desmos](https://www.desmos.com/calculator)
2. Video explanation of [Fourier tranform](https://www.youtube.com/watch?v=spUNpyF58BY)(3blue1brown)
5. Introductory tutorials on [FFT/MFCC/VAD](http://practicalcryptography.com/miscellaneous/machine-learning/)

In [None]:
import pathlib

import numpy as np
import librosa
import matplotlib.pyplot as plt
from IPython.display import display, Audio

# Physics of sound

- Sound is a wave or oscilation represented by air preasure disturbance cuased by vibration.
- Sound is a movement of molecules in the air 

Preconditions for a sound hearable by a human:
 - source of vibration: guitar string, human voice, antyhing that vibrates in a hearable range of frequencies (20 - 20k Hz)
 - elastic medium: air mulecules, watter.

[Interactive intorduction to waveforms](https://pudding.cool/2018/02/waveforms/)

Waveform properties:
 - frequency - number of full cycles per 1 second
 - amplitude - roughly equivalent to loudness

In [None]:
from IPython.display import display, HTML

display(HTML("""
<!DOCTYPE html>
<html>
  <head>
    <script src="https://www.desmos.com/api/v1.6/calculator.js?apiKey=dcb31709b452b1cf9dc26972add0fda6"></script>
  </head>
  <body>
    <div id="calculator" style="width: 100%; height: 600px;"></div>
    <script>
      var elt = document.getElementById('calculator');
      var calculator = Desmos.GraphingCalculator(elt);
      // Set the expression to plot y = sin(x)
      calculator.setExpression({ id: 'graph1', latex: 'y=\\\\sin(203x) + \\\\sin(3x)  + 2\\\\sin(50x) + \\\\cos(2000x)'});
    </script>
  </body>
</html>
"""))

# Digital Audio representation
![image.png](./assets/Signal_Sampling.svg.png)

[Sampling](https://en.wikipedia.org/wiki/Sampling_(signal_processing)#Applications) - take a continues wave and discritize it. 

**Sampling rate** - density of discretization. Common values (8000 Hz, 16000 Hz, 22050 Hz, 44100 Hz)
- 44100 Hz - can represent 20 kHz, maximum audible frequency by humans.
- 22050 Hz - Speech Synthesis models. was used for low bit rate MP3 in the past.
- 16000 Hz - widely used for training ASR models. Can represent human speech frequency spectrum (200 Hz - 8 kHz)
- 8000 Hz - telephone, encrypted walkie-talkie, adequate for human speech, but without fricative sounds /s/ /f/


**Bit depth** - resolution of each sample. 16 bit, 8 bit. etc.

$$ bit\_rate = sampling\_rate \times bit\_depth $$

[Nyquist-Shannon sampling theoreme](https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem) bridges continues and discrete signals


In short ucompressed PCM format represents sampled amplited at specified sampling rate and bitdepth. Please reffer to [ffmpeg wiki](http://trac.ffmpeg.org/wiki/audio%20types) on more detials about audio formats. 

# SoX - Sound eXchange


The Swiss Army knife of audio manipulation, like ffmpeg for video.


- query usefull audio properties via __soxi__

```
(venv)$ soxi LJ037-0171.wav

Input File     : 'LJ037-0171.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:07.58 = 167226 samples ~ 568.796 CDDA sectors
File Size      : 334k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM
```

- resample audio

```
(venv)$ sox LJ037-0171.wav LJ037-0171.8k.wav rate 8000
(venv)$ soxi LJ037-0171.8k.wav

Input File     : 'LJ037-0171.8k.wav'
Channels       : 1
Sample Rate    : 8000
Precision      : 16-bit
Duration       : 00:00:07.58 = 60672 samples ~ 568.8 CDDA sectors
File Size      : 121k
Bit Rate       : 128k
Sample Encoding: 16-bit Signed Integer PCM
```

- play audio directly in terminal!

```
(venv)$ play LJ037-0171.wav

LJ037-0171.wav:

 File Size: 334k      Bit Rate: 353k
  Encoding: Signed PCM
  Channels: 1 @ 16-bit
Samplerate: 22050Hz
Replaygain: off
  Duration: 00:00:07.58

In:58.8% 00:00:04.46 [00:00:03.13] Out:168k  [  -===|===-  ]        Clip:0
```

# Other usefull audio processing tools and libraries

- [Audacity](https://www.audacityteam.org/) opensource GUI for audio recording,editting, convenient manipulations.
- [Librosa](https://librosa.org/doc/latest/index.html) convenient python library for audio feature extraction, manipulation, and builidng ML pipelines. Gradually looses it's importance, perpahs due to emergence of more end-to-end models.
- [PyTorch Audio](https://github.com/pytorch/audio) seamless integration with PyTorch, under active development, still not mature but promissing library.
- [FFmpeg](https://ffmpeg.org/) Indispensible tool when you work with audio and video. Allows to quickly resample audio files, trim silence, apply effects, normalize loudness.

# Audio Spectral Representations

- Spectrogram(STFT): $ S_{stft} \in \mathbb{C}^{L \times F} $, where $F$ - number of frequencies
- Mel-spectrogram: $ S_{mel} \in  \mathbb{R}^{L \times M} $, where $M$ - number of MEL filters
- Gammatone: $ S_{gamma} \in \mathbb{R}^{L \times G} $, where $G$ - number of gammatone filters


# Discrete Fourier Transform

$$ {\huge X_k = \sum_{n=0}^{N-1} x_n \cdot e ^ {\frac{-i2\pi}{N}kn} } $$ 

Key Idea! Represent time domain waveform through new basis of trig functions, ($\cos and \sin$) of various frequencies

In [None]:
def dft(x):
    # Naive implementation of Discrete Fourier transform (DFT)
    # https://en.wikipedia.org/wiki/Discrete_Fourier_transform
    assert x.ndim == 1
    N = x.shape[0]
    K = x.shape[0]
    spectrum = np.zeros_like(x, dtype='complex')
    for i in range(K):
        for j in range(N):
            spectrum[i] += x[j] * np.exp(-1j * 2 * np.pi * i * j / N)
            
    return spectrum

def get_magnitude(x):
    # |x| = sqrt(a^2 + b^2)
    assert x.dtype == 'complex'
    return np.sqrt(x.real**2 + x.imag**2)

In [None]:
audio_path = 'data/wav_22050/piano.22050.wav'
audio_waveform, sampling_rate = librosa.core.load(audio_path, sr=None, duration=5.0, offset=5.0)
audio_waveform = audio_waveform[:sampling_rate * 5]

librosa.display.waveshow(audio_waveform, sr=sampling_rate)
display(Audio(data=audio_waveform, rate=sampling_rate))

In [None]:
spectrum = get_magnitude(dft(audio_waveform[-512:]))
spectrum_2 = np.abs((np.fft.fft(audio_waveform[-512:])))

assert np.allclose(spectrum, spectrum_2)

fig, axes = plt.subplots(1, 2, figsize=(16, 4))
axes[0].plot(spectrum)
axes[0].set_title('redundant DFT')
axes[1].plot(spectrum[:256+1])
axes[1].set_title('onesided DFT')

# Spectrogram, ShortTimeFourierTransform (STFT)

- Key Idea, slide over waveform with a window and compute DFT, concatenate DFT's for each window in a single 2D Matrix.
- has a well defined inverse transformation iSTFT

![stft_viz](./assets/stft_output.png)
image taken from [MathWorks](https://www.mathworks.com/help/dsp/ref/dsp.stft.html)

In [None]:
audio_stft = librosa.stft(audio_waveform)
librosa.display.specshow(librosa.amplitude_to_db(np.abs(audio_stft), ref=np.max), y_axis='linear')

# Mel-Spectrogram

- Perceptual scale of pitches equal in distance from one another.
- Mel comes from melody, to indicate that name comes from pitch comparison.
- Mel-spectrogram is obtained from stft by one more matrix multiplication with mel-filterbank
- Intuitively shows that humans distiguish low frequencies better then high frequencies.

In [None]:
n_mels = 20
mel_basis = librosa.filters.mel(sr=sampling_rate, n_fft=2048, n_mels=n_mels)
for i in range(n_mels):
    plt.plot(mel_basis[i, :])
plt.title('Mel Basis')

# Gammatones. Slaney, 1998

- Spectrogram like represnetations based on modelling how the human ear perceives, emphasises and separates different frequencies of sound.
- Gammatone should represent the human experience of sound better than, say, a Fourier-domain spectrum.
- Approximate version of Gammatones can be computed from Spectrogramms.


In [None]:
from gammatone import gtgram
# https://github.com/detly/gammatone
# Scipy implelementation of gammatone filters
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.gammatone.html
def compute_gammatone(wav):
    window_time = 0.02
    hop_time = window_time / 2
    channels = 120
    f_min = 20
    gtg = gtgram.gtgram(wav, sampling_rate, window_time, hop_time, channels, f_min)
    gtg = np.flipud(20 * np.log10(gtg))
    return gtg

In [None]:
wav_dir = pathlib.Path('./data/wav_22050/')
for wav_path in wav_dir.glob('*.wav'):
    print(f'{wav_path.name}')
    wav, sampling_rate = librosa.core.load(wav_path, sr=None, duration=5.0)
    stft = librosa.stft(wav)
    gamma = compute_gammatone(wav)
    mel = librosa.feature.melspectrogram(y=wav, sr=sampling_rate)
    
    display(Audio(data=wav, rate=sampling_rate))
    fig, axes = plt.subplots(4, 1, figsize=(12,14))
    fig.tight_layout(pad=3.0)

    librosa.display.waveshow(wav,sr=sampling_rate, ax=axes[0])
    axes[0].set_title('waveform')

    librosa.display.specshow(librosa.amplitude_to_db(np.abs(stft), ref=np.max),ax=axes[1])
    axes[1].set_title('STFT-spectrogram')

    librosa.display.specshow(librosa.amplitude_to_db(mel, ref=np.max), ax=axes[2])
    axes[2].set_title('MEL-spectrogram')

    axes[3].imshow(gamma)
    axes[3].set_title('gammatone')
    
    plt.show()

# Datasets

## Multi-purpose

[Audioset](https://research.google.com/audioset/index.html) 
- videos from YouTube. Useful for many purposes, audio classification, urban sounds recognition, noises for augmentations.
- 2.1 million annotated videos
- 5.8 thousand hours of audio
- 527 classes of annotated sounds

[LAION-audio-dataset](https://github.com/LAION-AI/audio-dataset)
- is an umbrella collection of multiple dataset of thousands of hours.
- useful for training unsupervised/self-supervised models.

## English TTS/STT data
[LJSpeech](https://keithito.com/LJ-Speech-Dataset/) 
- very clean and high quality.
- 24 hours, 13,100 short audio clips.
- single speaker reading passages from 7 non-fiction books.

[GIGA Speech](https://github.com/SpeechColab/GigaSpeech)
- large scale 10k hours dataset.
- contains podcasts/youtube videos/audio books.

## Multilingual data suitable TTS and STT
[Common Voice](https://voice.mozilla.org/en)
- huge dataset with more than 100 languages and thousands of speakers.
- has demographic metadata like age, sex, and accent.
  
[MAI-LABS](https://www.caito.de/2019/01/03/the-m-ailabs-speech-dataset/)
- total amount 1000 hours, in 9 languages.
- ukrainian data - audiobooks with total duration of 87 hours.