DSC160 Data Science and the Arts - Twomey - Spring 2020 - [dsc160.roberttwomey.com](http://dsc160.roberttwomey.com)

# Frequency Transforms

This notebook demonstrates a variety of frequency transforms.

It depends on the [numpy](https://numpy.org/), [matplotlib](https://matplotlib.org/), [seaborn](https://seaborn.pydata.org/), and [LibROSA](https://librosa.github.io/librosa/) libraries. 

The examples are adapted from the tutorials at [musicinformationretrieval.com](musicinformationretrieval.com), developed for the Stanford MIR workshops.

## Setup

Basic imports

In [None]:
%matplotlib inline

# visualization
import matplotlib.pyplot as plt
import seaborn

# sound processing
import librosa
import librosa.display

# to play audio inline in ipython/jupyter notebooks
from IPython.display import Audio

import numpy as np
import scipy
import sklearn

import os, requests

## Fourier Transform

The [Fourier Transform](https://en.wikipedia.org/wiki/Fourier_transform) is one of the most fundamental operations in applied mathematics and signal processing.

It transforms our time-domain signal into the *frequency domain*. Whereas the time domain expresses our signal as a sequence of samples, the frequency domain expresses our signal as a *superposition of sinusoids* of varying magnitudes, frequencies, and phase offsets.

Before we compute a FFT, let's load an audio file to work with:

In [None]:
x, sr = librosa.load("audio/c_strum.wav")

In [None]:
print(x.shape)
print(sr)

In [None]:
# display
plt.figure(figsize=(13, 5))
librosa.display.waveplot(x, sr=sr)
plt.show()

In [None]:
Audio(x, rate=sr)

To compute a Fourier transform in NumPy or SciPy, we use [`scipy.fft`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.fft.fft.html).

In [None]:
X = scipy.fft(x)
X_mag = np.absolute(X)
f = np.linspace(0, sr, len(X_mag)) # frequency variable

In [None]:
x.shape

In [None]:
f.shape

In [None]:
max(f)

Plot the spectrum:

In [None]:
plt.figure(figsize=(13, 5))
plt.plot(f, X_mag) # magnitude spectrum
plt.xlabel('Frequency (Hz)')
plt.show()

NOTE: the spectrum is symmetrical around sr/2. According to sampling frequency, the max frequency that can be captured by a digital signal with sampling rate of sr is sr/2.

Zooming in, let's inspect the frequency bands at the lower end of the spectrum:

In [None]:
plt.figure(figsize=(13, 5))
plt.plot(f[:5000], X_mag[:5000])
plt.xlabel('Frequency (Hz)')
plt.show()

Note: this sample has six large peaks, likely corresponding to the six strings of the guitar sounding the chord.

What is the value of the largest peak (leftmost) in Hz?

In [None]:
# index of the maximum value in time series
max_pos = X_mag.argmax()

# frequency at the same index
f[max_pos]

What is this frequency as a musical note?

In [None]:
librosa.hz_to_note(f[max_pos])

EXERCISE: Find the indices of the six largest peaks, and compute their corresponding frequencies and musical notes. Do those notes correspond to the expected components of a guitar C chord strum?

## Short-Time Fourier Transform

Musical signals are highly non-stationary, i.e., their statistics change over time. It would be rather meaningless to compute a single Fourier transform over an entire 10-minute song.

The [short-time Fourier transform (STFT)](https://en.wikipedia.org/wiki/Short-time_Fourier_transform) is obtained by computing the Fourier transform for successive frames in a signal. 

$$ X(m, \omega) = \sum_n x(n) w(n-m) e^{-j \omega n} $$

As we increase $m$, we slide the window function $w$ to the right. For the resulting frame, $x(n) w(n-m)$, we compute the Fourier transform. Therefore, the STFT $X$ is a function of both time, $m$, and frequency, $\omega$. We'll explore it below.

Let's load a file:

In [None]:
# load
x, sr = librosa.load('audio/brahms_hungarian_dance_5.mp3')

# display
plt.figure(figsize=(16, 5))
librosa.display.waveplot(x, sr=sr)
plt.show()

# play
Audio(x, rate=sr)

[`librosa.stft`](https://librosa.github.io/librosa/generated/librosa.core.stft.html#librosa.core.stft) computes a STFT. We provide it a frame size, i.e. the size of the FFT, and a hop length, i.e. the frame increment:

In [None]:
hop_length = 512
n_fft = 2048
X = librosa.stft(x, n_fft=n_fft, hop_length=hop_length)

To convert the hop length and frame size to units of seconds:

In [None]:
float(hop_length)/sr # units of seconds

In [None]:
float(n_fft)/sr  # units of seconds

For real-valued signals, the Fourier transform is symmetric about the midpoint. Therefore, `librosa.stft` only retains one half of the output:

In [None]:
X.shape

## Spectrogram

In music processing, we often only care about the spectral magnitude and not the phase content.

The [spectrogram](https://en.wikipedia.org/wiki/Spectrogram) shows the the intensity of frequencies over time. A spectrogram is simply the squared magnitude of the STFT:

$$ S(m, \omega) = \left| X(m, \omega) \right|^2 $$

The human perception of sound intensity is logarithmic in nature. Therefore, we are often interested in the log amplitude:

In [None]:
S = librosa.amplitude_to_db(abs(X))

To display any type of spectrogram in librosa, use [`librosa.display.specshow`](http://bmcfee.github.io/librosa/generated/librosa.display.specshow.html).

In [None]:
plt.figure(figsize=(15, 5))
librosa.display.specshow(S, sr=sr, hop_length=hop_length, x_axis='time', y_axis='linear')
plt.colorbar(format='%+2.0f dB')
plt.show()

## Mel-spectrogram

The [mel scale](https://en.wikipedia.org/wiki/Mel_scale) is a scale of pitches judged by listeners to be equal in distance one from another. The reference point between this scale and normal frequency measurement is defined by equating a 1000 Hz tone, 40 dB above the listener's threshold, with a pitch of 1000 mels. Below about 500 Hz the mel and hertz scales coincide; above that, larger and larger intervals are judged by listeners to produce equal pitch increments.

The name mel comes from the word melody to indicate that the scale is based on pitch comparisons.

Librosa can compute a mel-scaled spectrogram, using [`librosa.feature.melspectrogram`](https://librosa.github.io/librosa/generated/librosa.feature.melspectrogram.html):

In [None]:
hop_length = 256
S = librosa.feature.melspectrogram(x, sr=sr, n_fft=4096, hop_length=hop_length)

The human perception of sound intensity is logarithmic in nature. Therefore, like the STFT-based spectrogram, we are often interested in the log amplitude:

In [None]:
logS = librosa.power_to_db(abs(S))

To display any type of spectrogram in librosa, use [`librosa.display.specshow`](http://bmcfee.github.io/librosa/generated/librosa.display.specshow.html).

In [None]:
plt.figure(figsize=(15, 5))
librosa.display.specshow(logS, sr=sr, hop_length=hop_length, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')
plt.show()

Using `y_axis=mel` in `librosa.display.specshow` plots the y-axis on the [mel scale](https://en.wikipedia.org/wiki/Mel_scale) which is similar to the $\log (1 + f)$ function:

$$ m = 2595 \log_{10} \left(1 + \frac{f}{700} \right) $$

## Constant-Q Transform

Unlike the Fourier transform, but similar to the mel scale, the [constant-Q transform](http://en.wikipedia.org/wiki/Constant_Q_transform) uses a logarithmically spaced frequency axis. However, the width of each band is related to the frequency of its center. The transform maintains a constant reation between frequency and resolution. For the appropriate choice of frequency 0 and b, the bands of the transform relate to 12 tone notes. 

The constant in constant-Q is the ratio between frequency and resolution (i.e. the width of each frequency band changes as the frequency changes). 

To plot a constant-Q spectrogram, will use [`librosa.cqt`](http://bmcfee.github.io/librosa/generated/librosa.core.cqt.html#librosa.core.cqt):

In [None]:
fmin = librosa.midi_to_hz(36)
C = librosa.cqt(x, sr=sr, fmin=fmin, n_bins=72)
logC = librosa.amplitude_to_db(abs(C))

In [None]:
plt.figure(figsize=(15, 5))
librosa.display.specshow(logC, sr=sr, x_axis='time', y_axis='cqt_note', fmin=fmin, cmap='coolwarm')
plt.colorbar(format='%+2.0f dB')
plt.show()

## References
- International Society for Music Information Retrieval (ISMIR) [https://ismir.net/](https://ismir.net/)
- Laboratory for the Recognition and Organization of Speech and Audio at Columbia University: [LabROSA](https://labrosa.ee.columbia.edu/)
  - LibROSA [https://librosa.github.io/librosa/](https://librosa.github.io/librosa/)
- Brian McFee - SciPy 2015 Talk on Audio Processing and MIR with LibROSA: https://www.youtube.com/watch?v=MhOdbtPhbLU
  - [website](https://bmcfee.github.io/) [paper](https://bmcfee.github.io/papers/scipy2015_librosa.pdf)