# Preamble

This tutorial depends on the NumPy, SciPy, matplotlib, and intervaltree packages.

In [None]:
import numpy as np                                       # fast vectors and matrices
import matplotlib.pyplot as plt                          # plotting
from scipy import fft                                    # fast fourier transform

from IPython.display import Audio

from intervaltree import Interval,IntervalTree

%matplotlib inline

### Constants

A recording of a musical performance is a real-valued time series. The values of this time series represent sound pressure variations sampled at regular intervals, in this case 44,100Hz. The human ear interprets pressure periodicities as musical notes.

In [None]:
fs = 44100      # samples/second

### Load MusicNet

Download MusicNet from http://homes.cs.washington.edu/~thickstn/musicnet.html. See the Introductory tutorial for more information about the dataset.

In [None]:
train_data = np.load(open('../musicnet.npz','rb'))

# Spectrograms

In [None]:
X,Y = train_data['2494'] # data X and labels Y for recording id 1788

A spectrogram is the pointwise magnitude of the fourier transform of a segment of an audio signal. We will compute spectrograms of 2048 samples. The number of samples, i.e. the window size, is a parameter of the spectrogram representation. If the window size is too short, the spectrogram will fail to capture relevant information; if it is too long, it loses temporal resolution.

We compute this feature representation at a stride of 512 samples. Therefore each spectrogram has 75% overlap with the previous spectrogram in the time series. Shorter strides lead to a higher-resolution representation of the signal at the cost of increased computational demands.

In [None]:
window_size = 2048  # 2048-sample fourier windows
stride = 512        # 512 samples between windows
wps = fs/float(512) # ~86 windows/second
Xs = np.empty([int(10*wps),2048])

for i in range(Xs.shape[0]):
    Xs[i] = np.abs(fft(X[i*stride:i*stride+window_size]))

Each spectrogram is a window_size list of amplitudes; the k'th amplitude is the squared-response of the signal window to sinusoidal weights with frequency k. Specifically, if $\mathbf{x} = (x_1,\dots,x_t)$ denotes a segment of an audio signal of length $t$ then we can define
$$
\text{Spec}_k(\mathbf{x}) \equiv \left|\sum_{s=1}^t e^{iks}x_s\right|^2  = \left(\sum_{s=1}^t \cos(ks)x_s\right)^2 + \left(\sum_{s=1}^t \sin(ks)x_s\right)^2.
$$

The figure below (left) illustrates the spetrogram of X at time t = 3 seconds. The complete set of spectrogram filters ranges from k=0 to k=window_size, but the amplitudes are symmetric around k=1024, so we can ignore the second half of the spectrogram (below; middle). We can sometimes be much more aggressive; most of the frequency content of musical recording is concentrated in the low-frequency spectrogram components, so it is often reasonable to cut off the spectrogram at some smaller value (below; right).

In [None]:
second = 3

fig, ((ax1, ax2, ax3)) = plt.subplots(1, 3,sharey=True)
fig.set_figwidth(20)
ax1.plot(Xs[int(second*wps)],color=(41/255.,104/255.,168/255.))
ax1.set_xlim([0,window_size])
ax1.set_ylabel('amplitude')
ax2.plot(Xs[int(second*wps),0:window_size/2],color=(41/255.,104/255.,168/255.))
ax2.set_xlim([0,window_size/2])
ax3.plot(Xs[int(second*wps),0:150],color=(41/255.,104/255.,168/255.))
ax3.set_xlim([0,150])

Recall that the time series X is a floating point array of pressure samples, normalized to the interval [-1,1].

In [None]:
fig = plt.figure()
fig.set_figwidth(20)
fig.set_figheight(2)
plt.plot(X[0:10*fs],color=(41/255.,104/255.,168/255.))
fig.axes[0].set_xlabel('sample (44,100Hz)')
fig.axes[0].set_ylabel('amplitude')

We can plot spectrograms verus time using a heatmap. Compare the time series above to the two-dimensional spectrogram representation of X. The horizontal axis in both cases is time. Below, the vertical axis consists of color-coded values indicating the amplitude of the spectrogram at a point in time.

In [None]:
fig = plt.figure(figsize=(20,7))
plt.imshow(Xs.T[0:150],aspect='auto')
plt.gca().invert_yaxis()
fig.axes[0].set_xlabel('windows (~86Hz)')
fig.axes[0].set_ylabel('frequency')