# Understanding loudness

How do loudness, amplitude, power and the spectrogram relate to each other?

In [None]:
import IPython
import librosa
import librosa.display
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.signal

import warnings

warnings.filterwarnings('ignore', category=matplotlib.MatplotlibDeprecationWarning)
plt.rcParams['figure.figsize'] = (12, 4)

Create a signal consisting of:

1. A pure sine wave of amplitude 1/2.
2. The same sine, but with its first harmonic added at amplitude 1/2.
3. A block wave.
4. Noise.

In [None]:
sr = 44100
l = sr // 10
t = np.arange(0, 4*l) / sr
y = np.concatenate((
    0.5 * np.sin(t[0:l] * 2 * np.pi * 440),
    0.5 * np.sin(t[l:2*l] * 2 * np.pi * 440) + 0.5 * np.sin(t[l:2*l] * 2 * np.pi * 880),
    np.where(t[2*l:3*l] % (1/440) < 1/880, 1.0, -1.0),
    2.0 * np.random.random_sample((l,)) - 1.0
))

plt.plot(t, y)
plt.show()

IPython.display.display(IPython.display.Audio(y, rate=sr))

Create a spectrogram using the short-time Fourier transform.

In [None]:
n_fft = 512
hop_length = 256
win_length = 512
S = np.abs(librosa.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window='hann'))

fig, ax = plt.subplots()
ax.set_title('S')
img = librosa.display.specshow(S, x_axis='time', y_axis='log', ax=ax)
colorbar = fig.colorbar(img, ax=ax)

Let's compute the RMS value from the samples. The RMS of the square wave is equal to 1, which matches my expectation. That of the pure sine wave of amplitude $1/2$ should be $1/2 \frac{\sqrt{2}}{2} \approx 0.35$ which also seems right.

In [None]:
rms = librosa.feature.rms(y=y, frame_length=n_fft, hop_length=hop_length)[0, :]
plt.plot(rms)
plt.ylim(0, 1.1);

How do the spectrogram values relate to the terms "amplitude" and "power"?

Answer: the spectrum is _amplitude_, and _power_ is _amplitude squared_. This matches the documentation of `librosa.amplitude_to_db`:

> This is equivalent to `power_to_db(S**2)`, but is provided for convenience.

According to [Wikipedia](https://en.wikipedia.org/wiki/Audio_power#Power_and_loudness_in_the_real_world):

> Perceived "loudness" varies approximately logarithmically with acoustical output power.

From all I can find, (the log of) RMS is also roughly equal to perceived loudness, so RMS would be the same as "power" in this context.

In [None]:
fig, ax = plt.subplots()
ax.set_title('amplitude_to_db(S)')
img = librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
                               x_axis='time', y_axis='log', ax=ax)
colorbar = fig.colorbar(img, ax=ax)

fig, ax = plt.subplots()
ax.set_title('power_to_db(S) (this is wrong!)')
img = librosa.display.specshow(librosa.power_to_db(S, ref=np.max),
                               x_axis='time', y_axis='log', ax=ax)
colorbar = fig.colorbar(img, ax=ax)

fig, ax = plt.subplots()
ax.set_title('power_to_db(S**2)')
img = librosa.display.specshow(librosa.power_to_db(S**2, ref=np.max),
                               x_axis='time', y_axis='log', ax=ax)
colorbar = fig.colorbar(img, ax=ax)

The RMS that librosa calculates from the spectrogram is not the same as from the samples! It's about 2.3 times lower. Why? The answer is in the `librosa.feature.rms` docs:

> Use a STFT window of constant ones and no frame centering to get consistent results with the RMS computed from the audio samples `y`.

Of course: the windowing "removes" some amplitude/power/energy from the signal. So the values we are getting here depend on the window shape too!

In [None]:
S_rms = librosa.feature.rms(S=S, frame_length=n_fft, hop_length=hop_length)[0, :]

S_unwindowed = np.abs(librosa.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=n_fft, window=np.ones))
S_rms_unwindowed = librosa.feature.rms(S=S_unwindowed, frame_length=n_fft, hop_length=hop_length)[0, :]

plt.plot(rms, label='RMS from samples')
plt.plot(S_rms, 'x', label='RMS from spectrogram')
plt.plot(S_rms_unwindowed, 'o', label='RMS from spectrogram with square window')
plt.ylim(0, 1.1)
plt.legend();

Can we do this ourselves? The comment thread of [issue #1040](https://github.com/librosa/librosa/issues/1040) is interesting in this respect. I don't understand it fully but I can look at the code and replicate what it does (except the DC component which I assume to be zero).

_Note_: the source code links in the docs on librosa.org are [broken](https://github.com/librosa/librosa/issues/1222) and link to old version of the source, despite claiming to show the latest version! So what you see at [this link](https://librosa.org/doc/latest/_modules/librosa/feature/spectral.html#rms) is currently (2021-01-11) something pretty old and not correct!

In [None]:
plt.plot(S_rms, label='RMS from spectrogram')
plt.plot(np.sqrt(2 * np.sum(np.abs(S)**2, axis=0, keepdims=True) / n_fft**2)[0, :], 'x', label='My own')
plt.plot(S_rms_unwindowed, label='RMS from spectrogram with square window')
plt.plot(np.sqrt(2 * np.sum(np.abs(S_unwindowed)**2, axis=0, keepdims=True) / n_fft**2)[0, :], 'o', label='My own')
plt.legend();

How does this RMS depend on FFT parameters, if at all? Fortunately, it isn't affected by the frame length, which is as we would expect.

In [None]:
for n in [256, 512, 1024]:
    Sn = np.abs(librosa.stft(y=y, n_fft=n, hop_length=256))
    plt.plot(
        librosa.feature.rms(S=Sn, frame_length=n, hop_length=256)[0],
        label=f'n_fft = {n}')
plt.legend();

It is, however, affected by the window shape.

In [None]:
for w in ['hann', 'blackmanharris', ('kaiser', 8*np.pi)]:
    Sn = np.abs(librosa.stft(y=y, n_fft=n_fft, hop_length=hop_length, window=w))
    plt.plot(
        librosa.feature.rms(S=Sn, frame_length=n_fft, hop_length=hop_length)[0],
        label=str(w))
plt.legend();

Can we compensate for this? Yes, we can! Dividing by the RMS of the window shape does the trick.

In [None]:
for w in ['hann', 'blackmanharris', ('kaiser', 8*np.pi)]:
    Sn = np.abs(librosa.stft(y=y, n_fft=n_fft, hop_length=hop_length, window=w))
    window_rms = np.sqrt(np.mean(scipy.signal.get_window(w, n_fft)**2))
    plt.plot(
        librosa.feature.rms(S=Sn, frame_length=n_fft, hop_length=hop_length)[0] / window_rms,
        label=f'{w} (rms: {window_area:.3f})')
plt.legend();

None of this is directly representative of _perceptual_ volume. For that, we need something like A-weighting.

In [None]:
freqs = librosa.cqt_frequencies(108, librosa.note_to_hz('C1')) # From librosa example. I don't understand it.
for w in 'ABCDZ':
    plt.plot(freqs, librosa.frequency_weighting(freqs, w), label=w)
plt.xscale('log')
plt.legend();

These decibels need to be applied to the _power_ spectrum, not the _amplitude_ spectrum we have, as per the docs for `librosa.perceptual_weighting`:

> Perceptual weighting of a power spectrogram:
> 
>     S_p[f] = frequency_weighting(f, 'A') + 10*log(S[f] / ref)

And the subsequent example:

>     C = np.abs(librosa.cqt(y, sr=sr, fmin=librosa.note_to_hz('A1')))
>     perceptual_CQT = librosa.perceptual_weighting(C**2,
>                                                   freqs,
>                                                   ref=np.max)

This `cqt` is not the same as `stft`, but it seems the [constant-Q transform](https://en.wikipedia.org/wiki/Constant-Q_transform) is similar to the FFT but with logarithmically spaced bins, rather than linearly spaced. So I would expect that code to remain correct if we use the `stft` instead.