# Processing audio data

Data for this competition consists of audio files. There are many ways to build machine learning systems that can analyze audio data. Yet, past editions have shown that some attempts appear to be more practical than others. In this notebook, we will briefly explore a few ways how to open sound files and extract meaningful samples for training.

Here are a few things to consider before we start:

* all audio files are sampled at 32 kHz, contain only one channel (i.e., mono), and are compressed using the open OGG Vorbis encoding
* test audio files (i.e., soundscapes) all have a uniform length (600 seconds)
* training audio files are of different length, ranging between a few seconds and multiple minutes

The different durations of training recordings is one of the biggest challenges. On top of that, we have to deal with weak labels, which means that we don’t exactly know the precise timestamp of each bird call.

But let’s take a look at a random training recording.

In [None]:
# Pick a file
audio_path = '../input/birdclef-2021/train_short_audio/banana/XC112602.ogg'

# Listen to it
import IPython.display as ipd
ipd.Audio(audio_path)

We can hear a Bananaquit as primary species, but there's also a lot going on in the background.

Ok, now let's open the file with Librosa and look at the first 15 seconds of it.

In [None]:
import numpy as np
import warnings
warnings.filterwarnings(action='ignore')

# Librosa is the most versatile audio library for Python 
# and uses FFMPEG to load and open audio files
# For more information visit: https://librosa.org/doc/latest/index.html
import librosa

# Load the first 15 seconds this file using librosa
sig, rate = librosa.load(audio_path, sr=32000, offset=None, duration=15)

# The result is a 1D numpy array that conatains audio samples. 
# Take a look at the shape (seconds * sample rate == 15 * 32000 == 480000)
print('SIGNAL SHAPE:', sig.shape)

This is what the waveform looks like:

In [None]:
import matplotlib.pyplot as plt
import librosa.display

plt.figure(figsize=(15, 5))
librosa.display.waveplot(sig, sr=32000)

We could think about using the raw signal for training, but we need to consider the input size of a 5-second segment: 5 seconds * 32,000 Hz == 160,000 samples. That would be the equivalent of a 400 x 400 pixel image, which would be an uncommonly large input for a neural network.

We’ve seen in the past that convolutional neural networks (CNN) perform particularly well for sound classification. But CNN need 2D inputs. Luckily, we can transform an audio signal into a 2D representation: a so-called spectrogram (https://en.wikipedia.org/wiki/Spectrogram). A spectrogram is a visual representation of the audio signal, and we can use Librosa to extract a spectrogram from our raw signal.

In [None]:
# First, compute the spectrogram using the "short-time Fourier transform" (stft)
spec = librosa.stft(sig)

# Scale the amplitudes according to the decibel scale
spec_db = librosa.amplitude_to_db(spec, ref=np.max)

# Plot the spectrogram
plt.figure(figsize=(15, 5))
librosa.display.specshow(spec_db, 
                         sr=32000, 
                         x_axis='time', 
                         y_axis='hz', 
                         cmap=plt.get_cmap('viridis'))

Wow, that looks nice. We can clearly see three foreground vocalizations of a Bananaquit, each 5-second segment contains one of them (which is, of course, not always the case). Since our initial sampling rate is 32 kHz, the maximum frequency that we can visualize in a spectrogram is 16 kHz (y-axis).

However, that's still a very large input for a CNN, look at the shape:

In [None]:
print('SPEC SHAPE:', spec_db.shape)

We need to consider some parameters which we can specify during the STFT: “*window length*” (and with that implicitly also the number of frequency bins, "*n_fft*") and “*hop length*”. Take a look at the librosa documentation to see what these values mean (https://librosa.org/doc/0.8.0/generated/librosa.stft.html). 

Let’s try different values. Say, we want to have an overlap between frames of 50%, so our *hop_length* should be half of our *win_length*.

In [None]:
# Try a few window lengths (should be a power of 2)
for win_length in [128, 256, 512, 1024]:
    
    # We want 50% overlap between samples
    hop_length = win_length // 2
    
    # Compute spec (win_length implicity also sets n_fft and vice versa)
    spec = librosa.stft(sig, 
                        n_fft=win_length, 
                        hop_length=hop_length)
    
    # Scale to decibel scale
    spec_db = librosa.amplitude_to_db(spec, ref=np.max)
    
    # Show plot
    plt.figure(figsize=(15, 5))
    plt.title('Window length: ' + str(win_length) + ', Shape: ' + str(spec_db.shape))
    librosa.display.specshow(spec_db, 
                             sr=32000, 
                             hop_length=hop_length, 
                             x_axis='time', 
                             y_axis='hz', 
                             cmap=plt.get_cmap('viridis'))

We can clearly see that there is a trade-off between vertical resolution (i.e. frequency resolution) and horizontal resolution (i.e. number of time steps).

Ok, but how do we get inputs that have a reasonable size? Well, we could use the so-called mel scale (https://en.wikipedia.org/wiki/Mel_scale) to scale the frequency axis of our spectrogram. In the past, this attempt (even though it was initially designed for human speech) worked well for bird sound recognition. Luckily, Librosa supports this transformation. We can set the number of mel bins we want to use and that number would eventually be our vertical resolution of the spectrogram. We also know that the hop length we choose is key for the width of the spectrogram, so we have to settle on a certain value. On top of that, we should probably process 5-second chunks of audio (since that’s the submission segment duration).

We should probably also consider the vocal and auditory range of birds. We know that most songbirds vocalize between 1 and 4 kHz. Yet, some species vocalize below that, and some significantly above. If we look at our Bananaquit example, we can see that it is indeed a "high frequency" species, vocalizing up to 10 kHz. In general, we can probably limit the frequency range we want to include in a spectrogram between 500 Hz and 12.5 kHz. Not many birds will vocalize outside this range.

Here is what it would look like to extract 5-second, mel scale spectrograms with a target resolution of 64 x 256 pixels with Librosa:


In [None]:
# Desired shape of the input spectrogram
SPEC_HEIGHT = 64
SPEC_WIDTH = 256

# Derive num_mels and hop_length from desired spec shape
# num_mels is easy, that's just spec_height
# hop_length is a bit more complicated
NUM_MELS = SPEC_HEIGHT
HOP_LENGTH = int(32000 * 5 / (SPEC_WIDTH - 1)) # sample rate * duration / spec width - 1 == 627

# High- and low-pass frequencies
# For many birds, these are a good choice
FMIN = 500
FMAX = 12500

# Let's get all three spectrograms
for second in [5, 10, 15]:  
    
    # Get start and stop sample
    s_start = (second - 5) * 32000
    s_end = second * 32000

    # Compute the spectrogram and apply the mel scale
    mel_spec = librosa.feature.melspectrogram(y=sig[s_start:s_end], 
                                              sr=32000, 
                                              n_fft=1024, 
                                              hop_length=HOP_LENGTH, 
                                              n_mels=NUM_MELS, 
                                              fmin=FMIN, 
                                              fmax=FMAX)
    
    mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)

    # Show the spec
    plt.figure(figsize=(15, 5))
    plt.title('Second: ' + str(second) + ', Shape: ' + str(mel_spec_db.shape))
    librosa.display.specshow(mel_spec_db, 
                             sr=32000, 
                             hop_length=HOP_LENGTH, 
                             x_axis='time', 
                             y_axis='mel',
                             fmin=FMIN, 
                             fmax=FMAX, 
                             cmap=plt.get_cmap('viridis'))

Neat! We got some nice looking samples out of that. However, be aware that many 5-second chunks won’t contain the target species. How do you get rid of those? Well, that’s for you to find out. (Hint: Signal-to-noise ratio or energy levels might be a good start).

The methods mentioned in this notebook are not the only way to approach the competition. But they should give you a few starting points. Make sure to check out our other notebooks, let us know if you have any comments and - of course - don’t hesitate to start a forum thread if you have any questions.