Preprocessing audio data seems extremely important and is not obvious.


In this notebook I will generate 3 different sets of spectograms that we will be able to train on. Last but not least, we will use 3 spectograms to generate an image (storing them in the 3 channels).

Before we jump to this, it would be good to pinpoint the information that would be worthwhile to visualize.

In [1]:
from pathlib import Path
import librosa
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Audio
from fastai2.basics import *
from fastai2.callback.all import *
from fastai2.vision.all import *
from fastprogress import progress_bar

In [2]:
trn_paths = list(Path('data/audio_train/').iterdir())
tst_paths = list(Path('data/audio_test/').iterdir())

In [3]:
%%time

trn_srs = []
trn_lengths = []
max_vals = []

for path in Path('data/audio_train').iterdir():
    x, sr = librosa.core.load(path, sr=None)
    trn_srs.append(sr)
    trn_lengths.append(x.shape[0] / sr)
    max_vals.append(np.max(x))

CPU times: user 6.7 s, sys: 3.1 s, total: 9.8 s
Wall time: 23.5 s


In [4]:
np.quantile(trn_lengths, 0.56), np.quantile(trn_lengths, 0.75)

(5.0, 9.38)

56% of data is under 5 seconds, that sounds like a reasonable duration to work with. We can pad the shorter audio files with `0`s. But where in the longer files do we find the interesting bits?

One important consideration is to trim the silence from the beginning and end of an audio file. I first found the sounds that have the lowest recorded maximum value.

In [5]:
np.argsort(max_vals)[:20]

array([5097, 8590, 5119, 7190, 3011, 7235, 4412, 2757, 4790, 3581,    5,
       4214, 8167,  781, 2356, 5748, 4661, 4210, 4628, 4408])

By manually inspecting each of the sounds I find one that is not only white noise, it is the one below.

In [6]:
x, sr = librosa.load(trn_paths[3581])
Audio(x, rate=sr)

I can now tune the parameters of `librosa.effects.trim` to remove the silence while retaining the signal. My hope is that if this should work for this recording (and it having the lowest maximum recorded value across the train set), that it should also work equally else for other files.

In [7]:
xx, idx = librosa.effects.trim(x, top_db=1)

In [8]:
Audio(xx, rate=sr)

This worked!

The second question we need to answer is how do we identify the interesting portions of an audio file.

My somewhat naive reasoning is this - the interesting portion will probably contain loud sounds, sounds where the waveform has high magnitude. There probably must be a better way to approach this (I suspect there is some interplay between frequency and magnitude on this) but maybe this naive reasoning will be good enough.

We can take the absolute value of the signal and find it's center of mass.

In [9]:
from scipy.ndimage.measurements import center_of_mass

In [10]:
center_of_mass(xx)

(249.43684074919517,)

Ok, seems like the solution is starting to shape. Let's start putting everything we want to do in a function.

In [11]:
SR = 22050

def preprocess(x, length=5):
    # length given in seconds
    
    xx, idx = librosa.effects.trim(x, top_db=1)
    center_idx = int(center_of_mass(xx)[0])
    
    half_of_length = length*sr//2 # measured in samples
    
    if center_idx < half_of_length:
        xxx = xx[:length*sr]
    else:
        xxx = xx[center_idx-half_of_length:center_idx+half_of_length]
    
    return np.pad(xxx, (0, 5*SR))[:5*SR]

We seem to be getting somewhere. Let's test this with a really long file.

In [12]:
x, sr = librosa.load(trn_paths[np.argsort(trn_lengths)[-1]])
Audio(x, rate=sr)

In [13]:
x = preprocess(x)

In [14]:
Audio(x, rate=sr)

Seems like this could be doing what we would like it to do.

Let's save the files after this preprocessing. We might want to build another model straight on the waveform and this should be much better than the 2 seconds of files from the beginning of a clip we used earlier.

In [20]:
def preprocess_file(path, output_dir):
    x, sr = librosa.core.load(path, sr=SR)
    if np.count_nonzero(x) == 0: return # this is to accomodate the 3 corrupted files in test
    x = preprocess(x)
    librosa.output.write_wav(f'{output_dir}/{path.name}', x, SR, norm=False)
    
def preprocess_train(path): preprocess_file(path, 'data/audio_train_22k_5sec')
def preprocess_test(path): preprocess_file(path, 'data/audio_test_22k_5sec')

In [16]:
!rm -rf data/audio_train_22k_5sec
!rm -rf data/audio_test_22k_5sec

In [17]:
!mkdir data/audio_train_22k_5sec
!mkdir data/audio_test_22k_5sec

Unfortunately, we cannot use `parallel` to process files in parallel. There is something about the functionality in `preprocess` that hangs the processing. There is not that much data - we should be able to process one file at a time.

In [18]:
for path in progress_bar(trn_paths): preprocess_train(path)

In [21]:
for path in progress_bar(tst_paths): preprocess_test(path)