# Audio Segmentation

My approach for this competition was to segment each audio clip into small chunks that contain only one or a few bird calls, and train on spectrograms of those. Th competetion is coming to an end and I wanted to share my method of segmentation, hopefully someone finds it useful or can point me to a better implementation :-)

In [None]:
from pathlib import Path

import numpy as np
import matplotlib.pyplot as plt
import librosa
import pandas as pd
import IPython.display as ipd

Load train dataframe and select a row at random

In [None]:
BASE_DIR = Path('../input/birdsong-recognition')
train_df = pd.read_csv(BASE_DIR / 'train.csv')
random_row = train_df.sample().squeeze()

Load audio, then plot the waveform and listen to the audio[](http://)

In [None]:
sample_rate = 32000
fpath = BASE_DIR / 'train_audio' / random_row['ebird_code'] / random_row['filename']
audio, _ = librosa.core.load(fpath, sr=sample_rate, mono=True)

plt.plot(audio)
ipd.display(ipd.Audio(audio, rate=sample_rate))

Here is my Signal-to-Noise-based segmenter, given an audio clip it first tries to estimate the noise level by finding the absmax of small non-overlapping chunks of audio. The smallest absmax is choosen as the noise level. When looking at the waveform of the audio, this can intuitively be though of as the maximum amplitude of a small chunk where there is just noise.

Then we go through the audio signal in longer and overlapping segments, if a segment has an absmax that is significantly larger than the noise level, we keep that segment.

In [None]:
class SNRSegmenter(object):

    def __init__(self, sample_rate, segment_len_ms, hop_len_ms, noise_len_ms, call_snr):
        self.segment_len_samples = int(sample_rate * segment_len_ms / 1000)
        self.hop_len_samples = int(sample_rate * hop_len_ms / 1000)
        self.noise_len_samples = int(sample_rate * noise_len_ms / 1000)

        self.call_snr = call_snr

    def _get_noise_level(self, sample):
        abs_max = []
        
        if len(sample) > self.noise_len_samples:
            idx = 0
            while idx + self.noise_len_samples < len(sample):
                abs_max.append(np.max(np.abs(sample[idx:(idx+self.noise_len_samples)])))
                idx += self.noise_len_samples
        else:
            abs_max.append(np.max(np.abs(sample)))

        return min(abs_max)

    def __call__(self, sample):
        
        noise_level = self._get_noise_level(sample)

        call_segments = []
        call_snrs = []

        if len(sample) > self.segment_len_samples:
            idx = 0
            while idx + self.segment_len_samples < len(sample):
                segment = sample[idx:(idx+self.segment_len_samples)]
                seg_abs_max = np.max(np.abs(segment))
                if seg_abs_max / noise_level > self.call_snr:
                    call_segments.append(segment)
                    call_snrs.append(seg_abs_max / noise_level)

                idx += self.hop_len_samples

        return call_segments, call_snrs

We specify how long we want the found segments to be, how much overlap we want and how long to want the chunks to be when finding the noise level, then we use the segmenter to get all the relevant segments.

In [None]:
segment_len_ms = 2500
hop_len_ms = 1000
noise_len_ms = 500
call_snr_thresh = 5

segmenter = SNRSegmenter(sample_rate, segment_len_ms, hop_len_ms, noise_len_ms, call_snr_thresh)

calls, call_snrs = segmenter(audio)

And now we can take a look at some of the found call segments

In [None]:
plt.title(f'SNR = {call_snrs[0]}')
plt.plot(calls[0])
ipd.display(ipd.Audio(calls[0], rate=sample_rate))  

In [None]:
plt.title(f'SNR = {call_snrs[5]}')
plt.plot(calls[5])
ipd.display(ipd.Audio(calls[5], rate=sample_rate))