I took this competition as an opportunity to experiment with Dask and distributed computing. Everything was made using Kaggle kernels.

In a previous notebook I created a preprocessing scheme using **PyDub, Dask and Zarr** to read each MP3 file, resample it to 32MHz as recommended by this competition hosts, and extract only one audio channel.

The recording is then converted to a sequence stored as a NumPy array. Each audio sequence has been split into subsequences using a non overlapping moving window of size 46^3, or approximately 3 seconds of audio. These sequences are left padded with zeroes. The final shape of the array is (n,46,46,46), however it’s also possible to slice and reshape it using the current solution or even create a different preprocessing scheme.

For example the first recording has 815616 steps and is splitted into an array of 9 rows.

The final result is stored as a compressed Zarr array composed of approximately 677 chunks totaling 42gb and uploaded as a Kaggle dataset. 

If you find this useful, I can share the kernel where it’s possible to adapt or alter the preprocessing to your own requirements.


In [None]:
!conda install dask -y
import dask
import dask.array as da
import dask.bag as db
import dask.dataframe as dd
from dask.distributed import Client

!pip install pydub
from pydub import AudioSegment
from pydub.utils import mediainfo

!pip install zarr
from zarr import Zlib, BZ2, LZMA, Blosc

import librosa, os, time, gc, json
import numpy as np
import h5py
import pandas as pd
import matplotlib.pyplot as plt
import IPython.display as ipd
from skimage.util.shape import view_as_windows, view_as_blocks
import tensorflow as tf

pd.set_option("display.max_columns", 60)
pd.set_option("display.max_rows", 120)

# Set up Dask Client

In [None]:
temp_folder = '/home/dask_hdd'
os.mkdir(temp_folder)

client = Client(memory_limit='4GB', local_directory=temp_folder)
client

# Read Zarr dataset

In [None]:
audio = da.from_zarr('/kaggle/input/bird-train')

In [None]:
audio

**It's possible to retrieve any part of the array relatively fast.**

In [None]:
%%time
recording = audio[20000:20500].compute()

In [None]:
recording

# Create index


Inside the dataset bird-train there is also a copy of the train dataframe with two columns "len" and "path" included.

The "len" column has the exact length of each audio sequence, and can be used to create an index. The length does not include left padding, so this has to be accounted for in the code. Also, some mp3 files could not be read, these were filtered out.

This way it's possible to create an index that maps a specific line of the train dataframe to a slice of the zarr array that contains that recording (audio_index). The inverse operation is also possible, map any array row to a specific entry in the train dataframe (train_index). 

In [None]:
train = pd.read_feather('/kaggle/input/bird-train/train.feather')
train = train[train['len']>0]

window_maxdim = np.prod(audio.shape[1:])
train['shape'] = train['len'].map(lambda x: (x+window_maxdim-x%window_maxdim)/window_maxdim)
train['slice'] = train['shape'].cumsum().astype(int)
train['slice'] = list(zip(train['slice'].shift(1,fill_value=0),train['slice']))

audio_index = train['slice'].map(lambda x: slice(*x))
train_index = train['slice'].apply(lambda x: np.arange(*x)).explode()
train_index = pd.Series(train_index.index, index=train_index.astype(int))

# Retrieve recording from Zarr array

In [None]:
train.loc[5000]

In [None]:
audio_index.loc[5000]

In [None]:
recording = audio[audio_index.loc[5000]].compute()
print(recording.shape)

In [None]:
recording = recording.reshape(1,-1)
ipd.Audio(recording, rate=32000)

**Double check to make sure we got the right one!**

In [None]:
train.loc[5000].path

In [None]:
recording_original = AudioSegment.from_mp3(train.loc[5000].path).set_frame_rate(32000).set_channels(1)
recording_original

In [None]:
recording = recording.ravel()
recording_original = np.array(recording_original.get_array_of_samples(), dtype=np.int32)
np.all(np.equal(recording[-recording_original.shape[0]:], recording_original))

# Retrieve recording information from array row

In [None]:
audio[99553]

In [None]:
train_index.loc[99553]

In [None]:
train.loc[train_index.loc[99553]]

# Get max value of entire array

Go through an entire 158gb array in under 2 minutes with only 16gb of ram!

In [None]:
%%time
audio.max().compute()

**Get exact position of max value**

In [None]:
%%time
audio.argmax().compute()

In [None]:
%%time
np.unravel_index(1280089650, audio.shape)

In [None]:
audio[13151, 11, 13, 40].compute()

# Bonus: group by recording and apply function

Get min and max value of each recording.

Be careful when calculating statistics like mean because the array has been zero padded. This can be accounted for during calculation. The processing notebook can also be modified to include numpy masked arrays.

The process takes about 3 minutes, but there may be more efficient ways to do this with Dask.

In [None]:
%%time
groupby_minmax = [dask.delayed(lambda x: (x.min(), x.max()))(audio[sli]) for sli in audio_index]

In [None]:
%%time
groupby_minmax = db.compute(groupby_minmax)

In [None]:
groupby_minmax