<center><img src="https://i.imgur.com/bIw8deA.jpg" width="1000px"></center>

# Introduction

Welcome to the "Cornell Birdcall Identification" challenge on Kaggle! In this challenge, contestants need to identify the species of birds involved in audio clips of them calling. In this kernel, I will generate <code>melspectrograms</code> from all the training audio clips and save them as images, so that training can be speeded up!

# Acknowledgements

1. [LibROSA](https://librosa.org/librosa/) ~ by the librosa team
2. [Audio Data Analysis Using librosa ðŸ“ˆ](https://www.kaggle.com/hamditarek/audio-data-analysis-using-librosa) ~ by Tarek Hamdi
3. [Understanding the Mel Spectrogram](https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53) ~ by Leland Roberts
4. [Bidirectional LSTM for audio labeling with Keras](https://www.kaggle.com/carlolepelaars/bidirectional-lstm-for-audio-labeling-with-keras) ~ by Carlo Lepelaars

# Preparing the ground <a id="1"></a>

In this section, we will prepare the ground to train and test the model by installing packages, setting hyperparameters, and loading the data.

## Install additional packages <a id="1.1"></a>

* We will now install <code>pydub</code>
* <code>pydub</code> will help us load audio data from <code>.mp3</code> files much faster than the <code>librosa</code> command: <code>librosa.load</code>

In [None]:
!pip install -q pydub

## Import necessary libraries <a id="1.2"></a>

* Now, we import all the libraries we need.
* <code>matplotlib</code> and <code>tqdm</code> for data analysis and visualization.
* <code>librosa</code>, <code>pydub</code> and <code>keras</code> for model training and inference.
* <code>numpy</code>, <code>pandas</code>, and <code>sklearn</code> for data processing and manipulation.

In [None]:
import os
import gc
import cv2
import time
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from joblib import Parallel, delayed

import warnings
warnings.filterwarnings('ignore')

import pydub
import librosa
import librosa.display
from pydub import AudioSegment as AS
from librosa.feature import melspectrogram
from librosa.core import power_to_db as ptdb

from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences as pad

## Define key hyperparameters and paths <a id="1.3"></a>

* Here we define the key hyperparameters: sequence length, train/val split, batch size, epochs, LR.
* We also specify the correct paths for loading data, training, inference, and finally submission to this competition.

### Define hyperparameters

The hyperparameters used are documented below ~

1. Data processing related

    * <code>N_MELS</code> is the number of <code>melspectrogram</code> features per time step.
    * <code>AMPLITUDE</code> represents the default signal amplitude applied to unreadable files.
    * <code>SR</code> is the sampling rate at which the audio is loaded (readings per second). It defaults to <code>44100 Hz</code>.
    * <code>TSR</code> is the sampling rate at which the test audio clips are loaded. It defaults to <code>32000 Hz</code>.
    * <code>MAXLEN</code> is the maximum number of readings. Longer clips will be trimmed and shorter ones will be padded.
    * <code>SPLIT</code> represents the fraction of data to be used for training. The rest of the data is used for validation.
    * <code>CHUNKS</code> represents the number of chunks that will be extracted from each signal. It defaults to <code>1</code> per signal.
    * <code>CHUNK_SIZE</code> represents the sequence length of each audio chunk to be fed into the <code>melspectrogram</code> function.
    * <code>POP_FRAC</code> is the maximum proportion of signal information to be ignored per chunk (defaults to <code>0.25</code>).

In [None]:
N = 8
SR = 44100
CHUNKS = 1
TSR = 32000
N_MELS = 128
POP_FRAC = 0.25
MAXLEN = 1000000
AMPLITUDE = 1000
CHUNK_SIZE = 500000

### Check data available

We have two datasets at our disposal: <code>birdsong-recognition</code> and <code>prepare-check-dataset</code>.

In [None]:
os.listdir('../input')

### Define paths

The paths used are documented below ~

1. Metadata related

    * <code>../input/birdsong-recognition/test.csv</code> contains the test metadata used for submission.
    * <code>../input/birdsong-recognition/train.csv</code> contains the train metadata used for submission.
    * <code>../input/prepare-check-dataset/test.csv</code> contains the test metadata used for committing.


2. Audio data related

    * <code>../input/birdsong-recognition/test_audio</code> contains the test audio used for submission.
    * <code>../input/birdsong-recognition/train_audio</code> contains the train audio used for submission.
    * <code>../input/prepare-check-dataset/test_audio</code> contains the test audio used for committing.

In [None]:
TEST_DATA_PATH = '../input/birdsong-recognition/test.csv'
TRAIN_DATA_PATH = '../input/birdsong-recognition/train.csv'
TEST_AUDIO_PATH = '../input/birdsong-recognition/test_audio/'
TRAIN_AUDIO_PATH = '../input/birdsong-recognition/train_audio/'
CHECKING_PATH = '../input/prepare-check-dataset/birdcall-check/'

In [None]:
sub = os.path.exists(TEST_AUDIO_PATH)
TEST_DATA_PATH = TEST_DATA_PATH if sub else CHECKING_PATH + 'test.csv'
TEST_AUDIO_PATH = TEST_AUDIO_PATH if sub else CHECKING_PATH + 'test_audio/'

## Load metadata from .csv files <a id="1.4"></a>

* Now we load the training and testing metadata.

* We can see that the testing dataframe has only <code>3</code> rows in it.

* This is only a dummy test dataframe. The actual testing data will be used during submission.

In [None]:
test_df = pd.read_csv(TEST_DATA_PATH)
train_df = pd.read_csv(TRAIN_DATA_PATH)

In [None]:
test_df.head()

In [None]:
train_df.head()

### Prepare the label dictionary

* Next we prepare a dictionary linking each bird species to a unique integer.
* This dictionary will help us when we need to one-hot encode our targets later.

In [None]:
keys = set(train_df.ebird_code)
values = np.arange(0, len(keys))
code_dict = dict(zip(sorted(keys), values))

## Data processing <a id="3.1"></a>

The first step to define important function to process the data and generate features.

### Define utility function to read audio

* Now we define a function using <code>pydub</code> to read audio files into <code>numpy</code> arrays.
* This implementation is significantly faster than <code>librosa.load</code> and <code>torchaudio.load</code>.

In [None]:
def normalize(x):
    return np.float32(x)/2**15

def read(file, norm=False):
    try:
        a = AS.from_mp3(file)
        a = a.set_frame_rate(TSR)
    except:
        return TSR, np.zeros(MAXLEN)

    y = np.array(a.get_array_of_samples())
    if a.channels == 2: y = y.reshape((-1, 2))
    if norm: return a.frame_rate, normalize(y)
    if not norm: return a.frame_rate, np.float32(y)

def write(file, sr, x, normalized=False):
    birds_audio_bitrate, file_format = '320k', 'mp3'
    ch = 2 if (x.ndim == 2 and x.shape[1] == 2) else 1
    y = np.int16(x * 2 ** 15) if normalized else np.int16(x)
    song = AS(y.tobytes(), frame_rate=sr, sample_width=2, channels=ch)
    song.export(file, format=file_format, bitrate=birds_audio_bitrate)

### Define functions to process audio signals

These are a set of functions which process the audio before the <code>melspectrogram</code> transformation.

The functions used are documented below ~

* <code>get_idx</code> selects the start and end index of a given audio chunk.
* <code>get_chunk</code> takes indices from <code>get_idx</code> and outputs a chunk of data between those indices.
* <code>get_len</code> is a helper function which is used to decide possible chunk indices based on <code>POP_FRAC</code>.
  
  --> <code>If</code> the signal is longer than <code>MAXLEN</code>, it sets the maximum index to <code>MAXLEN</code>.<br>
  --> <code>Else</code> it uses <code>POP_FRAC</code> to ensure chunks are centered around audio signal and not padding.


* <code>get_signal</code> flattens the signal, pads it to <code>MAXLEN</code>, and stacks multiple chunks into one array.

In [None]:
def get_idx(length):
    length = get_len(length)
    max_idx = MAXLEN - CHUNK_SIZE
    idx = np.random.randint(length + 1)
    chunk_range = idx, idx + CHUNK_SIZE
    chunk_idx = max([0, chunk_range[0]])
    chunk_idx = min([chunk_range[1], max_idx])
    return (chunk_idx, chunk_idx + CHUNK_SIZE)

def get_len(length):
    if length > MAXLEN: return MAXLEN
    if length <= MAXLEN: return int(length*POP_FRAC)

In [None]:
def get_chunk(data, length):
    index = get_idx(length)
    return data[index[0]:index[1]]

def get_signal(data):
    length = max(data.shape)
    data = data.T.flatten().reshape(1, -1)
    data = np.float32(pad(data, maxlen=MAXLEN).reshape(-1))
    return [get_chunk(data, length) for _ in range(CHUNKS)]

### Define functions to calculate melspectrogram features

Below we define some functions to calculate the <code>melspectrogram</code> features from audio signals.

In [None]:
def to_imagenet(X, mean=None, std=None, norm_max=None, norm_min=None, eps=1e-6):
    mean = mean or X.mean()
    X = X - mean
    std = std or X.std()
    Xstd = X / (std + eps)
    _min, _max = Xstd.min(), Xstd.max()
    norm_max = norm_max or _max
    norm_min = norm_min or _min
    if (_max - _min) > eps:
        # Normalize to [0, 255]
        V = Xstd
        V[V < norm_min] = norm_min
        V[V > norm_max] = norm_max
        V = 255*((V - norm_min) / (norm_max - norm_min))
    else:
        # Just zero
        V = np.zeros_like(Xstd, dtype=np.uint8)
    return np.stack([V]*3, axis=-1)

In [None]:
def get_melsp(data):
    melsp = melspectrogram(data, n_mels=N_MELS)
    return to_imagenet(librosa.power_to_db(melsp))

def get_melsp_img(data):
    data = get_signal(data)
    return np.stack([get_melsp(point) for point in data])

### Define spectrogram loading function

Now we define a function that generates a <code>spectrogram</code> at a list of indices.

In [None]:
def save(indices, path):
    folder = TRAIN_AUDIO_PATH

    for index in tqdm(indices):
        file_name = train_df.filename[index]
        ebird_code = train_df.ebird_code[index]

        default_signal = np.random.random(MAXLEN)*AMPLITUDE
        default_values = SR, np.int32(np.round(default_signal))

        values = read(folder + ebird_code + '/' + file_name)
        _, data = values if len(values) == 2 else default_values
        
        image = np.nan_to_num(get_melsp_img(data))[0]
        cv2.imwrite(path + file_name + '.jpg', image); del image; gc.collect()

### Load all training spectrograms and save with parallel processing

Next we will use multi-threading to generate all the spectorgrams and save them quickly.

In [None]:
train_ids = np.array_split(np.arange(len(train_df)), 5)
train_ids_1, train_ids_2, train_ids_3, train_ids_4, train_ids_5 = train_ids

In [None]:
train_ids_1 = np.array_split(np.array(train_ids_1), N)
train_ids_2 = np.array_split(np.array(train_ids_2), N)
train_ids_3 = np.array_split(np.array(train_ids_3), N)
train_ids_4 = np.array_split(np.array(train_ids_4), N)
train_ids_5 = np.array_split(np.array(train_ids_5), N)

In [None]:
!mkdir train_1
path = "train_1/"
parallel = Parallel(n_jobs=N, backend="threading")
parallel(delayed(save)(ids, path) for ids in train_ids_1)

In [None]:
!zip -r train_1.zip train_1
!rm -rf train_1

In [None]:
!mkdir train_2
path = "train_2/"
parallel = Parallel(n_jobs=N, backend="threading")
parallel(delayed(save)(ids, path) for ids in train_ids_2)

In [None]:
!zip -r train_2.zip train_2
!rm -rf train_2

In [None]:
!mkdir train_3
path = "train_3/"
parallel = Parallel(n_jobs=N, backend="threading")
parallel(delayed(save)(ids, path) for ids in train_ids_3)

In [None]:
!zip -r train_3.zip train_3
!rm -rf train_3

In [None]:
!mkdir train_4
path = "train_4/"
parallel = Parallel(n_jobs=N, backend="threading")
parallel(delayed(save)(ids, path) for ids in train_ids_4)

In [None]:
!zip -r train_4.zip train_4
!rm -rf train_4

In [None]:
!mkdir train_5
path = "train_5/"
parallel = Parallel(n_jobs=N, backend="threading")
parallel(delayed(save)(ids, path) for ids in train_ids_5)

In [None]:
!zip -r train_5.zip train_5
!rm -rf train_5