# Kaggle BirdCLEF 2021

## Data Preparation
Before we get working on concrete algorithms, let's load our data and take a look at it to see what data cleaning is necessary.

In [None]:
import pandas as pd

input_folder = '/kaggle/input/birdclef-2021/'
output_folder = '/kaggle/working/'

# Python f-strings allow us to do variable interpolation
test_soundscapes_folder = f"{input_folder}test_soundscapes/"
train_short_audio_folder = f"{input_folder}train_short_audio/"
train_soundscapes_folder = f"{input_folder}train_soundscapes/"

sample_submission = pd.read_csv(f"{input_folder}sample_submission.csv")
test = pd.read_csv(f"{input_folder}test.csv")
train_metadata = pd.read_csv(f"{input_folder}train_metadata.csv")
train_soundscape_labels = pd.read_csv(f"{input_folder}train_soundscape_labels.csv")

Now, let's take at the first few rows of our training metadata set to see what kind of data we are working with.

In [None]:
train_metadata.head()

We can see by inspecting the columns that the "secondary labels" and "type" attributes are multi-valued. This means that our data set is not currently [normalized](https://en.wikipedia.org/wiki/Database_normalization). To address this and thus make working with our data set a bit easier further down the line, let's normalize our data set into [first normal form](https://en.wikipedia.org/wiki/Database_normalization#Satisfying_1NF). Let's first create one DataFrame `filename_metadata` containing only the atomic attributes currently in our DataFrame and split off the "secondary labels" and "type" attributes into separate DataFrames, making sure to strip the brackets from the strings in the 'secondary_labels' and 'type' columns.

In [None]:
import re # to be able to use regular expressions

def remove_brackets_quotes_spaces(string: str) -> str:
    return re.sub(r'[\[\]\' ]', '', string)

In [None]:
filenames = train_metadata['filename']
file_metadata = train_metadata.drop(columns=['secondary_labels', 'type']).rename(columns={'primary_label': 'primary label'})
# Note that indexing into train_metadata with a column name will return a Pandas Series object
secondary_birdcall_labels = train_metadata['secondary_labels'].map(remove_brackets_quotes_spaces)
birdsound_labels = train_metadata['type'].map(remove_brackets_quotes_spaces)

Great! Now, let's [explode](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html) the `secondary_birdcall_labels` and `birdsound_labels` DataFrames so that every valid tuple of (filename, secondary birdcall label) and likewise every tuple of (filename, birdsound label) get expanded into distinct rows within their respective DataFrames. However, we will need to convert the secondary birdcall labels and birdsound labels into iterable Python objects so that we can actually apply the `explode` method.

In [None]:
from typing import Sequence # to specify a parameter as a sequence type

def convert_to_list(string: str) -> Sequence[str]:
    """Turns a comma-delimited string into a list

    Parameters
    ----------
    string: str
        A comma-delimited string

    Returns
    -------
    str
        A list containing each of the words from the original string"""
    return re.split(r'\W+', string)

In [None]:
secondary_birdcall_labels = secondary_birdcall_labels.map(convert_to_list)
birdsound_labels = birdsound_labels.map(convert_to_list)

Looking good so far. Now, let's convert `secondary_birdcall_labels` and `birdsound_labels` into actual DataFrames and then apply the `explode` method on them.

In [None]:
secondary_birdcall_labels = pd.concat([filenames, secondary_birdcall_labels], axis=1).rename(columns={'secondary_labels': 'secondary label'})
birdsound_labels = pd.concat([filenames, birdsound_labels], axis=1).rename(columns={'type': 'birdsound label'})

# Pass the option ignore_index=True so that the original indices aren't duplicated in the process of expanding the DataFrame
secondary_birdcall_labels = secondary_birdcall_labels.explode('secondary label', ignore_index=True)
birdsound_labels = birdsound_labels.explode('birdsound label', ignore_index=True)

Finally, let's print out the `secondary_birdcall_labels` DataFrame to make sure it looks reasonable.

In [None]:
secondary_birdcall_labels.head()

It seems like we've forgotton to remove rows that have no secondary label. Let's fix that.

In [None]:
# Use the reindex() method to renumber the rows after selecting a subset of the original DataFrame
# We pass the option drop=True in order to avoid having the old row indices being added as a column to the DataFrame
secondary_birdcall_labels = secondary_birdcall_labels[secondary_birdcall_labels['secondary label'] != ''].reset_index(drop=True)
secondary_birdcall_labels.head()

Fantastic! Let's do the same with our birdsound_labels DataFrame.

In [None]:
birdsound_labels = birdsound_labels[birdsound_labels['birdsound label'] != ''].reset_index(drop=True)

Finally, let's clean up the `filename_metadata` DataFrame to remove the attributes we don't need. Let's take a have quick refresher of the attributes in the DataFrame.

In [None]:
file_metadata.columns

The "license" and "url" fields will certainly not help with our prediction task. The "common_name" and "scientific_name" fields will similarly be not too helpful since our target labels will be found with the "primary_label" attribute. The "author" attribute may indeed exhibit some non-negligible correlation with the birdcall labels (i.e. some authors may be interested in observing certain birds over others), but that will not help us in our goal of creating a machine learning model capable of identifying birdcalls strictly from soundscape recordings.

The "rating" attribute should definitely be preserved as it can help us focus on training on the recordings that have the best audio quality. The "latitude", "longitude" and "date" attributes will also be of importance as those attributes will also be present along with the test data during our notebook submission.

Finally, let's also make sure to split off the "primary label" attribute into its own DataFrame and also make a DataFrame combining the primary and secondary labels into a single data set. This will be convenient for us later.

All that said, let's finish tidying up the data.

In [None]:
file_metadata = file_metadata.drop(columns=['scientific_name', 'common_name', 'author', 'license', 'url'])

# Let's create the following DataFrame for our future convenience
primary_birdcall_labels = file_metadata[['filename', 'primary label']]

file_metadata = file_metadata.drop(columns=['primary label'])

# https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html#group-by
birdcall_labels = pd.concat([primary_birdcall_labels.rename(columns={'primary label': 'birdcall label'}), secondary_birdcall_labels.rename(columns={'secondary label': 'birdcall label'})])

del train_metadata

## Exploratory Data Analysis

Our next step towards training a good model for identifying bird calls will be to get a good understanding of the data that we are working with, so let's visualize our data to see if we can gain any insights that will help us in training our models.  

In [None]:
# How many unique bird species are there in this data set?
NUM_CLASSES = len(pd.unique(birdcall_labels['birdcall label']))
NUM_CLASSES

Before moving on, we need to make sure that `secondary_bircall_labels` DataFrame not contain a label that is not found in the `primary_birdcall_labels` DataFrame

In [None]:
set(pd.unique(secondary_birdcall_labels['secondary label'])) - set(pd.unique(primary_birdcall_labels['primary label']))

That's unexpected. Let's see how many files have "rocpig1" as a label.

In [None]:
birdcall_labels[birdcall_labels['birdcall label'] == 'rocpig1'].shape[0]

How many tuples of (filename, birdcall label) are there total?

In [None]:
birdcall_labels.shape[0]

Since there are only 9 instances of a file being associated with the label "rocpig1" among the many tens of thousands of training examples, we will ignore it for the purposes of making our model.

In [None]:
import numpy as np # for working with arrays and numerical data
import seaborn as sns # from plotting functionality

# What is the distribution of each of the bird species found in this data set?
species_distribution = birdcall_labels.groupby('birdcall label')['birdcall label'].count().sort_values(ascending=False).values
plt = sns.lineplot(x=np.arange(NUM_CLASSES), y=species_distribution)
plt.set_xlabel('Species (by index)')
plt.set_ylabel('Count')
plt.set_title('Count by Species'); # suppressing standard output in an IPython notebook: https://stackoverflow.com/questions/42635806/how-to-remove-matplotlib-output-lines-from-showing-in-jupyter-notebook-when-plot

Since our training set is quite large, we'll choose the most frequent 50 classes as the cut-off point for training our models. This will certainly not yield the best performance, but it will give us a reasonable strating point for exploring different ideas. Let's also take this opportunity to trim our training set down to the examples that have only a primary birdcall label and no secondary birdcall label.

In [None]:
# Create a DataFrame with all files that don't have a secondary birdcall label
singular_birdcall_files = pd.DataFrame({'filename': list(set(primary_birdcall_labels['filename']) - set(secondary_birdcall_labels['filename']))})

# Merge the resulting DataFrame with the file metadata DataFrame by joining on the "filename" attribute
singular_birdcall_file_metadata = pd.merge(singular_birdcall_files, file_metadata, on='filename')
# Merge with the primary birdcall labels DataFrame by joining on the "filename" attribute
singular_birdcall_file_metadata = pd.merge(singular_birdcall_file_metadata, primary_birdcall_labels, on='filename')

# Identify the top fifty most frequently occurring birdcall classes
most_common_birdcalls = singular_birdcall_file_metadata.groupby('primary label')['primary label'].count()
most_common_birdcalls = pd.DataFrame({'primary label': most_common_birdcalls.index, 'Count': most_common_birdcalls.values})
most_common_birdcalls = most_common_birdcalls.sort_values(by='Count', ascending=False).head(50).drop(columns=['Count'])

training_files = most_common_birdcalls.merge(primary_birdcall_labels, on='primary label')
# Merge on all key attributes in order to avoid duplicating columns
training_files = singular_birdcall_file_metadata.merge(training_files, on=['filename', 'primary label']).rename(columns={'primary label': 'birdcall label'}) # Reference: https://www.pauldesalvo.com/how-to-remove-or-prevent-duplicate-columns-from-a-pandas-merge/

Now, let us identify the highest audio quality recording to which we can restrict the audio samples of our training set while still retaining at least one example for every class of birdcall.

In [None]:
minmax_audio_rating = np.min(training_files.groupby('birdcall label')['rating'].agg(np.max))
minmax_audio_rating

Great! Now, let's assure ourselves that there are still a sufficient number of samples from each of the birdcall classes if we are to restrict ourselves to only using audio samples of rating 5.0.

In [None]:
high_quality_training_files = training_files[np.isclose(training_files['rating'], 5.0)]
min_audio_sample_count = np.min(high_quality_training_files.groupby('birdcall label')['birdcall label'].count())
min_audio_sample_count

Well if that isn't an auspicious number! It seems we'll have enough audio samples from every birdcall class in order to train, at the least, some moderately performant models. Let's go ahead and update our training set to use only the highest quality samples.

In [None]:
training_files = high_quality_training_files.drop(columns=['rating'])

One last thing. Let's replace our "date" and "time" attributes with a single, normalized "timestamp" attribute.

In [None]:
from sklearn.preprocessing import normalize

dates = training_files['date'].values
times = training_files['time'].values

# Keep track of which files have dates and times that cannot be parsed so that we can filter them from the data set.
invalid_date_indices = []

# In case the dates and times cannot be parseed
for i, (date, time) in enumerate(zip(dates, times)):
    try:
        pd.Timestamp(f"{date} {time}")
    except:
        invalid_date_indices.append(i)

valid_date_indices = list(set(range(training_files.shape[0])) - set(invalid_date_indices))

training_files = training_files.iloc[valid_date_indices, :]
dates = dates[valid_date_indices]
times = times[valid_date_indices]

timestamps = list(map(lambda pair: [pd.Timestamp(f"{pair[0]} {pair[1]}").timestamp()], zip(dates, times)))
timestamps = normalize(np.array(timestamps).reshape(-1, 1), axis=0).reshape(-1)
timestamps = pd.DataFrame({'timestamp': timestamps}, index=training_files.index)

training_files = pd.concat([training_files.drop(columns=['date', 'time']), timestamps], axis=1)

Now, before we move on, let's just do a sanity check to make sure we still have a sufficient number of examples from each of the birdcall classes.

In [None]:
min_audio_sample_count = np.min(training_files.groupby('birdcall label')['birdcall label'].count())
min_audio_sample_count

## Training Setup

Now that our data is relatively tidied up, we can now focus on preparing it for model training. Our only goal will be to create an appropriate train and cross-validation split using our training data. Ideally, we would create a separate training sample from every possible disjoint five-second long segment from each of our files (or perhaps even every ovelapping five-second long segment using a stride of one second over each file), but, due to our limited time and memory constraints, we will limit ourselves to treating an entire file as a single training example.

In [None]:
# Check the current number of training files
training_files.shape[0]

In [None]:
from sklearn.model_selection import train_test_split

X = training_files
y = training_files['birdcall label']

# Pass the train_size=0.70 argument to get a roughly 70-30 split of train and cross-validation data and pass the stratify=y argument to get a similar distribution of birdcall labels between the train and cross-validation sets
X_train, X_cv = train_test_split(X, train_size=0.70, stratify=y)

In [None]:
# See how many examples are in our training set
X_train.shape[0]

### Helper Functions

Let's do some preparatory work by defining a couple of functions that will be useful as we move on to training our models.

In [None]:
import random

RANDOM_SEED = 1234
# Set the seed for the random number generator provided by the random library
random.seed(RANDOM_SEED)

In [None]:
import soundfile as sf # for working with the .ogg audio files

def read_ogg_data(filename: str, birdcall_label: str) -> np.ndarray:
    """Returns the audio data from a specified file as a NumPy array

    Parameters
    ----------
    filename: str
        The name of the file containing the audio signal to be returned
    birdcall_label: str
        The birdcall label of the specified file; used to determine the parent folder of the specified file within the train_short_audio_folder

    Returns
    -------
    numpy.ndarray
        A NumPy array representing the audio signal
    """
    # Use the soundfile library to read in the audio data
    data, _ = sf.read(f"{train_short_audio_folder}{birdcall_label}/{filename}")
    # Reshape `data` into two dimensions since the soundfile library will return a one-dimensional NumPy array by default
    reshaped_data = data.reshape((-1, data.shape[0]))
    return reshaped_data

In [None]:
def extract_time_interval(audio_signal: np.ndarray, sample_rate: int, start_time_in_seconds: int, interval_length_in_seconds: int, pad: bool=True) -> np.ndarray:
    """Returns a NumPy array corresponding to a specified time interval from a given NumPy array audio sample

    Parameters
    ----------
    audio_signal: np.ndarray
        The input NumPy array audio signal
    sample_rate: int
        The sample rate in Hertz of the inputted audio
    start_time_in_seconds: int
        The offset in seconds from the beginning of the audio sample from which the desired interval of audio is extracted
    interval_length_in_seconds: int
        The length of the desired duration in seconds of the audio sample to be extracted
    pad: bool
        A bool indicating whether the output array should be zero-padded to satisfy the desired interval length

    Returns
    -------
    numpy.ndarray
        A NumPy array representing the extracted audio signal
    """
    # Use the start time in seconds and the sample rate to calculate the actual index at which the desired interval of audio begins
    start_index = start_time_in_seconds * sample_rate
    # Use the interval in seconds and the sample rate to calculate actual the interval in indices in indices of the input NumPy array that must be extracted
    index_interval = interval_length_in_seconds * sample_rate
    sample = audio_signal[:, start_index:start_index + index_interval]
    # If the caller does not wish to pad the data should it be shorter than the desired interval
    if not pad:
        return sample
    # If padding is required
    else:
        # Using the sample length that was required appropriate number of zeros so that it has the desired interval length
        num_indices_to_pad = index_interval - sample.shape[1]
        # If the length of the sample is of the desired length
        if num_indices_to_pad == 0:
            return sample
        # Else if the length of the sample is shorter than desired
        else:
            return np.concatenate([sample, np.zeros((1, num_indices_to_pad))], axis=1)

In [None]:
def extract_random_time_interval(audio_signal: np.ndarray, sample_rate: int, interval_length_in_seconds: int, pad: bool) -> np.ndarray:
    """Returns a random interval of the specified length from the inputted audio signal

    Parameters
    ----------
    audio_signal: np.ndarray
        A NumPy array representing an audio signal
    sample_rate: int
        The sample rate in Hertz of the inputted audio
    interval_length_in_seconds: int
        The length of the desired duration in seconds of the audio sample to be extracted
    pad: bool
        A bool indicating whether the output array should be zero-padded to satisfy the desired interval length

    Returns
    -------
    numpy.ndarray
        A NumPy array representing the extracted audio signal"""
    num_elements = audio_signal.shape[1]
    if pad:
        num_segments = int(np.ceil(num_elements / (interval_length_in_seconds * sample_rate)))
    else:
        num_segments = num_elements // (interval_length_in_seconds * sample_rate)

    return extract_time_interval(audio_signal, sample_rate, random.choice(range(num_segments)) * interval_length_in_seconds, interval_length_in_seconds, pad)

## Model Definitions

Now that we the preparatory work out of the way, let's move on to defining our neural network models. First, we'll test out the performance of convolutional neural networks (CNNs). To be able to assign a variable number of birdcall labels to a given sound sample, we will train a separate neural network model to recognize each of the 397 distinct classes of birds in our data set. We will use some of the code provided by the [fast.ai tutorial](https://pytorch.org/tutorials/beginner/nn_tutorial.html) at [pytorch.org](pytorch.org) to set up our initial boilerplate and then further modify the neural networks per our needs. Note that we will take a slightly unconvential approach since, in addition to testing simple one-dimensional convolutional neural networks over the audio samples, we will also try reshaping the one-dimensional audio signals into two-dimensional arrays so that we can relatively cheaply extract temporal information over entire audio samples using two-dimensional convolutions. After testing out the convolutional neural networks, we will test the performance of vanilla recurrent neural networks (RNNs) on the data set.

In [None]:
import torch
from torch import nn # to expose the Module class from which we can subclass to create our own, custom neural network layers
from typing import Callable # to specify a parameter as a function type
from typing import Union # to specify a parameter as a union type

class Lambda(nn.Module):
    """A class for creating activation layers for PyTorch neural networks

    ...

    Attributes
    ----------
    f: function
        The activation function
    """
    def __init__(self, f: Callable[[torch.tensor], torch.tensor]):
        """
        Parameters
        ----------
        f: function
            The activation function
        """
        super().__init__()
        self.f = f

    def forward(self, xs: Union[torch.tensor, Sequence[torch.tensor]]) -> torch.tensor:
        """A function that applies the activation function `self.f` to the inputted sequence of data `xs`

        Parameters
        ----------
        x: torch.tensor
            The inputs to this activation layer

        Returns
        -------
        torch.tensor
            A PyTorch tensor representing the result of applying `self.f` to each element of the inputted tensor x
        """
        if type(xs) == torch.Tensor:
            return self.f(xs)
        else:
            return self.f(*xs)

In [None]:
def reshape_into_image(x: torch.tensor) -> torch.tensor:
    """Reshapes an inputted PyTorch tensor representing five seconds of audio sampled at 32kHz into a square, 400 x 400 pixel image

    Parameters
    ----------
    x: torch.tensor
        A PyTorch tensor representing a five-second audio signal sampled at 32kHz

    Returns
    -------
    torch.tensor
        A PyTorch tensor in the shape of a 400 x 400 image

    Notes
    -----
    This function takes advantage of the fact that a five-second audio clip sampled at 32kHz will have 32,000 * 5 = 160,000 data points, which yields a perfect square.
    """
    return x.view(-1, 1, 400, 400)

In [None]:
# Set our device to be the GPU if available else the CPU
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

In [None]:
class OneDimensionalConvolutionalNeuralNetwork(nn.Module):
    """A class for creating a 1D convolutional neural network birdcall classifier for inputs containing a latitude, longitude, timestamp, and a five-second audio signal sampled at 32kHz"""
    def __init__(self):
        super(OneDimensionalConvolutionalNeuralNetwork, self).__init__()
        self.splitters = nn.ModuleDict({
            'non-audio': Lambda(lambda x: x[0, :3].view(1, 3)),
            'audio': Lambda(lambda x: x[0, 3:])
        })
        # Create the convolutional neural network for the audio signal
        self.cnn = nn.Sequential(
            Lambda(lambda x: x.view(-1, 1, 160_000)),
            nn.Conv1d(1, 1, kernel_size=16, stride=16),
            nn.ReLU(),
            nn.Conv1d(1, 1, kernel_size=100, stride=100),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(10),
            Lambda(lambda x: x.view(-1, 10))
        )
        self.combiner = Lambda(lambda x1, x2: torch.cat([x1, x2], dim=1))
        self.output = nn.Linear(13, 2)

    def forward(self, x):
        non_audio_x = self.splitters['non-audio'](x)
        audio_x = self.splitters['audio'](x)
        cnn_output = self.cnn(audio_x)
        recombined_input = self.combiner([non_audio_x, cnn_output])
        return self.output(recombined_input)

In [None]:
from torch import nn # to expose the nn.Module class

class TwoDimensionalConvolutionalNeuralNetwork(nn.Module):
    """A class for creating a 2D convolutional neural network birdcall classifier for inputs containing a latitude, longitude, timestamp, and a five-second audio signal sampled at 32kHz"""
    def __init__(self):
        super(TwoDimensionalConvolutionalNeuralNetwork, self).__init__()
        # Create a dictionary of `nn.Module`s for splitting the input into its audio and non-audio components
        self.splitters = nn.ModuleDict({
            'non-audio': Lambda(lambda x: x[0, :3].view(1, 3)),
            'audio': Lambda(lambda x: x[0, 3:])
        })
        # Create the convolutional neural network for the audio signal
        self.cnn = nn.Sequential(
            # Reshape the audio signal into a square image
            Lambda(reshape_into_image),
            nn.Conv2d(1, 1, kernel_size=20, stride=20),
            nn.ReLU(),
            nn.Conv2d(1, 1, kernel_size=10, stride=10),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(1),
            Lambda(lambda x: x.view(-1, 1))
        )
        # For combining the CNN output and non-audio input into a single PyTorch tensor
        self.combiner = Lambda(lambda x1, x2: torch.cat([x1, x2], dim=1))
        # For yielding the final output
        self.output = nn.Linear(4, 2)

    def forward(self, x):
        non_audio_x = self.splitters['non-audio'](x)
        audio_x = self.splitters['audio'](x)
        cnn_output = self.cnn(audio_x)
        recombined_input = self.combiner([non_audio_x, cnn_output])
        return self.output(recombined_input)

In [None]:
class RecurrentNeuralNetwork(nn.Module):
    """A class for creating a recurrent neural network birdcall classifier for inputs containing a latitude, longitude, timestamp, and a five-second audio signal sampled at 32kHz"""
    def __init__(self):
        super(RecurrentNeuralNetwork, self).__init__()
        self.splitters = nn.ModuleDict({
            'non-audio': Lambda(lambda x: x[:, :3]),
            'audio': Lambda(lambda x: x[:, 3:])
        })
        # Create the recurrent neural network for the audio signal
        self.rnn_one = nn.RNNCell(10000, 5)
        self.rnn_two = nn.RNNCell(5, 10)
        self.combiner = Lambda(lambda x1, x2: torch.cat([x1, x2], dim=1))
        self.output = nn.Linear(13, 2)

    def forward(self, x):
        non_audio_x = self.splitters['non-audio'](x)
        audio_x = self.splitters['audio'](x)
        hx_1 = torch.randn(1, 5).to(DEVICE)
        hx_2 = torch.randn(1, 10).to(DEVICE)
        for i in range(16):
            hx_1 = self.rnn_one(audio_x[:, i * 10000:i * 10000 + 10000], hx_1)
            hx_2 = self.rnn_two(hx_1, hx_2)
        recombined_input = self.combiner([non_audio_x, hx_2])
        return self.output(recombined_input)

Now, that we've got our models, let's make sure that we can run them on a real training sample.

In [None]:
def get_randomized_model_input_as_tensor(example: pd.DataFrame, sample_rate: int, sample_interval_length_in_seconds: int, device: torch.device, pad: bool) -> torch.Tensor:
    """Produces a tensor corresponding to a random audio segment of the desired duration from the specified file along with the audio example's metadata attributes

    Parameters
    ----------
    example: pd.DataFrame
        A Pandas DataFrame containing the name of the file from which the audio segment should be extracted along with its associated metadata attributes
    sample_rate: int
        The sample rate of the audio file
    sample_interval_length_in_seconds: int
        The desired duration in seconds of the audio segment to be extracted
    device: torch.device
        The device (i.e. CPU or GPU) on which the returned PyTorch tensor should reside
    pad: bool
        Whether or not the function should permit returning an audio segment that is originally shorter than the desired length by means of zero-padding

    Returns
    -------
    torch.Tensor
        The PyTorch tensor representing an input vector to be fed into a neural network"""
    filename = example['filename']
    latitude = example['latitude']
    longitude = example['longitude']
    timestamp = example['timestamp']
    birdcall_label = example['birdcall label']
    audio_signal = read_ogg_data(filename, birdcall_label)
    audio_data = extract_random_time_interval(audio_signal, sample_rate, sample_interval_length_in_seconds, pad=pad)
    # Ensure that the elements of the input are of type torch.float32 instead of torch.float64 so that they are compatible with the convolution layers of our neural network
# See the following StackOverflow article for reference: https://stackoverflow.com/questions/66074684/runtimeerror-expected-scalar-type-double-but-found-float-in-pytorch-cnn-train
    non_audio_data = torch.tensor([latitude, longitude, timestamp]).view(1, -1).float().to(DEVICE)
    audio_data = torch.tensor(audio_data).view(1, -1).float().to(DEVICE)
    return torch.cat([non_audio_data, audio_data], dim=1).float().to(DEVICE)

In [None]:
import torch.nn.functional as F

# Use cross-entropy as our classification loss function
LOSS_FUNCTION = F.cross_entropy

SAMPLE_INTERVAL_LENGTH_IN_SECONDS = 5
SAMPLE_RATE = 32_000

LEARNING_RATE = 0.01
MOMENTUM = 0.90

In [None]:
from torch import optim # to expose the optim module and the Optimizer class

cnn_1d = OneDimensionalConvolutionalNeuralNetwork().to(DEVICE)
cnn_2d = TwoDimensionalConvolutionalNeuralNetwork().to(DEVICE)
rnn = RecurrentNeuralNetwork().to(DEVICE)

# Get a random training file
training_file = training_files.iloc[random.choice(range(training_files.shape[0])), :]
x = get_randomized_model_input_as_tensor(training_file, SAMPLE_RATE, SAMPLE_INTERVAL_LENGTH_IN_SECONDS, DEVICE, False)

cnn_1d_optim = optim.SGD(cnn_1d.parameters(), lr=LEARNING_RATE, momentum=MOMENTUM)
cnn_2d_optim = optim.SGD(cnn_2d.parameters(), lr=LEARNING_RATE, momentum=MOMENTUM)
rnn_optim = optim.SGD(rnn.parameters(), lr=LEARNING_RATE, momentum=MOMENTUM)

cnn_1d_out = cnn_1d(x)
cnn_2d_out = cnn_2d(x)
rnn_out = rnn(x)

# Assume the true label to be 0 and propagate gradients backward through the networks
cnn_1d_loss = LOSS_FUNCTION(cnn_1d_out, torch.zeros(1).long().to(DEVICE))
cnn_1d_loss.backward()

cnn_2d_loss = LOSS_FUNCTION(cnn_2d_out, torch.zeros(1).long().to(DEVICE))
cnn_2d_loss.backward()

rnn_label_prediction_loss = LOSS_FUNCTION(rnn_out, torch.zeros(1).long().to(DEVICE))
rnn_label_prediction_loss.backward()

# Try making a gradient descent step
cnn_1d_optim.step()
cnn_2d_optim.step()
rnn_optim.step()

# Zero out the error derivatives over the computation graph
cnn_1d_optim.zero_grad()
cnn_2d_optim.zero_grad()
rnn_optim.zero_grad()

## Training

In [None]:
from typing import Dict # to specify a parameter as a dict

def train_models_on_audio_file(training_file: pd.DataFrame, loss_function: Callable[[torch.tensor, torch.tensor], torch.tensor], class_label_to_model_optimizer_dict: Dict[str, optim.Optimizer], class_label_to_model_dict: Dict[str, nn.Module], sample_rate: int, sample_interval_length_in_seconds: int, device: torch.device) -> Dict[str, torch.Tensor]:
    """Trains a set of binary neural network classifiers on a set of training files

    Parameters
    ----------
    training_files: pd.DataFrame
        A Pandas DataFrame containing the filename of a training file along with its associated metadata attributes
    loss_function: Callable[[torch.tensor, torch.tensor], torch.tensor]
        The loss function applied to each of the classifiers
    class_label_to_model_optimizer_dict: Dict[str, optim.Optimizer]
        A dictionary mapping each label from the list of class labels to the optimizer of its corresponding model instance
    class_label_to_model_dict: Dict[str, nn.Module]
        A dictionary that maps each label from the list of class labels to its associated model instance
    audio_file_sample_rate: int
        The sampling rate of each audio file on which the models are to be trained
    sample_interval_length_in_seconds: int
        The duration in seconds of the audio samples on which the models should be trained
    device: torch.device
        The type of device (CPU or GPU) on which to store the PyTorch tensors

    Returns
    -------
    Dict[str, torch.Tensor]
        A dictionary mapping each birdcall label to the cross-entropy classification loss of the corresponding model
    """
    x_train = get_randomized_model_input_as_tensor(training_file, sample_rate, sample_interval_length_in_seconds, device, False)
    class_label_to_loss_dict = {}
    for class_label, model in class_label_to_model_dict.items():
        # Creata a one-dimensional PyTorch tensor consisting of all zeros if the birdcall label is not equal to the parent folder else a one-dimensional PyTorch tensor consisting of all ones with the number of elements equal to the number of samples
        y_train = torch.ones(1) if (class_label == training_file['birdcall label']) else torch.zeros(1)
        # Ensure that the elements of y_train are integral values of type torch.int64
        y_train = y_train.long()
        # Make sure that the label vector resides on the appropriate device
        y_train = y_train.to(device)
        predictions = model(x_train)
        loss = loss_function(predictions, y_train)
        # Backpropagate the gradients over the graph representing the neural network
        loss.backward()
        # Make a gradient descent step using the model's associated optimizer
        class_label_to_model_optimizer_dict[class_label].step()
        # Reset all model parameter gradients to zero
        class_label_to_model_optimizer_dict[class_label].zero_grad()
        model.eval()
        with torch.no_grad():
            class_label_to_loss_dict[class_label] = loss_function(predictions, y_train).item()
    return class_label_to_loss_dict

In [None]:
# Use the tqdm object from the tqdm.notebook module so that the progress bar does not result in the creation of multiple lines of output in the IPython notebook
# Suggestion taken from the following StackOverflow article: https://stackoverflow.com/questions/42212810/tqdm-in-jupyter-notebook-prints-new-progress-bars-repeatedly
from tqdm.notebook import tqdm # to be able to show a progress bar during training

def train_models_on_examples(training_files: pd.DataFrame, loss_function: Callable[[torch.tensor, torch.tensor], torch.tensor], class_label_to_model_optimizer_dict: Dict[str, optim.Optimizer], class_label_to_model_dict: Dict[str, nn.Module], audio_file_sample_rate: int, sample_interval_length_in_seconds: int, device: torch.device, progress_bar: bool=True) -> Dict[str, Sequence[torch.Tensor]]:
    """Trains a set of classifiers to a sequence of labelled audio files

    Parameters
    ----------
    training_files: pd.DataFrame
        A Pandas DataFrame containing a list of files on which the 
    loss_function: Callable[[torch.tensor, torch.tensor], torch.tensor]
        The loss function to be applied to each of the classifiers
    class_label_to_model_optimizer_dict: Dict[str, optim.Optimizer]
        A dictionary that maps each label from the list of class labels to the optimizer of its corresponding model instance
    class_label_to_model_dict: Dict[str, nn.Module]
        A dictionary that maps each label from the list of class labels to its associated model instance
    audio_file_sample_rate: int
        The sampling rate of each audio file on which the models are to be trained
    sample_interval_length_in_seconds: int
        The duration in seconds of the audio samples on which the models should be trained
    device: torch.device
        The type of device (CPU or GPU) on which to store the PyTorch tensors
    progress_bar: bool
        A bool indicating whether or not a progress bar should be displayed during training

    Returns
    -------
    Dict[str, Sequence[torch.Tensor]]
        A dictionary mapping each birdcall label to a sequence of cross-entropy classification losses for each training file on which the models were trained
    """
    # Create an appropriate iterable over the provided sequence of identifiers based on whether the user wished to display a progress bar or not
    indices = tqdm(range(training_files.shape[0])) if progress_bar else audio_sample_id_sequence
    class_label_to_losses_dict = {}
    for idx in indices:
        class_label_to_loss_dict = train_models_on_audio_file(training_files.iloc[idx, :], loss_function, class_label_to_model_optimizer_dict, class_label_to_model_dict, audio_file_sample_rate, sample_interval_length_in_seconds, device)
        for class_label, loss in class_label_to_loss_dict.items():
            losses = class_label_to_losses_dict.get(class_label, [])
            losses.append(loss)
            class_label_to_losses_dict[class_label] = losses
    return class_label_to_losses_dict

In [None]:
birdcalls = most_common_birdcalls.values.reshape(-1)

# Initialize our neural network models
one_dimensional_convolutional_models = dict(zip(birdcalls, [OneDimensionalConvolutionalNeuralNetwork().to(DEVICE) for _ in birdcalls]))
two_dimensional_convolutional_models = dict(zip(birdcalls, [TwoDimensionalConvolutionalNeuralNetwork().to(DEVICE) for _ in birdcalls]))
rnn_models = dict(zip(birdcalls, [RecurrentNeuralNetwork().to(DEVICE) for _ in birdcalls]))

In [None]:
# Create an optimizer to do gradient updates for every set of models that must be trained
one_dimensional_convolutional_network_optimizers = dict(zip(birdcalls, [optim.SGD(model.parameters(), lr=LEARNING_RATE, momentum=MOMENTUM) for model in one_dimensional_convolutional_models.values()]))
two_dimensional_convolutional_network_optimizers = dict(zip(birdcalls, [optim.SGD(model.parameters(), lr=LEARNING_RATE, momentum=MOMENTUM) for model in two_dimensional_convolutional_models.values()]))
rnn_optimizers = dict(zip(birdcalls, [optim.SGD(model.parameters(), lr=LEARNING_RATE, momentum=MOMENTUM) for model in rnn_models.values()]))

### Reproducibility

Though we cannot guarantee with one hundred percent confidence that our experiments will be fully reproducible, we can take some steps to minimize the variation that might occur across runs of our notebook by following the advice provided in the PyTorch [reproducibility article](https://pytorch.org/docs/stable/notes/randomness.html?highlight=reproducible#).

In [None]:
# Set the random seed used by PyTorch to help ensure that we get more reproducible results during training
torch.manual_seed(RANDOM_SEED)
# Similarly set the random seed used by NumPy
np.random.seed(RANDOM_SEED)
# Tell PyTorch to use deterministic algorithms instead of non-deterministic ones to also help ensure that we get more reproducible results
try:
    torch.use_deterministic_algorithms(True)
# In case the currently loaded version of the PyTorch library does not support `use_deterministic_algorithms()`
except AttributeError:
    print("This version of PyTorch does not support the use_deterministic_algorithms() method. Skipping...")

In [None]:
print("Initiating training...")
print("Training 1D convolutional models...")
one_dimensional_convolutional_losses = train_models_on_examples(X_train, LOSS_FUNCTION, one_dimensional_convolutional_network_optimizers, one_dimensional_convolutional_models, SAMPLE_RATE, SAMPLE_INTERVAL_LENGTH_IN_SECONDS, DEVICE)
print("Training 2D convolutional models...")
two_dimensional_convolutional_losses = train_models_on_examples(X_train, LOSS_FUNCTION, two_dimensional_convolutional_network_optimizers, two_dimensional_convolutional_models, SAMPLE_RATE, SAMPLE_INTERVAL_LENGTH_IN_SECONDS, DEVICE)
print("Training recurrent models...")
recurrent_losses = train_models_on_examples(X_train, LOSS_FUNCTION, rnn_optimizers, rnn_models, SAMPLE_RATE, SAMPLE_INTERVAL_LENGTH_IN_SECONDS, DEVICE)
print("Training completed!")

## Cross-Validation

In [None]:
# Set the confidence threshold to be fairly high so that we only append a bircall label to a multi-label classification only if we are fairly certain that that birdcall occurs within a particular audio segment
CONFIDENCE_THRESHOLD = 0.95

In [None]:
def compute_f1_score(num_false_positives: int, num_true_positives: int, num_false_negatives: int) -> float:
    # Referenced from: https://en.wikipedia.org/wiki/F-score
    precision = num_true_positives / (num_false_positives + num_true_positives)
    recall = num_true_positives / (num_false_negatives + num_true_positives)
    return 2.0 * (precision * recall) / (precision + recall)

In [None]:
model_types = ['1D CNN', '2D CNN', 'RNN']

model_type_to_false_positives_dict = dict(zip(model_types, [0] * 3))
model_type_to_true_positives_dict = dict(zip(model_types, [0] * 3))
model_type_to_false_negatives_dict = dict(zip(model_types, [0] * 3))

model_type_to_models_dict = dict(zip(model_types, [one_dimensional_convolutional_models, two_dimensional_convolutional_models, rnn_models]))

print("Initiating cross-validation of models...")
for idx in tqdm((range(X_cv.shape[0]))):
    cv_file = X_cv.iloc[idx, :]
    x = get_randomized_model_input_as_tensor(cv_file, SAMPLE_RATE, SAMPLE_INTERVAL_LENGTH_IN_SECONDS, DEVICE, False)
    for model_type, models in model_type_to_models_dict.items():
        for birdcall, model in models.items():
            prediction = model(x)
            if torch.argmax(prediction, dim=1) == 1 and prediction[:, 1] > CONFIDENCE_THRESHOLD:
                if birdcall == cv_file['birdcall label']:
                    model_type_to_true_positives_dict[model_type] += 1
                else:
                    model_type_to_false_positives_dict[model_type] += 1
            else:
                if birdcall == cv_file['birdcall label']:
                    model_type_to_false_negatives_dict[model_type] += 1

print("Completed cross-validation!")
print("Calculated F1 scores:")
for model_type in model_types:
    print(f"{model_type} -> {compute_f1_score(model_type_to_false_positives_dict[model_type], model_type_to_true_positives_dict[model_type], model_type_to_false_negatives_dict[model_type])}")

In [None]:
one_dimensional_convolutional_losses = pd.DataFrame(one_dimensional_convolutional_losses)
two_dimensional_convolutional_losses = pd.DataFrame(two_dimensional_convolutional_losses)
recurrent_losses = pd.DataFrame(recurrent_losses)

In [None]:
one_dimensional_convolutional_losses_melted = one_dimensional_convolutional_losses.melt(var_name='birdcall', ignore_index=False)
two_dimensional_convolutional_losses_melted = two_dimensional_convolutional_losses.melt(var_name='birdcall', ignore_index=False)
recurrent_losses_melted = recurrent_losses.melt(var_name='birdcall', ignore_index=False)

In [None]:
one_dimensional_convolutional_losses.to_csv(f"{output_folder}1D_CNN_losses.csv")
two_dimensional_convolutional_losses.to_csv(f"{output_folder}2D_CNN_losses.csv")
recurrent_losses.to_csv(f"{output_folder}recurrent_losses.csv")