# Training data (Short audio)

The training data for this competition consists of a collection of so-called “focal recordings”. These recordings were made using semi-professional equipment (often using highly directional microphones) and primarily focus on one single species. All recordings were contributed by Xeno-canto (https://www.xeno-canto.org), one of the largest digital archives for bird sounds. Each recording comes with metadata specifying things like recording date, recording location, and (of course) the bird species that was recorded.

To get a better understanding of the metadata, let’s look at a few entries.

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings(action='ignore')

train = pd.read_csv('../input/birdclef-2021/train_metadata.csv',)
train.head()

## 1. Primary species

Most importantly, the metadata specifies the audible species for each recording. The primary species annotation consists of three data fields: *primary_label, scientific_name, and common_name*[](http://). All labels have to be considered as “weak labels” since we do know which species is audible in the recording, but we do not know the exact timestamps of the vocalizations. Training with weakly labeled data is one of the core challenges of this competition.

Let’s look at the number of different species.

In [None]:
len(train['primary_label'].value_counts())

Our dataset contains recordings for **397** different *primary* species, all of them defined by their **eBird code** (the codes that we use as primary label). Just as Xeno-canto is a digital platform that collects audio recordings, eBird (https://ebird.org) is a citizen science project that collects observations of birds. eBird uses unique species codes to reference birds. You can access additional information on each bird species by combining the base URL “https://ebird.org/species/” with a species code from the *primary_label* columns of the metadata.

Here are a few examples:

Golden-crowned Kinglet: https://ebird.org/species/gockin  
Red-winged Blackbird: https://ebird.org/species/rewbla  
American Goldfinch: https://ebird.org/species/amegfi

Let’s take a look at the number of recordings for each species in the training data:

In [None]:
# Code adapted from https://www.kaggle.com/shahules/bird-watch-complete-eda-fe
# Make sure to check out the entire notebook.

import plotly.graph_objects as go

# Unique eBird codes
species = train['primary_label'].value_counts()

# Make bar chart
fig = go.Figure(data=[go.Bar(y=species.values, x=species.index)],
                layout=go.Layout(margin=go.layout.Margin(l=0, r=0, b=10, t=50)))

# Show chart
fig.update_layout(title='Number of traning samples per species')
fig.show()

We can clearly see that we’re facing a so-called “long-tail” classification problem with highly imbalanced training data. We limited the number of recordings to 500 for each species, but that only affected a dozen species. Most species have less than 500 recordings available on Xeno-canto (1). **It is important to note that the training data contains more classes than there are species annotated in the test data.** However, the training data only contains species that are likely to occur at the test data recordings sites. Crawling Xeno-canto for more samples will not be necessary this year, and (in light of the API limitations imposed by Xeno-canto, https://www.xeno-canto.org/about/terms) we strongly discourage doing so.

(1) Xeno-canto uses open data licenses, but some of them do not allow derivatives (CC-ND), so we excluded them.

## 2. Background species

The metadata for each recording lists the number of audible background species. The data field “*seconday_labels*” contains lists of eBird codes (i.e., primary labels) that recordists annotated. It is important to note that these lists might be incomplete, and you might be able to hear background species, although none are specified in the metadata. Therefore, lists of secondary labels are not very reliable, but they might still be useful for multi-label training (e.g., through loss masking for background species).

Let's look at some values:

In [None]:
train['secondary_labels'].value_counts()

We can see that the majority of recordings does not have an annotation of background species. Yet, it is highly likely that most of them actually contain one or more additional species. The data also shows us that the Red-winged Blackbird (rewbla), American Robin (amerob), House Sparrow (houspa), and Northern Cardinal (norcar) appear to be some of the most common background species.

**Please note, secondary lables only contain labels of species that are actually represented in the data set.**

## 3. Location, location, location

Each recording comes with a recording location specified in the metadata. Data fields “*latitude*” and “*longitude*” contain GPS coordinates as provided by the recordist. In combination with the recording data (data field “*date*”), this information can be very useful to map distribution and migration patterns. Why is it important? Not all birds occur at all locations at all times! 

Let's look at a few examples:


In [None]:
# Code adapted from: https://www.kaggle.com/andradaolteanu/birdcall-recognition-eda-and-audio-fe
# Make sure to ckeck out the entire nootebook. It's brilliant.

import matplotlib.pyplot as plt
import seaborn as sns
import descartes
import geopandas as gpd
from shapely.geometry import Point, Polygon

# SHP file
world_map = gpd.read_file("../input/map-data/world_shapefile.shp")

# Coordinate reference system
crs = {"init" : "epsg:4326"}

# Lat and Long need to be of type float, not object
species_list = ['norcar', 'houspa', 'wesblu', 'banana']
data = train[train['primary_label'].isin(species_list)]
data["latitude"] = data["latitude"].astype(float)
data["longitude"] = data["longitude"].astype(float)

# Create geometry
geometry = [Point(xy) for xy in zip(data["longitude"], data["latitude"])]

# Geo Dataframe
geo_df = gpd.GeoDataFrame(data, crs=crs, geometry=geometry)

# Create ID for species
species_id = geo_df["primary_label"].value_counts().reset_index()
species_id.insert(0, 'ID', range(0, 0 + len(species_id)))

species_id.columns = ["ID", "primary_label", "count"]

# Add ID to geo_df
geo_df = pd.merge(geo_df, species_id, how="left", on="primary_label")

# === PLOT ===
fig, ax = plt.subplots(figsize = (16, 10))
world_map.plot(ax=ax, alpha=0.4, color="grey")

palette = iter(sns.hls_palette(len(species_id)))
for i in range(len(species_list)):
    geo_df[geo_df["ID"] == i].plot(ax=ax, 
                                   markersize=20, 
                                   color=next(palette), 
                                   marker="o", 
                                   label = species_id['primary_label'].values[i]);
    
ax.legend()

As we can see, different species occur over different spatial scales. According to the recording locations, the House Sparrow (houspa) has occurrences around the globe, the Northern Cardinal (norcar) appears to be a typical East coast species of the U.S., the Western Bluebird (wesblu) a West coast species. The Bananaquit (banana) seems to only occur in Central and South America. 

Location data can help us to create subsets of the training data for each of the four test data recording locations (which we will explore later). But be aware: The range of certain species may not be fully reflected by recording location data, and the actual range may differ from what we can see in the data. Yet, recording locations are a good starting point.

Please note that the training data only contains species that are likely to occur at the recording locations of the test data, even though sometimes the majority of the recordings were made in Europe. If you want to know more about the range of a certain species, please take a look at the associated eBird entry.

## 4. Rating

Xeno-canto has a rating system for the quality of each recording. Ratings are assigned by users, and we adapted this rating scheme for the training data. In our case, ratings range from 0.5 to 5.0 (the latter being the best possible rating) and reflect the overall quality assigned by users and the number of background species. A value of “0” means that this particular recording does not have a rating, and it is by that the fallback value.

Let's see how rating values are distributed across the training data:

In [None]:
# Code adapted from https://www.kaggle.com/shahules/bird-watch-complete-eda-fe
# Again, make sure to check out the entire notebook.

hist_data = train['rating'].values.tolist()
fig = go.Figure(data=[go.Histogram(x=hist_data)], 
                layout=go.Layout(margin=go.layout.Margin(l=0, r=0, b=10, t=50)))
fig.update_layout(title='Number of recordings per rating')

fig.show()

Overall, the training data contains high-quality recordings and the majority of samples is rated with 3.5 or higher. Whenever we had to limit the amount of recordings per species for the training data, we used the 500 top-rated samples. Sub-sampling training data based on user rating might help to extract high-quality training samples.

Other data fields of the metadata might be of value at some point during development, here is a brief description for each of them:

* **type**: Represents the type of the vocalization, with “song” and “call” as the most common. Excluding or including recordings of certain call types might help to diversify training data. Learn more about how and why birds vocalize here: https://academy.allaboutbirds.org/birdsong/

* **author**: Acknowledgement to the recordists who contributed the recording. Some recordists focus on a specific subsets of species, so there might be some value in these data.

* **filename**: A reference to the sound file in the training data.

* **license**: All recordings have an open source license which is noted in this field. Make sure to respect the license when sharing the data.

* **time**: Time of recording as stated by the recordist. Might be of value to distinguish between birds that vocalize during the day and those which only vocalize during the night. Can be used to diversify the training data.

* **url**: A link to the original recording on Xeno-canto.

# Training data (Soundscapes)

One of the major obstacles in this competition is the significant gap between training and test recordings. There is a distinct shift in acoustic domains between the two and it can be very challenging to train classifiers that generalize well enough to bridge the gap. Yet, training with target samples (i.e., soundscapes) is often not possible - somebody has to annotate the data for each new deployment, for each new recording location. However, we decided to include some examples of soundscape recordings (i.e., test recordings) that can be used for validation, or even for training. These 20 recordings represent 2 of the 4 test recording locations. Yet, they might not be 100% representative, some species might be missing and only audible in the hidden test set, recording equipment might differ. But they should nonetheless provide a good overview of what to expect in the hidden test data.

Let’s take a look at the label data for this set of recordings.

In [None]:
soundscapes = pd.read_csv('../input/birdclef-2021/train_soundscape_labels.csv',)
soundscapes.head()

We can see a few data fields and here’s a brief description for each of them:

* **row_id**: Unique identifier of a 5-second segment of each soundscape file. Use this create the submission file entry.

* **site**: Recording site of the soundscape data. In this competition, we included recordings from 4 different sites (COL = Colombia, COR = Costa Rica, SNE = Sierra Nevada, SSW = Sapsucker Woods). **Make sure to take a look at the “test_soundscape_metadata” which contains more information on each location**. Training soundscapes only represent two of the four locations (COR and SSW).

* **audio_id**: Identifier used to reference audio recordings. Filenames contain the file ID, recording site and recording date (yyyymmdd).

* **seconds**: End time of the 5-second segment for which this entry states the label. A value of 85 would mean that this particular segment starts at 00:01:20 and lasts until 00:01:25 of the audio file.

* **birds**: primary label (i.e., eBird code) of the audible species of this segment. “nocall” references a segment without any bird vocalization. Segments can have more than one bird, in that case, eBird codes are separated by space. “nocall” can never appear together with other codes.

Let’s look at the most common entries for “birds”:

In [None]:
print(soundscapes['birds'].value_counts())

“Nocall” seems to be the most common, which is no surprise: Birds only vocalize occasionally during a recording. Yet, some recordings contain very dense acoustic scenes with multiple birds vocalizing at the same time. Why is “nocall” important? There’s a simple reason: Your classifier should be able to suppress false positives for these segments, which is important for ornithologists when confronted with the detections. One of the core challenges of this competition is to reduce the number of false positives (precision) without losing too many true positives (recall).

It is up to you if you use training soundscapes for validation (since they represent the hidden test set) or if you use annotated segments for training (to cope with the shift in acoustic domains). But be aware: Training with soundscape data for a few species might introduce unwanted biases when overfitting to one recording site.

# Test data

If you’re already familiar with the training soundscapes, the hidden test set should not be a surprise. It contains 20 soundscape recordings of 10-minute duration for each of the four recording sites. Again, you need to predict audible species for 5-second chunks of the audio data. The submission file needs to contain the ID of the processed audio chunk  (fileID_site_time) and all audible species as a list of space-delimited eBird codes.

Let’s look at one example:

When analyzing file “*1234_SSW_20170101.ogg*” (that’s a mock filename), the audio chunk ending at second *00:00:35* of the entire file would have the unique ID “*1234_SSW_35*”. If your classifier thinks that species “bluwa1” and “redwa2” (again, mock codes) vocalize during this time, the final submission entry should look like this:

*1234_SSW_35 bluwa1 redwa2*

A submission for this file should include **ALL** segments, starting at 5 seconds. Like this:
 
*1234_SSW_5 nocall*  
*1234_SSW_10 bluwa1*  
*1234_SSW_15 nocall*  
*1234_SSW_20 bluwa1 redwa2*  
*1234_SSW_25 nocall*  
*1234_SSW_30 nocall* 

And so on...


Make yourself familiar with the training and test data, also make sure to check out our other notebooks, let us know if you have any comments and - of course - don’t hesitate to start a forum thread if you have any questions.