# BirdCLEF 2022
## Identify bird calls in soundscapes

![](https://nas-national-prod.s3.amazonaws.com/styles/article_hero_inline/s3/web_aud_apa-2018_iiwi_a1-7039-1_ts_photo-justin-peter.jpg?itok=0CUwtcxL)
#### > Challenge in this competition is to identify which birds are calling in long recordings given quite limited training data. This is the exact challenge faced by scientists trying to monitor rare birds in Hawaii. For example, there are only a few thousand individual Nene geese left in the world, which makes it difficult to acquire recordings of their calls.

# Files

#### train_metadata.csv - A wide range of metadata is provided for the training data. The most directly relevant fields are:
 
* primary_label - a code for the bird species. You can review detailed information about the bird codes by appending the code to https://ebird.org/species/, such as https://ebird.org/species/amecro for the American Crow.
* secondary_labels: Background species as annotated by the recordist. An empty list does not mean that no background birds are audible.
* author - the eBird user who provided the recording.
* filename: the associated audio file.
* rating: Float value between 0.0 and 5.0 as an indicator of the quality rating on Xeno-canto and the number of background species, where 5.0 is the highest and 1.0 is the lowest. 0.0 means that this recording has no user rating yet.
* train_audio/ - The bulk of the training data consists of short recordings of individual bird calls generously uploaded by users of xenocanto.org. These files have been downsampled to 32 kHz where applicable to match the test set audio and converted to the ogg format.

#### test_soundscapes/ - When you submit a notebook, the test_soundscapes directory will be populated with approximately 5,500 recordings to be used for scoring. These are each within a few milliseconds of 1 minute long and in the ogg audio format. Only one soundscape is available for download.


#### test.csv - Metadata for the test set. Only the first three rows are available for download; the full test.csv is provided in the hidden test set.

* row_id - A unique identifier for the row.
* file_id - A unique identifier for the audio file.
* bird - The ebird code for the row. There is one row for each of the scored species per 5 second window per audio file.
* end_time - The last second of the 5 second time window (5, 10, 15, etc).


#### sample_submission.csv - A valid sample submission. Only the first three rows are available for download; the full submission.csv is provided in the hidden test set.

* row_id - A unique identifier for the row.
* target - True/False for whether or not the bird in question called during the 5 second window.
* scored_birds.json - The subset of the species in the dataset that are scored.


####  eBird_Taxonomy_v2021.csv - Data on the relationships between different species.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pylab as plt
import seaborn as sns
import plotly.express as px

# For exploring audio files
import librosa
import librosa.display
import IPython.display as ipd

sns.set_theme(style="white", palette=None)
color_pal = plt.rcParams["axes.prop_cycle"].by_key()["color"]

from itertools import cycle

color_cycle = cycle(plt.rcParams["axes.prop_cycle"].by_key()["color"])

In [None]:
# Read in the CSV files.
BASE_DIR = '../input/birdclef-2022/'
train = pd.read_csv(f'{BASE_DIR}/train_metadata.csv')
test = pd.read_csv(f'{BASE_DIR}/test.csv')
ebird = pd.read_csv(f'{BASE_DIR}/eBird_Taxonomy_v2021.csv')
ss = pd.read_csv(f'{BASE_DIR}/sample_submission.csv')

# Data Exploration

In [None]:
train.head()

In [None]:
train.info()

In [None]:
test

In [None]:
test.info()

In [None]:
ebird.head()

In [None]:
ebird.info()

In [None]:
ss.head()

In [None]:
print(f'There are {train.primary_label.nunique()} bird species:\n{train.primary_label.unique()}')

# EDA

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(16, 6))
# See the frequency of labels in the training dataset
train["common_name"].value_counts().head(20).plot(
    kind="bar", ax=axs[0], width=1, color=color_pal[0]
)

axs[0].set_title("Top 20 Birds with Labels", fontsize=20)

# See the frequency of labels in the training dataset
ax = (
    train["common_name"]
    .value_counts()
    .tail(20)
    .plot(kind="bar", ax=axs[1], width=1, color=color_pal[1])
)
axs[1].set_title("Bottom 20 Birds with Labels", fontsize=20)


* We see that there are varying counts of examples for each bird type.
* Some birds have 500 labels while others have less than 10

In [None]:
# Top 50 birds with labels
pie, ax = plt.subplots(figsize=[20,15])
train["common_name"].value_counts().head(50).plot(kind='pie',
#                                                     autopct='%.2f',
                                                    ax=ax,
                                                    title='Spacies distibution',
                                                    rotatelabels =True,
                                                    cmap = 'hot')
plt.show()

In [None]:
fig = px.scatter_geo(
    train,
    lat="latitude",
    lon="longitude",
    color="common_name",
    width=1_000,
    height=500,
    title="BirdCLEF 2022 Training Data",
)
fig.show()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
# See the frequency of labels in the training dataset
train["author"].value_counts().head(50).plot(
    kind="bar", ax=ax, width=1, color=color_pal[2]
)

ax.set_title("Top 50 Authors", fontsize=20)

In [None]:
# Listen to the audio for the first training example
fn = train["filename"].values[1]
ipd.Audio(f"{BASE_DIR}train_audio/{fn}")

#### Load in the audio file as a numpy array

In [None]:
y, sr = librosa.load(f"{BASE_DIR}train_audio/{fn}")
print(f"Numpy array of the audio loaded of shape {y.shape} and sample rate {sr}")

#### Plotting 10 Random Audio Files from the training dataset


In [None]:
# Plot The Audio File
def plot_raw_audio(filename, birdtype, color):
    y, sr = librosa.load(f"{BASE_DIR}train_audio/{filename}")
    ax = pd.DataFrame(y).plot(
        figsize=(10, 3), title=f"{birdtype} Raw Audio", lw=0.1, color=color
    )
    plt.legend().remove()
    plt.show()


for i, d in train.sample(10, random_state=529).iterrows():
    plot_raw_audio(d["filename"], d["common_name"], next(color_cycle))

#### Creating Spectograms of Birds

In [None]:
def plot_audio_melspec(filename, birdtype):
    y, sr = librosa.load(f"{BASE_DIR}train_audio/{filename}")
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, fmax=8000)

    fig, ax = plt.subplots(figsize=(10, 3))
    S_dB = librosa.power_to_db(S, ref=np.max)
    img = librosa.display.specshow(
        S_dB, x_axis="time", y_axis="mel", sr=sr, fmax=8000, ax=ax
    )
    fig.colorbar(img, ax=ax, format="%+2.0f dB")
    ax.set(title=f"Mel-frequency for bird {birdtype}")
    plt.show()


for i, d in train.sample(10, random_state=529).iterrows():
    plot_audio_melspec(d["filename"], d["common_name"])

# Submission

In [None]:
submission = pd.read_csv('../input/birdclef-2022/sample_submission.csv')
submission['target'] = True
submission.to_csv('submission.csv', index=False)
submission.head()

> Work in progress on More EDAs,Data Exploration and Better Submission