# BirdCLEF 2022 Data Exploration

![img](https://academy.allaboutbirds.org/wp-content/uploads/ARTICLE-SONG-1440X8004.png)

This notebook was created on a live coding stream. [Follow here for future streams or to watch the video.](https://www.twitch.tv/medallionstallion_)

In [None]:
!pip install nb_black > /dev/null

In [None]:
%load_ext lab_black

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pylab as plt
import seaborn as sns
import plotly.express as px

# For exploring audio files
import librosa
import librosa.display
import IPython.display as ipd

sns.set_theme(style="white", palette=None)
color_pal = plt.rcParams["axes.prop_cycle"].by_key()["color"]

from itertools import cycle

color_cycle = cycle(plt.rcParams["axes.prop_cycle"].by_key()["color"])

## Data Files

We are provided with a number of files for this competition. 

CSV Files:
- `train_metadata.csv` - A wide range of metadata is provided for the training data.
- `test.csv` - Metadata for the test set.
- `sample_submission.csv` - A valid sample submission.
- `scored_birds.json` - The subset of the species in the dataset that are scored.
- `eBird_Taxonomy_v2021.csv` - Data on the relationships between different species.

Folders with Audio Files:

- `train_audio/` - The bulk of the training data consists of short recordings of individual bird calls generously uploaded by users of xenocanto.org.
- `test_soundscapes/` - We have 1 example file but in the true test there will be 5,500 recordings to be used for scoring. These are each ~ 1 minute long.


In [None]:
!ls -GFlash --color ../input/birdclef-2022/

In [None]:
# Read in the CSV files.
BASE_DIR = '../input/birdclef-2022/'
train = pd.read_csv(f'{BASE_DIR}/train_metadata.csv')
test = pd.read_csv(f'{BASE_DIR}/test.csv')
ebird = pd.read_csv(f'{BASE_DIR}/eBird_Taxonomy_v2021.csv')
ss = pd.read_csv(f'{BASE_DIR}/sample_submission.csv')

# Explore Metadata

- We see that there are varying counts of examples for each bird type.
- Some birds have 500 labels while others have less than 10

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
# See the frequency of labels in the training dataset
train["common_name"].value_counts().head(20).plot(
    kind="bar", ax=axs[0], width=1, color=color_pal[0]
)

axs[0].set_title("Top 20 Birds with Labels", fontsize=20)

# See the frequency of labels in the training dataset
ax = (
    train["common_name"]
    .value_counts()
    .tail(20)
    .plot(kind="bar", ax=axs[1], width=1, color=color_pal[1])
)
axs[1].set_title("Bottom 20 Birds with Labels", fontsize=20)

We are given lat/long locations. Lets try to plot this!

In [None]:
fig = px.scatter_geo(
    train,
    lat="latitude",
    lon="longitude",
    color="common_name",
    width=1_000,
    height=500,
    title="BirdCLEF 2022 Training Data",
)
fig.show()

# Training Data by Author

There are 1356 different authors in the training dataset. The number of observations per author varies from 1 to 947!
- 540 of the 1356 authors only have labeled one audio file.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
# See the frequency of labels in the training dataset
train["author"].value_counts().head(50).plot(
    kind="bar", ax=ax, width=1, color=color_pal[2]
)

ax.set_title("Top 50 Authors", fontsize=20)

# Load Example Training Audio File

In [None]:
# Listen to the audio for the first training example
fn = train["filename"].values[0]
ipd.Audio(f"{BASE_DIR}train_audio/{fn}")

In [None]:
# Barn Owl Example - (WARNING IT'S CREEPY SOUNDING)
fn = train.loc[train["common_name"] == "Barn Owl"]["filename"].values[0]
ipd.Audio(f"{BASE_DIR}train_audio/{fn}")

## Load in the audio file as a numpy array

In [None]:
y, sr = librosa.load(f"{BASE_DIR}train_audio/{fn}")
print(f"Numpy array of the audio loaded of shape {y.shape} and sample rate {sr}")

## Plot 10 Random Audio Files from the training dataset

In [None]:
# Plot The Audio File
def plot_raw_audio(filename, birdtype, color):
    y, sr = librosa.load(f"{BASE_DIR}train_audio/{filename}")
    ax = pd.DataFrame(y).plot(
        figsize=(10, 3), title=f"{birdtype} Raw Audio", lw=0.1, color=color
    )
    plt.legend().remove()
    plt.show()


for i, d in train.sample(10, random_state=529).iterrows():
    plot_raw_audio(d["filename"], d["common_name"], next(color_cycle))

## Create Spectograms of Birds

In [None]:
def plot_audio_melspec(filename, birdtype):
    y, sr = librosa.load(f"{BASE_DIR}train_audio/{filename}")
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, fmax=8000)

    fig, ax = plt.subplots(figsize=(10, 3))
    S_dB = librosa.power_to_db(S, ref=np.max)
    img = librosa.display.specshow(
        S_dB, x_axis="time", y_axis="mel", sr=sr, fmax=8000, ax=ax
    )
    fig.colorbar(img, ax=ax, format="%+2.0f dB")
    ax.set(title=f"Mel-frequency for bird {birdtype}")
    plt.show()


for i, d in train.sample(10, random_state=529).iterrows():
    plot_audio_melspec(d["filename"], d["common_name"])