# Intro
Welcome to the [BirdCLEF 2022](https://www.kaggle.com/c/birdclef-2022/overview) compedition.

![](https://storage.googleapis.com/kaggle-competitions/kaggle/33246/logos/header.png)

We recommmend this [notebook](https://www.kaggle.com/drcapa/birdclef-2021-starter) to the data set of the last year and this [notebook](https://www.kaggle.com/drcapa/recognizesongapp-fromscratch-tutorial) for handling audio data tutorial.

**Table of content:**
1. [Overview](#Overview)
2. [A Sample File](#SampleFile)
3. [Plot Examples](#PlotExamples)
4. [Exploratory Data Analysis](#EDA)
5. [Focus On Labels](#FocusOnLabels)
6. [Audio Data Generator](#AudioDataGenerator)

<font size="4"><span style="color: royalblue;">Please vote the notebook up if it helps you. Feel free to leave a comment above the notebook. Thank you. </span></font>

# Libraries

In [None]:
import os
import ast
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import soundfile as sf
import librosa
import librosa.display
import IPython.display as display

# Path


In [None]:
path = '/kaggle/input/birdclef-2022/'
os.listdir(path)

# Load Data

In [None]:
train_meta = pd.read_csv(path+'train_metadata.csv')
test_data = pd.read_csv(path+'test.csv')
ebird_data = pd.read_csv(path+'eBird_Taxonomy_v2021.csv')
samp_subm = pd.read_csv(path+'sample_submission.csv')

with open(path+'scored_birds.json') as f:
    scored_birds = json.load(f)

# Funcions
We load some helper functions.

In [None]:
def read_ogg_file(path, file):
    """ Read ogg audio file and return numpay array and samplerate"""
    
    data, samplerate = sf.read(path+file)
    return data, samplerate


def plot_audio_file(data, samplerate):
    """ Plot the audio data"""
    
    sr = samplerate
    fig = plt.figure(figsize=(8, 4))
    x = range(len(data))
    y = data
    plt.plot(x, y)
    plt.plot(x, y, color='red')
    plt.legend(loc='upper center')
    plt.grid()
    
    
def plot_spectrogram(data, samplerate):
    """ Plot spectrogram with mel scaling """
    
    sr = samplerate
    spectrogram = librosa.feature.melspectrogram(y=data, sr=sr)
    log_spectrogram = librosa.power_to_db(spectrogram, ref=np.max)
    librosa.display.specshow(log_spectrogram, sr=sr, x_axis='time', y_axis='mel')

# Overview <a name="Overview"></a>
**train_metadata.csv** - A wide range of metadata is provided for the training data. The most directly relevant fields are:

* primary_label - a code for the bird species. You can review detailed information about the bird codes by appending the code to https://ebird.org/species/, such as https://ebird.org/species/amecro for the American Crow.
* secondary_labels: Background species as annotated by the recordist. An empty list does not mean that no background birds are audible.
* author - the eBird user who provided the recording.
* filename: the associated audio file.
* rating: Float value between 0.0 and 5.0 as an indicator of the quality rating on Xeno-canto and the number of background species, where 5.0 is the highest and 1.0 is the lowest. 0.0 means that this recording has no user rating yet.

In [None]:
train_meta.head()

**train_audio/** - The bulk of the training data consists of short recordings of individual bird calls generously uploaded by users of xenocanto.org. These files have been downsampled to 32 kHz where applicable to match the test set audio and converted to the ogg format.

In [None]:
print('Number of subfolders/species:', len(os.listdir(path+'train_audio')))

**test_soundscapes/** - When you submit a notebook, the test_soundscapes directory will be populated with approximately 5,500 recordings to be used for scoring. These are each within a few milliseconds of 1 minute long and in the ogg audio format. Only one soundscape is available for download.

In [None]:
os.listdir(path+'test_soundscapes')

**test.csv** - Metadata for the test set. Only the first three rows are available for download; the full test.csv is provided in the hidden test set.

* row_id - A unique identifier for the row.
* file_id - A unique identifier for the audio file.
* bird - The ebird code for the row. There is one row for each of the scored species per 5 second window per audio file.
* end_time - The last second of the 5 second time window (5, 10, 15, etc).

In [None]:
test_data.head()

**sample_submission.csv** - A valid sample submission. Only the first three rows are available for download; the full submission.csv is provided in the hidden test set.

* row_id - A unique identifier for the row.
* target - True/False for whether or not the bird in question called during the 5 second window.

In [None]:
samp_subm

**scored_birds.json** - The subset of the species in the dataset that are scored.

In [None]:
scored_birds[0:5]

**eBird_Taxonomy_v2021.csv** - Data on the relationships between different species.

In [None]:
ebird_data.head()

# A Sample File <a name="SampleFile"></a>
We focus on the sample in the first row of the train meta data.

In [None]:
row = 0
train_meta.iloc[row]

We extract to features, the primary label which is the name of the folder where the audio file is stored and the filename:

In [None]:
label = train_meta.loc[row, 'primary_label']
filename = train_meta.loc[row, 'filename']

print(filename)
# Check if the file is in the folder
filename.split('/')[1] in os.listdir(path+'train_audio/'+label)

In [None]:
data, samplerate = sf.read(path+'train_audio/'+filename)
print('snipe of data:', data[:4])
print('samplerate:', samplerate)

In [None]:
plot_audio_file(data, samplerate)

Plot [spectrogram](https://en.wikipedia.org/wiki/Spectrogram) with mel scaling:

In [None]:
plot_spectrogram(data, samplerate)

Listen to the bird:

In [None]:
display.Audio(path+'train_audio/'+filename)

# Exploratory Data Analysis <a name="EDA"></a>
There are 152 primary lables and 152 common names. Both are related. The labels are not evenly distributed.

In [None]:
train_meta['primary_label'].value_counts()[0:5]

In [None]:
train_meta['common_name'].value_counts()[0:5]

# Focus On Labels <a name="FocusOnLabels"></a>
The secondary label is a string representation of list.

In [None]:
row = 1
print('original type:',  type(train_meta.loc[row, 'secondary_labels']))
print('converted type:', type(ast.literal_eval(train_meta.loc[row, 'secondary_labels'])))

We convert all label of the train data:

In [None]:
labels = []
for row in train_meta.index:
    labels.extend(ast.literal_eval(train_meta.loc[row, 'secondary_labels']))
labels = list(set(labels))

print('Number of unique bird labels:', len(labels))

# Audio Data Generator <a name="AudioDataGenerator"></a>
We use a Data Generator to load the data on demand.

**Coming Soon**

# Export

In [None]:
samp_subm['target'] = True
samp_subm.head()

In [None]:
samp_subm.to_csv('submission.csv', index=False)