# BirdCLEF 2021 - BirdCall Identification

## Problem statement

- In this competition, you’ll automate the acoustic identification of birds in soundscape recordings. You'll examine an acoustic dataset to build detectors and classifiers to extract the signals of interest (bird calls).

- With proper sound detection and classification—aided by machine learning—researchers can improve their ability to track the status and trends of biodiversity in important ecosystems, enabling them to better support global conservation efforts.

## Provided Data

1. **train_short_audio**
     1. These are short recording of individual bird call as recorded by users of [xenocanto](www.xenocanto.org). These 
     contitutes bulk of your training data.

2. **train_soundscapes** 
     1. These files are similar to your test data. Audio files are of ~10 mins long duration.
    
3. **train_metadata.csv** 
     1. These file contains metadata of recordings of train_short_audio like site, date, filename, recordist etc
4. **train_soundscape_labels**
     1. contains the labels for auido file present in train_sounscapes folder.Labels are given for each 5-second window of auido file. 
     2. For example - row_id - 7019_COR_5 means and auido file with id 7019, 
     3. COR is the site at which audio was recorded and 5 indicates the 0 to 5 second window of the complete audio file of id 7019. 
     4."birds" column is the labels i.e. birds(separated by space) which were heard during that time window. 

## Submission File

For each row_id i.e.(audio id)_(site name)_(5 second time_window), you need to list the birds which are being heard in that window. which will be evaluated based on F1 score.

# Data Exploration

## Imports

In [None]:
import pandas as pd
import numpy as np
import re

import seaborn as sns
import matplotlib.pyplot as plt
import geopandas as gpd
import descartes
from shapely.geometry import Point,Polygon

from collections import Counter

## Lets analyse the train_metadata.csv file

In [None]:
base_dir = '../input/birdclef-2021/'
train_metadata = pd.read_csv(base_dir + '/train_metadata.csv')

### Shape and columns of the train_metadata

In [None]:
train_metadata.shape

In [None]:
train_metadata.head()

1. We have Labels for **62874** individual xenocanto audio files
2. Each file_name has 2 labels - 
    1. Primary_label - Primary bird in the audio file
    2. Secondary_labels - bird audio(if present) in the background
3. Type - Type of call made by the bird in the audio file
4. Latitude & Longitude - Co-ordinates of location where audio was captured
5. Scientific_name - Scientific name of primary bird in the audio file
6. Common_name - Common_name of primary bird in the audio file
7. Author - Recordist name
8. Filename - Filename as in **train_short_audio** folder
9. date - Date on which recording was captured
10. Rating - Rating based on quality of audio captured
11. Time - Time when recording was captured
12. url - url of <https://xenocanto.org>

### Is there any null values in the this file?

In [None]:
train_metadata.isnull().sum()

- **No Null value; Good to Go.**

### Individual Column wise analysis

#### Primary labels

- **Number of unique Primary labels**

In [None]:
train_metadata.primary_label.nunique()

- **Distribution of differnt primary labels**

In [None]:
primary_label_dist = train_metadata.primary_label.value_counts().reset_index().rename(columns={'primary_label':'count',
                                                                                               'index':'primary_label'})

primary_label_dist

In [None]:
plt.figure(figsize=(24,12))
ax = sns.barplot(x = 'primary_label', y='count', data= primary_label_dist[primary_label_dist['count']>=200])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.plot()

- **So, we have count within groups ranging from maximum of 500 to minimum of 8**. Ok, lets see the distribution of these counts

In [None]:
plt.figure(figsize=(24,12))
sns.distplot(primary_label_dist['count'], kde=False)
plt.title("Primary Label Distribution")

- **Looks like most of the bird species has  ~75 to 200 samples corresponding to it.**

#### Secondary label

- Number of unique secondary labels

In [None]:
train_metadata.secondary_labels.nunique()

In [None]:
seocondary_label_dist = train_metadata.secondary_labels.value_counts().reset_index().rename(columns={'secondary_labels':'count',
                                                                             'index':'secondary_labels'})

seocondary_label_dist

- secondarylabels contains more than one species.
- Majority of the these labels is empty i.e. there is no other bird heard in that audio clip

- **ok, so lets see how these counts is actually distributed.**

In [None]:
plt.figure(figsize=(24,12))
sns.distplot(seocondary_label_dist['count'],kde= False)

- **Majority of these unique values in secondary_labels is basically repeated very few number of time.** But what about the repeatation of unique individual bird?

In [None]:
all_secondary_labels = train_metadata.secondary_labels.apply(lambda x: eval(x)).sum()

In [None]:
individual_secondary_label = dict(Counter(all_secondary_labels))
individual_secondary_label_df = pd.DataFrame({'label':list(individual_secondary_label.keys()),
                                              'count':list(individual_secondary_label.values())}).sort_values('count',ascending=False)

In [None]:
individual_secondary_label_df

- lets see the count of these bird labels where count is more than 200

In [None]:
plt.figure(figsize=(24,12))
ax = sns.barplot(x = 'label', y='count', data= individual_secondary_label_df[individual_secondary_label_df['count']>=200])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.plot()

#### Geolocation column

- As there is no null in the co-ordinates lets plot these in on a world map

In [None]:
gdf = gpd.GeoDataFrame(train_metadata, geometry=gpd.points_from_xy(train_metadata.longitude, 
                                                                   train_metadata.latitude))

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
fig,ax = plt.subplots(figsize=(24,12))
world.plot(ax= ax, color='black', edgecolor='black')
gdf.plot(ax=ax, color='red', markersize=2)
plt.show()

- **Looks like, though the data has been collected across the globe, but majority of it has been collected in North America, South America and European countries**

#### Date

In [None]:
train_metadata['year'] = train_metadata['date'].astype('str').str[0:4]
year_dist = train_metadata.year.value_counts().reset_index().rename(columns={'index':'year',
                                                                 'year':'count'}).sort_values('year',ascending=True).reset_index(drop=True)

plt.figure(figsize=(24,12))
ax = sns.barplot(x='year',y='count',data = year_dist)
ax.set_xticklabels(labels= ax.get_xticklabels(), rotation= 90)
plt.title('Distribution across year of auido recordings')
plt.show()

- Some garbage date value is present where year is in (0000, 0199, 0201,0202,2104)
- Most of the audio has been captured in last decade

In [None]:
train_metadata['month'] = train_metadata['date'].astype('str').str[5:7]
month_dist = train_metadata.month.value_counts().reset_index().rename(columns={'index':'month',
                                                                              'month':'count'}).sort_values('month',ascending=True).reset_index(drop=True)

plt.figure(figsize=(24,12))
ax = sns.barplot(x='month',y='count',data = month_dist[month_dist['month']!='00'])
ax.set_xticklabels(labels= ax.get_xticklabels(), rotation= 90)
plt.title('Distribution across month of auido recordings')
plt.show()

- Most of the audio has been captured between March to July timeline

#### ratings

In [None]:
rating_dist = train_metadata.rating.value_counts().reset_index().rename(columns={'index':'rating',
                                                                                 'rating':'count'}).sort_values('rating',ascending=True).reset_index(drop=True)

In [None]:
plt.figure(figsize=(24,12))
ax = sns.barplot(x='rating',y='count',data = rating_dist)
plt.title('Distribution across rating of auido recordings')
plt.show()

- Most of the audio clip has rating more than 3.5

#### time

In [None]:
def hour_extractor(time_object):
    if re.match(r'^\d\d:\d\d$',time_object.strip('')):
        return time_object.strip('')[0:2]
    else:
        return 'NA'
train_metadata['hour'] = train_metadata['time'].apply(lambda x: hour_extractor(x))

In [None]:
hour_dist = train_metadata['hour'].value_counts().reset_index().rename(columns={'index':'hour',
                                                                                'hour':'count'}).sort_values('hour',ascending=True).reset_index(drop=True)

plt.figure(figsize=(24,12))
ax = sns.barplot(x='hour',y='count',data = hour_dist)
plt.title('Distribution across hour of auido recordings')
plt.show()

- Singnificant number of audio recoding does not have time captured. However, looks like majority of audio recording has been captured in first half.