![](https://wallpapercave.com/wp/wp7039642.jpg)
# BirdCLEF 2022 : Introduction
---
In this notebook we will iterate through the data and file for the first time and find some tabular insights and find whether they play and specific role predicting.

**This is the first of the 3 notebooks which contains the introduction towards given data.
On the next notebooks model training and preparation will be done .**

# Libraries

In [None]:
from glob import glob
import json
import numpy as np
import os
import pandas as pd
from termcolor import cprint
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

# Source of the files and data

In [None]:
train_base_path = "../input/birdclef-2022/train_metadata.csv"
test_base_path = "../input/birdclef-2022/test.csv"
sample_submission_base_path = "../input/birdclef-2022/sample_submission.csv"
bird_taxonomy_base_path = "../input/birdclef-2022/eBird_Taxonomy_v2021.csv"
labels_base_path = "../input/birdclef-2022/scored_birds.json"
train_dir = "../input/birdclef-2022/train_audio"
test_dir = "../input/birdclef-2022/test_soundscapes"

# Loading Data

In [None]:
train_df = pd.read_csv(train_base_path)
test_df = pd.read_csv(test_base_path)
samp_sub_df = pd.read_csv(sample_submission_base_path)
bird_details_df = pd.read_csv(bird_taxonomy_base_path)
labels = json.load(open(labels_base_path, "r"))

#### Iterating through train metadata :

We will take an overview of the train metadata first then we will jump to further insights.

In [None]:
train_df.head()

As we can see the train data is mostly filled with naming columns which will not help us to retrieve any knowledge of the species.
The primary important features should be -
#### 1. primary_label : 
        It represents the target feature.
#### 2. type :
        It will help us to determine te nature of calls
#### 3. latitude :
        It will help us retreieving geographic location.
#### 4. longitude :
        It will help us retreieving geographic location.
#### 5. rating :
        Quality of the recording will help us analyzing the final products as quality may differ in training and testing data.
#### 6. time :
        It will help us understanding the birds up time , which can be a contributing factor to model building.
#### 7. filename :
        It will act as source for the records of the row data.

In [None]:
train_df.info()

From the train metadata we can see that most of the features are object type and useless for final prediction.

#### Iterating through test metadata :

From test metadata we can found what features should be the main ones for prediction.

In [None]:
test_df.head()

As we can see in test metadat there's a column named **end_time** which will describe the last 5 second recording of the bird which should be predicted.

In [None]:
test_df.info()

In the test metadata we can see most of the features aren't present like the location of the recording which could be beneficial for primitive model training and predicting.

In [None]:
samp_sub_df.head()

The sample submission holds only the test id and the prediction .

In [None]:
samp_sub_df.info()

## Approach :

The approach should be as follows -

            1. Create and train a NN model and train it on the training data.
            2. Trim and prepare the testing data and predict the model over it.
            3. Check if the one-hot coded representation really predicts the output class.
            4. Assign True of False as it should be.

In [None]:
bird_details_df.head()

In [None]:
bird_details_df.info()

The details of the bird may lead us to find some key points which will be helpful preparing the model.

In [None]:
# Scored birds
labels

In [None]:
train_df.primary_label.nunique()

Let's check throuh the primary label of the birds which will be our target feature.

In [None]:
train_df.primary_label.unique()

Also checking the secondary level as it might contain any relevant data or not.

In [None]:
train_df.secondary_labels.apply(lambda x: len(x)).unique()

Looks like it is not containing any similar fashion, so moving from it.

#### Calls of the birds :

The calls of thee birds will help us to determine several type of speech variations of a single species.

In [None]:
def call_extractor(data):
    data = data[1:-1]
    data = pd.Series(data.split(", "))
    return data.tolist()

In [None]:
all_calls = train_df["type"].apply(lambda x : call_extractor(x)).tolist()
calls = []
for call_list in all_calls:
    for call in call_list:
        calls.append(call)
unique_calls = pd.Series(calls).value_counts().index
print("Total unique calls : {}".format(
                                len(unique_calls)))
unique_calls[:15]

We can see there are different type of calls, so the approach to update it will be to only use the most abundant calls for training, otherwise it will ruin the training metrics.

# Exploratory Data Analysis :

Let's visualize and try to find if there's any pattern of the features.

#### Importing libraries for visualization.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# global location of the species
unique_labels = train_df.primary_label.unique()
plt.figure(figsize=(20,6))
for label in unique_labels:
    temp_df = train_df[train_df.primary_label == label]
    plt.scatter(temp_df.longitude, temp_df.latitude)
plt.xlabel("longitude")
plt.ylabel("latitude")
plt.title("Location Wise Distribution")
plt.legend()
plt.show()

Let's check presenceee of the species by their latitude

In [None]:
# global location of the species
unique_labels = train_df.primary_label.unique()
plt.figure(figsize=(20,6))
for label in unique_labels:
    temp_df = train_df[train_df.primary_label == label]
    plt.scatter(temp_df.longitude.apply(lambda x: abs(x)), temp_df.latitude)
plt.xlabel("longitude")
plt.ylabel("latitude")
plt.title("Location Wise Distribution (Equtorial to Polar region)")
plt.legend()
plt.show()

Let's check the ratings of the bird call recording perprimary label.

In [None]:
avg_ratings = train_df.groupby("primary_label").agg({"rating" : "mean"})
plt.figure(figsize = (20, 6))
sns.barplot(avg_ratings.index, avg_ratings.rating)
plt.title("Average ratings on specific bird codes")
plt.xlabel("Bird Codes")
plt.ylabel("Rating")
plt.xticks(rotation = 90)
plt.show()

Let's also check the average call timings of the species.

In [None]:
def time_converter(data):
    try:
        hour, minute = data.split(":")
        return int(hour)*60 + int(minute)
    except:
        return -1

timings = train_df[["primary_label", "time"]].copy()
timings["time"] = timings["time"].apply(time_converter)
timings = timings[timings["time"] >= 0 ]
avg_timings = timings.groupby("primary_label").agg({"time" : "mean"})
plt.figure(figsize = (20, 6))
sns.barplot(avg_timings.index, avg_timings.time)
plt.title("Average time on specific bird codes records")
plt.xlabel("Bird Codes")
plt.ylabel("Recording time")
plt.xticks(rotation = 90)
plt.show()

Now checking the 3d spatial representation of latitude, longitude and timing so that we can find deeper meanings

In [None]:
feature_df = train_df[["latitude", "longitude", "time", "primary_label"]]
feature_df.time = feature_df.time.apply(time_converter)
feature_df = feature_df[feature_df.time >= 0]
feature_df.head()

In [None]:
from mpl_toolkits import mplot3d
fig = plt.figure(figsize=(10,10))
ax = plt.axes(projection='3d')
X = feature_df.latitude
Y = feature_df.longitude
Z = feature_df.time
ax.plot_trisurf(X, Y, Z, linewidth=0, antialiased=False)
ax.set_title('surface');
fig.show()

The target feature count also should be checked as it contributes the most while predicting.

In [None]:
target_feature_dist = train_df.primary_label.value_counts()
plt.figure(figsize = (8, 8))
plt.pie(target_feature_dist.values, labels= target_feature_dist.index)
plt.title("Target Feature Distribution")
plt.show()

We can see the distribution is not well enough to predict all the class similarly.

# Record Insights :
---
Now it's time to jump inside the recordings.

In [None]:
import torch
import torchaudio
from torchaudio.transforms import MelSpectrogram
import random
random.seed(42)
torchaudio.set_audio_backend("sox_io")

In [None]:
sample_ogg_file_base_path = "../input/birdclef-2022/train_audio/barpet/XC441955.ogg"
waveform, sample_rate = torchaudio.load(sample_ogg_file_base_path)
print("Sample rate : {}".format(sample_rate))
print("Shape of waveform : {}".format(waveform.shape))

In [None]:
plt.figure(figsize = (20, 6))
plt.scatter(range(waveform.shape[1]), waveform.reshape(-1), s = 1)
plt.xlabel("Time")
plt.ylabel("Frequency")
plt.title("Sample audio wave propagation")
plt.show()

Let's see any pattern is there for every single species type

In [None]:
melsp = MelSpectrogram(sample_rate = 32000)

In [None]:
def pad_or_truncate(waveform, max_sample_len):
    if waveform.shape[0] > max_sample_len:
        return waveform[:max_sample_len]
    else:
        pad_length = (max_sample_len - waveform.shape[0]) // 2
        left_pad = right_pad = torch.zeros(pad_length)
        return torch.cat([left_pad, waveform, right_pad], axis = 0)
        
        
def form_audio_curve(directory, num_samples = 3, max_sample_len = 60000):
    file_paths = glob(f"{directory}/*ogg")
    #print(file_paths[:num_samples])
    random.shuffle(file_paths)
    plt.figure(figsize = (20, 6))
    for index, path in enumerate(file_paths[:num_samples]):
        waveform, _ = torchaudio.load(path)
        print(waveform.shape)
        waveform = pad_or_truncate(waveform[0], max_sample_len)
        plt.plot(waveform, label = f"audio{index+1}")
    plt.legend()
    plt.show()
form_audio_curve(directory = "../input/birdclef-2022/train_audio/buwtea")

In [None]:
for folder in os.listdir(train_dir):
    folder_path = os.path.join(train_dir, folder)
    form_audio_curve(directory = folder_path)

Visiting every single directory elements , it is sure that most of them follow a periodic effect, but some are very robust and randomize.

## Thanks for visiting!
## Do STAR if yo like it!