In [None]:
import pkg_resources

def placeholder(x):
    raise pkg_resources.DistributionNotFound
pkg_resources.get_distribution = placeholder

!pip uninstall fastai fastcore torchaudio -y
#!pip install torch==1.8.1 torchaudio==0.8.1 fastcore==1.3.20
!pip install torch==1.9.0 torchaudio==0.9.0
!pip install fastaudio

In [None]:
try:
    import pycaret
except:
    !pip install pycaret-nightly

<hr style="border: solid 3px blue;">

# Introduction

![](https://i1.rgstatic.net/publication/259653570_History_Shut_up_and_calculate/links/5816acbe08aeffbed6c1a187/largepreview.png)

Picture Credit: https://i1.rgstatic.net

When I first encountered this problem, I had many thoughts. It was hard to figure out where to start. The more I read and tried to understand the problem, the less confident I became. I am neither an ornithologist nor an aucostic engineer. In short, domain knowledge is lacking. 

**Shut up and Calculate!**

This is a famous saying in quantum mechanics. While I was thinking about various ways to solve the problem, this saying suddenly came to my mind.
Rather than giving up because you don't have domain knowledge, let's stop thinking for a moment and start.

**I would like to organize this notebook in the following order.**
> 1. Understand each dataset. Check which features are important features using a simple model.
> 2. Understand audio data.
> 3. Transform audio data and design dataloader to process them.
> 4. Design a model and train it by introducing various methods to improve performance.

In [None]:
import os
import librosa
from tqdm import tqdm

import pandas as pd
from fastaudio.all import *
from fastai.vision.all import *

import torch
import torchaudio
import fastcore
import fastai
import fastaudio
import torchaudio
torchaudio.set_audio_backend("sox_io")

import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rcParams
import plotly.express as px

from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings(action='ignore')

In [None]:
sns.set(style="ticks", context="talk",font_scale = 1.2)
plt.style.use("seaborn-paper")
plt.subplots_adjust(wspace=1)

------------------------------------------------------
# EDA

In [None]:
# CONFIGURATIONS
DATA_DIR = Path('../input/birdclef-2022/train_audio')

In [None]:
audio_fns = get_audio_files(DATA_DIR)
print(f'No. of audio files: {len(audio_fns)}')

# to save time, I subset training data
meta_df = pd.read_csv('../input/birdclef-2022/train_metadata.csv')
test_df = pd.read_csv('../input/birdclef-2022/test.csv')

--------------------------------------------
# Analyzing Meta Data

In [None]:
meta_df.head().T.style.set_properties(**{'background-color': 'black',
                           'color': 'white',
                           'border-color': 'white'})

In [None]:
meta_df.nunique()

In [None]:
test_df.head().T.style.set_properties(**{'background-color': 'black',
                           'color': 'white',
                           'border-color': 'white'})

In [None]:
train_df = meta_df.drop(['url','filename','scientific_name','license','time','common_name','secondary_labels']
                        ,axis=1
                        ,errors='ignore')
train_df = train_df.sample(1000)

In [None]:
enc_list = ['primary_label','author','type']
for feature in enc_list:
    le = LabelEncoder()
    le = le.fit(train_df[feature])
    train_df[feature] = le.transform(train_df[feature])

------------------------------------------------------------------------
# Understaning Metadata from the simple model

In [None]:
from pycaret.classification import *

In [None]:
setup(data = train_df, 
      target = 'primary_label',
      preprocess = True,
      silent = True
     )

In [None]:
dt = create_model('dt',verbose = False)

In [None]:
plot_model(dt, plot='feature')

<span style="color:Blue"> Observation:
* latitue and longitude are important features.
* rating is of relatively low importance.
    
I still don't know if this information is helpful for learning. If it is helpful, I still don't know how to use it for learning.

In [None]:
fig = px.scatter_geo(
    meta_df,
    lat="latitude",
    lon="longitude",
    color="primary_label",
    width=1000,
    height=500,
    title="Bird Distribution",
)
fig.show()

------------------------------------
# Understanding Train Dataset

Train data are audio files. Let's hear it for ourselves and see the waveform to see what strategy can solve this problem.

In [None]:
def show_bird(audios):
    for fn in audios:
        audio = AudioTensor.create(fn)
        audio.show()

-------------------------------
## Normoc

![](https://live.staticflickr.com/7070/6873951614_2dd80c1d7c_b.jpg)

Ref: https://cdn.download.ams.birds.cornell.edu

In [None]:
fig = px.scatter_geo(
    meta_df[meta_df['primary_label'] == 'normoc'],
    lat="latitude",
    lon="longitude",
    color="primary_label",
    width=1000,
    height=500,
    title="Bird Distribution",
)
fig.show()

In [None]:
normoc_fns = get_audio_files( '../input/birdclef-2022/train_audio/normoc')

In [None]:
show_bird(normoc_fns[:3])

<span style="color:Blue"> Observation:
* They are of the same bird species, but sound different.
* It seems that there are audio files in which noise is heard.
* Even if you look at the Waveform, there is no similarity.

--------------------------------
## Norcar

![](https://upload.wikimedia.org/wikipedia/commons/d/da/Cardinal.jpg)

Ref: https://upload.wikimedia.org

In [None]:
fig = px.scatter_geo(
    meta_df[meta_df['primary_label'] == 'norcar'],
    lat="latitude",
    lon="longitude",
    color="primary_label",
    width=1000,
    height=500,
    title="Bird Distribution",
)
fig.show()

In [None]:
norcar_fns = get_audio_files( '../input/birdclef-2022/train_audio/norcar')

In [None]:
show_bird(norcar_fns[:3])

<span style="color:Blue"> Observation:
* They are of the same bird species, but sound different.
* It seems that there are audio files in which noise is heard.
* Even if you look at the Waveform, there is no similarity.

----------------------------------------
## Bcnher

![](https://www.birdingintaiwan.com/Black-crowned%20Night-Heron.600b.jpg)

Picture Credit: https://www.birdingintaiwan.com

In [None]:
fig = px.scatter_geo(
    meta_df[meta_df['primary_label'] == 'bcnher'],
    lat="latitude",
    lon="longitude",
    color="primary_label",
    width=1000,
    height=500,
    title="Bird Distribution",
)
fig.show()

In [None]:
bcnher_fns = get_audio_files( '../input/birdclef-2022/train_audio/bcnher')

In [None]:
show_bird(bcnher_fns[:3])

<span style="color:Blue"> Observation:
* They are of the same bird species, but sound different.
* It seems that there are audio files in which noise is heard.
* Even if you look at the Waveform, there is no similarity.

--------------------------------------------
# Making fingerprints of audios

![](https://www.researchgate.net/publication/335398843/figure/fig1/AS:796124961058818@1566822390492/MFCC-mel-frequency-cepstral-coefficients-characteristic-vectors-extraction-flow.png)

Picture Credit: https://www.researchgate.net

We try to solve the problem of audio by changing it to an image problem. Regarding Image, there are proven CNN models with high performance, and many methods have been devised to improve the performance.

Fortunately, there are APIs related to MFCC provided by fastai, so let's use them to transform audio into image. In other words, we are going to take a method that makes audio fingerprints so that we can classify them with these fingerprints.

**What is Mel-frequency cepstral coefficients (MFCCs)?**

> In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.
> 
> Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal spectrum. This frequency warping can allow for better representation of sound, for example, in audio compression.

Ref: https://en.wikipedia.org/wiki/Mel-frequency_cepstrum

In [None]:
aud2mfcc = AudioToMFCC(n_mfcc=40, melkwargs={'n_fft':2048, 'hop_length':256,'n_mels':128})

--------------------------------------------------------------
# Making Dataloders

![](https://miro.medium.com/max/1838/1*3vAYjhGh_EopD0cRdxrbOQ.png)

Picture Credit: https://miro.medium.com

## Making Pipeline

We build the pipeline in the following order.
> Resample -> DownmixMono -> RemoveSilence -> ResizeSignal -> AudioToMFCC -> Delta -> ToTensor

In [None]:
item_tfms = [RemoveSilence(),ResizeSignal(1000), aud2mfcc, Delta()]

In [None]:
aud_digit = DataBlock(blocks=(AudioBlock, CategoryBlock),  
                 get_items=get_audio_files, 
                 splitter=RandomSplitter(),
                 item_tfms = item_tfms,
                 get_y=parent_label)

In [None]:
aud_digit.summary(DATA_DIR)

In [None]:
dls = aud_digit.dataloaders(DATA_DIR, bs=16)

In [None]:
dls.c

<span style="color:Blue"> Observation:
    
* Target is composed of 152 classes.

--------------------------------------------------------------------
## Checking batch

In [None]:
plt.figure(figsize=(10, 8))
dls.show_batch(max_n=3,figsize=(20,10))

<span style="color:Blue"> Observation:
* We converted audio to image. In other words, the above figures can be regarded as fingerprints of each audio file.
   
The format of the question has been changed. Looking at the picture above, it can be seen that the problem of audio classification has been changed to the problem of image classification.
    
Quantum mechanics cannot explain the behavior of quantum clearly, but as it can be solved mathematically, rather than understanding the above figures accurately, we will only think about how the model can understand and learn the above picture.

------------------------------------------------------
# Modeling

In [None]:
def audio_learner(dls, arch, loss_func, metrics):
  "Prepares a `Learner` for audio processing"
  learn = Learner(dls, arch, loss_func, metrics=metrics, 
                  cbs = [EarlyStoppingCallback(monitor='accuracy', patience=5),ActivationStats(with_hist=True)]).to_fp16()
  n_c = dls.one_batch()[0].shape[1]
  if n_c == 1: alter_learner(learn)
  return learn

In [None]:
learn = audio_learner(dls, 
                      xresnet18(), 
                      LabelSmoothingCrossEntropy(), 
                      accuracy)

--------------------------------------------
# Finding the proper learning rate

![](https://149695847.v2.pressablecdn.com/wp-content/uploads/2019/05/learning-rate.gif)

Picture Credit: https://149695847.v2.pressablecdn.com

> In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns". In the adaptive control literature, the learning rate is commonly referred to as gain.
> 
> In setting a learning rate, there is a trade-off between the rate of convergence and overshooting. While the descent direction is usually determined from the gradient of the loss function, the learning rate determines how big a step is taken in that direction. A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum.

Ref: https://en.wikipedia.org/wiki/Learning_rate

Learning rate is one of the important parameters among hyperparameters. However, choosing a learning rate is not an easy task with many considerations.
Fastai finds the learning rate so that an appropriate learning rate can be determined.

In [None]:
sr = learn.lr_find()
sr

------------------------------------------------
# Training

In [None]:
learn.fit_one_cycle(10, sr.lr_steep)

-------------------------------------------
# Interpreting Model

In [None]:
def plot_layer_stats(self, idx):
    plt,axs = subplots(1, 3, figsize=(15,3))
    plt.subplots_adjust(wspace=0.5)
    for o,ax,title in zip(self.layer_stats(idx),axs,('mean','std','% near zero')):
        ax.plot(o)
        ax.set_title(title)

In [None]:
plot_layer_stats(learn.activation_stats,-2)

In [None]:
plot_layer_stats(learn.activation_stats,-1)

<span style="color:Blue"> Observation:
* The activation distribution is well distributed.

-------------------------------------------
# Checking Underfitting and Overfitting

![](https://vitalflux.com/wp-content/uploads/2020/12/overfitting-and-underfitting-wrt-model-error-vs-complexity.png)

Picture Credit: https://vitalflux.com/wp-content

In [None]:
learn.recorder.plot_loss()

<span style="color:Blue"> Observation:
* It seems that learning has ended before overffing. However, I still don't know if it is the optimal point. It is burdensome to increase the epoch because the training time is too long.

----------------------------------------------
# Checking Results

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(30,30), dpi=240)

<span style="color:Blue"> Observation:
* Because there are many classes of Target, it is not good to look at.
* If you look at the diagonal of the matrix, it seems that it has learned to some extent. However, it is not satisfactory.

**Let's check the cases where the model made the most mistakes!**

In [None]:
interp.most_confused(min_val=10)

<span style="color:Blue"> Observation:
* In the combination of 'bcnher' and 'brnowl', there are many cases where the judgment is exceptionally wrong. I guess I'll have to check more on why. 

<hr style="border: solid 3px blue;">

# Conclusion

The accuracy confirmed by the validation dataset is not satisfactory.
It seems that you need to think about ways to improve performance while reading various materials again.

Thank you for reading! 

**I will be back!**

![](https://i.kym-cdn.com/photos/images/newsfeed/000/996/505/98a.gif)

<hr style="border: solid 3px blue;">