## BirdCLEF Call Identification Pre-Modeling
---

1. MetaData EDA
2. Audio EDA
3. Toy model with fastai

### Acknowledgments

---

* [EDA kickstarter](https://www.kaggle.com/virajkadam/birdclef-exploratory-data-analysis)
* [torch-librosa help](https://www.kaggle.com/whurobin/training-pipeline-in-pytorch-lightning/data)
* [fastaudio help]( https://colab.research.google.com/drive/1hTRtTq3Tr9kgld0i78ao8gVrq_0yPTgW#scrollTo=UktmdDZ7wt8T)

In [None]:
!pip uninstall fastai -y
!pip install fastai==2.2.7
!pip install fastaudio
!pip install fastcore==1.3.19

In [None]:
### Libraries

import os, random, math
import numpy as np
import pandas as pd 
import geopandas as gpd 
import matplotlib.pyplot as plt 
import seaborn as sns 
exec(open('../input/plting/plt-apple-dark.py').read())

from fastaudio.core.all import *
from fastaudio.augment.all import *
from fastai.torch_basics import *
from fastai.vision.all import *

import torchaudio
import librosa.display
import librosa
import librosa.display
from IPython.display import Audio
from pathlib import Path


In [None]:
def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    
seed_everything()

In [None]:
torch.cuda.is_available()

## 1. MetaData EDA
---

* Basic EDA before modeling.
* Where were the records taken?
  * Are the records evenly distributed across space or unevenly?


In [None]:
training_metadata_df=pd.read_csv('../input/birdclef-2021/train_metadata.csv')
print(f"Len training data: {len(training_metadata_df.index)} \n\n")
training_metadata_df.sample(2)

In [None]:
print(f'Number of different species: {training_metadata_df.primary_label.nunique()}')
print(f"Number of species > 100 audio records: {sum(training_metadata_df['common_name'].value_counts()<100)}")
percentage_low = 133/397*100
percentage_low = "{:.2f}".format(percentage_low)
print(f"At least {percentage_low}% of the records are infrequent")

In [None]:
print(f"Minimum longitude: {training_metadata_df['longitude'].min()}")
print(f"Maximum longitude: {training_metadata_df['longitude'].max()}")
print(f"Minimum latitude: {training_metadata_df['latitude'].min()}")
print(f"Maximum latitude: {training_metadata_df['latitude'].max()}")

In [None]:
df = training_metadata_df.iloc[:, [0,3,4,8,9]]
gdf = gpd.GeoDataFrame(
    df, geometry=gpd.points_from_xy(df.longitude, df.latitude))
gdf_1 = gdf[(gdf.latitude<35) & (gdf.longitude>-30)]
gdf_2 = gdf[(gdf.latitude<10) & (gdf.longitude<-20)]
gdf_3 = gdf[(gdf.latitude>=35) & (gdf.longitude>=-40)]
gdf_4 = gdf[(gdf.latitude>=10) & (gdf.longitude<=-40)]
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
ax = world.plot(color='white', edgecolor='black')
# We can now plot our ``GeoDataFrame``.
gdf_1.plot(ax=ax, markersize=1)
gdf_2.plot(ax=ax, markersize=1)
gdf_3.plot(ax=ax, markersize=1)
gdf_4.plot(ax=ax, markersize=1)
plt.show()

In [None]:
total = np.sum([len(gdf_1), len(gdf_2), len(gdf_3), len(gdf_4)])
assert total == len(training_metadata_df.index)
print('The number of records for each quadrant of the earth: \n' +
       f'     -> {[len(gdf_1), len(gdf_3), len(gdf_2), len(gdf_4)]}\n' +
       f'     -> {["Af", "EU", "SA", "NA"]}')

### MetaData EDA observations
---

Most of the recordings are in North and South America, as well as West and North Europe. The rest of the map has varying sparcity.

As per the Kaggle competition intro:

* "Some bird species may have local call 'dialects,' so you may want to seek geographic diversity in your training data". In addition, "while some bird calls can be made year round, such as an alarm call, some are restricted to a specific season. You may want to seek temporal diversity in your training data." 

Our EDA confirms that this approach, seeking geospatial and temporal diversity, may be critical to preserve lower density types of audio records. For example, only 0.5% (0.005) of our data is from the Southeast quadrant of the globe.

## 2. Audio EDA

---

* Added subtle Pink noise and removed silence
* Cropped to seven seconds
* Converted to uint8 data to save memory
* Loaded into kaggle: audio-flacs-birdclef21 dataset

In [None]:
audio_files = get_audio_files('../input/audio-flacs-birdclef21/audio_flac/')
audio_files

In [None]:
y, sr = torchaudio.load(audio_files[0])
print("Sample rate:", sr)
print("Signal Length:", len(y))
print("Duration:", len(y)/sr, "seconds")
y = y.numpy()[0]
print("Signal: ", y)
print("Shape:", y.shape)

In [None]:
# Anna's Hummingbird

annhum, sr = torchaudio.load('../input/audio-flacs-birdclef21/audio_flac/annhum/XC57971.ogg.flac')
plt.figure(figsize=(15, 5))
librosa.display.waveplot(annhum.numpy()[0], sr=sr)
annhum_audio = '../input/audio-flacs-birdclef21/audio_flac/annhum/XC57971.ogg.flac'
Audio(annhum_audio)

In [None]:
# Bluejay

blujay, sr = torchaudio.load('../input/audio-flacs-birdclef21/audio_flac/blujay/XC108404.ogg.flac')
plt.figure(figsize=(15, 5))
librosa.display.waveplot(blujay.numpy()[0], sr=sr)
blujay_audio = '../input/audio-flacs-birdclef21/audio_flac/blujay/XC108404.ogg.flac'
Audio(blujay_audio)

In [None]:
# Plot previous species

gdf_blujay = gdf[gdf.primary_label == 'blujay']
gdf_annhum = gdf[gdf.primary_label == 'annhum']
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
ax = world.plot(color='white', edgecolor='black')
gdf_blujay.plot(ax=ax, markersize=0.3, label='bluejay')
gdf_annhum.plot(ax=ax, markersize=0.3, label='annhum')
plt.legend()
plt.xlim([-150, -20])
plt.ylim([10, 75])
plt.show()

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True, figsize=(15, 10))

sg0 = librosa.stft(annhum.numpy()[0])
sg_mag, sg_phase = librosa.magphase(sg0)
sg1 = librosa.feature.melspectrogram(S=sg_mag, sr=sr)
sg2 = librosa.amplitude_to_db(sg1, ref=np.min)
librosa.display.specshow(sg2, sr=sr, y_axis='mel', fmax=8000, x_axis='time', ax=ax[0])
ax[0].set(title='Annas hummingbird Mel spectrogram')
ax[0].label_outer()

sg0 = librosa.stft(blujay.numpy()[0])
sg_mag, sg_phase = librosa.magphase(sg0)
sg1 = librosa.feature.melspectrogram(S=sg_mag, sr=sr)
sg2 = librosa.amplitude_to_db(sg1, ref=np.min)
librosa.display.specshow(sg2, sr=sr, y_axis='mel', fmax=8000, x_axis='time', ax=ax[1])
ax[1].set(title='Bluejay Mel spectrogram')
ax[1].label_outer()

### 3. Build a toy model with fastai to experiment
---

* Audio to melspec batch transform
* densenet121, as per [last comp's winning solution](https://www.kaggle.com/c/birdsong-recognition/discussion/183208)

In [None]:
audio_files

In [None]:
training_metadata_df.sample(2)

In [None]:
class AudioConfig:
    """
    Custom `AudioToSpec` transform for birdclef 
    """
    birds = config_from_func(
        transforms.MelSpectrogram,
        "Voice",
        mel="True",
        to_db="True",
        f_min=50.0,
        f_max=8000.0,
        n_fft=2048,
        n_mels=64,
        hop_length=int(2048 // 4)
    )

In [None]:
cfg = AudioConfig.birds()

batch_tfms = [AudioToSpec.from_cfg(cfg)]

get_y = lambda x: re.split('[/]', x.as_posix())[-2]

audio_db = DataBlock(blocks = (AudioBlock, CategoryBlock),
                     get_items = get_audio_files,
                     splitter = RandomSplitter(),
                     batch_tfms = batch_tfms, #augments wouldn't work on batch
                     get_y=get_y)

In [None]:
dbunch = audio_db.dataloaders('../input/audio-flacs-birdclef21/audio_flac', bs=128)
dbunch.show_batch(figsize=(10, 5))

In [None]:
audio_db.summary('../input/audio-flacs-birdclef21/audio_flac')

In [None]:
from fastai.callback.data import CudaCallback

learn = cnn_learner(dbunch, 
            densenet121,
            n_in=1,
            loss_func=CrossEntropyLossFlat(),
            metrics=[error_rate],
            cbs=[CudaCallback]).to_fp16()

In [None]:
learn.lr_find() 

In [None]:
learn.fit_one_cycle(3, 6e-3)

In [None]:
learn.unfreeze()
learn.lr_find() 

In [None]:
learn.fit_one_cycle(3, lr_max=slice(1e-6,1e-4))

* Looks like training after unfreezing doesn't move the needle much

In [None]:
preds_list = list()
for i in range(0, len(targs)):
    preds_list.append(preds[i].argmax().item())

preds_tensor = TensorCategory(preds_list)

In [None]:
f1score = F1Score(average='micro')
preds, targs = learn.tta()
print(f'Micro average F1 validation: {f1score(preds_tensor, targs).item()}')

In [None]:
np.savetxt("raw_pred_probabilities.csv", np.array(preds), delimiter=",")
np.savetxt("final_preds.csv", np.array(preds_list), delimiter=",")