## **INTRODUCTION**

### * Goal: In this notebook, I’ll use tensor flow and keras skills to identify bird species by sound. Specifically, I'll develop a model that can process continuous audio data and then acoustically recognize the species. I'll help advance the science of bioacoustics and support ongoing research to protect endangered birds.
### * This notebook is part of the BirdCLEF 2022 contest sponsored by Cornell Lab of Ornithology, and will be evaluated on 21 classes there's not much data for these 21 classes - just 1266 entries (based on primary_label). 

## **ALTERNATIVE SOLUTION**

### Strategy:
### 1) Filter and remove audios without information. Eg: XC182414.ogg.
![XC182414](https://i.pinimg.com/originals/58/13/d2/5813d2f6fd6832aa64fb67a78111b175.jpg)
### 2) Filter and clean audios with too much noise. Eg: XC663738.ogg.
![XC663738](https://i.pinimg.com/originals/88/f3/1e/88f31e47b1b57979cc9fa3dabc5d5e44.jpg)
### 3) Filter and establish the most common song of each bird, since these audios have mixtures of two or three songs, so its impossible for the best model to work properly. Eg: XC644916.ogg.
![XC644916](https://i.pinimg.com/originals/a9/e1/20/a9e12057a653959a60323c7ad649abf5.jpg)
### 4) Develop a model with tensor flow and keras in kaggle, starting with two classes. Then we will progressively increase the classes until we reach the 21 requested classes.
### 5) Verify our model with another AI provider. Eg, I have done tests with Edge Impulse AI provider (https://www.edgeimpulse.com/) and I have achieved an accuracy of 96.9% with bird audio and two classes as shown below:
![Ede Impulse test](https://i.pinimg.com/originals/de/a3/98/dea398c65ecd5332aef56aa2c518fb56.jpg)

## **LIBRARIES**
### Installing dependencies.


In [None]:
import os
import json
import tqdm
import librosa
import librosa.display
import numpy as np
import pandas as pd
import seaborn as sns
from PIL import Image
import plotly.express as px
import IPython.display as ipd
import matplotlib.pyplot as plt

### Library downloaded from: https://www.wheelodex.org/projects/noisereduce/

In [None]:
!pip install ../input/noisereduce/noisereduce-2.0.0-py3-none-any.whl

## **LOAD DATA**

### According Cornell Lab, many bird songs have frequency ranges between 1,000 Hz and 8,000 Hz. In addition, the frequency response of a common microphone is between 100 to 10,000 Hz. 
https://www.allaboutbirds.org/news/do-bird-songs-have-frequencies-higher-than-humans-can-hear/#:~:text=Many%20bird%20songs%20have%20frequency,reach%208%2C000%20Hz%20and%20beyond.

In [None]:
seed = 42
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)

DURATION = 15
SPEC_SHAPE = (48, 128)
SAMPLE_RATE = 32000
TEST_DURATION = 5
SPEC_SHAPE = (48, 128)
FMIN = 500
FMAX = 8500

In [None]:
main_dir = '../input/birdclef-2022'
train_audio_dir = main_dir+'/train_audio'
test_audio_dir = main_dir+'/test_soundscapes'
train = pd.read_csv(main_dir+'/train_metadata.csv')
train['time_dt'] = pd.to_datetime(train['time'], errors='coerce')
train['time_dt'] = train['time_dt'].dt.round('30min')
train['time_H_M'] = train['time_dt'].dt.strftime('%H:%M')
train['secondary_label_len'] = train.secondary_labels.apply(lambda x:len(x.split(','))) 
test = pd.read_csv(main_dir+'/test.csv') 
submission = pd.read_csv(main_dir+'/sample_submission.csv')
taxonomy = pd.read_csv(main_dir+'/eBird_Taxonomy_v2021.csv')
scored_birds = json.load(open(main_dir+'/scored_birds.json', 'r'))

In [None]:
train.head(5)

In [None]:
test.head(5)

In [None]:
submission.head(5)

In [None]:
taxonomy.head(5)

In [None]:
print("There are {} no of unique classes but we will be evaluated only on {} no of classes".format(len(train.primary_label.unique()), len(scored_birds)))

In [None]:
print(scored_birds)

In [None]:
fig, ax = plt.subplots(figsize=(30, 8))
sns.countplot(data=train, x='primary_label', ax=ax, order=train['primary_label'].value_counts().index)
plt.xticks(rotation=90);

## **SPECTOGRAMS**
### Here you can try fft values of 256, 512 or 1024

In [None]:
import torch
import torchaudio
import noisereduce as nr
from math import ceil

def create_spectrogram(
    fname: str,
    reduce_noise: bool = False,
    frame_size: int = 5,
    frame_step: int = 2,
    channel: int = 0,
    device = "cpu",
) -> list:
    waveform, sample_rate = torchaudio.load(fname)
    
    transform = torchaudio.transforms.Spectrogram(n_fft=1024, win_length=512).to(device)
    if reduce_noise:
        waveform = torch.tensor(nr.reduce_noise(
            y=waveform,
            sr=sample_rate,
            win_length=transform.win_length,
            use_tqdm=False,
            n_jobs=2,
        ))
    step = int(frame_step * sample_rate)
    size = int(frame_size * sample_rate)
    spectrograms = []
    for i in range(ceil((waveform.size()[-1] - size) / step)):
        begin = i * step
        frame = waveform[channel][begin:begin + size]
        if len(frame) < size:
            if i == 0:
                rep = round(float(size) / len(frame))
                frame = frame.repeat(int(rep))
            elif len(frame) < (size * 0.33):
                continue
            else:
                frame = waveform[channel][-size:]
        sg = transform(frame.to(device))
        spectrograms.append(np.nan_to_num(torch.log(sg).numpy()))
        # spectrograms.append(np.nan_to_num(sg.numpy()))
    return spectrograms


path_audio = os.path.join(train_audio_dir, train["filename"][0])
print(path_audio)
sgs = create_spectrogram(path_audio, reduce_noise=True)


fig, axarr = plt.subplots(ncols=len(sgs), figsize=(4 * len(sgs), 4))
for i, sg in enumerate(sgs):
    ax = axarr[i].imshow(sg, vmin=-50, vmax=10)
plt.colorbar(ax)

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
ohe.fit(train.primary_label.unique().reshape(-1, 1))

## **GENERATING THE TRAIN DATA**

In [None]:
%%time
X = []
Y = []
for ul in tqdm.tqdm(train.primary_label.unique()):
    records = train[train.primary_label==ul]
    for r in records[['filename','primary_label','secondary_labels']].values:
        file = r[0]
        pl = r[1]
        sl = r[2]
        y = ohe.transform(np.array([pl]).reshape(-1, 1)).todense()
        arr, sr = librosa.load(os.path.join(train_audio_dir, file), sr=SAMPLE_RATE, duration=DURATION)
        chunks = []
        for c_ in range(0, len(arr), (TEST_DURATION*SAMPLE_RATE)):
            chunk = arr[c_:c_ + TEST_DURATION * SAMPLE_RATE]
            if len(chunk) < int(TEST_DURATION * SAMPLE_RATE):
                break
            chunks.append(chunk)
        y_arr = []
        mel_chunks = []
        for c_ in chunks:
            hop_length = int(TEST_DURATION * SAMPLE_RATE / (SPEC_SHAPE[1] - 1))
            #Extract Mel Spec
            mel_spec = librosa.feature.melspectrogram(y=c_,sr=SAMPLE_RATE,n_fft=1024, hop_length=hop_length, 
                                                  n_mels=SPEC_SHAPE[0], fmin=FMIN, fmax=FMAX)
    
            mel_spec = librosa.power_to_db(mel_spec, ref=np.max) 
            # Normalize
            mel_spec = (mel_spec - mel_spec.min())/(mel_spec.max() - mel_spec.min())
            mel_chunks.append(np.asarray(Image.fromarray(mel_spec * 255.0).convert("L")))
            y_arr.append(y)
        y_arr = np.array(y_arr).reshape(-1, 152)
        mel_chunks = np.array(mel_chunks)
        X.extend(mel_chunks)
        Y.extend(y_arr)
        
X = np.array(X)
Y = np.array(Y) 

In [None]:
print(X.shape,Y.shape)

## **MODEL TRAINING**
### Now, let’s build our Sequential neural network model utilizing the ADAM optimizer (try out RMSProp or other optimizers and see if you can squeeze out some more accuracy!). My network architecture consists of 2 convolutional layers with increasing filter density in order to best extract the features of each image with each successive layer (although I have tried with up to four convolutional layers - 2D). The pooling and dropout layers serve to increase computational efficiency and to prevent overfitting, respectively. Also, I have resizing the spectrogram to 48 x 48 in order to reduce the data processing time

In [None]:
from tensorflow.keras import layers
from tensorflow.keras import models
import tensorflow as tf
import tensorflow_addons as tfa

tf.random.set_seed(seed)
model = tf.keras.Sequential([
    tf.keras.layers.Conv1D(8, 3, 1, activation='relu', 
                           input_shape=(48, 128, 1)),
    tf.keras.layers.Resizing(48, 48),    
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Dropout(0.25),  
    
    tf.keras.layers.Conv1D(16, 3, 1, activation='relu'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.MaxPooling2D(), 
    tf.keras.layers.Dropout(0.25),  
        
    tf.keras.layers.Flatten(),     
    tf.keras.layers.Dense(len(train.primary_label.unique()), activation='softmax')
])

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy', tfa.metrics.F1Score(num_classes=len(train.primary_label.unique()))])
model.summary()

In [None]:
callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', 
                                              verbose=1,
                                              patience=150)]

history = model.fit(np.expand_dims(X, -1), Y, batch_size =128, epochs=150, validation_split = 0.2,
         callbacks=callbacks)

In [None]:
# saving our model for later use

model.save('model.h5')

## **EVALUATION OF THE MODEL**

In [None]:
# Plotting the accuracy of the model over the epochs

plt.figure(figsize=(15,5))
plt.plot(history.history['accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.show()

In [None]:
# Plotting the loss of the model over the epochs

plt.figure(figsize=(15,5))
plt.plot(history.history['loss'])
plt.title('Model loss')
plt.ylabel('loss')
plt.xlabel('Epoch')
plt.show()

In [None]:
# Plotting the loss of the model over the epochs

plt.figure(figsize=(15,5))
plt.plot(history.history['val_loss'])
plt.title('Model val_loss')
plt.ylabel('val_loss')
plt.xlabel('Epoch')
plt.show()

In [None]:
main_dir = '../input/birdclef-2022'
test = pd.read_csv(main_dir+'/test.csv') 
submission = pd.read_csv(main_dir+'/sample_submission.csv')

In [None]:
test.head()

In [None]:
submission.head()

In [None]:
import os
import random
from tqdm import tqdm

test_audio_dir = '../input/birdclef-2022/test_soundscapes'
for idx in tqdm(range(len(test))):
    audio_id = test.loc[idx, 'file_id']
    true_label = test.loc[idx, 'bird']
    end_time = test.loc[idx, 'end_time']
    
    path = os.path.join(test_audio_dir, audio_id, '.ogg')

    if os.path.isfile(path):
        sig, sr = torchaudio.load(file_pth)
        
        rows = sig.shape[1] // (32000 *5)
        sig = sig.reshape(rows, -1)

        row_id = end_time // 5

        sig = sig[row_id-1].reshape(1,-1)

        audio = MonoToStereo((sig, sr))
        audio = pad_signal(audio, 10000)
        audio = time_shift(audio, shift_limit=0.4)
        spec = mel_spec(audio)
        aug_spec = spectro_augment(spec)
        aug_spec = aug_spec.unsqueeze(0)
        output = model(aug_spec)

        _, pred = torch.max(output, dim=1)
        if labels[pred] == true_label:
            submission.loc[idx, 'target'] = True
        else:
            submission.loc[idx, 'target'] = False
        
    else:
        pred = True if random.randint(0,1) else False
        submission.loc[idx, 'target'] = pred
        continue

## **MAKE SUBMISSION**

In [None]:
submission.head()

In [None]:
submission.to_csv('submission.csv', index=False)

In [None]:
print('Done!')

## **CONCLUSION** 
### * There is no perfect model, and I will continue working.
### * If you like this notebook then, please upvote!
### * As I said in the introduction of this notebook, the precision will increase until all the audio data will be filtered, either by the sponsor or by the user and spending a lot of time doing it.
