<img src="https://i.pinimg.com/originals/a8/97/b2/a897b2f00156fdebec75f7478a330c7d.jpg" alt="drawing" height="100%" width="100%" class="center"/>

## Understand Business Problem <a id="1"></a>

In the challenge, you are  identify which birds are calling in long recordings, given training data generated in meaningfully different contexts. This is the exact problem facing scientists trying to automate the remote monitoring of bird populations.


## Data Overview <a id="2"></a>


`train_audio` : The train data consists of short recordings of individual bird calls generously uploaded by users of xenocanto.org.

`test_audio` : The hidden test_audio directory contains approximately 150 recordings in mp3 format, each roughly 10 minutes long. They will not all fit in a notebook's memory at the same time. The recordings were taken at three separate remote locations. Sites 1 and 2 were labeled in 5 second increments and need matching predictions, but due to the time consuming nature of the labeling process the site 3 files are only labeled at the file level. Accordingly, site 3 has relatively few rows in the test set and needs lower time resolution predictions.

Two example soundscapes from another data source are also provided to illustrate how the soundscapes are labeled and the hidden dataset folder structure. The two example audio files are `BLKFR-10-CPL_20190611_093000.pt540.mp3` and `ORANGE-7-CAP_20190606_093000.pt623.mp3`. These soundscapes were kindly provided by Jack Dumbacher of the California Academy of Science's Department of Ornithology and Mammology.

`test.csv` Only the first three rows are available for download; the full `test.csv` is in the hidden test set.

- `site`: Site ID.

- `row_id`: ID code for the row.

- `seconds`: the second ending the time window, if any. Site 3 time windows cover the entire audio file and have null entries for seconds.

- `audio_id`: ID code for the audio file.

- `example_test_audio_metadata.csv` Complete metadata for the example test audio. These labels have higher time precision than is used for the hidden test set.

`example_test_audio_summary.csv` Metadata for the example test audio, converted to the same format as used in the hidden test set.

- `filename_seconds`: a row identifier.

- `birds`: all ebird codes present in the time window.

- `filename`: audio file names

- `seconds`: the second ending the time window.

`train.csv` A wide range of metadata is provided for the training data. The most directly relevant fields are:

- `ebird_code`: a code for the bird species. You can review detailed information about the bird codes by appending the code to https://ebird.org/species/, such as https://ebird.org/species/amecro for the American Crow.

- `recodist`: the user who provided the recording.

- `location`: where the recording was taken. Some bird species may have local call 'dialects', so you may want to seek geographic diversity in your training data.

- `date`: while some bird calls can be made year round, such as an alarm call, some are restricted to a specific season. You may want to seek temporal diversity in your training data.

- `filename`: the name of the associated audio file.

In [None]:
!pip install pretrainedmodels
!pip install pydub

In [None]:
import numpy as np
import pandas as pd
import librosa
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import random

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold, StratifiedKFold
import math
from collections import OrderedDict

from PIL import Image
import albumentations
from pydub import AudioSegment

import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.utils.data import Dataset, DataLoader

import pretrainedmodels

import warnings
warnings.filterwarnings('ignore')

## Preprocessing <a id="3"></a>

In [None]:
train = pd.read_csv("../input/birdsong-recognition/train.csv")
test = pd.read_csv("../input/birdsong-recognition/test.csv")
submission = pd.read_csv("../input/birdsong-recognition/sample_submission.csv")

### e-bird code

a code for the bird species. we need to predict `ebird_code` using metadata and audio data 


In [None]:
print("Number of Unique birds : ", train.ebird_code.nunique())

### top10 Birds

we are taking top10 birds to build stater model

In [None]:
top10_birds = list(train.ebird_code.value_counts().index[:10])

train = train[train.ebird_code.isin(top10_birds)]

# label encoding for target values
train["ebird_label"] = LabelEncoder().fit_transform(train.ebird_code.values)

### K-Fold

In [None]:
train.loc[:, "kfold"] = -1

train= train.sample(frac=1).reset_index(drop=True)

X = train.filename.values
y = train.ebird_code.values

kfold = StratifiedKFold(n_splits=5)

for fold, (t_idx, v_idx) in enumerate(kfold.split(X, y)):
    train.loc[v_idx, "kfold"] = fold

print(train.kfold.value_counts())

### Arguments

In [None]:
class args:
    
    ROOT_PATH = "../input/birdsong-recognition/train_audio"
    
    num_classes = 10
    max_duration= 5 # seconds
    
    sample_rate = 32000
    
    img_height = 128
    img_width = 313
    std = (0.229, 0.224, 0.225)
    mean = (0.485, 0.456, 0.406)
    
    batch_size = 16
    num_workers = 4
    epochs = 5
    
    lr = 0.001
    wd = 1e-5
    momentum = 0.9
    eps = 1e-8
    betas = (0.9, 0.999)
    
    melspectrogram_parameters = {
        "n_mels": 128,
        "fmin": 20,
        "fmax": 16000
    }
    

### Loading Audio Files

In [None]:
def load_audio(path):
    try:
        sound = AudioSegment.from_mp3(path)
        sound = sound.set_frame_rate(args.sample_rate)
        sound = sound[:args.max_duration*1000]
        sound_array = np.array(sound.get_array_of_samples())
    except:
        sound_array = np.random.rand(args.sample_rate * args.max_duration)
        
    return sound_array, args.sample_rate

In [None]:
# Example
sound, sample_rate = load_audio("../input/birdsong-recognition/train_audio/ameavo/XC139921.mp3")
plt.plot(sound)
plt.show()

### Melspectrogram

- extract Melspectrogram features from raw audio into using librosa

In [None]:
def Melspectrogram(audio_path):
        
    y, sr = load_audio(audio_path)
    y = y.astype(float)

    melspec = librosa.feature.melspectrogram(y, sr=args.sample_rate, **args.melspectrogram_parameters)
    melspec = librosa.power_to_db(melspec).astype(np.float32)

    return melspec


# Example
spect = Melspectrogram("../input/birdsong-recognition/train_audio/ameavo/XC139921.mp3")
print("shape ", spect.shape)

plt.figure(figsize=(5 ,30))
plt.imshow(spect)
plt.show()

### Pytorch DataLoader

In [None]:
class BirdDataset:
    def __init__(self, df):
        
        self.filename = df.filename.values
        self.ebird_label = df.ebird_label.values
        self.ebird_code = df.ebird_code.values
    
    def __len__(self):
        return len(self.filename)
    
    def __getitem__(self, item):
        
        filename = self.filename[item]
        ebird_code = self.ebird_code[item]
        ebird_label = self.ebird_label[item]

        spect = Melspectrogram(f"{args.ROOT_PATH}/{ebird_code}/{filename}")
        
        spect = torch.FloatTensor(spect)
        target = ebird_label
        
        return spect, target
    
def _collate_fn(batch):
    def func(p):
        return p[0].size(1)
    
    batch = sorted(batch, key=lambda sample: sample[0].size(1), reverse=True)
    longest_sample = max(batch, key=func)[0]
    freq_size = longest_sample.size(0)
    minibatch_size = len(batch)
    max_seqlength = longest_sample.size(1)
    inputs = torch.zeros(minibatch_size, 1, freq_size, max_seqlength)
    input_percentages = torch.FloatTensor(minibatch_size)
    targets = []
    for x in range(minibatch_size):
        sample = batch[x]
        tensor = sample[0]
        target = sample[1]
        seq_length = tensor.size(1)
        inputs[x][0].narrow(1, 0, seq_length).copy_(tensor)
        input_percentages[x] = seq_length / float(max_seqlength)
        targets.append(target)

    targets = torch.tensor(targets, dtype=torch.long)
    return inputs, targets, input_percentages


class AudioDataLoader(DataLoader):
    
    def __init__(self, *args, **kwargs):
    
        """
        Creates a data loader for AudioDatasets
        """
        super(AudioDataLoader, self).__init__(*args, **kwargs)
        self.collate_fn = _collate_fn
    

In [None]:
# Example 
dataset = BirdDataset(train)
d = dataset.__getitem__(10)

d[0].shape, d[1]

In [None]:
loader = AudioDataLoader(dataset, batch_size=2)

In [None]:
for i, d in enumerate(loader):
    print(d[0].shape)
    print(d[1].shape)
    if i == 0:
        break

### Deepspeech2 Model

Reference : https://github.com/SeanNaren/deepspeech.pytorch/blob/master/model.py

In [None]:
import math
from collections import OrderedDict

import torch
import torch.nn as nn
import torch.nn.functional as F

supported_rnns = {
    'lstm': nn.LSTM,
    'rnn': nn.RNN,
    'gru': nn.GRU
}
supported_rnns_inv = dict((v, k) for k, v in supported_rnns.items())


class SequenceWise(nn.Module):
    def __init__(self, module):
        """
        Collapses input of dim T*N*H to (T*N)*H, and applies to a module.
        Allows handling of variable sequence lengths and minibatch sizes.
        :param module: Module to apply input to.
        """
        super(SequenceWise, self).__init__()
        self.module = module

    def forward(self, x):
        t, n = x.size(0), x.size(1)
        x = x.view(t * n, -1)
        x = self.module(x)
        x = x.view(t, n, -1)
        return x

    def __repr__(self):
        tmpstr = self.__class__.__name__ + ' (\n'
        tmpstr += self.module.__repr__()
        tmpstr += ')'
        return tmpstr


class MaskConv(nn.Module):
    def __init__(self, seq_module):
        """
        Adds padding to the output of the module based on the given lengths. This is to ensure that the
        results of the model do not change when batch sizes change during inference.
        Input needs to be in the shape of (BxCxDxT)
        :param seq_module: The sequential module containing the conv stack.
        """
        super(MaskConv, self).__init__()
        self.seq_module = seq_module

    def forward(self, x, lengths):
        """
        :param x: The input of size BxCxDxT
        :param lengths: The actual length of each sequence in the batch
        :return: Masked output from the module
        """
        for module in self.seq_module:
            x = module(x)
            mask = torch.BoolTensor(x.size()).fill_(0)
            if x.is_cuda:
                mask = mask.cuda()
            for i, length in enumerate(lengths):
                length = length.item()
                if (mask[i].size(2) - length) > 0:
                    mask[i].narrow(2, length, mask[i].size(2) - length).fill_(1)
            x = x.masked_fill(mask, 0)
        return x, lengths


class InferenceBatchSoftmax(nn.Module):
    def forward(self, input_):
        if not self.training:
            return F.softmax(input_, dim=-1)
        else:
            return input_


class BatchRNN(nn.Module):
    def __init__(self, input_size, hidden_size, rnn_type=nn.LSTM, bidirectional=False, batch_norm=True):
        super(BatchRNN, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bidirectional = bidirectional
        self.batch_norm = SequenceWise(nn.BatchNorm1d(input_size)) if batch_norm else None
        self.rnn = rnn_type(input_size=input_size, hidden_size=hidden_size,
                            bidirectional=bidirectional, bias=True)
        self.num_directions = 2 if bidirectional else 1

    def flatten_parameters(self):
        self.rnn.flatten_parameters()

    def forward(self, x, output_lengths):
        if self.batch_norm is not None:
            x = self.batch_norm(x)
        #print(x.shape)
        x = nn.utils.rnn.pack_padded_sequence(x, output_lengths)
        x, h = self.rnn(x)
        x, _ = nn.utils.rnn.pad_packed_sequence(x)
        if self.bidirectional:
            x = x.view(x.size(0), x.size(1), 2, -1).sum(2).view(x.size(0), x.size(1), -1)  # (TxNxH*2) -> (TxNxH) by sum
        return x


class Lookahead(nn.Module):
    # Wang et al 2016 - Lookahead Convolution Layer for Unidirectional Recurrent Neural Networks
    # input shape - sequence, batch, feature - TxNxH
    # output shape - same as input
    def __init__(self, n_features, context):
        super(Lookahead, self).__init__()
        assert context > 0
        self.context = context
        self.n_features = n_features
        self.pad = (0, self.context - 1)
        self.conv = nn.Conv1d(self.n_features, self.n_features, kernel_size=self.context, stride=1,
                              groups=self.n_features, padding=0, bias=None)

    def forward(self, x):
        x = x.transpose(0, 1).transpose(1, 2)
        x = F.pad(x, pad=self.pad, value=0)
        x = self.conv(x)
        x = x.transpose(1, 2).transpose(0, 1).contiguous()
        return x

    def __repr__(self):
        return self.__class__.__name__ + '(' \
               + 'n_features=' + str(self.n_features) \
               + ', context=' + str(self.context) + ')'


class DeepSpeech(nn.Module):
    def __init__(self, rnn_type, rnn_hidden_size, nb_layers,
                 bidirectional, context=20):
        super(DeepSpeech, self).__init__()

        self.hidden_size = rnn_hidden_size
        self.hidden_layers = nb_layers
        self.rnn_type = rnn_type
        self.bidirectional = bidirectional

        sample_rate = args.sample_rate
        num_classes = args.num_classes

        self.conv = MaskConv(nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=(41, 11), stride=(2, 2), padding=(20, 5)),
            nn.BatchNorm2d(32),
            nn.Hardtanh(0, 20, inplace=True),
            nn.Conv2d(32, 32, kernel_size=(21, 11), stride=(2, 1), padding=(10, 5)),
            nn.BatchNorm2d(32),
            nn.Hardtanh(0, 20, inplace=True)
        ))
        
        rnn_input_size = 1024
        
        #print(rnn_input_size)

        rnns = []
        rnn = BatchRNN(input_size=rnn_input_size, hidden_size=rnn_hidden_size, rnn_type=rnn_type,
                       bidirectional=bidirectional, batch_norm=False)
        rnns.append(('0', rnn))
        for x in range(nb_layers - 1):
            rnn = BatchRNN(input_size=rnn_hidden_size, hidden_size=rnn_hidden_size, rnn_type=rnn_type,
                           bidirectional=bidirectional)
            rnns.append(('%d' % (x + 1), rnn))
        self.rnns = nn.Sequential(OrderedDict(rnns))
        self.lookahead = nn.Sequential(
            # consider adding batch norm?
            Lookahead(rnn_hidden_size, context=context),
            nn.Hardtanh(0, 20, inplace=True)
        ) if not bidirectional else None

        self.fc = nn.Sequential(
            nn.BatchNorm1d(rnn_hidden_size),
            nn.Linear(rnn_hidden_size, num_classes, bias=False)
        )
        

    def forward(self, x, lengths):
        lengths = lengths.cpu().int()
        output_lengths = self.get_seq_lens(lengths)
        x, _ = self.conv(x, output_lengths)

        sizes = x.size()
        x = x.view(sizes[0], sizes[1] * sizes[2], sizes[3])  # Collapse feature dimension
        x = x.transpose(1, 2).transpose(0, 1).contiguous()  # TxNxH
        
        #print(x.shape)

        for rnn in self.rnns:
            x = rnn(x, output_lengths)

        if not self.bidirectional:  # no need for lookahead layer in bidirectional
            x = self.lookahead(x)
        
        #print(x.shape)

        x = self.fc(x[-1])
        #x = x.transpose(0, 1)
        # identity in training mode, softmax in eval mode
        #x = self.inference_softmax(x)
        return x#, output_lengths

    def get_seq_lens(self, input_length):
        """
        Given a 1D Tensor or Variable containing integer sequence lengths, return a 1D tensor or variable
        containing the size sequences that will be output by the network.
        :param input_length: 1D Tensor
        :return: 1D Tensor scaled by model
        """
        seq_len = input_length
        for m in self.conv.modules():
            if type(m) == nn.modules.conv.Conv2d:
                seq_len = ((seq_len + 2 * m.padding[1] - m.dilation[1] * (m.kernel_size[1] - 1) - 1) / m.stride[1] + 1)
        return seq_len.int()

    @classmethod
    def load_model(cls, path):
        package = torch.load(path, map_location=lambda storage, loc: storage)
        model = DeepSpeech.load_model_package(package)
        return model

    @classmethod
    def load_model_package(cls, package):
        model = cls(rnn_hidden_size=package['hidden_size'],
                    nb_layers=package['hidden_layers'],
                    labels=package['labels'],
                    audio_conf=package['audio_conf'],
                    rnn_type=supported_rnns[package['rnn_type']],
                    bidirectional=package.get('bidirectional', True))
        model.load_state_dict(package['state_dict'])
        return model

    def serialize_state(self):
        return {
            'hidden_size': self.hidden_size,
            'hidden_layers': self.hidden_layers,
            'rnn_type': supported_rnns_inv.get(self.rnn_type, self.rnn_type.__name__.lower()),
            'audio_conf': self.audio_conf,
            'labels': self.labels,
            'state_dict': self.state_dict(),
            'bidirectional': self.bidirectional,
        }

    @staticmethod
    def get_param_size(model):
        params = 0
        for p in model.parameters():
            tmp = 1
            for x in p.size():
                tmp *= x
            params += tmp
        return params

In [None]:
model = DeepSpeech(rnn_type=nn.LSTM, rnn_hidden_size=1024, nb_layers=2, bidirectional=False)
model

### Utility functions

In [None]:
def to_list(tensor):
    return tensor.detach().cpu().tolist()

class AverageMeter(object):
    """Computes and stores the average and current values"""
    def __init__(self):
        self.reset()
    
    def __init__(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

def get_position_accuracy(logits, labels):
    predictions = np.argmax(F.softmax(logits, dim=1).cpu().data.numpy(), axis=1)
    labels = labels.cpu().data.numpy()
    total_num = 0
    sum_correct = 0
    for i in range(len(labels)):
        if labels[i] >= 0:
            total_num += 1
            if predictions[i] == labels[i]:
                sum_correct += 1
    if total_num == 0:
        total_num = 1e-7
    return np.float32(sum_correct) / total_num, total_num

### Loss function

In [None]:
def loss_fn(preds, labels):
    loss = nn.CrossEntropyLoss(ignore_index=-1)(preds, labels)
    return loss

### train & validation functions

In [None]:
def train_fn(train_loader, model, optimizer, epoch):
    total_loss = AverageMeter()
    accuracies = AverageMeter()
    
    model.train()

    t = tqdm(train_loader)
    for step, d in enumerate(t):
        
        inputs = d[0].to(args.device)
        targets = d[1].to(args.device)
        input_percentages = d[2].to(args.device)
        input_sizes = input_percentages.mul_(int(inputs.size(3))).int()
        
        outputs = model(inputs, input_sizes)

        loss = loss_fn(outputs, targets)

        acc, n_position = get_position_accuracy(outputs, targets)
        

        total_loss.update(loss.item(), n_position)
        accuracies.update(acc, n_position)

        optimizer.zero_grad()
        
        loss.backward()
        optimizer.step()
        
        t.set_description(f"Train E:{epoch+1} - Loss:{total_loss.avg:0.4f} - Acc:{accuracies.avg:0.4f}")
    
    return total_loss.avg

def valid_fn(valid_loader, model, epoch):
    total_loss = AverageMeter()
    accuracies = AverageMeter()
    
    model.eval()

    t = tqdm(valid_loader)
    for step, d in enumerate(t):
        
        with torch.no_grad():
        
            inputs = d[0].to(args.device)
            targets = d[1].to(args.device)
            input_percentages = d[2].to(args.device)
            input_sizes = input_percentages.mul_(int(inputs.size(3))).int()

            outputs = model(inputs, input_sizes)

            loss = loss_fn(outputs, targets)

            acc, n_position = get_position_accuracy(outputs, targets)


            total_loss.update(loss.item(), n_position)
            accuracies.update(acc, n_position)
            
            t.set_description(f"Eval E:{epoch+1} - Loss:{total_loss.avg:0.4f} - Acc:{accuracies.avg:0.4f}")

    return total_loss.avg, accuracies.avg

In [None]:
def main(fold_index):
    
    model = DeepSpeech(rnn_type=nn.LSTM, rnn_hidden_size=1024, nb_layers=2, bidirectional=False)
    
    args.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    
    # Setting seed
    seed = 42
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    
    model.to(args.device)
    
    optimizer = torch.optim.AdamW(model.parameters(),
                                      lr=args.lr,
                                      betas=args.betas,
                                      eps=args.eps,
                                      weight_decay=args.wd)
    
    train_df = train[~train.kfold.isin([fold_index])]
    
    train_dataset = BirdDataset(df=train_df)
    
    train_loader = AudioDataLoader(
        dataset = train_dataset,
        batch_size = args.batch_size,
        shuffle = True,
        num_workers = args.num_workers,
        pin_memory = True,
        drop_last = False
    )
    
    
    valid_df = train[train.kfold.isin([fold_index])]
    
    valid_dataset = BirdDataset(df=valid_df)
    
    valid_loader = AudioDataLoader(
        dataset = valid_dataset,
        batch_size = args.batch_size,
        shuffle = False,
        num_workers = args.num_workers,
        pin_memory = True,
        drop_last = False
    )
    
    best_acc = 0
    
    for epoch in range(args.epochs):
        train_loss = train_fn(train_loader, model, optimizer, epoch)
        valid_loss, valid_acc = valid_fn(valid_loader, model, epoch)
        
        print(f"**** Epoch {epoch+1} **==>** Accuracy = {valid_acc}")
        
        if valid_acc > best_acc:
            print("**** Model Improved !!!! Saving Model")
            torch.save(model.state_dict(), f"fold_{fold_index}.bin")
            best_acc = valid_acc  

### 5 Folds

In [None]:
# fold0
main(0)

In [None]:
# fold1
#main(1)

In [None]:
# fold2
#main(2)

In [None]:
# fold3
#main(3)

In [None]:
# fold4
#main(4)

<h2 style="color:red;"> Please upvote if you like it. It motivates me. Thank you ☺️ .</h2>