Hi Kagglers. Days back at the beginning of this comp', I've released a [dummy  kernel](https://www.kaggle.com/kneroma/inference-resnest-rfcx-audio-detection) that performs poorly because of the simple training pipeline and the poor data cropping strategy. I've updated a lot of things and I'm publishing a novel version which is more robust and has a clever training pipeline.

The goal of this work is to show that we can get a decent score with just a single **ResneSt50** architecture with no TTA of any fancy inference augmentations. As a great fan of the **open-source** philosophy, I will be releasing as much as I can, including my datasets, weights and tips. Unfortunately, my training is currently very dirty and will be hard to release. But, I will be hopefully cleaning and releasing it soon or later.

# Training

* The training is based on *intelligent* crops of **10 s**, with just a training set less than **500 Mo** !!! 
* I used a stratified KFold, with n_splits = 5
* I found the learning rate scheduler very important
* Higher learning rates seem to give me better results
* For the cross validation metric, I was sticked on F1 score which seems to be more robust than the comps' metric

# Inference

* The inference is based on these [resnest50 weights](https://www.kaggle.com/kneroma/kkiller-rfcx-species-detection-public-checkpoints). Please, don't forget upvoting the dataset to make it more visible for others
* The inference pipeline is optimized as much as I can in order to reduce execution time
* I'm using the pytorch native multi-worker data loading framework
* You can try increasing the **DURATION** hyperparam, but it can lead to higher execution time
* Reducing the **STRIDE** hyperparam may give better result, but  it can lead to higher execution time

# Pre-computed MFCCs
> In order to make inference crazingly faster (**< 10 mins on GPU**), I've created [this precomputed MFCC  dataset](https://www.kaggle.com/kneroma/kkiller-rfcx-test-mfcc-1-0400). Don't mind using it in your pipelines :) 

<h2><font color="blue">If you find this work useful, please don't forget upvoting :)</font></h2>

In [None]:
!nvidia-smi

In [None]:
!pip install  resnest > /dev/null

In [None]:
# for TPU
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py > /dev/null
!python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev > /dev/null

In [None]:
import torch_xla
import torch_xla.core.xla_model as xm

In [None]:
import numpy as np
import librosa as lb
import soundfile as sf
import pandas as pd
from pathlib import Path

import torch
from  torch.utils.data import Dataset, DataLoader

from tqdm.notebook import tqdm

from resnest.torch import resnest50

import time

# Configs

In [None]:
# Data Loader
NUM_CLASSES = 24
SR = 32_000
DURATION =  10
STRIDE = 5


# Neural Net
TEST_BATCH_SIZE = 30
TEST_NUM_WORKERS = 2

USE_PRE_COMPUTED_MFCC = True

# DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DEVICE = xm.xla_device() # TPU
torch.set_default_tensor_type('torch.FloatTensor')

TEST_AUDIO_ROOT = Path("../input/rfcx-species-audio-detection/test")

TEST_MFCC_ROOT = "../input/kkiller-rfcx-test-mfcc-1-0400/test_mfcc_d10_s10_sr32000"

# Data

In [None]:
class MelSpecComputer:
    def __init__(self, sr, n_mels, fmin, fmax):
        self.sr = sr
        self.n_mels = n_mels
        self.fmin = fmin
        self.fmax = fmax

    def __call__(self, y):

        melspec = lb.feature.melspectrogram(
            y, sr=self.sr, n_mels=self.n_mels, fmin=self.fmin, fmax=self.fmax,
        )

        melspec = lb.power_to_db(melspec).astype(np.float32)
        return melspec

In [None]:
def mono_to_color(X, eps=1e-6, mean=None, std=None):
    X = np.stack([X, X, X], axis=-1)

    # Standardize
    mean = mean or X.mean()
    std = std or X.std()
    X = (X - mean) / (std + eps)

    # Normalize to [0, 255]
    _min, _max = X.min(), X.max()

    if (_max - _min) > eps:
        V = np.clip(X, _min, _max)
        V = 255 * (V - _min) / (_max - _min)
        V = V.astype(np.uint8)
    else:
        V = np.zeros_like(X, dtype=np.uint8)

    return V


def normalize(image, mean=None, std=None):
    image = image / 255.0
    if mean is not None and std is not None:
        image = (image - mean) / std
    return np.moveaxis(image, 2, 0).astype(np.float32)


def crop_or_pad(y, length, sr, is_train=True):
    if len(y) < length:
        y = np.concatenate([y, np.zeros(length - len(y))])
    elif len(y) > length:
        if not is_train:
            start = 0
        else:
            start = np.random.randint(len(y) - length)

        y = y[start:start + length]

    y = y.astype(np.float32, copy=False)

    return y

In [None]:
class RFCXDataset(Dataset):

    def __init__(self, data, sr, n_mels=128, fmin=0, fmax=None, num_classes=NUM_CLASSES, duration=DURATION, stride=STRIDE, root=None):

        self.data = data
        
        self.sr = sr
        self.n_mels = n_mels
        self.fmin = fmin
        self.fmax = fmax or self.sr//2


        self.num_classes = num_classes
        self.duration = duration
        self.stride = stride
        self.audio_length = self.duration*self.sr
        
        self.root =  root or TEST_AUDIO_ROOT

        self.mel_spec_computer = MelSpecComputer(sr=self.sr, n_mels=self.n_mels, fmin=self.fmin, fmax=self.fmax)
        
        self.res_type = "kaiser_best"


    def __len__(self):
        return len(self.data)
    
    def load(self, record):
        y, _ = lb.load(self.root.joinpath(record).with_suffix(".flac").as_posix(), sr=self.sr, res_type=self.res_type)
        return y
    
    def load2(self, record):
        y, orig_sr = sf.read(self.root.joinpath(record).with_suffix(".flac").as_posix())
        y = lb.resample(y, orig_sr=orig_sr, target_sr=self.sr, res_type=self.res_type)
        return y
    
    def read_index(self, idx):
        d = self.data.iloc[idx]
        record = d["recording_id"]
        
        y = self.load2(record)
        
        window = self.duration*self.sr
        stride = self.stride*self.sr
            
        y = np.stack([y[i:i+window] for i in range(0, 60*self.sr+stride-window, stride)])

        return y
            
    def process(self, y):
        melspec = self.mel_spec_computer(y) 
        image = mono_to_color(melspec)
        image = normalize(image, mean=None, std=None)
        return image

    def __getitem__(self, idx):

        y = self.read_index(idx)
        
        image = np.stack([self.process(_y) for _y in y])

        return image

In [None]:
class SimpleRFCXDataset(Dataset):
    def __init__(self, data):
        self.data = data
        
    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        record_id_path = Path(row.mfcc_root).joinpath(row.recording_id).with_suffix(".npy")
        image = np.load(record_id_path)
        return image
    
    def __len__(self):
        return len(self.data)

In [None]:
%%time

data = pd.DataFrame({
    "recording_id": [path.stem for path in Path(TEST_AUDIO_ROOT).glob("*.flac")],
})
data["mfcc_root"] = TEST_MFCC_ROOT
print(data.shape)
data.head()

In [None]:
TEST_MFCC_ROOTs = [
    "../input/kkiller-rfcx-test-mfcc-0000-0400/test_mfcc_d10_s2_sr32000_0000_0400",
    "../input/kkiller-rfcx-test-mfcc-0400-0800/test_mfcc_d10_s2_sr32000_0400_0800",
    "../input/kkiller-rfcx-test-mfcc-0800-1200/test_mfcc_d10_s2_sr32000_0800_1200",
    "../input/kkiller-rfcx-test-mfcc-1200-1600/test_mfcc_d10_s2_sr32000_1200_1600",
    "../input/kkiller-rfcx-test-mfcc-1600-2000/test_mfcc_d10_s2_sr32000_1600_2000",
]

In [None]:
mfccs = []
for mfcc_root in TEST_MFCC_ROOTs:
    mfccs += [(mfcc.stem, mfcc.parent.as_posix()) for mfcc in Path(mfcc_root).glob("*.npy")]
mfccs = pd.DataFrame(mfccs, columns = ["recording_id", 'mfcc_root'])

data = data[["recording_id"]].merge(mfccs, on="recording_id")
print(data.shape)
data.head()

In [None]:
ds = RFCXDataset(data=data, sr=SR)

In [None]:
%%time

x = ds[1]
print(x.shape)

# Inference

In [None]:
test_data = SimpleRFCXDataset(data) if (USE_PRE_COMPUTED_MFCC and TEST_MFCC_ROOT) else RFCXDataset(data=data, sr=SR)
test_loader = DataLoader(test_data, batch_size=TEST_BATCH_SIZE, num_workers=TEST_NUM_WORKERS)

In [None]:
def load_net(checkpoint_path):
    net = resnest50(pretrained=True)#.to(DEVICE)
    n_features = net.fc.in_features
    net.fc = torch.nn.Linear(n_features, NUM_CLASSES)
    dummy_device = torch.device("cpu")
    net.load_state_dict(torch.load(checkpoint_path, map_location=dummy_device))
    net = net.to(DEVICE)
    net = net.eval()
    return net

In [None]:
checkpoint_paths = [
    "../input/kkiller-rfcx-species-detection-public-checkpoints/rfcx_resnest50/rfcx_resnest_50_fold0.pth",
    "../input/kkiller-rfcx-species-detection-public-checkpoints/rfcx_resnest50/rfcx_resnest_50_fold1.pth",
    "../input/kkiller-rfcx-species-detection-public-checkpoints/rfcx_resnest50/rfcx_resnest_50_fold2.pth",
    "../input/kkiller-rfcx-species-detection-public-checkpoints/rfcx_resnest50/rfcx_resnest_50_fold3.pth",
    "../input/kkiller-rfcx-species-detection-public-checkpoints/rfcx_resnest50/rfcx_resnest_50_fold4.pth",
]

nets = [
        load_net(checkpoint_path) for checkpoint_path in checkpoint_paths
]

In [None]:
preds = []
# net.eval()
with torch.no_grad():
    for xb in  tqdm(test_loader):
        bsize, nframes = xb.shape[:2]
        xb = xb.to(DEVICE).view(bsize*nframes, *xb.shape[2:])

        pred = 0.
        for net in nets:
            o = net(xb)
            o = torch.sigmoid(o)
            o = o.view(bsize, nframes, *o.shape[1:]).max(1).values
            o = o.detach().cpu().numpy()

            pred += o
        
        pred /= len(nets)
        
        preds.append(pred)
preds = np.vstack(preds)
preds.shape

In [None]:
sub = pd.DataFrame(preds, columns=[f"s{i}" for i in range(24)])
sub["recording_id"] = data["recording_id"].values[:len(sub)]
sub = sub[["recording_id"] + [f"s{i}" for i in range(24)]]
print(sub.shape)
sub.head()

In [None]:
sub.to_csv("submission.csv", index=False)