<a href="https://colab.research.google.com/github/thirdformant/cats_dogs_audio/blob/master/notebooks/cats_dogs_audio_librosa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Audio classification of cats and dogs

Training a machine learning classification model on images of cats and dogs is a common introductory problem. With modern convolutional neural network (CNN) architectures and transfer learning, it is now possible to achieve near-perfect levels of classification accuracy as demonstrated very effectively (and efficiently!) early on in the [fast.ai deep learning MOOC](https://course.fast.ai/). When browsing through the datasets on Kaggle some time ago, I stumbled upon the [Audio Cats and Dogs](https://www.kaggle.com/mmoreaux/audio-cats-and-dogs/home) data, which presents itself as 'the audio counterpart' to the typical image classification problem. As someone with a background in acoustic phonetics, this has rather a lot of appeal. In addition, the dataset brings some new challenges to classification:  first, the audio data requires considerable preprocessing if a CNN is to be used, and second, the data consists of only 277 files. More on both these topics later.

This notebook comprises two major parts. The first is a partial implementation of the approaches described in Huzaifah (2017) in his experimentation regarding CNN-based classification of environmental sound data. While the CNN architectures he used were implemented in TensorFlow, I have chosen instead to use Pytorch. The second section ((improves on the approaches used by Huzaifah, increasing the overall classification accuracy using TODO: finish this once the implementation is done)).

# Libraries and setup

[Librosa](http://librosa.github.io/librosa/) is a general-purpose audio processing and analysis library.

In [0]:
# !apt-get install sox libsox-dev libsox-fmt-all
# !pip3 install git+git://github.com/pytorch/audio

In [0]:
import os
from pathlib import Path
from typing import Optional
from collections import Counter

import numpy as np
import pandas as pd

import librosa # Audio library

# Data viz
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import confusion_matrix

# Deep learning
import torch
# import torchaudio
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data.dataset import Dataset
from torch.utils.data import DataLoader

# transforms
import torchvision.transforms as transforms
# import torchaudio.transforms

In [0]:
torch.cuda.is_available()

In [0]:
# Connecting to google drive
from google.colab import drive

drive.mount('/content/gdrive')

# Reading audio data

I created a csv file containing the filename and the file labels. This can be passed to the Pytorch Dataset class defined in the next section.

In [0]:
# Paths
ROOT_PATH = Path('/content/gdrive/My Drive/data/cats_dogs_audio/cats_dogs/')
CSV_PATH = Path('/content/gdrive/My Drive/data/cats_dogs_audio/all_data.csv')

In [0]:
# # Creates the csv file. Commented out because I only needed to run it once

# all_data = {
#     'label': [],
#     'filename': []
# }


# files_list = os.listdir(ROOT_PATH)
# for f in files_list:
#     if Path(f).suffix == '.wav':
#         all_data['filename'].append(f)
#         if 'cat' in f:
#             all_data['label'].append(0)
#         else:
#             all_data['label'].append(1)

# data_df = pd.DataFrame(all_data).iloc[:, ::-1] # Reverse the column order
# data_df.to_csv(CSV_PATH, index=False)

In [0]:
data_df = pd.read_csv(CSV_PATH)

In [0]:
print(f'The distribution of classes in the data is: {Counter(data_df["label"])}')

# Feature extraction

Waveforms show changes in the signal, especially in its amplitude (loudness), over time and are therefore representations of the signal in the *time domain*. This is, however, only one aspect of the signal. When it comes to training audio classification or e.g. speech recognition models, better results are achieved when considering how the frequency of the signal, rather than its amplitude, changes over time. These are known as *time-frequency* representations, the best-known of which is the spectrogram.

While Huzaifah (2017) created four time-frequency representations, only two are currently implemented here:
- Linear-scaled Short-time Fourier Transform (STFT) spectrograms
- mel-scaled STFT spectrograms.

In addition, Mel-frequency Cepstral Coefficients (MFCCs) were also extracted from the input signal.

Prior to feature extractions, all raw audio was either clipped or padded to a 4 second duration.

## Pytorch dataset definition

Pytorch datasets allow the audio features to be extracted and transformed 'on the fly' when needed, rather than being extracted in advance and saved separately.

My current implementation of this, however, is not optimal. There is a way of doing this using `torch.stft` and `torchaudio.transforms.Spectrogram` which would allow the transforms to be done on the GPU. However, I can't get it to work well for some reason: the values returned by the STFT are much smaller than those produced by `np.abs(librosa.stft)`. It might have something to do with the way the differing approaches to dealing with the imaginary component of the STFT, which Pytorch tensors cannot handle properly. Then again, it might not.

For now, therefore, everything here is using `librosa`. As the dataset is small, the slower speed of the CPU shouldn't be too much of an issue.

In [0]:
class AudioFeatureDataset(Dataset):
    def __init__(self, dataframe:pd.DataFrame, root_path:Path,
                 offset:float=0.0, duration:int=None, sr:int=16000,
                 audio_transform=None, image_transform=None):
        """
        Args:
            dataframe: Pandas dataframe of format {labels, filename}
            root: root path for input data dir
            offset: ?
            duration: desired duration of the audio signal
            sr: desired sampling rate of the audio signal
            transform: transformations to be applied to the data
        """
        self.root = root_path
        
        self.data = dataframe
        self.files = np.array(self.data.iloc[:, 1])
        self.classes = np.array(self.data.iloc[:, 0])
        
        self.sr = sr
        self.offset = offset
        self.duration = duration
        
        self.audio_transform = audio_transform
        self.image_transform = image_transform
        self.librosa = librosa
        
        
    def __getitem__(self, index):
        # Stuff
        file = self.files[index]
        label = self.classes[index]
        
        signal, sr = librosa.core.load(self.root / file, sr=self.sr,
                                       offset=self.offset,
                                       duration=self.duration)
        if self.duration:
            signal = self._pad_audio(signal)
        signal = self.audio_transform(signal)
        signal = np.abs(signal)
        signal = np.expand_dims(signal, axis=2)
        signal = self.image_transform(signal.astype("uint8"))
        return signal, label
    

    def __len__(self):
        return len(self.files)
    
    def _pad_audio(self, signal):
        '''
        From https://stackoverflow.com/a/32477869 with changes for clarity
        '''
        # Calculate target number of samples
        n_target = int(self.sr * self.duration)
        # Calculate number of zero samples to append
        shape = signal.shape
        # Create the target shape    
        padding = n_target - shape[0]
        #   print("Padding with %s seconds of silence" % str(N_pad/fs) )
        shape = (padding,) + shape[1:]
        # Stack only if there is something to append    
        if shape[0] > 0:                
            if len(shape) > 1:
                return np.vstack((np.zeros(shape),
                                signal))
            else:
                return np.hstack((np.zeros(shape),
                                signal))
        else:
            return signal

## Short-time Fourier Transform (STFT) features

Put very briefly, any periodic waveform can be represented by as a sum (possibly infinite) of sinusoids of different frequency and phase. Fourier transforms perform this decomposition on a signal, revealing its frequency and phase components.

Fourier transforms are typically applied to the signal as a whole and thus do not reveal how the frequency and phase components of the signal change over time. This can, however, be done with the STFT. First, the signal is divided into overlapping segments of equal length and a window function applied to these segments. The transform is then taken for each of these windows, with the complex-valued results being added to a matrix. 

The spectrogram of the signal is then defined as the magnitude (the absolute value) of the STFT matrix squared.

The STFT of a signal is given by: 

\begin{equation*}
X[m, \omega] = \sum_{k=0}^{\textrm{win_length}-1}\textrm{input}[m] \cdot \textrm{window}[n - m] \cdot \exp{\bigg(-j\frac{2\pi \cdot \omega k}{\textrm{win_length}}\bigg)}
\end{equation*}

Varying the window length (`win_length`) affects the resolution of the STFT in terms of time and frequency. Longer windows capture higher frequencies, but show less change across time (as the time domain is split into fewer windows) and shorter windows show greater change in time in exchange for lower frequency resolution. The former are known as narrowband transforms and the latter as wideband transforms.

### Linear STFT spectrograms

In linear spectrograms, the STFT frequency bins are scaled linearly which is the unmodified output of the STFT.

Both wideband (`win_length` = 512) and narrowband (`win_length` = 2048) linear spectrograms were extracted from the input signal. In both cases the window hop length was $\textrm{win_len} / 2$. The STFT values were converted to the logarithmic scale (dB) and spectrogram values were normalised to $[-1, 1]$.

The final images were resized to 37$\times$50 pixels for narrowband spectrograms and 154$\times$12 pixels for wideband spectrograms with Lanczos resampling.

In [0]:
class LinearSpectrogram(object):
    """
    Creates a spectrogram from a raw audio signal using librosa
    This isn't ideal but...
    Args:
        sr: sample rate
        n_fft: size of fft
    """
    def __init__(self, n_fft, sr=16000, hop_length=None, center=True):
        self.n_fft = n_fft
        self.sr = sr
        self.hop_length = hop_length if hop_length else self.n_fft // 2
        self.center = center
    
    def __call__(self, sig):
        stft = np.abs(librosa.core.stft(sig, n_fft=self.n_fft,
                                        hop_length=self.hop_length,
                                        center=self.center))
        spectrogram = librosa.amplitude_to_db(stft, ref=np.max)
        return spectrogram

#### Narrowband

In [0]:
nb_linear_params = {
    'n_fft': 512
}

nb_lin_transform = transforms.Compose([
    LinearSpectrogram(n_fft=nb_linear_params['n_fft'])
])

nb_image_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((37, 50), 4),
    transforms.Grayscale(num_output_channels=1),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5])
])

Below shows an example linear STFT narrowband spectrogram:

In [0]:
nb_linear_stft = AudioFeatureDataset(data_df, ROOT_PATH,
                                    audio_transform=nb_lin_transform,
                                    image_transform=nb_image_transform,
                                    duration=4)
image_nblinear = nb_linear_stft.__getitem__(1)[0]

plt.imshow(image_nblinear[0], origin='lower')
plt.axis('off')
plt.title('Narrowband linear spectrogram')
plt.show()

#### Wideband

In [0]:
nb_linear_params = {
    'n_fft': 2048
}

wb_lin_transform = transforms.Compose([
    LinearSpectrogram(n_fft=nb_linear_params['n_fft'])
])

wb_image_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((154, 12), 4),
    transforms.Grayscale(num_output_channels=1),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5])
])

In [0]:
wb_linear_stft = AudioFeatureDataset(data_df, ROOT_PATH,
                                    audio_transform=wb_lin_transform,
                                    image_transform=wb_image_transform,
                                    duration=4)
image_wblinear = wb_linear_stft.__getitem__(1)[0]

plt.imshow(image_wblinear[0], origin='lower')
plt.axis('off')
plt.title('Wideband linear spectrogram')
plt.show()

### Mel-scale STFT spectrograms

`librosa.features.melspectrogram()` was used to extract a  [mel scaled](https://en.wikipedia.org/wiki/Mel_scale)  spectrogram by applying a mel filterbank of a specified size to the STFT frequency bin values.

128 mel bands were used for narrowband spectrograms and 512 for wideband spectrograms. As with linear spectrograms values were log scaled and normalised to $[-1, 1]$. Images were resized to 37$\times$50 pixels for narrowband spectrograms and 154$\times$12 pixels for wideband spectrograms with Lanczos resampling

In [0]:
class MelSpectrogram(object):
    """
    Creates a spectrogram from a raw audio signal using librosa
    This isn't ideal but...
    Args:
        sr: sample rate
        n_fft: size of fft
    """
    def __init__(self, n_fft, n_mels, sr=16000, hop_length=None):
        self.n_fft = n_fft
        self.sr = sr
        self.hop_length = hop_length if hop_length else self.n_fft // 2
        self.n_mels = n_mels
    
    def __call__(self, sig):
        stft = np.abs(librosa.feature.melspectrogram(sig, n_fft=self.n_fft,
                                                     hop_length=self.hop_length,
                                                     n_mels=self.n_mels))
        spectrogram = librosa.amplitude_to_db(stft, ref=np.max)
        return spectrogram

#### Narrowband

In [0]:
nb_mel_params = {
    'n_fft': 512,
    'n_mels': 128
}

nb_mel_transform = transforms.Compose([
    MelSpectrogram(n_fft=nb_mel_params['n_fft'],
                  n_mels=nb_mel_params['n_mels'])
])

nb_image_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((37, 50), 4),
    transforms.Grayscale(num_output_channels=1),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5])
])

In [0]:
nb_mel_stft = AudioFeatureDataset(data_df, ROOT_PATH,
                                    audio_transform=nb_mel_transform,
                                    image_transform=nb_image_transform,
                                    duration=4)
image_nbmel = nb_mel_stft.__getitem__(1)[0]

plt.imshow(image_nbmel[0], origin='lower')
plt.axis('off')
plt.title('Narrowband mel-scaled spectrogram')
plt.show()

#### Wideband

In [0]:
wb_mel_params = {
    'n_fft': 2048,
    'n_mels': 512
}

wb_mel_transform = transforms.Compose([
    MelSpectrogram(n_fft=wb_mel_params['n_fft'],
                  n_mels=wb_mel_params['n_mels'])
])

wb_image_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((154, 12), 4),
    transforms.Grayscale(num_output_channels=1),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5])
])

In [0]:
%%time
wb_mel_stft = AudioFeatureDataset(data_df, ROOT_PATH,
                                    audio_transform=wb_mel_transform,
                                    image_transform=wb_image_transform,
                                    duration=4)
image_wbmel = wb_mel_stft.__getitem__(1)[0]

plt.imshow(image_wbmel[0], origin='lower')
plt.axis('off')
plt.title('Wideband mel-scaled spectrogram')
plt.show()

### Mel Frequency Cepstral Coefficients (MFCCs)

Along with the four spectral features, mel-frequency cepstral coefficients were also extracted from the raw audio. Per Huzaifah (2017: 2) these were 'obtained using the standard procedure'. That is by:
Huzaifah (2017: 2) writes that 'MFCCs were obtained using the standard procedure'. That is by:

1. Computing the STFT of the signal
2. Applying a mel filterbank to the power spectrum of the signal, producing the mel-scaled STFT
3. Taking the logarithm of the powers of each mel frequency
4. Taking the Discrete Cosine Transform of the list of log mel powers.

The default behaviour of `librosa.feature.mfcc` is to only return the first 20 MFCCs. This was left unchanged as it was assumed to be the approach used by Huzaifah. MFCC values were normalised to $[-1, 1]$ and the image resized to 37$\times$50 pixels with Lanczos resampling.

In [0]:
class MFCC(object):
    """
    Gets MFCCs from a raw audio signal using librosa
    Args:
        sr: sample rate
        n_fft: size of fft
    """
    def __init__(self, sr=16000, n_mfcc=20):
        self.sr = sr
        self.n_mfcc = n_mfcc
    
    def __call__(self, sig):
        mfcc = librosa.feature.mfcc(sig, sr=self.sr, n_mfcc=self.n_mfcc)
        return mfcc

In [0]:
mfcc_transform = transforms.Compose([
    MFCC()
])

mfcc_image_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((37, 50), 4),
    transforms.Grayscale(num_output_channels=1),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5])
])

In [0]:
mfcc = AudioFeatureDataset(data_df, ROOT_PATH,
                                    audio_transform=mfcc_transform,
                                    image_transform=mfcc_image_transform,
                                    duration=4)
image_mfcc = mfcc.__getitem__(15)[0]

plt.imshow(image_mfcc[0], origin='lower')
plt.axis('off')
plt.title('MFCCs')
plt.show()

# CNN architectures

Huzaifah (2017: 3) defined two CNN architectures, with Conv-5 being deeper than Conv-3. For each network, two different convolutional filters were considered, a  $3\times3$ filter and an $M\times3$ filter where $M$ spans the FFT frequency bins. So far, only the $3\times3$ filter networks are implemented here.

In [0]:
def outputSize(in_size, kernel_size, stride, padding):

    output = int((in_size - kernel_size + 2*(padding)) / stride) + 1
    return(output)

In [0]:
class Lambda(nn.Module):
    def __init__(self, func):
        super().__init__()
        self.func = func
    
    def forward(self, x):
        return self.func(x)
    
class PrintLambda(nn.Module):
    def __init__(self, func):
        super().__init__()
        self.func = func

    def forward(self, x):
        self.func(x)
        return x

## Conv-3

### Narrowband conv-3

In [0]:
nb_conv3_model = nn.Sequential(
    nn.Conv2d(1, 180, kernel_size=3, stride=1, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=4, stride=4, padding=1),
    nn.Dropout(0.5),
    Lambda(lambda x: x.view(-1, 180 * 9 * 13)),
    nn.Linear(180 * 9 * 13, 800),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(800, 2)
)

# With leaky ReLU
nb_conv3__leaky_model = nn.Sequential(
    nn.Conv2d(1, 180, kernel_size=3, stride=1, padding=1),
    nn.LeakyReLU(),
    nn.MaxPool2d(kernel_size=4, stride=4, padding=1),
    nn.Dropout(0.5),
    Lambda(lambda x: x.view(-1, 180 * 9 * 13)),
    nn.Linear(180 * 9 * 13, 800),
    nn.LeakyReLU(),
    nn.Dropout(0.5),
    nn.Linear(800, 2)
)


### Wideband conv-3

In [0]:
wb_conv3_model = nn.Sequential(
    nn.Conv2d(1, 180, kernel_size=3, stride=1, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=4, stride=4, padding=1),
    nn.Dropout(0.5),
    Lambda(lambda x: x.view(-1, 180 * 39 * 3)),
    nn.Linear(180 * 39 * 3, 800),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(800, 2)
)

# With leaky ReLU
wb_conv3_leaky_model = nn.Sequential(
    nn.Conv2d(1, 180, kernel_size=3, stride=1, padding=1),
    nn.LeakyReLU(),
    nn.MaxPool2d(kernel_size=4, stride=4, padding=1),
    nn.Dropout(0.5),
    Lambda(lambda x: x.view(-1, 180 * 39 * 3)),
    nn.Linear(180 * 39 * 3, 800),
    nn.LeakyReLU(),
    nn.Dropout(0.5),
    nn.Linear(800, 2)
)

## Conv-5

### Narrowband conv-5

In [0]:
nb_conv5_model = nn.Sequential(
    nn.Conv2d(1, 24, kernel_size=3, stride=1, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2, padding=1),
    nn.Dropout(0.5),
    nn.Conv2d(24, 48, kernel_size=3, stride=1, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2, padding=1),
    nn.Conv2d(48, 96, kernel_size=3, stride=1, padding=1),
    nn.ReLU(),
    Lambda(lambda x: x.view(-1, 96 * 10 * 14)),
    nn.Linear(96 * 10 * 14, 800),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(800, 2)
)

# With leaky ReLU
nb_conv5_leaky_model = nn.Sequential(
    nn.Conv2d(1, 24, kernel_size=3, stride=1, padding=1),
    nn.LeakyReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2, padding=1),
    nn.Dropout(0.5),
    nn.Conv2d(24, 48, kernel_size=3, stride=1, padding=1),
    nn.LeakyReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2, padding=1),
    nn.Conv2d(48, 96, kernel_size=3, stride=1, padding=1),
    nn.LeakyReLU(),
    Lambda(lambda x: x.view(-1, 96 * 10 * 14)),
    nn.Linear(96 * 10 * 14, 800),
    nn.LeakyReLU(),
    nn.Dropout(0.5),
    nn.Linear(800, 2)
)

### Wideband conv-5

In [0]:
wb_conv5_model = nn.Sequential(
    nn.Conv2d(1, 24, kernel_size=3, stride=1, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2, padding=1),
    nn.Dropout(0.5),
    nn.Conv2d(24, 48, kernel_size=3, stride=1, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2, padding=1),
    nn.Conv2d(48, 96, kernel_size=3, stride=1, padding=1),
    nn.ReLU(),
    Lambda(lambda x: x.view(-1, 96 * 40 * 4)),
    nn.Linear(96 * 40 * 4, 800),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(800, 2)
)

# With leaky ReLU
wb_conv5_leaky_model = nn.Sequential(
    nn.Conv2d(1, 24, kernel_size=3, stride=1, padding=1),
    nn.LeakyReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2, padding=1),
    nn.Dropout(0.5),
    nn.Conv2d(24, 48, kernel_size=3, stride=1, padding=1),
    nn.LeakyReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2, padding=1),
    nn.Conv2d(48, 96, kernel_size=3, stride=1, padding=1),
    nn.LeakyReLU(),
    Lambda(lambda x: x.view(-1, 96 * 40 * 4)),
    nn.Linear(96 * 40 * 4, 800),
    nn.LeakyReLU(),
    nn.Dropout(0.5),
    nn.Linear(800, 2)
)

# Training CNNs

## Class definition

Defining a class for preparing the data and training the CNNs.

In [0]:
class _WrappedDataLoader:
    def __init__(self, dl, func, dev):
        self.dl = dl
        self.func = func
        self.dev = dev

    def __len__(self):
        return len(self.dl)

    def __iter__(self):
        batches = iter(self.dl)
        for b in batches:
            yield (self.func(*b, self.dev))

            

class FitCNN(object):
    """
    Functions for training a CNN
    Args:
        train_ds: pytorch Dataset of training data
        train_ds: pytorch Dataset of validation data
    """
    def __init__(self, train_ds, valid_ds, bs, preprocess_func, model, epochs,
                 loss_func, opt, dev):
        self.dev = dev
        self.train_ds = train_ds
        self.valid_ds = valid_ds
        self.bs = bs
        self.preprocess = preprocess_func
        self.model = model
        self.epochs = epochs
        self.loss_func = loss_func
        self.opt = opt
        
        self.preds = []
        self.actuals = []
        self.accuracy = []
        
    
    def fit(self):
        train_dl, valid_dl = self._get_data(self.train_ds, self.valid_ds, self.bs)
        train_dl = _WrappedDataLoader(train_dl, self.preprocess, self.dev)
        valid_dl = _WrappedDataLoader(valid_dl, self.preprocess, self.dev)
        
        self.model.apply(self._weights_init)
        for epoch in range(self.epochs):
            self.model.train()
            for xb, yb in train_dl:
                self._batch_loss(self.model, self.loss_func, xb, yb, self.opt)
            
            self.model.eval()
            with torch.no_grad():
                losses, nums = zip(
                    *[self._batch_loss(self.model, self.loss_func, xb, yb) for xb, yb in valid_dl]
                )
                val_loss = np.sum(np.multiply(losses, nums)) / np.sum(nums)
            if epoch % 5 == 0:
                print(epoch, val_loss)
        self.preds, self.actuals, self.accuracy = self._accuracy_score(valid_dl)
    
    
    def _get_data(self, train_ds, valid_ds, bs):
        return (
            DataLoader(train_ds, batch_size=bs, shuffle=True),
            DataLoader(valid_ds, batch_size=bs*2)
        )
    
    
    def _batch_loss(self, model, loss_func, xb, yb, opt=None):
        loss = loss_func(model(xb), yb)

        if opt is not None:
            loss.backward()
            opt.step()
            opt.zero_grad()

        return loss.item(), len(xb)
    
    
    def _accuracy_score(self, valid_dl):
        correct = 0
        total = 0
        preds = []
        labels = []
        with torch.no_grad():
            for data in valid_dl:
                b_images, b_labels = data
                outputs = self.model(b_images)
                _, b_predicted = torch.max(outputs.data, 1)
                total += b_labels.size(0)
                correct += (b_predicted == b_labels).sum().item()
                labels.extend(b_labels.cpu().tolist())
                preds.extend(b_predicted.cpu().cpu().tolist())
        accuracy = (100 * correct / total)
        print("-" * 40)
        print(f'Accuracy: {accuracy}')
        print(f"Confusion matrix:\n {confusion_matrix(labels, preds)}")
        print("-" * 40)
        return preds, labels, accuracy
        
        
    def _weights_init(self, m):
        if isinstance(m, nn.Conv2d):
            nn.init.xavier_uniform_(m.weight.data,
                                         nn.init.calculate_gain('relu'))
            m.bias.data.zero_()
        elif isinstance(m, nn.Linear):
            nn.init.xavier_uniform_(m.weight.data,
                                         nn.init.calculate_gain('relu'))
            m.bias.data.zero_()

In [0]:
def preprocess_nb(x, y, dev):
    return x.view(-1, 1, 37, 50).to(dev), y.to(dev)

def preprocess_wb(x, y, dev):
    return x.view(-1, 1, 154, 12).to(dev), y.to(dev)

## Cross validation

In [0]:
class CVModel:
    """
    Create and train a CNN with n-fold cross-validation
    """
    def __init__(self, X, nfolds, root, audio_trfms, image_trfms, dur,
                librosa=False, dev=None):
        self.X = X
        self.y = self.X['label']
        self.nfolds = nfolds
        
        self.root = root
        self.audio_trfms = audio_trfms
        self.image_trfms = image_trfms
        self.dur = dur
        self.librosa = librosa
        
        self.model = ''
        self.loss_func = ''
        self.opt = ''
        self.params = {}
        
        self.acc_scores = []
        self.train_preds = []
        self.train_actual = []
        
        if dev:
            self.dev = dev
        elif torch.cuda.is_available():
            self.dev = torch.device("cuda")
        else:
            self.dev = torch.device("cpu")
    
    def train_cnn(self, model, params):
        # CV setup
#         dev = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
        self.model = model.to(self.dev)
        self.params = params
        
        skf = StratifiedKFold(n_splits=self.nfolds, random_state=42, shuffle=True)
        fold_count = 1
        
        for train_index, valid_index in skf.split(np.zeros(self.X.shape[0]), self.y):
            print(f"Fold {fold_count} / {self.nfolds}:")
            # Splits data into training and validation sets
            X_train, X_valid = self.X.iloc[train_index], self.X.iloc[valid_index]
            
            train_ds = AudioFeatureDataset(X_train, self.root,
                                           audio_transform=self.audio_trfms,
                                           image_transform=self.image_trfms,
                                           duration=self.dur,
                                           librosa=self.librosa)
                        
            valid_ds = AudioFeatureDataset(X_valid, self.root,
                                           audio_transform=self.audio_trfms,
                                           image_transform=self.image_trfms,
                                           duration=self.dur,
                                           librosa=self.librosa)
            
            # Trains the model and gets predictions and accuracy scores
            opt = optim.Adam(self.model.parameters(),
                             lr=self.params['lr'],
                             weight_decay=self.params['l2'])
            loss_func = nn.CrossEntropyLoss()
            
            model_fitter = FitCNN(train_ds, valid_ds, bs=self.params['bs'],
                                  preprocess_func=self.params['preprocess'],
                                  model=self.model,
                                  epochs=self.params['epochs'],
                                  loss_func=loss_func,
                                  opt=opt, dev=self.dev)
            
            
            model_fitter.fit()

            # Process model results
            self.train_preds.extend(model_fitter.preds)
            self.train_actual.extend(model_fitter.actuals)
            self.acc_scores.append(model_fitter.accuracy)
            fold_count += 1
            
        # Print model results summary
        print("=" * 40)
        print(f"Accuracy scores: {self.acc_scores}")
        print(f"Mean accuracy: {np.mean(self.acc_scores)}")
        print(f"Accuracy sd: {np.std(self.acc_scores)}")
        print(f"Confusion matrix:\n{confusion_matrix(self.train_actual, self.train_preds)}")
        print("-" * 40)
        print(f"Model parameters:\n {self.params}")
        print("=" * 40)

## Narrowband linear STFT: Conv-3

In [0]:
%%time
nblinear_conv3_params = {
    'bs': 40,
    'preprocess': preprocess_nb,
    'epochs': 50,
    'lr': 0.01,
    'l2': 0.005
}

nblinear_conv3 = CVModel(data_df, 5, ROOT_PATH,
                        nb_lin_transform, nb_image_transform,
                        4)
nblinear_conv3.train_cnn(nb_conv3_model,
                        nblinear_conv3_params)

## Narrowband linear STFT: Conv-5

In [0]:
%%time
nblinear_conv5_params = {
    'bs': 40,
    'preprocess': preprocess_nb,
    'epochs': 40,
    'lr': 0.01,
    'l2': 0.005
}

nblinear_conv5 = CVModel(data_df, 5, ROOT_PATH,
                        nb_lin_transform, nb_image_transform,
                        4)
nblinear_conv5.train_cnn(nb_conv5_model,
                        nblinear_conv5_params)

## Wideband linear STFT: Conv-3

In [0]:
wblinear_conv3_params = {
    'bs': 30,
    'preprocess': preprocess_wb,
    'epochs': 40,
    'lr': 0.01,
    'l2': 0.005
}

wblinear_conv3 = CVModel(data_df, 5, ROOT_PATH,
                        wb_lin_transform, wb_image_transform,
                        4)
wblinear_conv3.train_cnn(wb_conv3_model,
                        wblinear_conv3_params)

## Wideband Linear STFT: Conv-5

In [0]:
wblinear_conv5_params = {
    'bs': 30,
    'preprocess': preprocess_wb,
    'epochs': 40,
    'lr': 0.01,
    'l2': 0.005
}

wblinear_conv5 = CVModel(data_df, 5, ROOT_PATH,
                        wb_lin_transform, wb_image_transform,
                        4)
wblinear_conv5.train_cnn(wb_conv5_model,
                        wblinear_conv5_params)

## Narrowband mel-scale STFT: Conv-3

In [0]:
%%time
nbmel_conv3_params = {
    'bs': 30,
    'preprocess': preprocess_nb,
    'epochs': 100,
    'lr': 0.005,
    'l2': 0.005
}

nbmel_conv3 = CVModel(data_df, 5, ROOT_PATH,
                        nb_mel_transform, nb_image_transform,
                        4)
nbmel_conv3.train_cnn(nb_conv3_model,
                        nbmel_conv3_params)

## Narrowband mel-scale STFT: Conv-5

In [0]:
%%time
nbmel_conv5_params = {
    'bs': 30,
    'preprocess': preprocess_nb,
    'epochs': 40,
    'lr': 0.01,
    'l2': 0.005
}

nbmel_conv5 = CVModel(data_df, 5, ROOT_PATH,
                        nb_mel_transform, nb_image_transform,
                        4)
nbmel_conv5.train_cnn(nb_conv5_model,
                        nbmel_conv5_params)

## Wideband mel-scale STFT: Conv-3

In [0]:
%%time
wbmel_conv3_params = {
    'bs': 30,
    'preprocess': preprocess_wb,
    'epochs': 40,
    'lr': 0.01,
    'l2': 0.005
}

wbmel_conv3 = CVModel(data_df, 5, ROOT_PATH,
                        wb_mel_transform, wb_image_transform,
                        4)
wbmel_conv3.train_cnn(wb_conv3_model,
                        wbmel_conv3_params)

## WIdeband mel-scale STFT: Conv-5

In [0]:
%%time
wbmel_conv5_params = {
    'bs': 40,
    'preprocess': preprocess_wb,
    'epochs': 40,
    'lr': 0.01,
    'l2': 0.005
}

wbmel_conv5 = CVModel(data_df, 5, ROOT_PATH,
                        wb_mel_transform, wb_image_transform,
                        4)
# wbmel_conv5.train_cnn(wb_conv5_model,
#                         wbmel_conv5_params)

wbmel_conv5.train_cnn(wb_conv5_leaky_model,
                        wbmel_conv5_params)

## MFCCs: Conv-3

In [0]:
%%time
mfcc_conv3_params = {
    'bs': 30,
    'preprocess': preprocess_nb,
    'epochs': 40,
    'lr': 0.01,
    'l2': 0.005
}

mfcc_conv3 = CVModel(data_df, 5, ROOT_PATH,
                        mfcc_transform, nb_image_transform,
                        4)
mfcc_conv3.train_cnn(nb_conv3_model,
                        mfcc_conv3_params)