### Audio Classification in Pytorch
This notebook describes the steps taken in realizing a Convolutional Neural Network (CNN), usable for classification in audio classification. The goal is to research the use of CNNs in audio classification and how to go about preprocessing audio data to make it useful for training a model. 

#### 1. Introduction
This assignment aims to solve the following problem:

"The company SLIFTS (Smart Lifts in Floor Transition Systems) want to expand its elevator capabilities to handle spoken commands. In the aftermath of the global 2020 COVID pandemic, the company has noted a sharp decline in the number of passengers that use their elevators. Marketing research has shown that people are hesitant to touch physical buttons in the elevator. As one user noted “this up-button looks really yucky, I can almost see the germs crawling on it!”. The situation is extremely serious and people are even doing previously unthinkable things like taking the stairs, which has to be prevented in all cases. To resolve this problem, SLIFTS has hired Zuyd Hogeschool to research and develop elevators with voice command capabilities."

To solve this problem, the following points will be looked at closely:
1. Preprocessing audio data for training a CNN
2. Preparing the dataset for training. This includes making decisions on splitting the dataset.
3. Designing the model.
4. Implementing the model for training.
5. Evaluating the output of the model after training.
6. Finetuning the model after evaluating the results.

Note that the advisory and conclusion is included in a separate document.

#### 2. Data Collection
For any machine learning/ai project, a dataset is needed to train, test and evaluate a model once it's built in code. In the case of recognizing speech commands, a dataset is needed where these speech commands, like 'yes' or 'down', need to be made audible. Fortunately, such a dataset is already available on [Kaggle](https://www.kaggle.com/datasets/antfilatov/mini-speech-commands/data).

This dataset includes the following commands: down, go, left, no, right, stop, up, yes.

In [None]:
'''To make a custom dataset, a class will be made that inherits its properties
from the pytorch dataset class'''
from torch.utils.data import Dataset
import torchaudio
import os 
import torch

class LiftCommandDataset(Dataset):
    def __init__(self, root_dir: str, transform=None):
        self.root_dir = root_dir
        self.file_paths = []
        self.labels = []
        self.label_map = {}
        self.transform = transform

        '''Recursively load all files and append corresponding labels from folder name'''
        for label in os.listdir(root_dir):
            label_dir = os.path.join(root_dir, label)
            if os.path.isdir(label_dir):
                self.label_map[label] = len(self.label_map)
                for file_name in os.listdir(label_dir):
                    if file_name.endswith('.wav'):
                        self.file_paths.append(os.path.join(label_dir, file_name))
                        self.labels.append(self.label_map[label])

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, index: int):
        audio_path = self.file_paths[index]
        label = self.labels[index]

        waveform, sample_rate = torchaudio.load(audio_path, normalize=True)

        if waveform.size(0) > 1:
            waveform = waveform[0:1, :]

        if waveform.dim() > 1:
            waveform = waveform.squeeze(0)
        
        if self.transform:
            waveform = self.transform(waveform)

        return waveform, label


#### 3. Data Preprocessing


Given a dataset, a ML/AI model will usually not be able to operate unless all files are uniform in format and length. In the case of the speech command dataset, there are plenty of audio files that do not have similar lengths. To make all audio files equal in length, they will be padded, since this will lead to the least amount of loss of information from the audio files.

In [None]:
def padding(batch):
    waveforms, labels = zip(*batch)
    
    for i, waveform in enumerate(waveforms):
        print(f"Waveform {i} shape: {waveform.shape}")
        
    '''Perform padding'''
    waveforms_padded = torch.nn.utils.rnn.pad_sequence(waveforms, batch_first=True)
    labels = torch.tensor(labels)

    return waveforms_padded, labels

In [3]:
from torch.utils.data import DataLoader

root_dir = './mini_speech_commands'
dataset = LiftCommandDataset(root_dir)
data_loader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=padding)

for waveforms, labels in data_loader:
    print(waveforms.shape)
    print(labels)
    break

RuntimeError: The size of tensor a (16000) must match the size of tensor b (14861) at non-singleton dimension 1

To make audio file formats usable for training a CNN, they will need to be converted to an image of some sort. For the given problem, there are two options to achieve this:
1. Mel-spectrogram: full time-frequency representation of an audio signal
2. Mel-frequency Cepstral Coefficients (MFCC): reduced set of coefficients that summarize the spectral characteristics.

Both frequencies make use of the Mel scale, which is a scale of the pitches that approximately represent the way humans perceive sound. The Mel-spectrogram however captures a full time-frequency representation of the audio signal, while Mel-frequency only captures the most important characteristics of the audio signal.

For this project, Mel-spectrogram will be used since it might give the model more opportunities to understand underlying patterns.

In [None]:
import librosa
import numpy as np

def extract_features(file_path):
    y, sr = librosa.load(file_path, sr=None)
    mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
    log_mel_spectrogram = librosa.power_to_db(mel_spectrogram)
    return log_mel_spectrogram

#### 4. Dataset Preparation

#### 5. Model Design

Vgg network using concolutional layers
The network will consist of 4 convolutional layers, a flatten layer, linear tranfsormation and softmax

In [None]:
class CNNet(nn.Module):
    def __init__(self):
        super().__init__()
        
        '''When using nn.Sequential, the layers will be processed in a sequential manner'''
        self.conv1 = nn.Sequential(
            nn.Conv2d(in_channels=1,
                out_channels=16,
                kernel_size=3,
                stride=1,
                padding=2),
            nn.ReLU,
            nn.MaxPool2d(kernel_size=2)
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(in_channels=16,
                out_channels=32,
                kernel_size=3,
                stride=1,
                padding=2),
            nn.ReLU,
            nn.MaxPool2d(kernel_size=2)
        )
        self.conv3 = nn.Sequential(
            nn.Conv2d(in_channels=32,
                out_channels=64,
                kernel_size=3,
                stride=1,
                padding=2),
            nn.ReLU,
            nn.MaxPool2d(kernel_size=2)
        )
        self.conv4 = nn.Sequential(
            nn.Conv2d(in_channels=64,
                out_channels=128,
                kernel_size=3,
                stride=1,
                padding=2),
            nn.ReLU,
            nn.MaxPool2d(kernel_size=2)
        )

        self.flatten = nn.Flatten()
        self.linear = nn.linear(128*5*4, 8)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, input_data):
        x = self.conv1(input_data)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.conv4(x)

        x = self.flatten(x)
        logits = self.linear(x)
        predictions = self.softmax(logits)

        return predictions


#### 6. Training Model

#### 7. Evaluation

In [None]:
from torchinfo import summary

cnn = CNNet()

'''Adjust parameters for summarizing model info'''
summary(cnn, (0,0,0))

#### 8. Optimization