# WavClassifier

Use the WAV file directly; extract features with a CNN using 1DConv.
State-of-the-art audio classifiers use Mel-Spectograms as described in `./mel_spec_classifier.ipynb`, but do not  preserve phase information

In [None]:
!pip install "ray[tune]"
import torch
from utils import *
import torch.nn as nn
import numpy as np
import torchvision.transforms as transforms
import torch.utils.data as Data
from scipy.io import wavfile
from ray import air
import os
from ray.tune.schedulers import ASHAScheduler

## Mount drive
Mount google drive if running on google colab

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Constant parameters used in training

Run `setup.sh` to mount Google Drive containing GTZAN

In [None]:
GTZAN_WAV = "/content/drive/MyDrive/GTZAN/Data/genres_original/"

GENRES = {'blues': 0, 'classical': 1, 'country': 2, 'disco': 3,
          'hiphop': 4, 'jazz': 5, 'metal': 6, 'pop': 7, 'reggae': 8,
          'rock': 9}

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device", DEVICE)

## Training

Create a `Dataset` for the audio files

We split the WAV file into equal halves, for 2 reasons.

1. Deal with Curse of Dimensionality. The expectation is that the two halves are not identical, so this would be a cheap method to increase data-points and reduce overfitting. Naturally we have reduced the dimensionality of our input as well. 15 seconds should be more than enough to figure out the genre!
2. Ray-Tuner was not able to store `WAVDataset` instances by default due to exceeding size constraints. We can override this by setting a larger size constraint, but this hits two birds with one stone.

In [None]:
class WAVDataset(Data.Dataset):
    def __init__(self):
        self.wav = []
        self.labels = []

        # The WAV files were not all the same size. Obtain the minimum size
        # and prune the datapoints accordingly
        max_size = 330000

        # Go through all songs and tag X (tensor of image), Y as genre.
        for genre in os.listdir(GTZAN_WAV):
            for song in os.listdir(os.path.join(GTZAN_WAV, genre)):
                abs_path = os.path.join(GTZAN_WAV, genre, song)

                # Seems like there is a format issue with jazz.00054.wav. Skipping..
                if 'jazz.00054.wav' not in abs_path:
                  _, data = wavfile.read(abs_path)

                  # Split into two WAV files, each covering 15 seconds of music
                  data_1, data_2 = np.array_split(data, 2)
                  data_1 = data_1[:max_size].astype(np.float32)
                  data_2 = data_2[:max_size].astype(np.float32)

                  # Convert wav file to tensor
                  self.wav.append(torch.from_numpy(np.reshape(data_1, (1, max_size))))
                  self.wav.append(torch.from_numpy(np.reshape(data_2, (1, max_size))))

                  # Convert genre tag to associated digit
                  self.labels.append(GENRES[genre])
                  self.labels.append(GENRES[genre])
    def __len__(self):
        return len(self.wav)

    def __getitem__(self, idx):
        return self.wav[idx], self.labels[idx]

The `WavTrainer` model used is a CNN with 2 convolutional layers and 2 linear layers.

Each `wav` file is 30 seconds long and sampled at 22050 Hz. So, we have datapoints of size: ~661500. As humans. We make an estimation from a human standpoint regarding how long 'musical features' are to differentiate genres. The smallest 'features' seem to be differentiable within a significant fraction of a second.

So, the receptive field of the convolutional layer of the CNN should cover
a significant fraction of a second.

What is a significant fraction of a second? This is on hyperparameter tuning to decide. But the conclusion is that very small kernel sizes (such as 3 in `2DConv`) should not apply here since we wouldn't obtain much about the song features itself through essentially 0.0001 seconds of the song.


In [None]:
class WavTrainer(nn.Module):
  def __init__(self, l1=1000, l2=20):
    super().__init__()

    self.conv_layer_1 = nn.Sequential(nn.Conv1d(1, 8, 3),
                                      nn.ReLU(),
                                      nn.MaxPool1d(kernel_size=10, stride=10)
                                      )

    self.conv_layer_2 = nn.Sequential(nn.Conv1d(8, 16, 3),
                                      nn.ReLU(),
                                      nn.MaxPool1d(kernel_size=10, stride=10)
                                      )

    self.flatten_layer = nn.Flatten()

    self.linear_layer_1 = nn.Sequential(nn.Linear(52784, l1),
                                        nn.ReLU())

    self.linear_layer_2 = nn.Sequential(nn.Linear(l1, l2),
                                        nn.ReLU())

    self.classifier = nn.Linear(l2, 10)

  def forward(self, x):
      # First 1D convolution layer
      x = self.conv_layer_1(x)
      # Second 1D convolution layer
      x = self.conv_layer_2(x)

      # Linear layer and classifier
      x = self.flatten_layer(x)
      x = self.linear_layer_1(x)
      x = self.linear_layer_2(x)
      x = self.classifier(x)

      return x

Create routines for training and validation. Perform Hyperparameter Tuning to devise a closer to optimized model.

In [None]:
def train_wav_classifier_model(config):

    model = WavTrainer(l1=config["l1"], l2=config["l2"])
    model.to(DEVICE)
    wav_dataset = WAVDataset()

    # train model with training dataset, but ray tuner uses validation dataset to tune hyperparameters
    train_model(model, DEVICE, config, wav_dataset)

## Testing

 Create routine for testing model. The split being used is 80% for training, 10% for validation, and 10% for testing.

In [None]:
def test_wav_classifier_model(best_result):
    best_model = WavTrainer(l1=best_result.config["l1"], l2=best_result.config["l2"])
    best_model.to(DEVICE)

    wav_dataset = WAVDataset()

    test_model(best_model, best_result, wav_dataset, DEVICE)

# Main function

Here, we specify the range for the hyperparameters we want Ray Tune to tune on. Run the training of the model using various hyperparameters.

Test the model using the best trained model as obtained using Ray Tune

In [None]:
def run_wav_classifier():
    config = {
        "l1": 500,
        "l2": 20,
        "lr": 0.001,
        "batch_size": 35,
        "num_epochs": 35
    }

    # Only stop trials at least after 20 training iterations
    asha_scheduler = ASHAScheduler(time_attr='training_iteration',
                                   grace_period=20)

    # Adjust resources depending on availability
    tuner = tune.Tuner(tune.with_resources(tune.with_parameters(train_wav_classifier_model),
                       resources={"cpu": 12, "gpu": 1}),
                       tune_config=tune.TuneConfig(
                           metric='loss',
                           mode="min",
                           scheduler=asha_scheduler,
                           num_samples=1,
                       ),
                       param_space=config,)

    results = tuner.fit()
    best_result = results.get_best_result("loss", "min")

    test_wav_classifier_model(best_result)

run_wav_classifier()