# Applying a neural network for genre classification
## Overview
In this notebook, I create and test a simple _music genre classifier_, explain the workflow and parameters involved in processing data and building the network.

> Note: to understand this, one should be familiar with how Digital Signal Processing works and how Neural Neworks work. These are explained in 2 of my previous notebooks that can be found here:
> * https://www.kaggle.com/mantasu/how-neural-network-works
> * https://www.kaggle.com/mantasu/asp-cheatsheet

## Motivation
This notebook helps to understand how a deep neural network can be implemented on audio data

## Resources
The main sources of reference are these videos:
* https://www.youtube.com/watch?v=szyGiObZymo
* https://www.youtube.com/watch?v=_xcFAiufwd0
* https://www.youtube.com/watch?v=Gf5DO6br0ts

# Preprocessing audio data
## Understanding constants
* `DATA_PATH` - the path of the direcory which contains sub-folders named after each genre in the dataset. Each sub-folder contains a number of 30 sec audio files
* `JSON-PATH` - the path of a `.json` file where the features of each audio file will be saved in a structurised way
* `DURATION` - the length of each track (will help with calculations)
* `SAMPLE_RATE` - number of amplitude samples we take per second when sampling an analogous signal. We don't need a high one because most of the information is contained in lower frequencies anyway
* `SAMPLES_PER_TRACK` - total number of soundwave samples for a whole duration of the song 

## Defining the method
`save_mfcc` - saves the MFCCs of each song in a JSON file genre-wise which will be used as training data:
* `dataset_path` - the root directory which contains all the genre sub-directories with audio files
* `json_path` - path of the JSON file to which the training data will be saved
* `n_mfccs` - number of Mel-frequency cepstral coefficients we want to extract for each song
    * `13` by default because they provide the most information
* `n_fft` - number of samples for each frame the FFT will be applied to. Frames are perceivable audio chunks (~10ms) so the higher the sampling rate is, the higher the frame size should be
    * `2048` by default because power of 2 allows us to perform STFT
* `hop_length` - step size we take when shifting on performing FFT on next frame. Because frames are windowed (silenced at both ends) before each FFT, we need frames to overlap so usually `hop_length` is half the size of frame size but lower values make it more precise
    * `512` by default as 1/4 th of the default `n_fft` value
* `num_segments` - number of segments we split the audio file to (each piece of training data for one audio file will be split to segments with suffixes denoting the segment ID). With data, more is always better
    * `5` by default considering the length of the audio and assuming just a couple of samples will be taken from each genre
* `files_per_genre` - number of files to consider for one genre
    * `-1` by default indicating all the genres will be considered

## Workflow
Before we enter the main loop, we define a dictionary object wich will shape our training data:
* `data` - a dictionary which will represent the training data that will be saved in `.json` file
    * `mapping` - labels of each genre the coefficients will be mapped to (e.g., "blues")
    * `mfccs` - MFCC vectors for each segment (e.g., [0, 0.123, ...])
    * `labels` - values which represent to which genre each MFCC vector belongs to (e.g., 1)

### Main loop
1. We loop through each sub-folder in the dataset path
    * Ensure the sub-folder is not the root folder (i.e., is not the dataset path itself) because it is given as a result of the first iteration by the `os.walk`
    * Save the semantic label (the name of the sub-folder which represents the genre)
2. We loop through each audio file in the sub-folder
    * Load the audio file with `librosa`
3. We loop through each segment in audio file
    * Define start and end points at which the audio should be sliced
    * Extract MFCCs for that silce/segment
    * Store the data in the `data` dictionary

In [None]:
import os       # traversing directories
import librosa  # extracting audio features
import math     # allows to perform ceiling
import json     # saving data in json format

# Define important constants
DATA_PATH = "/kaggle/input/gtzan-fixed/Data/genres_original"
JSON_PATH = "/kaggle/temp/data.json"
DURATION = 30
SAMPLE_RATE = 22050
SAMPLES_PER_TRACK = SAMPLE_RATE * DURATION

def save_mfcc(dataset_path, json_path, n_mfcc=13, n_fft=2048, hop_length=512, num_segments=5, files_per_genre=-1):
    """Calculates MFCC vectors for each segment for each song in each genre and saves the data to a JSON file.
        
        Args:
            dataset_path (str):    path to audio files grouped genre-wise
            json_path (str):       path to json file where input and target data will be saved
            n_mfcc (int):          number of Mel-frequency cepstral coefficients to extract for each frame
            n_fft (int):           frame size
            hop_length (int):      step size when shifting to new frame
            num_segments (int):    number of segments each song will be divided into
            files_per_genre (int): number of files to consider for each genre
    """
    # Dictionary to store data
    data = {
        "mapping": [],
        "mfcc": [],
        "labels": []
    }
    
    # Find how many samples there are in one segment
    num_samples_per_segment = SAMPLES_PER_TRACK // num_segments
    
    # Define how many MFCC vectors there chould be for one segment
    expected_number_mfcc_vectors_per_segment = math.ceil(num_samples_per_segment / hop_length) # ceil because mfcc works that way
    
    # Loop through all the genres
    for i, (dirpath, dirnames, filenames) in enumerate(os.walk(dataset_path)):
        # Ensure we're not at the root level
        if dirpath is not dataset_path:
            # Save the semantic label
            semantic_label = dirpath.split("/")[-1]
            data["mapping"].append(semantic_label)
            
            # Print which folder we are processing
            print(f"\nProcessing {semantic_label}")
            
            # Process files for a specific genre
            for filename in filenames[:files_per_genre]:
                # Load audio file
                file_path = os.path.join(dirpath, filename)
                signal, sr = librosa.load(file_path, sr=SAMPLE_RATE)
                
                # Print which file we are processing
                print(f"Processing {filename}")
                
                # Process the segments
                for segment in range(num_segments):
                    start_sample = num_samples_per_segment * segment
                    end_sample = start_sample + num_samples_per_segment
                    
                    # Extract MFCC vectors for a particular segment
                    mfcc = librosa.feature.mfcc(signal[start_sample:end_sample],
                                                sr=SAMPLE_RATE,
                                                n_mfcc=n_mfcc,
                                                n_fft=n_fft,
                                                hop_length=hop_length)
                    
                    # Take the transpose for simpler calculations
                    mfcc = mfcc.T
                    
                    # Store MFCC vectors for segment if it has the expected length
                    if len(mfcc) == expected_number_mfcc_vectors_per_segment:
                        data["mfcc"].append(mfcc.tolist()) # tolist required to store as json file
                        data["labels"].append(i-1)         # because when i=0, we're at the dataset path so we ignore
                        # print(f"{file_path}, segment: {segment+1}")
    
    # Create "/kaggle/temp/" directory where temporary files can be stored
    if not os.path.exists("/kaggle/temp/"):
        os.makedirs("/kaggle/temp/")
    
    # Write a JSON file to the temporary directory
    with open(json_path, "w") as fp:
        json.dump(data, fp, indent=4)

# Use the method to save the data
save_mfcc(DATA_PATH, JSON_PATH, num_segments=10, files_per_genre=1)

# Implementing NN for genre classification
## Loading data
1. We create a helper method `load_data` which takes in dataset path of a `.json` file and returns inputs and target labels
2. We split inputs and targets to train and test sets

## Building NN
1. Input layer is a flattened matrix of segment length $\times$ mfcc count
2. 3 hidden layers use `512`, `256` and `64` neurons respectively and use **ReLU** as activation function
    * **ReLU** - improves convergence and reduces the likelihood of _vanishing gradient_
3. Output layer has 10 output nodes representing 10 genres and uses **softmax** as activation function
    * **softmax** - normalizes the values so that summing all of them results in `1`
4. Define `Adam` as our optimizer with learning rate `0.0001`
    * `Adam` - variation of _SGD_ that's very effective with deep learning
5. Compile the model and print the summary

## Training
We pass in parameters to the `train` method:
* Training and validation data
* Number of epochs (number of times we do training on a selected batch of samples)
* Batch size (gradient is computed only on a subset of the dataset)

In [None]:
import numpy as np                                    # linear algebra
from sklearn.model_selection import train_test_split  # splitting data
import tensorflow.keras as keras                      # model creation

def load_data(dataset_path):
    """Loads training dataset from json file.
    
        :param data_path (str): Path to json file containing data
        :return X (ndarray): Inputs
        :return y (ndarray): Targets
    """
    # Open the dataset file and parse data to lists
    with open(dataset_path, "r") as fp:
        data = json.load(fp)
        
    # Convert lists to numpy arrays
    inputs = np.array(data["mfcc"])
    targets = np.array(data["labels"])
    
    # Return the data
    return inputs, targets
    
# Load the data with the helper method
inputs, targets = load_data(JSON_PATH)

# Split the data to train and test sets
inputs_train, inputs_test, targets_train, targets_test = train_test_split(inputs, targets, test_size=.3)

# Build the network architecture
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(inputs.shape[1], inputs.shape[2])),
    keras.layers.Dense(512, activation="relu", kernel_regularizer=keras.regularizers.l2(0.001)),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(256, activation="relu", kernel_regularizer=keras.regularizers.l2(0.001)),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(64, activation="relu", kernel_regularizer=keras.regularizers.l2(0.001)),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(10, activation="softmax"),
])

# Compile network
optimizer = keras.optimizers.Adam(learning_rate=.0001)
model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.summary()

# Train network
history = model.fit(inputs_train, targets_train, validation_data=(inputs_test, targets_test), epochs=50, batch_size=32)

# Accuracy
## Checking accuracy
We define a helper plot function for plotting accuracy and error evaluations for both training and test data.

## Solving overfiting
There are several ways to solve overfitting:
* _Simpler architecture_ - a simpler model won't learn all the tiny patterns and possible artifacts
    * Imlementation: reduce # neurons (no universal rule)
* _Data augumentation_ - artificially increase # of training samples
    * Implementation: apply transformations to actual files (e.g., pitch shifing, time stretching)
* _Early stopping_ - choose rules to stop training
    * Implementation: stop the training if the test error doesn't improve after some number of iterations
* _Dropout_ - randomly drop neurons while training (increases network robustness because it can't rely too much on some specific neurons)
    * Implementation: on each batch, some neurons and their connections are not considered (probabiliy is chosen to be between `0.1` and `0.5`)
* _Regularization_ - adds penalty to error function and punishes large weights
    * **L1**: $E(\mathbf{p},\mathbf{y})=\frac{1}{2}(\mathbf{p} - \mathbf{y})^2+\lambda\sum|W_i|$
        * We get the absolute value of all the weights and weight it by a regularization term
        * Minimises absolute value of weights, is robust to outliers, generates simple model
    * **L2**: $E(\mathbf{p},\mathbf{y})=\frac{1}{2}(\mathbf{p} - \mathbf{y})^2+\lambda\sum W_i^2$
        * We get the total value of all the squared weights and weight it by a regularization term
        * Minimises squared value of weights, _not_ robust to outliers, learns complex patterns

We implemented dropout and _L2_ regularisation in our model

In [None]:
import matplotlib.pyplot as plt # visualizing plots

def plot_history(history):
    """Plots accuracy and error evaluations for both training and test data.
    
        :param history (ndarray): dictionary where history object is located
    """
    
    # Get figure and axis objects form 2 subplots
    fig, axs = plt.subplots(2)
    
    # Create accuracy subplot
    axs[0].plot(history.history["accuracy"], label="train accuracy")
    axs[0].plot(history.history["val_accuracy"], label="test accuracy")
    axs[0].set_ylabel("Accuracy")
    axs[0].legend(loc="lower right")
    axs[0].set_title("Accuracy eval")
    
    # Create error subplot
    axs[1].plot(history.history["loss"], label="train error")
    axs[1].plot(history.history["val_loss"], label="test error")
    axs[1].set_ylabel("Error")
    axs[1].set_xlabel("Epoch")
    axs[1].legend(loc="upper right")
    axs[1].set_title("Error eval")
    
    # Show the plot
    fig.tight_layout()
    plt.show()
    
# Plot accuracy and error over the epochs
plot_history(history)