# Applying Machine Learning on UrbanSound8k

## Install Packages

We install:
- Machine learning libraries: `Keras`, `sklearn`
- Audio processing: `librosa`
- Plots: `Plotly`, `matplotlib`

In [65]:
import os
import time
import librosa
import zipfile
import numpy as np
import pandas as pd
import librosa.display
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from PIL import Image

In [66]:
from google.colab import drive
drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


In [67]:
import pickle
with open('gdrive/MyDrive/urbansound8k/dataset_df.pickle','rb') as f:
     dataset_df = pickle.load(f)

In [70]:
# Split the dataset
from sklearn.model_selection import train_test_split

# Add one dimension for the channel
X = np.array(dataset_df['features'].tolist())
y = np.array(dataset_df['labels_categorical'].tolist())

# As there is unbalance for some classes I am going to stratify it so we have the same proportion in train/test
X_train, X_test, Y_train, Y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.30,
                                                    random_state=1,
                                                    stratify=y)
# Create validation and test
X_test, X_val, Y_test, Y_val = train_test_split(X_test,
                                                Y_test,
                                                test_size=0.5,
                                                random_state=1,
                                                stratify=Y_test)

X_train = X_train.reshape(-1,39,174,1)
X_val = X_val.reshape(-1,39,174,1)
X_test = X_test.reshape(-1,39,174,1)
print(X_train.shape, X_val.shape, X_test.shape)

(6112, 39, 174, 1) (1310, 39, 174, 1) (1310, 39, 174, 1)


## Machine Learning Model

### Model Design

We are going to create a **Fully Convolutional Network** Model using Keras running over Tensorflow with a few layers.

In [71]:
from keras.models import Sequential
from keras.layers import Input, Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, Conv2D, MaxPooling2D, GlobalAveragePooling2D

As our images are rectangular in shape (y axis is MFCC, x axis is time), instead of using square filters (as usual) we are going to make them rectangular so they can learn better the correlation of the MFCCs with the temporal dimension.

In [72]:
# FCN Model
def create_model(num_classes=10, input_shape=None, dropout_ratio=None):
    model = Sequential()
    if input_shape is None:
        model.add(Input(shape=(None, None, 1)))
    else:
        model.add(Input(shape=input_shape))
    model.add(Conv2D(filters=16, kernel_size=(2, 4), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 3)))
    model.add(Conv2D(filters=32, kernel_size=(2, 4), activation='relu'))
    model.add(MaxPooling2D(pool_size=2))
    model.add(Conv2D(filters=64, kernel_size=(2, 4), activation='relu'))
    model.add(MaxPooling2D(pool_size=2))
    model.add(Conv2D(filters=128, kernel_size=(2, 4), activation='relu'))
    model.add(GlobalAveragePooling2D())
    if dropout_ratio is not None:
        model.add(Dropout(dropout_ratio))
    # Add dense linear layer
    model.add(Dense(num_classes, activation='softmax'))
    return model

As it is a multi classification problem we will use the **Categorical Cross Entropy loss**. As optimizer we will use the Keras implementation of **Adam** with the default hyperparameters values.

In [73]:
# Create and compile the model
fcn_model = create_model(input_shape=X_train.shape[1:])
fcn_model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
fcn_model.summary()

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_9 (Conv2D)           (None, 38, 171, 16)       144       
                                                                 
 max_pooling2d (MaxPooling2  (None, 19, 57, 16)        0         
 D)                                                              
                                                                 
 conv2d_10 (Conv2D)          (None, 18, 54, 32)        4128      
                                                                 
 max_pooling2d_1 (MaxPoolin  (None, 9, 27, 32)         0         
 g2D)                                                            
                                                                 
 conv2d_11 (Conv2D)          (None, 8, 24, 64)         16448     
                                                                 
 max_pooling2d_2 (MaxPoolin  (None, 4, 12, 64)        

### Model training and evaluation

In [74]:
from keras.models import load_model
from keras.callbacks import ModelCheckpoint

In [75]:
!mkdir saved_models

In [76]:
def train_model(model, X_train, Y_train, X_val, Y_val, epochs, batch_size, callbacks):
    model.fit(X_train,
              Y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_data=(X_val, Y_val),
              callbacks=callbacks, verbose=1)
    return model

We will create a checkpoint for **early stopping**, so we will select the model that performs better on the validation set.

Creating a function to train the model will allow us to perform hyperparameter tuning faster.

In [77]:
checkpointer = ModelCheckpoint(filepath='saved_models/best_fcn.hdf5', monitor='val_accuracy', verbose=1, save_best_only=True)
callbacks = [checkpointer]

# Hyper-parameters
epochs = 100
batch_size = 256

In [78]:
# Train the model
model = train_model(model=fcn_model,
                    X_train=X_train,
                    X_val=X_val,
                    Y_train=Y_train,
                    Y_val=Y_val,
                    epochs=epochs,
                    batch_size=batch_size,
                    callbacks=callbacks)

Epoch 1/100
Epoch 1: val_accuracy improved from -inf to 0.49618, saving model to saved_models/best_fcn.hdf5
Epoch 2/100

  saving_api.save_model(


Epoch 2: val_accuracy improved from 0.49618 to 0.58702, saving model to saved_models/best_fcn.hdf5
Epoch 3/100
Epoch 3: val_accuracy improved from 0.58702 to 0.61679, saving model to saved_models/best_fcn.hdf5
Epoch 4/100
Epoch 4: val_accuracy improved from 0.61679 to 0.64962, saving model to saved_models/best_fcn.hdf5
Epoch 5/100
Epoch 5: val_accuracy improved from 0.64962 to 0.69618, saving model to saved_models/best_fcn.hdf5
Epoch 6/100
Epoch 6: val_accuracy did not improve from 0.69618
Epoch 7/100
Epoch 7: val_accuracy improved from 0.69618 to 0.71679, saving model to saved_models/best_fcn.hdf5
Epoch 8/100
Epoch 8: val_accuracy improved from 0.71679 to 0.74885, saving model to saved_models/best_fcn.hdf5
Epoch 9/100
Epoch 9: val_accuracy improved from 0.74885 to 0.75573, saving model to saved_models/best_fcn.hdf5
Epoch 10/100
Epoch 10: val_accuracy improved from 0.75573 to 0.75649, saving model to saved_models/best_fcn.hdf5
Epoch 11/100
Epoch 11: val_accuracy improved from 0.75649 t

In [79]:
# Load the best model
best_model = load_model('saved_models/best_fcn.hdf5')

Looks like the model has overfitted to the training data towards the end of the training. We have selected the model that performed better on the validation set, saved by the checkpoint. The similarity between validation and test score tells us that our training methodology is correct and that our validation set is a good estimator of testing performance.

In [80]:
# Evaluating the model on the training and testing set
score = best_model.evaluate(X_train, Y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = best_model.evaluate(X_val, Y_val, verbose=0)
print("Validation Accuracy: ", score[1])

score = best_model.evaluate(X_test, Y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  0.9978730082511902
Validation Accuracy:  0.9076336026191711
Testing Accuracy:  0.8977099061012268


We see that there has been overfitting so we could train another model adding dropout before the last layer to add more regularization.  

In [81]:
# We add a dropout ratio of 0.25
fcn_model = create_model(input_shape=X_train.shape[1:], dropout_ratio=0.5)
fcn_model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
fcn_model.summary()

Model: "sequential_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_13 (Conv2D)          (None, 38, 171, 16)       144       
                                                                 
 max_pooling2d_3 (MaxPoolin  (None, 19, 57, 16)        0         
 g2D)                                                            
                                                                 
 conv2d_14 (Conv2D)          (None, 18, 54, 32)        4128      
                                                                 
 max_pooling2d_4 (MaxPoolin  (None, 9, 27, 32)         0         
 g2D)                                                            
                                                                 
 conv2d_15 (Conv2D)          (None, 8, 24, 64)         16448     
                                                                 
 max_pooling2d_5 (MaxPoolin  (None, 4, 12, 64)       

In [None]:
checkpointer = ModelCheckpoint(filepath='saved_models/best_fcn_dropout.hdf5', monitor='val_accuracy',
                               verbose=1, save_best_only=True)
callbacks = [checkpointer]

model = train_model(model=fcn_model,
                    X_train=X_train,
                    X_val=X_val,
                    Y_train=Y_train,
                    Y_val=Y_val,
                    epochs=200,
                    batch_size=256,
                    callbacks=callbacks)

Epoch 1/200
Epoch 1: val_accuracy improved from -inf to 0.35954, saving model to saved_models/best_fcn_dropout.hdf5
Epoch 2/200
Epoch 2: val_accuracy improved from 0.35954 to 0.46565, saving model to saved_models/best_fcn_dropout.hdf5
Epoch 3/200
Epoch 3: val_accuracy improved from 0.46565 to 0.54351, saving model to saved_models/best_fcn_dropout.hdf5
Epoch 4/200
Epoch 4: val_accuracy improved from 0.54351 to 0.55344, saving model to saved_models/best_fcn_dropout.hdf5
Epoch 5/200
Epoch 5: val_accuracy improved from 0.55344 to 0.60305, saving model to saved_models/best_fcn_dropout.hdf5
Epoch 6/200
Epoch 6: val_accuracy improved from 0.60305 to 0.63664, saving model to saved_models/best_fcn_dropout.hdf5
Epoch 7/200
Epoch 7: val_accuracy did not improve from 0.63664
Epoch 8/200
Epoch 8: val_accuracy improved from 0.63664 to 0.68550, saving model to saved_models/best_fcn_dropout.hdf5
Epoch 9/200
Epoch 9: val_accuracy improved from 0.68550 to 0.68779, saving model to saved_models/best_fcn_d

In [None]:
best_model = load_model('saved_models/best_fcn_dropout.hdf5')

In [None]:
# Evaluating the model on the training and testing set
score = best_model.evaluate(X_train, Y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = best_model.evaluate(X_val, Y_val, verbose=0)
print("Validation Accuracy: ", score[1])

score = best_model.evaluate(X_test, Y_test, verbose=0)
print("Testing Accuracy: ", score[1])

In [None]:
# Plot a confusion matrix
from sklearn import metrics
Y_pred = best_model.predict(X_test)
matrix = metrics.confusion_matrix(Y_test.argmax(axis=1), Y_pred.argmax(axis=1))

In [None]:
# Confusion matrix code (from https://github.com/triagemd/keras-eval/blob/master/keras_eval/visualizer.py)
def plot_confusion_matrix(cm, concepts, normalize=False, show_text=True, fontsize=18, figsize=(16, 12),
                          cmap=plt.cm.coolwarm_r, save_path=None, show_labels=True):
    '''
    Plot confusion matrix provided in 'cm'
    Args:
        cm: Confusion Matrix, square sized numpy array
        concepts: Name of the categories to show
        normalize: If True, normalize values between 0 and ones. Not valid if negative values.
        show_text: If True, display cell values as text. Otherwise only display cell colors.
        fontsize: Size of text
        figsize: Size of figure
        cmap: Color choice
        save_path: If `save_path` specified, save confusion matrix in that location
    Returns: Nothing. Plots confusion matrix
    '''

    if cm.ndim != 2 or cm.shape[0] != cm.shape[1]:
        raise ValueError('Invalid confusion matrix shape, it should be square and ndim=2')

    if cm.shape[0] != len(concepts) or cm.shape[1] != len(concepts):
        raise ValueError('Number of concepts (%i) and dimensions of confusion matrix do not coincide (%i, %i)' %
                         (len(concepts), cm.shape[0], cm.shape[1]))

    plt.rcParams.update({'font.size': fontsize})

    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    if normalize:
        cm = cm_normalized

    fig = plt.figure(figsize=figsize)
    ax = fig.add_subplot(111)
    cax = ax.matshow(cm, vmin=np.min(cm), vmax=np.max(cm), alpha=0.8, cmap=cmap)

    fig.colorbar(cax)
    ax.xaxis.tick_bottom()
    plt.ylabel('True label', fontweight='bold')
    plt.xlabel('Predicted label', fontweight='bold')

    if show_labels:
        n_labels = len(concepts)
        ax.set_xticklabels(concepts)
        ax.set_yticklabels(concepts)
        plt.xticks(np.arange(0, n_labels, 1.0), rotation='vertical')
        plt.yticks(np.arange(0, n_labels, 1.0))
    else:
        plt.axis('off')

    if show_text:
        # http://stackoverflow.com/questions/21712047/matplotlib-imshow-matshow-display-values-on-plot
        min_val, max_val = 0, len(concepts)
        ind_array = np.arange(min_val, max_val, 1.0)
        x, y = np.meshgrid(ind_array, ind_array)
        for i, (x_val, y_val) in enumerate(zip(x.flatten(), y.flatten())):
            c = cm[int(x_val), int(y_val)]
            ax.text(y_val, x_val, c, va='center', ha='center')

    if save_path is not None:
        plt.savefig(save_path)

To observe better the performance of the model and the mistakes made between different classes we plot the confusion matrix.

In our case accuracy is a good metric because the dataset is mostly balanced but we observed a few classes with less samples (1`car_horn`, `gun_shot` and `siren`), so it will be good to observe the performance on these classes.

We can observe that a lot of mistakes are happening between class `children_playing` and class `street_music` so maybe it will be worth it to spend a little bit more time doing analysis and finding what could be the reasons.  

In [None]:
class_dictionary = {3: 'dog_bark', 2: 'children_playing', 1: 'car_horn', 0: 'air_conditioner', 9: 'street_music', 6: 'gun_shot', 8: 'siren', 5: 'engine_idling', 7: 'jackhammer', 4: 'drilling'}
classes = [class_dictionary[key] for key in sorted(class_dictionary.keys())]

In [None]:
plot_confusion_matrix(matrix, classes)

## Conclusions

We can observe a bump of 1-2% in the test set accuracy when introducing dropout as regularization. This shows that it has been a successful addition to our model.

There are many things that we can try to improve the model's performance such as:

- Hyperparameter tuning:
  - Tuning the parameters of feature extraction
  - Tuning the network parameters (number of layers, pooling layers, number and filter shape...)
  - Tuning the network hyperparameters (Learning rate, optimizer)

- Feature extraction:
  - Use STFT: The raw spectogram could provide more information to the CNN to learn correlation between frequency and time than the MFCCs.
  - Use Mel-Spectogram: The mel-spectogram could provide more information to the CNN to learn correlation between frequency and time than the MFCCs.