# Classifying Urban sounds using Deep Learning

## 4 Model Refinement 

#### Model refinement

In our inital attempt, we were able to achieve a Classification Accuracy score of: 

* Training data Accuracy:  92.3% 
* Testing data Accuracy:  87% 

We will now see if we can improve upon that score using a Convolutional Neural Network (CNN). 

#### Feature Extraction refinement 

In the prevous feature extraction stage, the MFCC vectors would vary in size for the different audio files (depending on the samples duration). 

However, CNNs require a fixed size for all inputs. To overcome this we will zero pad the output vectors to make them all the same size. 

In [1]:
import numpy as np
max_pad_len = 174

def extract_features(file_name):
   
    try:
        audio, sample_rate = librosa.load(file_name) 
        mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
        pad_width = max_pad_len - mfccs.shape[1]
        mfccs = np.pad(mfccs, pad_width=((0, 0), (0, pad_width)), mode='constant')
        
    except Exception as e:
        print("Error encountered while parsing file: ", file_name)
        return None 
     
    return mfccs

In [2]:
# Load various imports 
import pandas as pd
import os
import librosa

# Set the path to the full UrbanSound dataset 
fulldatasetpath = '../UrbanSound8K/audio/'

metadata = pd.read_csv('../UrbanSound8K/metadata/UrbanSound8K.csv')

features = []

# Iterate through each sound file and extract the features 
for index, row in metadata.iterrows():
    
    file_name = os.path.join(os.path.abspath(fulldatasetpath),'fold'+str(row["fold"])+'/',str(row["slice_file_name"]))
    
    class_label = row["class"]
    data = extract_features(file_name)
    
    features.append([data, class_label])

# Convert into a Panda dataframe 
featuresdf = pd.DataFrame(features, columns=['feature','class_label'])

print('Finished feature extraction from ', len(featuresdf), ' files') 

Use os.path.join(memory.location, 'joblib') attribute instead.
  if self.cachedir is not None:
Use os.path.join(memory.location, 'joblib') attribute instead.
  if self.cachedir is not None:
Use os.path.join(memory.location, 'joblib') attribute instead.
  if self.cachedir is not None:
Use os.path.join(memory.location, 'joblib') attribute instead.
  if self.cachedir is not None:
Use os.path.join(memory.location, 'joblib') attribute instead.
  if self.cachedir is not None:
Use os.path.join(memory.location, 'joblib') attribute instead.
  if self.cachedir is not None:
Use os.path.join(memory.location, 'joblib') attribute instead.
  if self.cachedir is not None:
Use os.path.join(memory.location, 'joblib') attribute instead.
  if self.cachedir is not None:
Use os.path.join(memory.location, 'joblib') attribute instead.
  if self.cachedir is not None:
Use os.path.join(memory.location, 'joblib') attribute instead.
  if self.cachedir is not None:
Use os.path.join(memory.location, 'joblib') attrib

Use os.path.join(memory.location, 'joblib') attribute instead.
  if self.cachedir is not None:
Use os.path.join(memory.location, 'joblib') attribute instead.
  if self.cachedir is not None:
Use os.path.join(memory.location, 'joblib') attribute instead.
  if self.cachedir is not None:
Use os.path.join(memory.location, 'joblib') attribute instead.
  if self.cachedir is not None:
Use os.path.join(memory.location, 'joblib') attribute instead.
  if self.cachedir is not None:


Finished feature extraction from  8732  files


In [17]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical

# Convert features and corresponding classification labels into numpy arrays
X = np.array(featuresdf.feature.tolist())
y = np.array(featuresdf.class_label.tolist())

# Encode the classification labels
le = LabelEncoder()
yy = to_categorical(le.fit_transform(y)) 

# split the dataset 
from sklearn.model_selection import train_test_split 

x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.2, random_state = 42)

### Convolutional Neural Network (CNN) model architecture 


We will modify our model to be a Convolutional Neural Network (CNN) again using Keras and a Tensorflow backend. 

Again we will use a `sequential` model, starting with a simple model architecture, consisting of four `Conv2D` convolution layers, with our final output layer being a `dense` layer. 

The convolution layers are designed for feature detection. It works by sliding a filter window over the input and performing a matrix multiplication and storing the result in a feature map. This operation is known as a convolution. 


The `filter` parameter specifies the number of nodes in each layer. Each layer will increase in size from 16, 32, 64 to 128, while the `kernel_size` parameter specifies the size of the kernel window which in this case is 2 resulting in a 2x2 filter matrix. 

The first layer will receive the input shape of (40, 174, 1) where 40 is the number of MFCC's 174 is the number of frames taking padding into account and the 1 signifying that the audio is mono. 

The activation function we will be using for our convolutional layers is `ReLU` which is the same as our previous model. We will use a smaller `Dropout` value of 20% on our convolutional layers. 

Each convolutional layer has an associated pooling layer of `MaxPooling2D` type with the final convolutional layer having a `GlobalAveragePooling2D` type. The pooling layer is do reduce the dimensionality of the model (by reducing the parameters and subsquent computation requirements) which serves to shorten the training time and reduce overfitting. The Max Pooling type takes the maximum size for each window and the Global Average Pooling type takes the average which is suitable for feeding into our `dense` output layer.  

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [18]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

num_rows = 40
num_columns = 174
num_channels = 1

x_train = x_train.reshape(x_train.shape[0], num_rows, num_columns, num_channels)
x_test = x_test.reshape(x_test.shape[0], num_rows, num_columns, num_channels)

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, input_shape=(num_rows, num_columns, num_channels), activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=32, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=64, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=128, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))
model.add(GlobalAveragePooling2D())

model.add(Dense(num_labels, activation='softmax')) 

### Compiling the model 

For compiling our model, we will use the same three parameters as the previous model: 

In [19]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 

In [20]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=1)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 39, 173, 16)       80        
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 19, 86, 16)        0         
_________________________________________________________________
dropout_5 (Dropout)          (None, 19, 86, 16)        0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 18, 85, 32)        2080      
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 9, 42, 32)         0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 9, 42, 32)         0         
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 8, 41, 64)        

### Training 

Here we will train the model. As training a CNN can take a sigificant amount of time, we will start with a low number of epochs and a low batch size. If we can see from the output that the model is converging, we will increase both numbers.  

In [21]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

#num_epochs = 12
#num_batch_size = 128

num_epochs = 72
num_batch_size = 256

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.basic_cnn.hdf5', 
                               verbose=1, save_best_only=True)
start = datetime.now()

model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Train on 6985 samples, validate on 1747 samples
Epoch 1/72

Epoch 00001: val_loss improved from inf to 2.24183, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 2/72

Epoch 00002: val_loss improved from 2.24183 to 1.96316, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 3/72

Epoch 00003: val_loss improved from 1.96316 to 1.76548, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 4/72

Epoch 00004: val_loss improved from 1.76548 to 1.63127, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 5/72

Epoch 00005: val_loss improved from 1.63127 to 1.56170, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 6/72

Epoch 00006: val_loss improved from 1.56170 to 1.49144, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 7/72

Epoch 00007: val_loss improved from 1.49144 to 1.48890, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 8/72

Epoch 00008: val_loss improved from 1.48890 to 1.35168, saving model 


Epoch 00035: val_loss did not improve from 0.68218
Epoch 36/72

Epoch 00036: val_loss did not improve from 0.68218
Epoch 37/72

Epoch 00037: val_loss did not improve from 0.68218
Epoch 38/72

Epoch 00038: val_loss improved from 0.68218 to 0.62521, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 39/72

Epoch 00039: val_loss improved from 0.62521 to 0.60002, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 40/72

Epoch 00040: val_loss improved from 0.60002 to 0.56954, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 41/72

Epoch 00041: val_loss did not improve from 0.56954
Epoch 42/72

Epoch 00042: val_loss improved from 0.56954 to 0.56138, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 43/72

Epoch 00043: val_loss did not improve from 0.56138
Epoch 44/72

Epoch 00044: val_loss improved from 0.56138 to 0.55647, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 45/72

Epoch 00045: val_loss did not improve from 0.5564

### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [22]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  0.873729407787323
Testing Accuracy:  0.826559841632843


The Training and Testing accuracy scores are both high and an increase on our initial model. Training accuracy has increased by ~6% and Testing accuracy has increased by ~4%. 

There is a marginal increase in the difference between the Training and Test scores (~6% compared to ~5% previously) though the difference remains low so the model has not suffered from overfitting. 

### Predictions  

Here we will modify our previous method for testing the models predictions on a specified audio .wav file. 

In [23]:
def print_prediction(file_name):
    prediction_feature = extract_features(file_name) 
    prediction_feature = prediction_feature.reshape(1, num_rows, num_columns, num_channels)

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector) 
    print("The predicted class is:", predicted_class[0], '\n') 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

As before we will verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [24]:
# Class: Air Conditioner

filename = '../UrbanSound8K/audio/fold5/100852-0-0-0.wav' 
print_prediction(filename) 

The predicted class is: air_conditioner 

air_conditioner 		 :  0.89445227384567260742187500000000
car_horn 		 :  0.00122791691683232784271240234375
children_playing 		 :  0.00020793086150661110877990722656
dog_bark 		 :  0.00052182417130097746849060058594
drilling 		 :  0.05325928330421447753906250000000
engine_idling 		 :  0.00283017568290233612060546875000
gun_shot 		 :  0.00010066344839287921786308288574
jackhammer 		 :  0.04609480500221252441406250000000
siren 		 :  0.00116794812493026256561279296875
street_music 		 :  0.00013724608288612216711044311523


In [25]:
# Class: Drilling

filename = '../UrbanSound8K/audio/fold3/103199-4-0-0.wav'
print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.00000051642001608342980034649372
car_horn 		 :  0.00002358512574573978781700134277
children_playing 		 :  0.00000032223891821558936499059200
dog_bark 		 :  0.00000154858275891456287354230881
drilling 		 :  0.99989855289459228515625000000000
engine_idling 		 :  0.00000080972461091732839122414589
gun_shot 		 :  0.00000000607995476187284111802001
jackhammer 		 :  0.00005069882536190561950206756592
siren 		 :  0.00000018746978014405613066628575
street_music 		 :  0.00002382141792622860521078109741


In [26]:
# Class: Street music 

filename = '../UrbanSound8K/audio/fold7/101848-9-0-0.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.01084155682474374771118164062500
car_horn 		 :  0.24718631803989410400390625000000
children_playing 		 :  0.01279234234243631362915039062500
dog_bark 		 :  0.00238284398801624774932861328125
drilling 		 :  0.00882344227284193038940429687500
engine_idling 		 :  0.00010181669495068490505218505859
gun_shot 		 :  0.00000000610049211147156711376738
jackhammer 		 :  0.00433317059651017189025878906250
siren 		 :  0.00037572311703115701675415039062
street_music 		 :  0.71316283941268920898437500000000


In [27]:
# Class: Car Horn 

filename = '../UrbanSound8K/audio/fold10/100648-1-0-0.wav'
print_prediction(filename) 

The predicted class is: car_horn 

air_conditioner 		 :  0.00313331861980259418487548828125
car_horn 		 :  0.27611202001571655273437500000000
children_playing 		 :  0.00371271255426108837127685546875
dog_bark 		 :  0.22665859758853912353515625000000
drilling 		 :  0.14113393425941467285156250000000
engine_idling 		 :  0.01280300226062536239624023437500
gun_shot 		 :  0.15091632306575775146484375000000
jackhammer 		 :  0.17114979028701782226562500000000
siren 		 :  0.01344307884573936462402343750000
street_music 		 :  0.00093731941888108849525451660156


#### Observations 

We can see that the model performs well. 

Interestingly, car horn was again incorrectly classifed but this time as drilling - though the per class confidence shows it was a close decision between car horn with 26% confidence and drilling at 34% confidence.  

### Other audio

Again we will further validate our model using a sample of various copyright free sounds that we not part of either our test or training data. 

In [28]:
filename = 'Evaluation audio/dog_bark_1.wav'
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00059910275740548968315124511719
car_horn 		 :  0.28348797559738159179687500000000
children_playing 		 :  0.00468853907659649848937988281250
dog_bark 		 :  0.40224260091781616210937500000000
drilling 		 :  0.14014597237110137939453125000000
engine_idling 		 :  0.00143193861003965139389038085938
gun_shot 		 :  0.16420076787471771240234375000000
jackhammer 		 :  0.00112051400355994701385498046875
siren 		 :  0.00154445448424667119979858398438
street_music 		 :  0.00053813605336472392082214355469


In [29]:
filename = 'Evaluation audio/drilling_1.wav'

print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.12794208526611328125000000000000
car_horn 		 :  0.01010722946375608444213867187500
children_playing 		 :  0.00012545159552246332168579101562
dog_bark 		 :  0.00074208871228620409965515136719
drilling 		 :  0.54446136951446533203125000000000
engine_idling 		 :  0.01167610846459865570068359375000
gun_shot 		 :  0.00001692227488092612475156784058
jackhammer 		 :  0.30254912376403808593750000000000
siren 		 :  0.00217386637814342975616455078125
street_music 		 :  0.00020569369371514767408370971680


In [30]:
filename = 'Evaluation audio/gun_shot_1.wav'

print_prediction(filename) 

The predicted class is: gun_shot 

air_conditioner 		 :  0.00001667208744038362056016921997
car_horn 		 :  0.00001367173626931617036461830139
children_playing 		 :  0.00005977243199595250189304351807
dog_bark 		 :  0.02629834413528442382812500000000
drilling 		 :  0.00743376137688755989074707031250
engine_idling 		 :  0.00021658393961843103170394897461
gun_shot 		 :  0.96557801961898803710937500000000
jackhammer 		 :  0.00003220059079467318952083587646
siren 		 :  0.00034986759419552981853485107422
street_music 		 :  0.00000113762780529214069247245789


#### Observations 

The performance of our final model is very good and has generalised well, seeming to predict well when tested against new audio data. 