# Classifying Urban sounds using Deep Learning

## 4 Model Refinement 

### Load Preprocessed data 

In [1]:
# retrieve the preprocessed data from previous notebook

%store -r x_train 
%store -r x_test 
%store -r y_train 
%store -r y_test 
%store -r yy 
%store -r le

#### Model refinement

In our inital attempt, we were able to achieve a Classification Accuracy score of: 

* Training data Accuracy:  92.3% 
* Testing data Accuracy:  87% 

We will now see if we can improve upon that score using a Convolutional Neural Network (CNN). 

#### Feature Extraction refinement 

In the prevous feature extraction stage, the MFCC vectors would vary in size for the different audio files (depending on the samples duration). 

However, CNNs require a fixed size for all inputs. To overcome this we will zero pad the output vectors to make them all the same size. 

In [4]:
import numpy as np
max_pad_len = 174

def extract_features(file_name):
   
    audio, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
    mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
    pad_width = max_pad_len - mfccs.shape[1]
    mfccs = np.pad(mfccs, pad_width=((0, 0), (0, pad_width)), mode='constant')
     
    return mfccs

In [6]:
# Load various imports 
import pandas as pd
import os
import librosa

# Set the path to the full UrbanSound dataset 
fulldatasetpath = '../UrbanSound8K/audio/'

metadata = pd.read_csv('../UrbanSound Dataset sample/metadata/UrbanSound8K.csv')

features = []

# Iterate through each sound file and extract the features 
for index, row in metadata.iterrows():
    
    file_name = os.path.join(os.path.abspath(fulldatasetpath),'fold'+str(row["fold"])+'/',str(row["slice_file_name"]))
    
    class_label = row["class_name"]
    data = extract_features(file_name)
    
    features.append([data, class_label])

# Convert into a Panda dataframe 
featuresdf = pd.DataFrame(features, columns=['feature','class_label'])

print('Finished feature extraction from ', len(featuresdf), ' files') 



Finished feature extraction from  8732  files


In [7]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical

# Convert features and corresponding classification labels into numpy arrays
X = np.array(featuresdf.feature.tolist())
y = np.array(featuresdf.class_label.tolist())

# Encode the classification labels
le = LabelEncoder()
yy = to_categorical(le.fit_transform(y)) 

# split the dataset 
from sklearn.model_selection import train_test_split 

x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.2, random_state = 42)

### Convolutional Neural Network (CNN) model architecture 


We will modify our model to be a Convolutional Neural Network (CNN) again using Keras and a Tensorflow backend. 

Again we will use a `sequential` model, starting with a simple model architecture, consisting of four `Conv2D` convolution layers, with our final output layer being a `dense` layer. 

The convolution layers are designed for feature detection. It works by sliding a filter window over the input and performing a matrix multiplication and storing the result in a feature map. This operation is known as a convolution. 


The `filter` parameter specifies the number of nodes in each layer. Each layer will increase in size from 16, 32, 64 to 128, while the `kernel_size` parameter specifies the size of the kernel window which in this case is 2 resulting in a 2x2 filter matrix. 

The first layer will receive the input shape of (40, 174, 1) where 40 is the number of MFCC's 174 is the number of frames taking padding into account and the 1 signifying that the audio is mono. 

The activation function we will be using for our convolutional layers is `ReLU` which is the same as our previous model. We will use a smaller `Dropout` value of 20% on our convolutional layers. 

Each convolutional layer has an associated pooling layer of `MaxPooling2D` type with the final convolutional layer having a `GlobalAveragePooling2D` type. The pooling layer is do reduce the dimensionality of the model (by reducing the parameters and subsquent computation requirements) which serves to shorten the training time and reduce overfitting. The Max Pooling type takes the maximum size for each window and the Global Average Pooling type takes the average which is suitable for feeding into our `dense` output layer.  

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [9]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.optimizers import Adam
# from keras.utils import np_utils
from sklearn import metrics 

num_rows = 40
num_columns = 174
num_channels = 1

x_train = x_train.reshape(x_train.shape[0], num_rows, num_columns, num_channels)
x_test = x_test.reshape(x_test.shape[0], num_rows, num_columns, num_channels)

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, input_shape=(num_rows, num_columns, num_channels), activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=32, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=64, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=128, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))
model.add(GlobalAveragePooling2D())

model.add(Dense(num_labels, activation='softmax')) 

### Compiling the model 

For compiling our model, we will use the same three parameters as the previous model: 

In [10]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 

In [11]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=1)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 39, 173, 16)       80        
                                                                 
 max_pooling2d (MaxPooling2  (None, 19, 86, 16)        0         
 D)                                                              
                                                                 
 dropout (Dropout)           (None, 19, 86, 16)        0         
                                                                 
 conv2d_1 (Conv2D)           (None, 18, 85, 32)        2080      
                                                                 
 max_pooling2d_1 (MaxPoolin  (None, 9, 42, 32)         0         
 g2D)                                                            
                                                                 
 dropout_1 (Dropout)         (None, 9, 42, 32)         0

### Training 

Here we will train the model. As training a CNN can take a sigificant amount of time, we will start with a low number of epochs and a low batch size. If we can see from the output that the model is converging, we will increase both numbers.  

In [12]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

#num_epochs = 12
#num_batch_size = 128

num_epochs = 72
num_batch_size = 256

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.basic_cnn.hdf5', 
                               verbose=1, save_best_only=True)
start = datetime.now()

model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Epoch 1/72
Epoch 1: val_loss improved from inf to 2.06145, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 2/72


  saving_api.save_model(


Epoch 2: val_loss improved from 2.06145 to 1.82104, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 3/72
Epoch 3: val_loss improved from 1.82104 to 1.58004, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 4/72
Epoch 4: val_loss improved from 1.58004 to 1.45514, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 5/72
Epoch 5: val_loss improved from 1.45514 to 1.37081, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 6/72
Epoch 6: val_loss improved from 1.37081 to 1.27913, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 7/72
Epoch 7: val_loss improved from 1.27913 to 1.22794, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 8/72
Epoch 8: val_loss improved from 1.22794 to 1.20379, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 9/72
Epoch 9: val_loss improved from 1.20379 to 1.14763, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 10/72
Epoch 10: val_loss improved from 1.1476

### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [13]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  0.9537580609321594
Testing Accuracy:  0.903262734413147


The Training and Testing accuracy scores are both high and an increase on our initial model. Training accuracy has increased by ~6% and Testing accuracy has increased by ~4%. 

There is a marginal increase in the difference between the Training and Test scores (~6% compared to ~5% previously) though the difference remains low so the model has not suffered from overfitting. 

### Predictions  

Here we will modify our previous method for testing the models predictions on a specified audio .wav file. 

In [34]:
def print_prediction(file_name):
    prediction_feature = extract_features(file_name) 
    prediction_feature = prediction_feature.reshape(1, num_rows, num_columns, num_channels)

    # Use the model's predict method to generate the probability distribution for each class
    predicted_vector = model.predict(prediction_feature)[0]  # Assuming the output is softmax and you want the first prediction

    # Get the class index and label
    predicted_class = np.argmax(predicted_vector)
    predicted_label = le.inverse_transform([predicted_class])
    
    # Print the predicted class
    print("The predicted class is:", predicted_label[0], '\n')

    # Print each class probability
    for i, prob in enumerate(predicted_vector):
        label = le.inverse_transform([i])
        print(label[0], "\t\t : ", f'{prob:.20f}')

### Validation 

#### Test with sample data 

As before we will verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [35]:
# Class: Air Conditioner

filename = '../UrbanSound Dataset sample/audio/100852-0-0-0.wav' 
print_prediction(filename) 

The predicted class is: air_conditioner 

air_conditioner 		 :  0.99810034036636352539
car_horn 		 :  0.00000230292471314897
children_playing 		 :  0.00005591498120338656
dog_bark 		 :  0.00001395892286382150
drilling 		 :  0.00137087260372936726
engine_idling 		 :  0.00024184925132431090
gun_shot 		 :  0.00000416626971855294
jackhammer 		 :  0.00008629060903331265
siren 		 :  0.00000028769991899935
street_music 		 :  0.00012391729978844523


In [36]:
# Class: Drilling

filename = '../UrbanSound Dataset sample/audio/103199-4-0-0.wav'
print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.00010423925414215773
car_horn 		 :  0.00001489401529397583
children_playing 		 :  0.00000129820784877666
dog_bark 		 :  0.00000777781588112703
drilling 		 :  0.99864238500595092773
engine_idling 		 :  0.00000005987637763383
gun_shot 		 :  0.00000020046979898325
jackhammer 		 :  0.00055755622452124953
siren 		 :  0.00000000013417388034
street_music 		 :  0.00067152950214222074


In [37]:
# Class: Street music 

filename = '../UrbanSound Dataset sample/audio/101848-9-0-0.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00094828725559636950
car_horn 		 :  0.00009445307659916580
children_playing 		 :  0.07455532997846603394
dog_bark 		 :  0.01619094237685203552
drilling 		 :  0.00007786870992276818
engine_idling 		 :  0.00031345422030426562
gun_shot 		 :  0.00000000133401223401
jackhammer 		 :  0.00001885023084469140
siren 		 :  0.01145354844629764557
street_music 		 :  0.89634722471237182617


In [38]:
# Class: Car Horn 

filename = '../UrbanSound Dataset sample/audio/100648-1-0-0.wav'
print_prediction(filename) 

The predicted class is: car_horn 

air_conditioner 		 :  0.00225700438022613525
car_horn 		 :  0.22851300239562988281
children_playing 		 :  0.00165492109954357147
dog_bark 		 :  0.22055572271347045898
drilling 		 :  0.14822559058666229248
engine_idling 		 :  0.00960475206375122070
gun_shot 		 :  0.19518885016441345215
jackhammer 		 :  0.17992477118968963623
siren 		 :  0.01264942344278097153
street_music 		 :  0.00142601772677153349


#### Observations 

We can see that the model performs well. 

Interestingly, car horn was again incorrectly classifed but this time as drilling - though the per class confidence shows it was a close decision between car horn with 26% confidence and drilling at 34% confidence.  

### Other audio

Again we will further validate our model using a sample of various copyright free sounds that we not part of either our test or training data. 

In [31]:
filename = '../Evaluation audio/dog_bark_1.wav'
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00077685475116595626
car_horn 		 :  0.03165749832987785339
children_playing 		 :  0.00200193352065980434
dog_bark 		 :  0.92742627859115600586
drilling 		 :  0.02914945222437381744
engine_idling 		 :  0.00033214196446351707
gun_shot 		 :  0.00486407661810517311
jackhammer 		 :  0.00058784085558727384
siren 		 :  0.00114176212809979916
street_music 		 :  0.00206223968416452408


In [32]:
filename = '../Evaluation audio/drilling_1.wav'

print_prediction(filename) 

The predicted class is: jackhammer 

air_conditioner 		 :  0.00455043744295835495
car_horn 		 :  0.00000200521253646002
children_playing 		 :  0.00000581568156121648
dog_bark 		 :  0.00009565481013851240
drilling 		 :  0.00131315516773611307
engine_idling 		 :  0.00285685434937477112
gun_shot 		 :  0.00000353644941242237
jackhammer 		 :  0.99112844467163085938
siren 		 :  0.00004353242184151895
street_music 		 :  0.00000048897226179179


In [33]:
filename = '../Evaluation audio/gun_shot_1.wav'

print_prediction(filename) 

The predicted class is: gun_shot 

air_conditioner 		 :  0.00000099360102012724
car_horn 		 :  0.00000876784179126844
children_playing 		 :  0.00126167805865406990
dog_bark 		 :  0.00551034463569521904
drilling 		 :  0.00615238584578037262
engine_idling 		 :  0.00013336223491933197
gun_shot 		 :  0.98682749271392822266
jackhammer 		 :  0.00000055469718063250
siren 		 :  0.00000494970390718663
street_music 		 :  0.00009947075886884704


#### Observations 

The performance of our final model is very good and has generalised well, seeming to predict well when tested against new audio data. 