# Classifying Urban sounds using Deep Learning

## 4 Model Refinement 

### Load Preprocessed data 

In [1]:
# retrieve the preprocessed data from previous notebook

%store -r x_train 
%store -r x_test 
%store -r y_train 
%store -r y_test 
%store -r yy 
%store -r le

#### Model refinement

In our inital attempt, we were able to achieve a Classification Accuracy score of: 

* Training data Accuracy:  92.3% 
* Testing data Accuracy:  87% 

We will now see if we can improve upon that score using a Convolutional Neural Network (CNN). 

#### Feature Extraction refinement 

In the prevous feature extraction stage, the MFCC vectors would vary in size for the different audio files (depending on the samples duration). 

However, CNNs require a fixed size for all inputs. To overcome this we will zero pad the output vectors to make them all the same size. 

In [2]:
import numpy as np
max_pad_len = 174

def extract_features(file_name):
   
    try:
        audio, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
        mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
        pad_width = max_pad_len - mfccs.shape[1]
        mfccs = np.pad(mfccs, pad_width=((0, 0), (0, pad_width)), mode='constant')
        
    except Exception as e:
        print("Error encountered while parsing file: ", file_name)
        return None 
     
    return mfccs

In [3]:
# Load various imports 
import pandas as pd
import os
import librosa

# Set the path to the full UrbanSound dataset 
fulldatasetpath = '../UrbanSound8K/audio/'

metadata = pd.read_csv('../UrbanSound8K/metadata/UrbanSound8K.csv')

features = []

# Iterate through each sound file and extract the features 
for index, row in metadata.iterrows():
    
    file_name = os.path.join(os.path.abspath(fulldatasetpath),'fold'+str(row["fold"])+'/',str(row["slice_file_name"]))
    
    class_label = row["class"]
    data = extract_features(file_name)
    
    features.append([data, class_label])

# Convert into a Panda dataframe 
featuresdf = pd.DataFrame(features, columns=['feature','class_label'])

print('Finished feature extraction from ', len(featuresdf), ' files') 

Finished feature extraction from  8732  files


In [4]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical

# Convert features and corresponding classification labels into numpy arrays
X = np.array(featuresdf.feature.tolist())
y = np.array(featuresdf.class_label.tolist())

# Encode the classification labels
le = LabelEncoder()
yy = to_categorical(le.fit_transform(y)) 

# split the dataset 
from sklearn.model_selection import train_test_split 

x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.2, random_state = 42)

Using TensorFlow backend.


### Convolutional Neural Network (CNN) model architecture 


We will modify our model to be a Convolutional Neural Network (CNN) again using Keras and a Tensorflow backend. 

Again we will use a `sequential` model, starting with a simple model architecture, consisting of four `Conv2D` convolution layers, with our final output layer being a `dense` layer. 

The convolution layers are designed for feature detection. It works by sliding a filter window over the input and performing a matrix multiplication and storing the result in a feature map. This operation is known as a convolution. 


The `filter` parameter specifies the number of nodes in each layer. Each layer will increase in size from 16, 32, 64 to 128, while the `kernel_size` parameter specifies the size of the kernel window which in this case is 2 resulting in a 2x2 filter matrix. 

The first layer will receive the input shape of (40, 174, 1) where 40 is the number of MFCC's 174 is the number of frames taking padding into account and the 1 signifying that the audio is mono. 

The activation function we will be using for our convolutional layers is `ReLU` which is the same as our previous model. We will use a smaller `Dropout` value of 20% on our convolutional layers. 

Each convolutional layer has an associated pooling layer of `MaxPooling2D` type with the final convolutional layer having a `GlobalAveragePooling2D` type. The pooling layer is do reduce the dimensionality of the model (by reducing the parameters and subsquent computation requirements) which serves to shorten the training time and reduce overfitting. The Max Pooling type takes the maximum size for each window and the Global Average Pooling type takes the average which is suitable for feeding into our `dense` output layer.  

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [5]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

num_rows = 40
num_columns = 174
num_channels = 1

x_train = x_train.reshape(x_train.shape[0], num_rows, num_columns, num_channels)
x_test = x_test.reshape(x_test.shape[0], num_rows, num_columns, num_channels)

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, input_shape=(num_rows, num_columns, num_channels), activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=32, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=64, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=128, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))
model.add(GlobalAveragePooling2D())

model.add(Dense(num_labels, activation='softmax')) 






Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


### Compiling the model 

For compiling our model, we will use the same three parameters as the previous model: 

In [6]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 





In [7]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=1)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 39, 173, 16)       80        
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 19, 86, 16)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 19, 86, 16)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 18, 85, 32)        2080      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 9, 42, 32)         0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 9, 42, 32)         0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 8, 41, 64)         8256      
__________

### Training 

Here we will train the model. As training a CNN can take a sigificant amount of time, we will start with a low number of epochs and a low batch size. If we can see from the output that the model is converging, we will increase both numbers.  

In [8]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

#num_epochs = 12
#num_batch_size = 128

num_epochs = 72
num_batch_size = 256

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.basic_cnn.hdf5', 
                               verbose=1, save_best_only=True)
start = datetime.now()

model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 6985 samples, validate on 1747 samples
Epoch 1/72

Epoch 00001: val_loss improved from inf to 2.05335, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 2/72

Epoch 00002: val_loss improved from 2.05335 to 1.90292, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 3/72

Epoch 00003: val_loss improved from 1.90292 to 1.66505, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 4/72

Epoch 00004: val_loss improved from 1.66505 to 1.52032, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 5/72

Epoch 00005: val_loss improved from 1.52032 to 1.41978, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 6/72

Epoch 00006: val_loss improved from 1.41978 to 1.34471, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 7/72

Epoch 00007: val_loss improved from 1.34471 to 1.29310, saving model to saved_models/weights.best.ba


Epoch 00034: val_loss did not improve from 0.55446
Epoch 35/72

Epoch 00035: val_loss did not improve from 0.55446
Epoch 36/72

Epoch 00036: val_loss improved from 0.55446 to 0.51228, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 37/72

Epoch 00037: val_loss improved from 0.51228 to 0.49313, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 38/72

Epoch 00038: val_loss did not improve from 0.49313
Epoch 39/72

Epoch 00039: val_loss did not improve from 0.49313
Epoch 40/72

Epoch 00040: val_loss improved from 0.49313 to 0.45501, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 41/72

Epoch 00041: val_loss did not improve from 0.45501
Epoch 42/72

Epoch 00042: val_loss did not improve from 0.45501
Epoch 43/72

Epoch 00043: val_loss did not improve from 0.45501
Epoch 44/72

Epoch 00044: val_loss improved from 0.45501 to 0.45002, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 45/72

Epoch 00045: val_loss did not improve from 0.45

### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [9]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  0.9404438081603436
Testing Accuracy:  0.8855180307395403


The Training and Testing accuracy scores are both high and an increase on our initial model. Training accuracy has increased by ~6% and Testing accuracy has increased by ~4%. 

There is a marginal increase in the difference between the Training and Test scores (~6% compared to ~5% previously) though the difference remains low so the model has not suffered from overfitting. 

### Predictions  

Here we will modify our previous method for testing the models predictions on a specified audio .wav file. 

In [10]:
def print_prediction(file_name):
    prediction_feature = extract_features(file_name) 
    prediction_feature = prediction_feature.reshape(1, num_rows, num_columns, num_channels)

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector) 
    print("The predicted class is:", predicted_class[0], '\n') 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

As before we will verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [11]:
# Class: Air Conditioner

filename = '../UrbanSound8K/audio/fold5/100852-0-0-0.wav' 
print_prediction(filename) 

The predicted class is: air_conditioner 

air_conditioner 		 :  0.99540388584136962890625000000000
car_horn 		 :  0.00000021822681617322814418002963
children_playing 		 :  0.00005133304875926114618778228760
dog_bark 		 :  0.00000729690236767055466771125793
drilling 		 :  0.00021975589334033429622650146484
engine_idling 		 :  0.00005994965977151878178119659424
gun_shot 		 :  0.00000166327004080812912434339523
jackhammer 		 :  0.00421318178996443748474121093750
siren 		 :  0.00000097702354651119094341993332
street_music 		 :  0.00004181402618996798992156982422


In [12]:
# Class: Drilling

filename = '../UrbanSound8K/audio/fold3/103199-4-0-0.wav'
print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.00003185506284353323280811309814
car_horn 		 :  0.00023022975074127316474914550781
children_playing 		 :  0.00000072985699262062553316354752
dog_bark 		 :  0.00000023159915940595965366810560
drilling 		 :  0.99956482648849487304687500000000
engine_idling 		 :  0.00000000507442399211299743910786
gun_shot 		 :  0.00000001821260120493661815999076
jackhammer 		 :  0.00003626054240157827734947204590
siren 		 :  0.00000005309399497832600900437683
street_music 		 :  0.00013576135097537189722061157227


In [13]:
# Class: Street music 

filename = '../UrbanSound8K/audio/fold7/101848-9-0-0.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00599978538230061531066894531250
car_horn 		 :  0.00197564391419291496276855468750
children_playing 		 :  0.02159359492361545562744140625000
dog_bark 		 :  0.00240595825016498565673828125000
drilling 		 :  0.00002193588989030104130506515503
engine_idling 		 :  0.00003434154496062546968460083008
gun_shot 		 :  0.00000000047840309491675725439563
jackhammer 		 :  0.00000520726007380289956927299500
siren 		 :  0.00224740942940115928649902343750
street_music 		 :  0.96571612358093261718750000000000


In [14]:
# Class: Car Horn 

filename = '../UrbanSound8K/audio/fold10/100648-1-0-0.wav'
print_prediction(filename) 

The predicted class is: car_horn 

air_conditioner 		 :  0.00215475377626717090606689453125
car_horn 		 :  0.24975246191024780273437500000000
children_playing 		 :  0.00224868883378803730010986328125
dog_bark 		 :  0.17384605109691619873046875000000
drilling 		 :  0.21584358811378479003906250000000
engine_idling 		 :  0.00714564463123679161071777343750
gun_shot 		 :  0.20734564960002899169921875000000
jackhammer 		 :  0.12255356460809707641601562500000
siren 		 :  0.01782992295920848846435546875000
street_music 		 :  0.00127960229292511940002441406250


#### Observations 

We can see that the model performs well. 

Interestingly, car horn was again incorrectly classifed but this time as drilling - though the per class confidence shows it was a close decision between car horn with 26% confidence and drilling at 34% confidence.  

### Other audio

Again we will further validate our model using a sample of various copyright free sounds that we not part of either our test or training data. 

In [15]:
filename = 'Evaluation audio/dog_bark_1.wav'
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00028464887873269617557525634766
car_horn 		 :  0.01682218536734580993652343750000
children_playing 		 :  0.00104110501706600189208984375000
dog_bark 		 :  0.93084555864334106445312500000000
drilling 		 :  0.03543039783835411071777343750000
engine_idling 		 :  0.00073492329102009534835815429688
gun_shot 		 :  0.00931918993592262268066406250000
jackhammer 		 :  0.00044781234464608132839202880859
siren 		 :  0.00377836846746504306793212890625
street_music 		 :  0.00129591778386384248733520507812


In [16]:
filename = 'Evaluation audio/drilling_1.wav'

print_prediction(filename) 

The predicted class is: jackhammer 

air_conditioner 		 :  0.03760356828570365905761718750000
car_horn 		 :  0.00001720797990856226533651351929
children_playing 		 :  0.00004938651181873865425586700439
dog_bark 		 :  0.00022485245426651090383529663086
drilling 		 :  0.01510195713490247726440429687500
engine_idling 		 :  0.00094820023514330387115478515625
gun_shot 		 :  0.00000354913413502799812704324722
jackhammer 		 :  0.94593375921249389648437500000000
siren 		 :  0.00010415645374450832605361938477
street_music 		 :  0.00001337047797278501093387603760


In [17]:
filename = 'Evaluation audio/gun_shot_1.wav'

print_prediction(filename) 

The predicted class is: gun_shot 

air_conditioner 		 :  0.00010857681627385318279266357422
car_horn 		 :  0.00008629232615930959582328796387
children_playing 		 :  0.01643224433064460754394531250000
dog_bark 		 :  0.17698208987712860107421875000000
drilling 		 :  0.01865081302821636199951171875000
engine_idling 		 :  0.00137622072361409664154052734375
gun_shot 		 :  0.78479003906250000000000000000000
jackhammer 		 :  0.00000245173305302159860730171204
siren 		 :  0.00024371214385610073804855346680
street_music 		 :  0.00132761534769088029861450195312


#### Observations 

The performance of our final model is very good and has generalised well, seeming to predict well when tested against new audio data. 