# Classifying Urban sounds using Deep Learning

## 3 Model Training and Evaluation 

### Load Preprocessed data 

In [1]:
# retrieve the preprocessed data from previous notebook

%store -r x_train 
%store -r x_test 
%store -r y_train 
%store -r y_test 
%store -r yy 
%store -r le

### Initial model architecture - MLP

We will start with constructing a Multilayer Perceptron (MLP) Neural Network using Keras and a Tensorflow backend. 

Starting with a `sequential` model so we can build the model layer by layer. 

We will begin with a simple model architecture, consisting of three layers, an input layer, a hidden layer and an output layer. All three layers will be of the `dense` layer type which is a standard layer type that is used in many cases for neural networks. 

The first layer will receive the input shape. As each sample contains 40 MFCCs (or columns) we have a shape of (1x40) this means we will start with an input shape of 40. 

The first two layers will have 256 nodes. The activation function we will be using for our first 2 layers is the `ReLU`, or `Rectified Linear Activation`. This activation function has been proven to work well in neural networks.

We will also apply a `Dropout` value of 50% on our first two layers. This will randomly exclude nodes from each update cycle which in turn results in a network that is capable of better generalisation and is less likely to overfit the training data.

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [2]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 
import tensorflow as tf

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()

model.add(Dense(256, input_shape=(40,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(num_labels))
model.add(Activation('softmax'))

Using TensorFlow backend.






Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


### Compiling the model 

For compiling our model, we will use the following three parameters: 

* Loss function - we will use `categorical_crossentropy`. This is the most common choice for classification. A lower score indicates that the model is performing better.

* Metrics - we will use the `accuracy` metric which will allow us to view the accuracy score on the validation data when we train the model. 

* Optimizer - here we will use `adam` which is a generally good optimizer for many use cases.


In [3]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 





In [4]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=0)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 256)               10496     
_________________________________________________________________
activation_1 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 256)               65792     
_________________________________________________________________
activation_2 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                2570      
__________

### Training 

Here we will train the model. 

We will start with 100 epochs which is the number of times the model will cycle through the data. The model will improve on each cycle until it reaches a certain point. 

We will also start with a low batch size, as having a large batch size can reduce the generalisation ability of the model. 

In [5]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

num_epochs = 100
num_batch_size = 32

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.basic_mlp.hdf5', 
                               verbose=1, save_best_only=True)
start = datetime.now()

model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 6985 samples, validate on 1747 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 5.85880, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 2/100

Epoch 00002: val_loss improved from 5.85880 to 1.90348, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 3/100

Epoch 00003: val_loss improved from 1.90348 to 1.66820, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 4/100

Epoch 00004: val_loss improved from 1.66820 to 1.41369, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 5/100

Epoch 00005: val_loss improved from 1.41369 to 1.28841, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 6/100

Epoch 00006: val_loss improved from 1.28841 to 1.18815, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 7/100

Epoch 00007: val_loss improved from 1.18815 to 1.12916, saving model to saved_models/weights.


Epoch 00033: val_loss improved from 0.57710 to 0.55368, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 34/100

Epoch 00034: val_loss did not improve from 0.55368
Epoch 35/100

Epoch 00035: val_loss did not improve from 0.55368
Epoch 36/100

Epoch 00036: val_loss improved from 0.55368 to 0.54467, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 37/100

Epoch 00037: val_loss did not improve from 0.54467
Epoch 38/100

Epoch 00038: val_loss did not improve from 0.54467
Epoch 39/100

Epoch 00039: val_loss did not improve from 0.54467
Epoch 40/100

Epoch 00040: val_loss did not improve from 0.54467
Epoch 41/100

Epoch 00041: val_loss improved from 0.54467 to 0.52521, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 42/100

Epoch 00042: val_loss did not improve from 0.52521
Epoch 43/100

Epoch 00043: val_loss did not improve from 0.52521
Epoch 44/100

Epoch 00044: val_loss improved from 0.52521 to 0.52431, saving model to saved_models/weights.best.


Epoch 00071: val_loss did not improve from 0.46122
Epoch 72/100

Epoch 00072: val_loss did not improve from 0.46122
Epoch 73/100

Epoch 00073: val_loss did not improve from 0.46122
Epoch 74/100

Epoch 00074: val_loss did not improve from 0.46122
Epoch 75/100

Epoch 00075: val_loss did not improve from 0.46122
Epoch 76/100

Epoch 00076: val_loss improved from 0.46122 to 0.45784, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 77/100

Epoch 00077: val_loss did not improve from 0.45784
Epoch 78/100

Epoch 00078: val_loss did not improve from 0.45784
Epoch 79/100

Epoch 00079: val_loss improved from 0.45784 to 0.44659, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 80/100

Epoch 00080: val_loss did not improve from 0.44659
Epoch 81/100

Epoch 00081: val_loss did not improve from 0.44659
Epoch 82/100

Epoch 00082: val_loss did not improve from 0.44659
Epoch 83/100

Epoch 00083: val_loss did not improve from 0.44659
Epoch 84/100

Epoch 00084: val_loss did not 

### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [6]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  0.9191123836878461
Testing Accuracy:  0.8672009156851613


The initial Training and Testing accuracy scores are quite high. As there is not a great difference between the Training and Test scores (~5%) this suggests that the model has not suffered from overfitting. 

### Predictions  

Here we will build a method which will allow us to test the models predictions on a specified audio .wav file. 

In [7]:
import numpy as np
import librosa 


def extract_feature(file_name):
   
    try:
        audio_data, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
        mfccs = librosa.feature.mfcc(y=audio_data, sr=sample_rate, n_mfcc=40)
        mfccsscaled = np.mean(mfccs.T,axis=0)
        
    except Exception as e:
        print("Error encountered while parsing file: ", file)
        return None, None

    return np.array([mfccsscaled])


In [8]:
def print_prediction(file_name):
    prediction_feature = extract_feature(file_name) 

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector) 
    print("The predicted class is:", predicted_class[0], '\n') 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

Initial sainity check to verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [9]:
# Class: Air Conditioner

filename = '../UrbanSound8K/audio/fold5/100852-0-0-0.wav' 
print_prediction(filename) 

The predicted class is: air_conditioner 

air_conditioner 		 :  0.99970954656600952148437500000000
car_horn 		 :  0.00001530246845504734665155410767
children_playing 		 :  0.00004839352914132177829742431641
dog_bark 		 :  0.00001115840950660640373826026917
drilling 		 :  0.00006572431448148563504219055176
engine_idling 		 :  0.00005342079748515971004962921143
gun_shot 		 :  0.00000191707545127428602427244186
jackhammer 		 :  0.00000161822993050009245052933693
siren 		 :  0.00000137717233883449807763099670
street_music 		 :  0.00009145555668510496616363525391


In [10]:
# Class: Drilling

filename = '../UrbanSound8K/audio/fold3/103199-4-0-0.wav'
print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.00000003351729560563398990780115
car_horn 		 :  0.00003639040369307622313499450684
children_playing 		 :  0.00008489481842843815684318542480
dog_bark 		 :  0.00805395096540451049804687500000
drilling 		 :  0.71168625354766845703125000000000
engine_idling 		 :  0.00000004893985661169608647469431
gun_shot 		 :  0.00000127771272673271596431732178
jackhammer 		 :  0.00000018530435852426307974383235
siren 		 :  0.00000043167796093257493339478970
street_music 		 :  0.28013649582862854003906250000000


In [11]:
# Class: Street music 

filename = '../UrbanSound8K/audio/fold7/101848-9-0-0.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00352347572334110736846923828125
car_horn 		 :  0.00111965404357761144638061523438
children_playing 		 :  0.01668413355946540832519531250000
dog_bark 		 :  0.00273463386110961437225341796875
drilling 		 :  0.00432230532169342041015625000000
engine_idling 		 :  0.00023902540851850062608718872070
gun_shot 		 :  0.00012609839905053377151489257812
jackhammer 		 :  0.01408823207020759582519531250000
siren 		 :  0.00038336782017722725868225097656
street_music 		 :  0.95677900314331054687500000000000


In [12]:
# Class: Car Horn 

filename = '../UrbanSound8K/audio/fold10/100648-1-0-0.wav'
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00029011914739385247230529785156
car_horn 		 :  0.09502947330474853515625000000000
children_playing 		 :  0.02013516426086425781250000000000
dog_bark 		 :  0.43034020066261291503906250000000
drilling 		 :  0.16503509879112243652343750000000
engine_idling 		 :  0.00213304418139159679412841796875
gun_shot 		 :  0.03804671764373779296875000000000
jackhammer 		 :  0.01010870467871427536010742187500
siren 		 :  0.00583092914894223213195800781250
street_music 		 :  0.23305052518844604492187500000000


#### Observations 

From this brief sanity check the model seems to predict well. One errror was observed whereby a car horn was incorrectly classifed as a dog bark. 

We can see from the per class confidence that this was quite a low score (43%). This allows follows our early observation that a dog bark and car horn are similar in spectral shape. 

### Other audio

Here we will use a sample of various copyright free sounds that we not part of either our test or training data to further validate our model. 

In [13]:
filename = 'Evaluation audio/dog_bark_1.wav'
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00003152462522848509252071380615
car_horn 		 :  0.00074103189399465918540954589844
children_playing 		 :  0.09837145358324050903320312500000
dog_bark 		 :  0.61339396238327026367187500000000
drilling 		 :  0.01142074353992938995361328125000
engine_idling 		 :  0.00030014684307388961315155029297
gun_shot 		 :  0.23123309016227722167968750000000
jackhammer 		 :  0.00000032560546969762071967124939
siren 		 :  0.01272556930780410766601562500000
street_music 		 :  0.03178221359848976135253906250000


In [14]:
filename = 'Evaluation audio/drilling_1.wav'

print_prediction(filename) 

The predicted class is: air_conditioner 

air_conditioner 		 :  0.52631795406341552734375000000000
car_horn 		 :  0.00000396633504351484589278697968
children_playing 		 :  0.02661064639687538146972656250000
dog_bark 		 :  0.00023366810637526214122772216797
drilling 		 :  0.30945852398872375488281250000000
engine_idling 		 :  0.00249849865213036537170410156250
gun_shot 		 :  0.00309279770590364933013916015625
jackhammer 		 :  0.12444933503866195678710937500000
siren 		 :  0.00000264825098383880686014890671
street_music 		 :  0.00733195338398218154907226562500


In [15]:
filename = 'Evaluation audio/gun_shot_1.wav'

print_prediction(filename) 

# sample data weighted towards gun shot - peak in the dog barking sample is simmilar in shape to the gun shot sample

The predicted class is: dog_bark 

air_conditioner 		 :  0.06747675687074661254882812500000
car_horn 		 :  0.00438506016507744789123535156250
children_playing 		 :  0.00190366897732019424438476562500
dog_bark 		 :  0.47365766763687133789062500000000
drilling 		 :  0.00409927684813737869262695312500
engine_idling 		 :  0.11580722033977508544921875000000
gun_shot 		 :  0.00371204037219285964965820312500
jackhammer 		 :  0.00006764467252651229500770568848
siren 		 :  0.00395335722714662551879882812500
street_music 		 :  0.32493731379508972167968750000000


In [16]:
filename = 'Evaluation audio/siren_1.wav'

print_prediction(filename) 

The predicted class is: siren 

air_conditioner 		 :  0.00000482561426906613633036613464
car_horn 		 :  0.00048245262587442994117736816406
children_playing 		 :  0.00856084469705820083618164062500
dog_bark 		 :  0.23407638072967529296875000000000
drilling 		 :  0.00001239275934494799003005027771
engine_idling 		 :  0.15051437914371490478515625000000
gun_shot 		 :  0.00215621548704802989959716796875
jackhammer 		 :  0.00000930212991079315543174743652
siren 		 :  0.58780175447463989257812500000000
street_music 		 :  0.01638141646981239318847656250000


#### Observations 

The performance of our initial model is satisfactorry and has generalised well, seeming to predict well when tested against new audio data. 

### *In the next notebook we will refine our model*