## URBANSOUND8K DATASET

* This dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music

* The files are pre-sorted into ten folds (folders named fold1-fold10) to help in the reproduction of and comparison with the automatic classification results reported in the article above.

* In addition to the sound excerpts, a CSV file containing metadata about each excerpt is also provided.

* 8732 audio files of urban sounds (see description above) in WAV format. The sampling rate, bit depth, and number of channels are the same as those of the original file uploaded to Freesound (and hence may vary from file to file).

In [None]:
import pandas as pd
import numpy as np

In [None]:
#Lets look at the data
data = pd.read_csv("../input/urbansound8k/UrbanSound8K.csv")
data.shape

In [None]:
data.head()

The meta-data contains 8 columns.

1. slice_file_name: name of the audio file
1. fsID: FreesoundID of the recording where the excerpt is taken from
1. start: start time of the slice
1. end: end time of the slice
1. salience: salience rating of the sound. 1 = foreground, 2 = background
1. fold: The fold number (1–10) to which this file has been allocated
1. classID:
0 = air_conditioner
1 = car_horn
2 = children_playing
3 = dog_bark
4 = drilling
5 = engine_idling
6 = gun_shot
7 = jackhammer
8 = siren
9 = street_music
1. class: class name

### The audio data has been already sliced and excerpted and even allocated to 10 different folds. Some of the excerpts are from the same original file but different slice. If one slice from a certain recording was in training data, and a different slice from the same recording was in test data, this might increase the accuracy of a final model falsely. Thanks to the original research, this has also been taken care of by allocating slices into folds such that all slices originating from the same Freesound recording go into the same fold.

In [None]:
# Lets look at class distribution of each fold
appended = []
for i in range(1,11):
    appended.append(data[data.fold == i]['class'].value_counts())
    
class_distribution = pd.DataFrame(appended)
class_distribution = class_distribution.reset_index()
class_distribution['index'] = ["fold"+str(x) for x in range(1,11)]
class_distribution

In [None]:
#dataset is not perfectly balanced
data['class'].value_counts(normalize=True)

### We can see that Car Horn and Gun Shot class is unbalanced, for now we are not going to do any augmentation

# We can use Librosa Python Library for extracting features

Librosa library can read audio files and convert them to there amplitude values for each sample of audio. 

Let us say there is an audio file of 4s and sampling rate of audio file is 22050 Hz. This means that audio file is made using amplitude samples such that 22050 samples of amplitudes are recorded in each second. Hence a 4s audio file with sampling rate 22050 can be expressed as an array of 4*22050=88200 size


## Model 1: With MFCC features

In [None]:
from pip._internal import main
main(["install","progressbar"])

In [None]:
#import python script to import audio files

from shutil import copyfile
copyfile(src = "../input/urbansound8k-import-data/import_data.py", dst = "../working/import_data.py")

In [None]:
# Importing required Library

import progressbar
import time
import os
import struct
import matplotlib.pyplot as plt
import IPython.display as ipd
import pandas as pd
import numpy as np
import librosa # for sound processing.
import import_data as dc # a local module

In [None]:
dataset = np.zeros(shape = (data.shape[0],2),dtype = object)
dataset.shape

## Extracting Feature

In [None]:

bar = progressbar.ProgressBar(maxval=data.shape[0], widgets=[progressbar.Bar('$', '||', '||'), ' ', progressbar.Percentage()])
bar.start()
for i in range(data.shape[0]):
    
    fullpath, class_id = dc.path_class(data,data.slice_file_name[i])
    try:
        X, sample_rate = librosa.load(fullpath, res_type='kaiser_fast')
        mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0)
    except Exception:
        print("Error encountered while parsing file: ", file)
        mfccs,class_id = None, None
    feature = mfccs
    label = class_id
    dataset[i,0],dataset[i,1] = feature,label
    
    bar.update(i+1)

In [None]:
np.save("dataset",dataset,allow_pickle=True)

In [None]:
l = np.load("dataset.npy",allow_pickle=True)

In [None]:
l.shape

In [None]:
l[8730,1]

In [None]:
data['class'][8730]

## Creating MFCC based Model

In [None]:
# Data Pre-processing

data_mfcc = pd.DataFrame(np.load("dataset.npy",allow_pickle= True))
data_mfcc.columns = ['feature', 'label']
data_mfcc['fold'] = data['fold']

In [None]:
data_mfcc.head()

In [None]:
data_mfcc[data_mfcc['fold'] != 5]

In [None]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
lb = LabelEncoder()

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

In [None]:
y = np.array(data_mfcc.label.tolist())

In [None]:
y.shape

In [None]:
y = np.array(data_mfcc.label.tolist())

filter_size = 3
lb = LabelEncoder()
y = np_utils.to_categorical(lb.fit_transform(y))

num_labels = y.shape[1]

# build model
model = Sequential()
model.add(Dense(512, input_shape=(40,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
model.summary()

In [None]:
predicted = []
actual = []
for i in range(1,11):
    validation_data = data_mfcc[data_mfcc['fold'] == i]
    train_data = data_mfcc[data_mfcc['fold'] != i]
    
    X = np.array(train_data.feature.tolist())
    y = np.array(train_data.label.tolist())
    
    x_val = np.array(validation_data.feature.tolist())
    y_val = np.array(validation_data.label.tolist())
    
    y = np_utils.to_categorical(lb.fit_transform(y))
    y_val = np_utils.to_categorical(lb.fit_transform(y_val))
    
    model.fit(X, y, batch_size=64, epochs=60, validation_data=(x_val, y_val))
    pred = model.predict(x_val)
    
    predicted.append(pred)
    actual.append(y_val)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
acc = []
for i in range(0,10):
    predict_conv = np.argmax(predicted[i],axis=1)
    actual_conv = np.argmax(actual[i],axis=1)
    acc.append(accuracy_score(actual_conv,predict_conv))

In [None]:
print("Accuracy for 10 fold cross validation",np.mean(acc))

## Model 2 : Melspectrogram

In [None]:
# Extracting feature from audio file
bar = progressbar.ProgressBar(maxval=data.shape[0], widgets=[progressbar.Bar('$', '||', '||'), ' ', progressbar.Percentage()])
bar.start()
for i in range(data.shape[0]):
    
    fullpath, class_id = dc.path_class(data,data.slice_file_name[i])
    try:
        X, sample_rate = librosa.load(fullpath, res_type='kaiser_fast')
        mfccs = np.mean(librosa.feature.melspectrogram(y=X, sr=sample_rate).T,axis=0)
    except Exception:
        print("Error encountered while parsing file: ", file)
        mfccs,class_id = None, None
    feature = mfccs
    label = class_id
    dataset[i,0],dataset[i,1] = feature,label
    
    bar.update(i+1)

In [None]:
np.save("dataset_melspectrogram",dataset,allow_pickle=True)

In [None]:
# Data Pre-processing

data_mal = pd.DataFrame(np.load("dataset_melspectrogram.npy",allow_pickle= True))
data_mal.columns = ['feature', 'label']
data_mal['fold'] = data['fold']

In [None]:
data_mal.head()

In [None]:
y = np.array(data_mal.label.tolist())

filter_size = 3
lb = LabelEncoder()
y = np_utils.to_categorical(lb.fit_transform(y))

num_labels = y.shape[1]

# build model
model = Sequential()
model.add(Dense(512, input_shape=(128,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
model.summary()

In [None]:
predicted = []
actual = []
for i in range(1,11):
    validation_data = data_mal[data_mal['fold'] == i]
    train_data = data_mal[data_mal['fold'] != i]
    
    X = np.array(train_data.feature.tolist())
    y = np.array(train_data.label.tolist())
    
    x_val = np.array(validation_data.feature.tolist())
    y_val = np.array(validation_data.label.tolist())
    
    y = np_utils.to_categorical(lb.fit_transform(y))
    y_val = np_utils.to_categorical(lb.fit_transform(y_val))
    
    
    model.fit(X, y, batch_size=64, epochs=100, validation_data=(x_val, y_val))
    pred = model.predict(x_val)
    
    predicted.append(pred)
    actual.append(y_val)

In [None]:
acc = []
for i in range(0,10):
    predict_conv = np.argmax(predicted[i],axis=1)
    actual_conv = np.argmax(actual[i],axis=1)
    acc.append(accuracy_score(actual_conv,predict_conv))

In [None]:
print("Accuracy for 10 fold cross validation",np.mean(acc))

## Conclusion:

1. 10 fold cross validation as described in the original paper results in 96.7% accuracy with MFCC features an 80% accuracy with melspectrogram features.
1. Due to short time we haven’t done hyper parameter tuning, we can do hyper-parameter tuning with reports like shown [Here](https://wandb.ai/buntyshah/XGBoost/reports/Chicago-Crimes-Datesets-XGBooster--VmlldzoxMzcxNzE?accessToken=xprku0os6i6np7ptv1knj9xhholi24b5u85qaqaq9rgm63mlnxqurb3ik0xqh7d4)
1. Due to short time we haven’t looked into AUC or F1 score or Classification report parameter and improve the model
1. We can do augmentation of input sound to be more accurate.
1. Currently it uses a very basic DNN, we can improve the model to be more accurate.
