In [1]:
!date

Fri Feb 26 01:51:15 UTC 2021


# Applying Machine Learning on UrbanSound8k 

In [2]:
# 挂载谷歌网盘
# Colaboratory: Can I access to my Google drive folder and file? 
# https://stackoverflow.com/questions/47744131/colaboratory-can-i-access-to-my-google-drive-folder-and-file
 
from google.colab import drive
drive.mount('/content/drive')
 
# 测试谷歌网盘
base_path = '/content/drive/MyDrive/dataset/sound_classification_ml_production/'
 
!pwd
!ls -la /content/drive/MyDrive/dataset/sound_classification_ml_production
!pwd
!ls -la .

Mounted at /content/drive
/content
total 6047
drwx------ 2 root root    4096 Jan 30 06:12  audio_files
-rw------- 1 root root 3014027 Feb  8 07:44 'Copy of dataset.json'
-rw------- 1 root root 3120372 Feb  8 07:50  dataset.json
drwx------ 2 root root    4096 Feb  8 06:27  features_mfcc_0
drwx------ 2 root root    4096 Jan 30 06:12  flask_app
drwx------ 2 root root    4096 Jan 30 06:12  .git
-rw------- 1 root root     451 Jan 30 06:12  .gitignore
drwx------ 2 root root    4096 Jan 30 06:12  images
-rw------- 1 root root    1072 Jan 30 06:12  LICENSE
drwx------ 2 root root    4096 Feb  7 07:21  my_features_mfcc
drwx------ 2 root root    4096 Feb  7 07:21  my_features_mfcc_0
drwx------ 2 root root    4096 Jan 30 06:12  notebooks
-rw------- 1 root root    9441 Jan 30 06:12  README.md
drwx------ 2 root root    4096 Feb  8 06:12  saved_models
drwx------ 2 root root    4096 Feb  8 06:12  saved_models_0
drwx------ 2 root root    4096 Jan 30 07:34  UrbanSound8K
/content
total 20
drwxr-xr-x 1 ro

## Install Packages

We install: 
- Machine learning libraries: `Keras`, `sklearn`
- Audio processing: `librosa`
- Plots: `Plotly`, `matplotlib`

In [3]:
!pip install pandas
!pip install setuptools
!pip install numpy
!pip install sklearn
!pip install librosa
!pip install plotly
!pip install matplotlib
!pip install pillow
!pip install keras



In [4]:
import os
import time
import librosa
import zipfile
import numpy as np
import pandas as pd
import librosa.display
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from PIL import Image

In [5]:
# Unzip dataset
# !wget -P /content/drive/MyDrive/dataset/sound_classification_ml_production https://zenodo.org/record/1203745/files/UrbanSound8K.tar.gz 
# !mkdir /content/drive/MyDrive/dataset/sound_classification_ml_production/UrbanSound8K
# !tar -xzf UrbanSound8K.tar.gz -C /content/drive/MyDrive/dataset/sound_classification_ml_production/UrbanSound8K
# !rm urban8k.tgz

!ls -la /content/drive/MyDrive/dataset/sound_classification_ml_production/UrbanSound8K
# !mv /content/drive/MyDrive/dataset/sound_classification_ml_production/UrbanSound8K.tar.gz .


total 55
drwx------ 2 root root  4096 Jan 30 07:34 audio
-rw------- 1 root root 15364 May 19  2014 .DS_Store
-rw------- 1 root root 26155 May 19  2014 FREESOUNDCREDITS.txt
drwx------ 2 root root  4096 May 28  2014 metadata
-rw------- 1 root root  4932 Jun  3  2014 UrbanSound8K_README.txt


In [6]:
!pwd & ls -la

# !mkdir /content/drive/MyDrive/dataset/sound_classification_ml_production/features_mfcc
# !cp ./features_mfcc/* /content/drive/MyDrive/dataset/sound_classification_ml_production/features_mfcc

/content
total 20
drwxr-xr-x 1 root root 4096 Feb 26 01:54 .
drwxr-xr-x 1 root root 4096 Feb 26 01:50 ..
drwxr-xr-x 4 root root 4096 Feb 24 17:48 .config
drwx------ 5 root root 4096 Feb 26 01:54 drive
drwxr-xr-x 1 root root 4096 Feb 24 17:49 sample_data


## Design Choices and Models

After analysing the dataset and spending a bit of time reading about state-of-the-art on audio signal classification, and some of my [previous work](https://github.com/jsalbert/Music-Genre-Classification-with-Deep-Learning) I have made the following design choices and proposals:

Train a Convolutional Neural Network and use either MFCCs, STFT or Mel-Spectogram as input. 

- As the audios duration range from 0 to 4s, I pad the spectogram generated, to make all the audios of equal length. 

Feature options:

- Using MFCCs as features:
  - It is usual to compute the first 13 MFCCs, their derivatives and second derivatives and use it as features.
  - Or it is also usual to use 40 MFCCs as it is the Librosa default.

- Using the STFT as features:
  - Contains less human processing than MFCCs and Mel-Spectogram, the CNN could learn other filters rather than the representations designed by humans.

- Using Mel-Spectogram as features:
  - A transformation applied on the STFT that approximates how humans perceive the sound. Less human engineered than MFCCs but a bit more than STFT. 

My first choice would be using STFT and Mel-Spectogram as it looks that CNNs could take more advantage of the frequency-temporal structure but due to **computational resources** and limited time I will show the use **MFCCs** as features as they are much more memory efficient. 
 

## Dataset Preprocessing and Splits

I load all the audio data using Librosa and the default sample rate of 22050Hz. This design decision is based on 
([Source]((https://librosa.org/blog/2019/07/17/resample-on-load/#Okay...-but-why-22050-Hz?--Why-not-44100-or-48000?))) and in further experiments different sample rates could be tried. 

> Humans can hear up to around 20000 Hz, it's possible to successfully analyze music and speech data at much lower rates without sacrificing much. The highest pitches we usually care about detecting are around C9≈8372 Hz, well below the 11025 cutoff implied by fs=22050.

By default Librosa will load the audio in mono, giving us 1 channel.



In [7]:
!cat /content/drive/MyDrive/dataset/sound_classification_ml_production/UrbanSound8K/metadata/UrbanSound8K.csv

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
17592-5-1-2.wav,17592,9.921326,13.921326,1,1,5,engine_idling
17592-5-1-3.wav,17592,10.421326,14.421326,1,1,5,engine_idling
176003-1-0-0.wav,176003,0.031344,1.1523,1,4,1,car_horn
17615-3-0-0.wav,17615,35.115512,39.115512,2,3,3,dog_bark
17615-3-0-3.wav,17615,36.615512,40.615512,2,3,3,dog_bark
17615-3-0-4.wav,17615,37.115512,41.115512,2,3,3,dog_bark
17615-3-0-6.wav,17615,38.115512,42.115512,2,3,3,dog_bark
176257-3-0-0.wav,176257,45.814379,49.814379,1,1,3,dog_bark
176258-3-1-12.wav,176258,75.897629,79.897629,1,1,3,dog_bark
176258-3-1-13.wav,176258,76.397629,80.397629,1,1,3,dog_bark
176258-3-1-18.wav,176258,78.897629,82.897629,1,1,3,dog_bark
176258-3-1-2.wav,176258,70.897629,74.897629,1,1,3,dog_bark
176631-1-0-0.wav,176631,28.32413,30.302215,2,3,1,car_horn
176634-1-0-0.wav,176634,2.441912,2.70793,1,2,1,car_horn
176638-1-0-0.wav,176638,4.51208,5.061169,1,1,1,car_horn
176638-1-1-0.wav,176638,6.724352,7.520133,1,1,1,car_horn
176638-5-0-0.wav,176638,0,4,

In [8]:
# FeatureExtractor class including librosa audio processing functions
class FeatureExtractor:
    def __init__(self, csv_file):
        self.csv_file = csv_file
        self.max_audio_duration = 4
        self.dataset_df = self._create_dataset(csv_file)
    
    @staticmethod
    def _create_dataset(csv_file):
        """
        Args:
            dataset_path: path with the .wav files after unzipping
        Returns: A pandas dataframe with the list of files and labels (`filenames`, `labels`)
        """
        dataset_df = pd.read_csv(csv_file)
        filepaths = []
        for i, row in dataset_df.iterrows():
            filepaths.append(os.path.join('UrbanSound8K/audio', 'fold'+str(row['fold']), row['slice_file_name']))
        dataset_df['filepath'] = filepaths
        return dataset_df

    @staticmethod
    def _compute_max_pad_length(max_audio_length, sample_rate=22050, n_fft=2048, hop_length=512):
        dummy_file = np.random.random(max_audio_length*sample_rate)
        stft = librosa.stft(dummy_file, n_fft=n_fft, hop_length=hop_length)
        # Return an even number for CNN computation purposes
        if stft.shape[1] % 2 != 0:
            return stft.shape[1] + 1
        return stft.shape[1]

    def compute_save_features(self, 
                        mode='mfcc', 
                        sample_rate=22050,
                        n_fft=2048,
                        hop_length=512,
                        n_mfcc=40,
                        output_path='features',
                        deltas=False
                        ):
        dataset_features = []
        max_pad = self._compute_max_pad_length(self.max_audio_duration, 
                                               sample_rate=sample_rate, 
                                               n_fft=n_fft,
                                               hop_length=hop_length)
        print('Max Padding = ', max_pad)
        
        if not os.path.exists(output_path):
            print('Creating output folder: ', output_path)
            os.makedirs(output_path)
        else:
            print('Output folder already existed')
            
        print('Saving features in ', output_path)
        i = 0
        t = time.time()
        
        features_path = []
        for relative_filepath in self.dataset_df['filepath']:
            filepath = base_path + relative_filepath;
            print('compute_save_features, filepath = ' + str(filepath))

            if i % 100 == 0:
                print('{} files processed in {}s'.format(i, time.time() - t))

            print('compute_save_features, librosa.load, filepath = ' + str(filepath))
            audio_file, sample_rate = librosa.load(filepath, sr=sample_rate, res_type='kaiser_fast')
            if mode == 'mfcc':
                audio_features = self.compute_mfcc(audio_file, sample_rate, n_fft, hop_length, n_mfcc, deltas)  
            elif mode == 'stft':
                audio_features = self.compute_stft(audio_file, sample_rate, n_fft, hop_length)
            elif mode == 'mel-spectogram':
                audio_features = self.compute_mel_spectogram(audio_file, sample_rate, n_fft, hop_length)
            
            audio_features = np.pad(audio_features, pad_width=((0, 0), (0, max_pad - audio_features.shape[1])))
            print('compute_save_features, audio_features = ' + str(type(audio_features)) + ', ' + str(audio_features.shape))

            # here save .npy feature as shape(39, 174, 1)
            npy_path = os.path.join(output_path, filepath.split('/')[-1].replace('wav', 'npy'))
            print('compute_save_features, npy_path = ' + str(npy_path))
            print('compute_save_features, audio_features.shape = ' + str(audio_features.shape))
            tmp_audio_features = audio_features.reshape(audio_features.shape[0], audio_features.shape[1], 1)
            print('compute_save_features, tmp_audio_features.shape = ' + str(tmp_audio_features.shape))
            np.save(npy_path, tmp_audio_features)
            
            save_path = os.path.join(output_path, filepath.split('/')[-1].replace('wav', 'png'))
            print('compute_save_features, save_features, save_path = ' + str(save_path))
            print('compute_save_features, audio_features.shape = ' + str(audio_features.shape))
            self.save_features(audio_features, save_path)
            features_path.append(save_path)
            i+=1
        self.dataset_df['features_path'] = features_path
        return self.dataset_df
    
    @staticmethod
    def save_features(audio_features, filepath):
        image = Image.fromarray(audio_features)
        # To grayscale
        image = image.convert("L")
        image.save(filepath)
        print('save_features, filepath = ' + str(filepath))

    @staticmethod
    def compute_mel_spectogram(audio_file, sample_rate, n_fft, hop_length):
        return librosa.feature.melspectrogram(audio_file,
                                              sr=sample_rate, 
                                              n_fft=n_fft,
                                              hop_length=hop_length)
    @staticmethod
    def compute_stft(audio_file, sample_rate, n_fft, hop_length):
        return librosa.stft(audio_file, n_fft=n_fft, hop_length=hop_length)
    
    @staticmethod
    def compute_mfcc(audio_file, sample_rate, n_fft, hop_length, n_mfcc, deltas=False):
        mfccs = librosa.feature.mfcc(audio_file,
                                    sr=sample_rate, 
                                    n_fft=n_fft,
                                    n_mfcc=n_mfcc,
                                    )
        # Change mode from interpolation to nearest
        if deltas:
          delta_mfccs = librosa.feature.delta(mfccs, mode='nearest')
          delta2_mfccs = librosa.feature.delta(mfccs, order=2, mode='nearest')
          return np.concatenate((mfccs, delta_mfccs, delta2_mfccs))
        return mfccs

In [9]:
# Create dataset and extract features
fe = FeatureExtractor(base_path + 'UrbanSound8K/metadata/UrbanSound8K.csv')

Access to disc and librosa loading of audio files is very slow on colab Notebook (30-40 min) we could load the pre-computed features instead.

In [None]:
# Uncomment and run to compute and save features on the colab notebook
dataset_df = fe.compute_save_features(mode='mfcc', n_mfcc=13, output_path=base_path + 'my_features_mfcc', deltas=True)

Max Padding =  174
Output folder already existed
Saving features in  /content/drive/MyDrive/dataset/sound_classification_ml_production/my_features_mfcc
compute_save_features, filepath = /content/drive/MyDrive/dataset/sound_classification_ml_production/UrbanSound8K/audio/fold5/100032-3-0-0.wav
0 files processed in 0.0004324913024902344s
compute_save_features, librosa.load, filepath = /content/drive/MyDrive/dataset/sound_classification_ml_production/UrbanSound8K/audio/fold5/100032-3-0-0.wav
compute_save_features, audio_features = <class 'numpy.ndarray'>, (39, 174)
compute_save_features, npy_path = /content/drive/MyDrive/dataset/sound_classification_ml_production/my_features_mfcc/100032-3-0-0.npy
compute_save_features, audio_features.shape = (39, 174)
compute_save_features, tmp_audio_features.shape = (39, 174, 1)
compute_save_features, save_features, save_path = /content/drive/MyDrive/dataset/sound_classification_ml_production/my_features_mfcc/100032-3-0-0.png
compute_save_features, audio

In [None]:
# Unzip features
# !wget -P /content/drive/MyDrive/dataset/sound_classification_ml_production --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1BU2B5EcbfyGBIOkB5YC44hpzPpuqw43H' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1BU2B5EcbfyGBIOkB5YC44hpzPpuqw43H" -O features_mfcc.zip && rm -rf /tmp/cookies.txt
# !unzip -q /content/drive/MyDrive/dataset/sound_classification_ml_production/features_mfcc.zip
# !rm features_mfcc.zip

In [None]:
# Download dataset.json file
# !wget -P /content/drive/MyDrive/dataset/sound_classification_ml_production --no-check-certificate "https://docs.google.com/uc?export=download&id=1pzSvGYaBXghLQFTZxlSex-Ts3T4B0X4C" -O dataset.json

In [None]:
!cat /content/drive/MyDrive/dataset/sound_classification_ml_production/dataset.json

In [None]:
dataset_path = base_path + 'dataset.json'
print(dataset_path)
dataset_df = pd.read_json(dataset_path)

For the purpose of this experiment we will load all the data in memory and process it in minibatches. If we had computational resources and more time we could create Dataloader objects that would allow to perform many other operations as Data Augmentation and iterate faster. 

In [None]:
tmp = []
for relative_feature_path in dataset_df['features_path']:
 
  feature_path = base_path + relative_feature_path.replace('features_mfcc', 'my_features_mfcc')
  print('feature_path = ' + str(feature_path))
 
  tmp.append(feature_path)

In [None]:
# Numpy：numpy与image互转 https://blog.csdn.net/weixin_40522801/article/details/106490005

dataset_df['features'] = [np.asarray(np.load(base_path + feature_path.replace('features_mfcc', 'my_features_mfcc'))) for feature_path in dataset_df['features_path']]

In [None]:
from keras.utils import to_categorical
dataset_df['labels_categorical'] = [to_categorical(label, 10) for label in dataset_df['classID']]

In [None]:
dataset_df.head()

We are going to create splits for the train, validation and test sets of our dataset. 
For the purpose of the experiment and to make it quick we will use the sklearn function `train_test_split`, two times. 

In [None]:
dataset_df_tolist = dataset_df['features'].tolist()
np_array = np.array(dataset_df_tolist)
print('dataset_df_tolist = ' + str(dataset_df_tolist))
print('np_array = ' + str(np_array))

In [None]:
# Split the dataset 
from sklearn.model_selection import train_test_split 

# Add one dimension for the channel
X = np.array(dataset_df['features'].tolist())
y = np.array(dataset_df['labels_categorical'].tolist())

# As there is unbalance for some classes I am going to stratify it so we have the same proportion in train/test
X_train, X_test, Y_train, Y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.30, 
                                                    random_state=1, 
                                                    stratify=y)
# Create validation and test
X_test, X_val, Y_test, Y_val = train_test_split(X_test, 
                                                Y_test, 
                                                test_size=0.5, 
                                                random_state=1, 
                                                stratify=Y_test)

print(X_train.shape, X_val.shape, X_test.shape)

## Machine Learning Model

### Model Design

We are going to create a **Fully Convolutional Network** Model using Keras running over Tensorflow with a few layers. 

In [None]:
from keras.models import Sequential
from keras.layers import Input, Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, Conv2D, MaxPooling2D, GlobalAveragePooling2D

As our images are rectangular in shape (y axis is MFCC, x axis is time), instead of using square filters (as usual) we are going to make them rectangular so they can learn better the correlation of the MFCCs with the temporal dimension. 

In [None]:
# FCN Model
def create_model(num_classes=10, input_shape=None, dropout_ratio=None):
    model = Sequential()
    if input_shape is None:
        model.add(Input(shape=(None, None, 1)))
    else:
        model.add(Input(shape=input_shape))
    model.add(Conv2D(filters=16, kernel_size=(2, 4), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 3)))
    model.add(Conv2D(filters=32, kernel_size=(2, 4), activation='relu'))
    model.add(MaxPooling2D(pool_size=2))
    model.add(Conv2D(filters=64, kernel_size=(2, 4), activation='relu'))
    model.add(MaxPooling2D(pool_size=2))
    model.add(Conv2D(filters=128, kernel_size=(2, 4), activation='relu'))
    model.add(GlobalAveragePooling2D())
    if dropout_ratio is not None:
        model.add(Dropout(dropout_ratio))
    # Add dense linear layer
    model.add(Dense(num_classes, activation='softmax'))
    return model

As it is a multi classification problem we will use the **Categorical Cross Entropy loss**. As optimizer we will use the Keras implementation of **Adam** with the default hyperparameters values. 

In [None]:
# Create and compile the model
fcn_model = create_model(input_shape=X_train.shape[1:])
fcn_model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
fcn_model.summary()

### Model training and evaluation

In [None]:
from keras.models import load_model
from keras.callbacks import ModelCheckpoint 

In [None]:
!mkdir /content/drive/MyDrive/dataset/sound_classification_ml_production/saved_models

In [None]:
def train_model(model, X_train, Y_train, X_val, Y_val, epochs, batch_size, callbacks):
    model.fit(X_train, 
              Y_train, 
              batch_size=batch_size, 
              epochs=epochs, 
              validation_data=(X_val, Y_val), 
              callbacks=callbacks, verbose=1)
    return model

We will create a checkpoint for **early stopping**, so we will select the model that performs better on the validation set. 

Creating a function to train the model will allow us to perform hyperparameter tuning faster. 

In [None]:
checkpointer = ModelCheckpoint(filepath=base_path + 'saved_models/best_fcn.hdf5', monitor='val_accuracy', verbose=1, save_best_only=True)
callbacks = [checkpointer]

# Hyper-parameters
epochs = 100
batch_size = 256

In [None]:
# Train the model
model = train_model(model=fcn_model,
                    X_train=X_train,
                    X_val=X_val,
                    Y_train=Y_train,
                    Y_val=Y_val,
                    epochs=epochs,
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
# Load the best model
best_model = load_model(base_path + 'saved_models/best_fcn.hdf5')

# !ls /content/drive/MyDrive/dataset/sound_classification_ml_production/saved_models

Looks like the model has overfitted to the training data towards the end of the training. We have selected the model that performed better on the validation set, saved by the checkpoint. The similarity between validation and test score tells us that our training methodology is correct and that our validation set is a good estimator of testing performance. 

In [None]:
# Evaluating the model on the training and testing set
score = best_model.evaluate(X_train, Y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = best_model.evaluate(X_val, Y_val, verbose=0)
print("Validation Accuracy: ", score[1])

score = best_model.evaluate(X_test, Y_test, verbose=0)
print("Testing Accuracy: ", score[1])

We see that there has been overfitting so we could train another model adding dropout before the last layer to add more regularization.  

In [None]:
# We add a dropout ratio of 0.25
fcn_model = create_model(input_shape=X_train.shape[1:], dropout_ratio=0.5)
fcn_model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
fcn_model.summary()

In [None]:
checkpointer = ModelCheckpoint(filepath=base_path + 'saved_models/best_fcn_dropout.hdf5', monitor='val_accuracy',
                               verbose=1, save_best_only=True)
callbacks = [checkpointer]

model = train_model(model=fcn_model,
                    X_train=X_train,
                    X_val=X_val,
                    Y_train=Y_train,
                    Y_val=Y_val,
                    epochs=200,
                    batch_size=256,
                    callbacks=callbacks)

In [None]:
best_model = load_model(base_path + 'saved_models/best_fcn_dropout.hdf5')

# !ls /content/drive/MyDrive/dataset/sound_classification_ml_production/saved_models

In [None]:
# Evaluating the model on the training and testing set
score = best_model.evaluate(X_train, Y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = best_model.evaluate(X_val, Y_val, verbose=0)
print("Validation Accuracy: ", score[1])

score = best_model.evaluate(X_test, Y_test, verbose=0)
print("Testing Accuracy: ", score[1])

In [None]:
# Plot a confusion matrix
from sklearn import metrics
Y_pred = best_model.predict(X_test)
matrix = metrics.confusion_matrix(Y_test.argmax(axis=1), Y_pred.argmax(axis=1))

In [None]:
# Confusion matrix code (from https://github.com/triagemd/keras-eval/blob/master/keras_eval/visualizer.py)
def plot_confusion_matrix(cm, concepts, normalize=False, show_text=True, fontsize=18, figsize=(16, 12),
                          cmap=plt.cm.coolwarm_r, save_path=None, show_labels=True):
    '''
    Plot confusion matrix provided in 'cm'
    Args:
        cm: Confusion Matrix, square sized numpy array
        concepts: Name of the categories to show
        normalize: If True, normalize values between 0 and ones. Not valid if negative values.
        show_text: If True, display cell values as text. Otherwise only display cell colors.
        fontsize: Size of text
        figsize: Size of figure
        cmap: Color choice
        save_path: If `save_path` specified, save confusion matrix in that location
    Returns: Nothing. Plots confusion matrix
    '''

    if cm.ndim != 2 or cm.shape[0] != cm.shape[1]:
        raise ValueError('Invalid confusion matrix shape, it should be square and ndim=2')

    if cm.shape[0] != len(concepts) or cm.shape[1] != len(concepts):
        raise ValueError('Number of concepts (%i) and dimensions of confusion matrix do not coincide (%i, %i)' %
                         (len(concepts), cm.shape[0], cm.shape[1]))

    plt.rcParams.update({'font.size': fontsize})

    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    if normalize:
        cm = cm_normalized

    fig = plt.figure(figsize=figsize)
    ax = fig.add_subplot(111)
    cax = ax.matshow(cm, vmin=np.min(cm), vmax=np.max(cm), alpha=0.8, cmap=cmap)

    fig.colorbar(cax)
    ax.xaxis.tick_bottom()
    plt.ylabel('True label', fontweight='bold')
    plt.xlabel('Predicted label', fontweight='bold')

    if show_labels:
        n_labels = len(concepts)
        ax.set_xticklabels(concepts)
        ax.set_yticklabels(concepts)
        plt.xticks(np.arange(0, n_labels, 1.0), rotation='vertical')
        plt.yticks(np.arange(0, n_labels, 1.0))
    else:
        plt.axis('off')

    if show_text:
        # http://stackoverflow.com/questions/21712047/matplotlib-imshow-matshow-display-values-on-plot
        min_val, max_val = 0, len(concepts)
        ind_array = np.arange(min_val, max_val, 1.0)
        x, y = np.meshgrid(ind_array, ind_array)
        for i, (x_val, y_val) in enumerate(zip(x.flatten(), y.flatten())):
            c = cm[int(x_val), int(y_val)]
            ax.text(y_val, x_val, c, va='center', ha='center')

    if save_path is not None:
        plt.savefig(save_path)

To observe better the performance of the model and the mistakes made between different classes we plot the confusion matrix. 

In our case accuracy is a good metric because the dataset is mostly balanced but we observed a few classes with less samples (1`car_horn`, `gun_shot` and `siren`), so it will be good to observe the performance on these classes.

We can observe that a lot of mistakes are happening between class `children_playing` and class `street_music` so maybe it will be worth it to spend a little bit more time doing analysis and finding what could be the reasons.  

In [None]:
class_dictionary = {3: 'dog_bark', 2: 'children_playing', 1: 'car_horn', 0: 'air_conditioner', 9: 'street_music', 6: 'gun_shot', 8: 'siren', 5: 'engine_idling', 7: 'jackhammer', 4: 'drilling'}
classes = [class_dictionary[key] for key in sorted(class_dictionary.keys())]

In [None]:
plot_confusion_matrix(matrix, classes)

## Conclusions

We can observe a bump of 1-2% in the test set accuracy when introducing dropout as regularization. This shows that it has been a successful addition to our model.

There are many things that we can try to improve the model's performance such as:

- Hyperparameter tuning:
  - Tuning the parameters of feature extraction
  - Tuning the network parameters (number of layers, pooling layers, number and filter shape...)
  - Tuning the network hyperparameters (Learning rate, optimizer) 

- Feature extraction:
  - Use STFT: The raw spectogram could provide more information to the CNN to learn correlation between frequency and time than the MFCCs.
  - Use Mel-Spectogram: The mel-spectogram could provide more information to the CNN to learn correlation between frequency and time than the MFCCs. 

In [None]:
!ls /content/drive/MyDrive/dataset/audio_classifier_tutorial/UrbanSound8K

In [None]:
# import matplotlib
# matplotlib.use('Agg')
import os

import librosa
from tensorflow.keras.models import load_model
import numpy as np
# from PIL import Image
# import cv2

#加载模型h5文件

# 读取音频数据
def load_data(data_path):
    wav, sr = librosa.load(data_path, sr=16000)
    intervals = librosa.effects.split(wav, top_db=20)
    wav_output = []
    for sliced in intervals:
        wav_output.extend(wav[sliced[0]:sliced[1]])
    assert len(wav_output) >= 8000, "有效音频小于0.5s"
    wav_output = np.array(wav_output)
    ps = librosa.feature.melspectrogram(y=wav_output, sr=sr, hop_length=256).astype(np.float32)
    ps = ps[np.newaxis, ..., np.newaxis]
    return ps

load_data_car_119_10785 = load_data('/content/drive/MyDrive/dataset/audio_classifier_tutorial/UrbanSound8K/predict/car_119_10785.wav')
best_model.predict(load_data_car_119_10785)

In [None]:
!date