# Spoken Digits Recognition

What happens in your mobile phone when you talk to Siri or Alexa ?
It is the question that interested us for this topic. 

We want you at the end of this notebook to have an idea of how something you use everyday (Speech Recognition) works.

In this Jupyter we simplify the problem. We focus on detecting spoken digits from 2 to 9.

### Packages required :
- librosa
- tensorflow
- sklearn

In [None]:
import os
import random
import librosa
import numpy as np
import librosa.display
import tensorflow as tf
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
from keras.models import model_from_json
from keras.layers import Flatten
from sklearn.model_selection import train_test_split

## Initialization of the parameters

In [None]:
bsize = 64 #batch size
audio_features = 20
utterance_length =25  # Modify to see what different results you can get (35 init)
ndigits = 10 #number of digits
n_mfcc=20
train_path = 'data/recordings/train/' #path to the folder that contains the training data
prediction_path = 'data/recordings/test/' #path to the folder that contains the data for prediction
nb_epochs = 100 #the number of epochs when training the neural network

## Functions that extract the Mel Frequency Cepstrum Coefficients (MFCC) of the data

MFCC are the audio features the most used in speech recognition because they try to integrate the way a human perceive sounds.

Here is a definition of the MFCC given by Wikipedia :

*In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.*

*Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping can allow for better representation of sound, for example, in audio compression.*

In [None]:
def extract_mfcc(folder_path, utterance_length, n_mfcc):
    # Get raw .wav data and sampling rate (from librosa's load function) : Load an audio file as a floating point time series
    raw_w, sampling_rate = librosa.load(folder_path, mono=True)
    
    # Obtain MFCC Features from raw data : Mnp.ndarray [shape=(n_mfcc, t)]
    mfcc_features = librosa.feature.mfcc(raw_w, sr = sampling_rate, n_fft = 2048, n_mfcc = n_mfcc)
    
    if mfcc_features.shape[1] > utterance_length:
        #print("ok")
        mfcc_features = mfcc_features[:, 0:utterance_length] #we then stop at utterance_length for the columns of the array
    else:
        #pad the array
        mfcc_features = np.pad(mfcc_features, ((0, 0), (0, utterance_length - mfcc_features.shape[1])),
                               mode='constant', constant_values=0)
    
    return mfcc_features

In [None]:
def get_mfcc_batch(folder_path, batch_size, utterance_length):
    print("Loading the data...\nIt will take less than a minute")
    files = os.listdir(folder_path) #a list containing the names of the entries in the directory given by path
    X = [] #ensemble des mfcc
    y = []
    spoken_digits = []

    while True:
        # Shuffle Files
        random.shuffle(files)
        for fname in files:
            spoken_digit = int(fname[0])
            spoken_digits += [spoken_digit]
            
            # Make sure file is a .wav file
            if not fname.endswith(".wav"):
                continue
            
            # Get MFCC Features for the file
            mfcc_features = extract_mfcc(folder_path + fname, utterance_length, n_mfcc)
            
            # One-hot encode label for 10 digits 0-9
            label = np.eye(10)[spoken_digit]
            
            # Append to label batch
            y.append(label)
            
            # Append mfcc features to ft_batch
            X.append(mfcc_features)
        
        print("Data loaded")
        
        return np.array(X), np.array(y), spoken_digits

## Visualization of a Mel Frequency Cepstrum (MFC)

You can visualize the Mel Frequency Cepstrum of a random audio, taken in the database. Execute several times this cell to see different types of MFC.

Since the audios in the database don't have the same length, we added a padding to put every MFC at the same shape. It is why, with a lot of audios, we have a zone with nothing relevant on the right of the cepstrum.

In [None]:
files = os.listdir(train_path)
random.shuffle(files)
fname = files[0] #a random file in our database

mfccs = extract_mfcc(train_path + fname, utterance_length, n_mfcc)

fig, ax = plt.subplots()
img = librosa.display.specshow(mfccs, x_axis='time', ax=ax)
fig.colorbar(img, ax=ax)
ax.set(title='MFC')

## Let's build and train our neural network

We build a neural network with three hidden fully connected layers and another fully connected layer for the output. The first step is to flatten the data so that is can fit the first hidden layer.

In [None]:
model = Sequential()
model.add(Flatten())  #3D arrays => 1D

#adding 3 hidden fully connected layers with ReLU activation fonction
model.add(Dense(512, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(64, activation='relu'))

#output layer
model.add(Dense(ndigits, activation='softmax'))

#compiling the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

We must load the data that we are going to use to train our neural network, using the name X for the lists of MFCCs and y for the labels.

In [None]:
#loading X (the list of MFCCs) and y (the list of labels)
(X, y, spoken_digits) = get_mfcc_batch(train_path, 256, utterance_length)

### Training of the neural network

We separate the data into two parts : the training data and the testing data. Then, we train our neural network.

In [None]:
#separation between training data (70%) and testing data (30%)
X_tr,X_test,y_tr,y_test = train_test_split(X,y,test_size = 0.3)


#training the model
history = model.fit(X_tr, y_tr, epochs=nb_epochs, batch_size=bsize, validation_data = (X_test, y_test))

Let's have a look of the summary of the network to see if it has been created the way we want.

In [None]:
#a print of the summary of the model
model.summary()

## Vizualisation of the evolution of accuracy and loss

In [None]:
#accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

#loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss']) 
plt.title('Model loss') 
plt.ylabel('Loss') 
plt.xlabel('Epoch') 
plt.legend(['Train', 'Test'], loc='upper left') 
plt.show()

## Saving the model

In [None]:
# serialize model to JSON
model_json = model.to_json()
with open("model/model.json", "w") as json_file:
    json_file.write(model_json)

# serialize weights to HDF5
model.save_weights("model/model.h5")
print("Saved model to disk")

## Loading the model saved on the files *model.h5* and *model.json*

In [None]:
# load json and create model
json_file = open('model/model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)

# load weights into new model
loaded_model.load_weights("model/model.h5")
print("Loaded model from disk")

loaded_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

In [None]:
loaded_model.summary() #visualisation of the architecture of the network

## Prediction

How does our neural network will recognize audios that he has never "heard" before ? We are going to give him some new audios and see if he recognize the right digits.

In [None]:
#getting the validation data
X, y, spoken_digits = get_mfcc_batch(prediction_path, 256, utterance_length)

#getting the prediction that our neural network gives us
prediction_digits = loaded_model.predict_classes(X)

accuracy = 0
total = len(prediction_digits)

for i in range (total):
    prediction_digit = prediction_digits[i]
    spoken_digit = spoken_digits[i]
    if int(prediction_digit) == int(spoken_digit) :
        accuracy += 100/30
    print('Prediction : ' + str(prediction_digit) + ' | Right spoken digit : ' + str(spoken_digit) + '               => accuracy : ' + str(int(accuracy)) + '%')

## Bibliography

Jason Brownlee, Your First Deep Learning Project in Python with Keras Step-By-Step, Deep Learning, 2019, available at https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/.

Jason Brownlee, How to Save and Load Your Keras Deep Learning Model, Deep Learning, 2019, available at https://machinelearningmastery.com/save-load-keras-deep-learning-models/.

Mohsin Baig, spoken-digit-recognition, available at https://github.com/moebg/spoken-digit-recognition/tree/master/data.

Adhish Thite, Digit Recognition from Sound, available at https://adhishthite.github.io/sound-mnist/.

Sanchit Tanwar, Building our first neural network in keras, 2019, available at https://towardsdatascience.com/building-our-first-neural-network-in-keras-bdc8abbc17f5.