# Speech Recognition Between YES and NO
Using the Librosa and the Conv2D package in Keras, make a prediction on whether someone has said the word 'yes' or the word 'no'

First, I loaded Keras and the necessary Keras packages:

In [1]:
import keras

Using TensorFlow backend.


In [2]:
import librosa
import librosa.display
import librosa.feature

In [3]:
import numpy as np 
import os
#import random

## Obtaining the Dataset

The dataset was obtained from the Google Research Blog. Here is the link to the blog: https://research.googleblog.com/2017/08/launching-speech-commands-dataset.html

The dataset was downloaded as a .tar file so the file had to be untarred for the correct data to be extracted. The file contained many other words but only the 'yes' and 'no' folders were used for this project. The 'yes' and 'no' folders were copied into my local directory. The path to each audio file inside each folder were loaded into an array and that array was concatenated into one large array.


In [4]:
directory_in_str = ("C:\\Users\\Trisha Kadle\\Documents\\Machine_Learning\\New folder\\yes")
fn = []
directory_in_str1 = ("C:\\Users\\Trisha Kadle\\Documents\\Machine_Learning\\New folder\\no")
fn1 = []

fn2 = []

directory = os.fsencode(directory_in_str) 
for file in os.listdir(directory): 
    filename = os.fsdecode(file)
    if filename.endswith(".wav"): 
        fn.append(os.path.join(directory_in_str, filename))
        
directory1 = os.fsencode(directory_in_str1) 
for file in os.listdir(directory1): 
    filename1 = os.fsdecode(file)
    if filename1.endswith(".wav"): 
        fn1.append(os.path.join(directory_in_str1, filename1))

fn2 = np.concatenate((fn, fn1),axis = 0)

## Getting the MFCC 

A model for speech recognition was found in this blog: 
https://blog.manash.me/building-a-dead-simple-word-recognition-engine-using-convnet-in-keras-25e72c19c12b
Modifications were made to how the mfcc_vec was created, how the yes_or_no label array was created and how the CNN was constructed.

For each audio file the MFCC (Mel-freqeuncy cepstral coefficients) had to be found. The MFCC are coefficients that easily map audio based on the different frequencies present in the audio. The following steps were taken to find the MFCC:
* Each audio file in fn2 was loaded using Librosa
* A downsample operation was performed to reduce the amount of computation needed
* The MFCC was found using Librosa
* The MFCC vector was zero padded to so that respective MFCC vectors would be the same length
* The MFCC for the samples were saved in a larger MFCC_vec containing the MFCC values for all the audio samples


In [5]:
mfcc_vec = np.empty((len(fn2),20,11))

for file in range(len(fn2)):
    y, sr = librosa.load(fn2[file], mono=True, sr=None)
    y = y[::3]
    S = librosa.feature.mfcc(y, sr=16000)
    pad = 11-S.shape[1]
    S = np.pad(S,pad_width=((0,0),(0,pad)), mode='constant')
    mfcc_vec[file] = np.r_[S]
                

## Create a 'Labels' Array

An array containing the labels for each audio sample was created. If the audio sample was a 'yes', the corresponding label would be a 1. If the audio sample was a 'no', the corresponding label would be a 0. The number of yes and no labels to add were based on the length of the arrays containing to paths to each 'yes' and 'no' audio files. 

In [6]:
yes_and_no_labels = []
yes_labels = np.full(len(fn), 1)
no_labels = np.full(len(fn1), 0)
yes_and_no_labels = np.concatenate((yes_labels,no_labels),axis = 0)

## Create Training and Test Data

The sklearn package was used to create the training and test data. The built in function train_test_split was used to appropariately shuffle and then split the total data into 60% training data and 40% test data.
(The default value of shuffle is TRUE in the train_test_split function) 

In [7]:
import sklearn.model_selection
from sklearn.model_selection import train_test_split
X_tr, X_ts, y_tr, y_ts = train_test_split(mfcc_vec, yes_and_no_labels, test_size=0.4)


## Reshaping the data

In order to train the data using Conv2D, the data had to be reshaped to be four dimensional. To do this, the reshape function was used to add a fourth dimension to the already existing shape of (number_of_samples, 20, 11). 

In [8]:
from keras.utils import to_categorical

X_tr = X_tr.reshape(X_tr.shape[0], 20, 11, 1)
X_ts = X_ts.reshape(X_ts.shape[0], 20, 11, 1)


## Training the Data

The appropriate functions were loaded from Keras. 
Conv2D was used with two dense layers. The final layer has 1 output and uses a sigmoid activation because it is a binary classification problem. The optimizer chosen was the Adam opitmizer and binary crossentropy was used because the desired results is either a 0 or 1. The model summary is printed below. 

In [9]:
from keras.models import Model, Sequential
from keras.layers import Conv2D, Dropout, Flatten, Dense, Activation
from keras import optimizers

In [10]:
model = Sequential()
model.add(Conv2D(32, kernel_size=(2, 2), activation='relu', input_shape=(20, 11, 1)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='sigmoid'))
model.add(Dropout(0.25))
model.add(Dense(1, activation='sigmoid'))

opt = optimizers.Adam(lr=0.001) # beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model.compile(optimizer=opt,
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 19, 10, 32)        160       
_________________________________________________________________
dropout_1 (Dropout)          (None, 19, 10, 32)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 6080)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               778368    
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
Total params: 778,657
Trainable params: 778,657
Non-trainable params: 0
_________________________________________________________________


## Getting the Model Accruacy 

The model created above was then used to test the remaining data. 50 Epochs were used with a batch size of 100. The model ends with a final validation accuracy above 95%.

In [11]:
model.fit(X_tr, y_tr, batch_size=100, epochs=50, verbose=1, validation_data=(X_ts, y_ts))


Train on 2851 samples, validate on 1901 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x1b8c864ae48>

## Test the Model With a New Audio File 

Using Pyaudio, a new 1 second audio file with a sampling rate of 16000 was recorded and saved. The recorded audio file can be played below. Then the MFCC of the new audio file is found using Librosa. This is reshaped and the model created above is used to predict whether the audio file is of a person saying 'yes' or saying 'no'. The predicted value is rounded to either a 1 or a 0 and the respective word is printed.  

In [14]:
from IPython.display import Audio
import pyaudio
import wave

duration = 1  # seconds
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
DURATION = 1
BLOCKSIZE = 1024
WAVE_OUTPUT_FILENAME = "audio_samp.wav"

p = pyaudio.PyAudio()

stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=BLOCKSIZE)

print("Start Recording")

frames = []

for i in range(0, int(RATE / BLOCKSIZE * DURATION)):
    data = stream.read(BLOCKSIZE)
    frames.append(data)

print("Done")

stream.stop_stream()
stream.close()
p.terminate()

wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()

fn_samp = "C:\\Anaconda3\\Scripts\\audio_samp.wav"
Audio(fn_samp, rate=RATE)

Start Recording
Done


In [15]:
fn_samp = "C:\\Anaconda3\\Scripts\\audio_samp.wav"
mfcc_vec = np.empty((1,20,11))

y, sr = librosa.load(fn_samp, mono=True, sr=None)
y = y[::3]
S = librosa.feature.mfcc(y, sr=16000)
pad = 11-S.shape[1]
S = np.pad(S,pad_width=((0,0),(0,pad)), mode='constant')
mfcc_samp = S

sample_reshaped = mfcc_samp.reshape(1, 20, 11, 1)

output = model.predict(sample_reshaped)
yes_or_no = np.around(output)

if (yes_or_no == 1):
    print("You said YES")
    
if (yes_or_no == 0):
    print("You said NO")

You said YES
