# EEL4511 Real-time DSP Applications Lab 9 Final Project

### Samuel Cuervo

### Git Repo: https://github.com/scuervo101/TI-F28379D-SPEECH-RECOGNITION

## Abstract:
Using the codec and the DSP, I will be sampling voice triggered audio and attempting to recognize what word is spoken from a set of pre-trained words. The DSP will communicate back to the user using UART and the serial console on the computer. The voice recognition engine uses MFCC, a Hamming window filter on which FFT and mel spectrum transform is applied to extract a smaller set of descriptive features, and uses a KNN model as a classification tool for determining the label (which word) the audio sample belongs to.

# TI-F28379D-SPEECH-RECOGNITION

## DSP Speech Recognition

### Notes

- Planning on using KNN for the classification and MFCC feature extraction for the audio files
- MFCC uses FFT feature extraction
- Extracting from wav files and a test set provided by me using the microphone on the codec
- Not sure how many test files will be needed, planning on doing 20-30 per word
- Planning on using the TensorFlow Command Speech data set for training
- Another possiblity is to run an FFT on the data and test its pitch (yes high pitch, no low pitch)
- Python will handle the Pre-computation of the ML model's parameters

### Functionality

The dsp will be continuously sampling the microphone. When audio is detected on the microphone (threshold is surpassed), it will begin to record the sampled data into the external SRAM. After recording the data into the SRAM, it will process the data in the background (ML processing, MFCC to KNN, or FFT). Responses from the DSP will be displayed through UART and a serial console.

Plans for the DSP to have a simple conversation with the user (with responses that illustrate the response ie. If answer no, DSP respond with "why not")

*Yes* and *No* will be the words initially featured. If the DSP can handle more, will add *Hello* and *Goodbye*

Conversation theme is about the DSP taking over the world. Answer improperly and flash the RED LEDs or answer correctly and flash the BLUE LEDs

Using the 8 bit LEDs for debugging the state.

**Logical states:** 

- Idle (listening and waiting for a signal or a response on the microphone)
- Recording (Storing the sampling data into the SRAM)
- Processing (Running ML processing or FFT(During this point either save the next signal response or don't sample))

After the processing is complete, trigger the DSPs response and send it through UART

# Python Speech Recognition Proof of Concept

Below is a test implementation of the DSP speech recognition using MFCC for feature extraction and KNN for classification

Testing the classification between yes and no

In another notebook, I will do the proper K fold testing for finding the best possible parameters 

In [38]:
import numpy as np
from scipy.io  import wavfile
from librosa.feature import mfcc

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

## Data Set

I created a personal data set of the words yes and no (24 samples for each word) for the proof of concept. For the final training data, I will use the larger [TensorFlow Speech Commands](https://www.tensorflow.org/datasets/catalog/speech_commands) data set.

Below I import the data set


In [3]:
data = []
labels = []

# Read the sample data from DSP-DataSet
for i in range(24):
    print("Reading data:" + "yes_" + str(i) + ".wav")
    samplerate, wav = wavfile.read("../../DSP-DataSet/yes_ds/yes_" + str(i) + ".wav")
    data += [wav]
    labels.append(1)

    print("Reading data:" + "no_" + str(i) + ".wav")
    samplerate, wav = wavfile.read("../../DSP-DataSet/no_ds/no_" + str(i) + ".wav")
    data += [wav]
    labels.append(0)

data = np.array(data,dtype="float")
labels = np.array(labels)

Reading data:yes_0.wav
Reading data:no_0.wav
Reading data:yes_1.wav
Reading data:no_1.wav
Reading data:yes_2.wav
Reading data:no_2.wav
Reading data:yes_3.wav
Reading data:no_3.wav
Reading data:yes_4.wav
Reading data:no_4.wav
Reading data:yes_5.wav
Reading data:no_5.wav
Reading data:yes_6.wav
Reading data:no_6.wav
Reading data:yes_7.wav
Reading data:no_7.wav
Reading data:yes_8.wav
Reading data:no_8.wav
Reading data:yes_9.wav
Reading data:no_9.wav
Reading data:yes_10.wav
Reading data:no_10.wav
Reading data:yes_11.wav
Reading data:no_11.wav
Reading data:yes_12.wav
Reading data:no_12.wav
Reading data:yes_13.wav
Reading data:no_13.wav
Reading data:yes_14.wav
Reading data:no_14.wav
Reading data:yes_15.wav
Reading data:no_15.wav
Reading data:yes_16.wav
Reading data:no_16.wav
Reading data:yes_17.wav
Reading data:no_17.wav
Reading data:yes_18.wav
Reading data:no_18.wav
Reading data:yes_19.wav
Reading data:no_19.wav
Reading data:yes_20.wav
Reading data:no_20.wav
Reading data:yes_21.wav
Reading d

## MFCC Feature Extraction

In an attempt to reduce the model complexity, MFCC will extract the top 12 coefficients that describe the audio. This reduces the size of samples from 44100 elements (1 sec audio clips) to 12 elements.

The MFCC function used here is using the Librosa library. I will be creating my own MFCC function (in C) using the DSP's FFT function and by manually applying the filtering.

In [29]:
n_fft=512
n_mfcc = 13
samplerate, _ = wavfile.read("../../DSP-DataSet/yes_ds/yes_0.wav")

features = []
for i in range(data.shape[0]):
    features += [mfcc(y=data[i], sr=samplerate, n_mfcc=n_mfcc)]

features = np.array(features)

processed_features = np.zeros((features.shape[0], features.shape[1]))

for i in range(data.shape[0]):
    for j in range(features.shape[1]):
        mean = sum(features[i][j]) / len(features[i][j])
        processed_features[i][j] = mean

features = processed_features
processed_features.shape

(48, 13)

## KNN Parameter Training

I have to split the data set to train it with a separate group from testing. Before I can train the KNN model, I most run some form of validation to find the best parameters. I am using a sklearn function called GridSearchCV which will test parameters until it finds the best scoring set of parameters.

In [54]:
features_train, features_test, labels_train, labels_test = train_test_split(features,labels,test_size=0.25)

# Testing parameters to find the best set
params = {
    "n_neighbors" : list(range(3, 21, 2)),
    "weights" : ["uniform", "distance"],
    "metric" : ["euclidean", "manhattan", "chebyshev"]
}

GSCV = GridSearchCV(KNeighborsClassifier(), params, verbose=1, cv=3, n_jobs=-1)

In [55]:
results = GSCV.fit(features, labels)

Fitting 3 folds for each of 54 candidates, totalling 162 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    2.7s
[Parallel(n_jobs=-1)]: Done 162 out of 162 | elapsed:    2.9s finished


In [49]:
print("The best parameters for KNN classification are: ")
print(results.best_params_)
print("\nWith a score of: " + str(results.best_score_))

The best parameters for KNN classification are: 
{'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'uniform'}

With a score of: 1.0


### KNN Model Training

After finding the best parameters, I will train the KNN model with the training data set and test it with the test data set. Once the model has been trained, I can extract the models weights to implement it in C for the DSP

**Note** Due to the small sample size, the model reaches perfect accuracy which could mean it is overfitting or could be inaccurate in a real world scenerio. I will be expanding the data set later.

In [53]:
knn = KNeighborsClassifier(n_neighbors=3, metric="euclidean", weights="uniform")
knn = knn.fit(features_train,labels_train)

pred = knn.predict(features_test)

print("Predicted labels for the test set: ")
print(pred)
print("Expected labels for the test set: ")
print(labels_test)

Predicted labels for the test set: 
[0 1 1 0 0 0 1 0 0 1 0 0]
Expected labels for the test set: 
[0 1 1 0 0 0 1 0 1 1 0 0]
