This is a Wake Word Detection model for Voice Assistant, which can detect/recognize user's voice (after trained), wake and response to user whenever a wake word from user's speech is detected.
***Wake Word: the word/command that used for wake the voice assistant up, ex: Hey Siri, Hey Google,...
Objective: Build a Wake Word Detection model for Voice Assistant just by PyTorch, no API. The model will be a binary classification RNN (LSTM variant) and must be light weight so that it can be used on small machine (like raspberry pi or personal laptop)
- Collect data in the form of .wav files, environment sounds or random speech will be zero labeled data, user's voice saying the wake word will be one labeled data.
- Use the MFFCs technique to process the data.
- Build a binary classification RNN (LSTM variant) model.
- Split collected data into train set and test set, then processing them before puting them to training phase along with the model.
- Train and optimize the model, testing with demo script.
- Try to put the model into raspberry pi
In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.
Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal spectrum. This frequency warping can allow for better representation of sound, for example, in audio compression that might potentially reduce the transmission bandwidth and the storage requirements of audio signals.
MFCCs are commonly derived as follows:
- Take the Fourier transform of (a windowed excerpt of) a signal.
- Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows or alternatively, cosine overlapping windows.
- Take the logs of the powers at each of the mel frequencies.
- Take the discrete cosine transform of the list of mel log powers, as if it were a signal.
- The MFCCs are the amplitudes of the resulting spectrum.
For more infomation: https://en.wikipedia.org/wiki/Mel-frequency_cepstrum https://www.analyticsvidhya.com/blog/2021/06/mfcc-technique-for-speech-recognition/
With the above configuration, the model contains 87249 parameters.
** After multiple times of training, I found that the best result can be obtained by distributing the data to 90/10 (90% for non-wake word, 10% for wake word).
For one labeled file, I recorded about 100 times of my own voice saying the wake word "Hey Eva" (2 seconds each), then replicated them by 70 times. For zero label, I recorded the surrounding sound, random speech from me and other people, split them in to files with 2 seconds lenght. Finally trained for 50 epochs with over 62.000 zero labeled and 7.000 one labeled audio files. I got these results:
- Best test accuracy:
- Classification Report:
After running engine.py with optimized model, I saw that the model worked well, it could easily recognize my voice saying the wake word and response back, but it not perfect, there are some pros and cons of the model:
Pros:
- Light weight
- Good sensitivity
- Short training time
- Can easily train on CPU or Colab
Cons:
- The model is hard to work in a noisy environment
- The voice assistant could wake up with user's random speech/non-wake word (rarely occur)
- If user said something similiar to the wake word, the voice assistant could also detect it as the wake word (ex: my wake word is "hey Eva", saying "hey Sheva" or even just "Eva" will also wake the assistant up)
- python3
- portaudio (for recording with pyaudio to work)
- using a linux/mac machine or wsl2 on windows (because torchaudio may not work on windows)
pip install -r requirements.txt
If you are running with just the cpu
docker build -f cpu.Dockerfile -t voiceassistant .
If you are running on a cuda enabled machine
docker build -f Dockerfile -t voiceassistant .
For more details make sure to visit these files to look at script arguments and description
wakeword/neuralnet/train.py
is used to train the model
wakeword/neuralnet/optimize_graph.py
is used to create a production ready graph that can be used in engine.py
wakeword/engine.py
is used to demo the wakeword model
wakeword/scripts/collect_wakeword_audio.py
- used to collect wakeword and environment data
wakeword/scripts/split_audio_into_chunks.py
- used to split audio into n second chunks
wakeword/scripts/split_commonvoice.py
- if you download the common voice dataset, use this script to split it into n second chunks
wakeword/scripts/create_wakeword_jsons.py
- used to create the wakeword json for training
For more details make sure to visit these files to look at script arguments and description
-
collect data
- environment and wakeword data can be collected using
python collect_wakeword_audio.py
cd VoiceAssistant/wakeword/scripts mkdir data cd data mkdir 0 1 wakewords environment python collect_wakeword_audio.py --sample_rate 8000 --seconds 2 --interactive --interactive_save_path ./data/wakewords python collect_wakeword_audio.py --sample_rate 8000 --save_path ./data/environment/env.wav
- to avoid the imbalanced dataset problem, we can duplicate the wakeword clips with
python replicate_audios.py --wakewords_dir data/wakewords/ --copy_destination data/1/ --copy_number 50
- be sure to collect other speech data like random speech or common voice dataset. split the data into n seconds chunk with
split_audio_into_chunks.py
. - put data into two seperate directory named
0
and1
.0
for non wakeword,1
for wakeword. usecreate_wakeword_jsons.py
to create train and test json - create a train and test json in this format...
// make each sample is on a seperate line {"key": "/path/to/audio/sample.wav, "label": 0} {"key": "/path/to/audio/sample.wav, "label": 1}
- environment and wakeword data can be collected using
-
train model
-
use
train.py
to train model (--no_cuda for training on just cpu)python train.py --save_checkpoint_path checkpoint/ --train_data_json data_json/train.json --test_data_json data_json/test.json --no_cuda
-
after model training use
optimize_graph.py
to create an optimized pytorch model
-
-
test
- test using the
engine.py
script
- test using the
NOTE: the test response will just randomly play audio file in demo_voice/terminator (Terminator's voice), which was downloaded from https://www.soundboard.com/sb/Pappa
*** The code was modified from a part of Michael Nguyen's AI Voice Assistant, license included.