Skip to content

Wakeword Project

secretsauceai edited this page Feb 18, 2022 · 36 revisions

Secret Sauce AI

Wakeword Project

Wake word Wakewords are an often overlooked part of the voice assistant pipeline. However, without a wakeword the voice assistant would always have to transcribe (ASR) everything the microphone hears and hope for the best. This is exactly why every voice assistant has a wakeword to trigger ASR transcription.

Problem statement

Creating a production quality wakeword using FOSS solutions is much harder and requires more resources than the average FOSS developer/user has.

Quality requirements

FIXED requirements

Production quality of a wakeword is defined as follows:

  • A user can say the wakeword 5 times consecutively and each time results in a wake up trigger.
  • A user encounters maximum 5 false wake ups per week of continuous usage.

FLEXED requirements

  • The amount of data needed to create a wake word
  • Which devices can run the engine in real time

User stories

  • As a developer/user I want to create my own production quality wakeword, so that I can run it with a wakeword engine to detect my custom wakeword.
  • As a developer/user I want to run my custom wakeword on a phone so that I can use my phone for waking up a voice assistant.

Currently, these user stories are out of reach for the FOSS community. Let's look into how wakewords are currently produced:

Our FOSS developer Dan wants to to make a wakeword where a user can say "Hey Lama" to trigger the wakeword.

What steps does he need to take to make this wish a reality?









1. Collect data

Collect data from thousands of people saying the wakeword and 'not-wakeword'.

To ensure that the wakeword works for as many people as possible, it is important to collect as much data as possible.

Unfortunately, such massive data collection strategies are difficult to implement when a project doesn't have the resources for such a commitment.





2. Train a model

Training a wakeword model, you would think you could just throw in your data, run the training script and out comes your production quality wakeword model. It's a bit more complicated than that. Actually producing a model that doesn't lock onto specific features (ie will trigger every time someone says "hey"), wakes up every time for anyone and doesn't falsely wake up to production quality is actually pretty difficult and time consuming.




3. Testing

Test it on thousands of other people and see if it works for everyone, repeat all of the steps until you have reached production quality.






Project solution

Users can create their own production* quality wakeword model using FOSS tools with their own sparse* data and use the model in real* time to spot wakewords.

By focusing on a TinyML+ approach where wakeword models are made that work for the specific user in their environment over creating a general wakeword model for all, the resources required to create a workword are greatly reduced. In addition, if more advanced data collection, data generation, and machine learning techniques can be used, this can further reduce the data requirements to produce a production quality wakeword for individual users.

The wakeword project is split into three project phases:

A user follows a data collection recipe

data collection recipe It all starts with a user data collection for the wakeword and not-wakeword categories. A user can use the Wakeword Data Collector.

TTS voice data generation

TTS voices recipe When you don't have enough data to train a model, generate it. TTS engines are scraped similar to the data collection recipe using TTS plugins from OpenVoiceOS. The more the better!

Best model selection

best model selection recipe How do you know if your test-training distibution yields the best model? When it comes to big data sets, randomly splitting it once (ie 80/20%) is usually good enough. However, when dealing with sparse data sets the initial test-training split becomes more important. By splitting the data set many times and training experimental models, the best initial data distribution can be found. This step can boost the model by as much as ~10% performance on the training set.

Incremental and curriculum learning

learning recipe Only add false positives(*) to the training/test set. Why add a bunch of files that the model can classify correctly, when you can give the model lessons where it needs to improve.

Speaking of lessons, you don't learn by reading pages of a text book in a totally random order, do you? Why should a machine learning model be subjected to this added difficulty in learning? Let the machine learn with an ordered curriculum of data. This usually boosts the model's performance over the shotgun approach by 5%-10%. Not bad!

(*)NOTE: This actually worsens the raw score of model, because it only trains and tests on hard to learn examples, instead of giving the model an easy A. But honestly, if you are getting 98% on your test and/or training set and it doesn't actually work correctly in the real world, you really need to reconsider your machine learning strategy. ;)

Noise generation recipes

noise generation recipe Gaussian noise (static) is mixed into the pre-existing audio recordings, this helps make the model more robust and helps with generalization of the model.

A user can use other noisy data sets (ie pdsounds) to generate background noise into existing audio files, further ensuring a robust model that can wake up even in noisy environments.

Wakeword theory

Typically, a wake word model is a binary acoustic model. Precise uses a GRU-RNN. This is a good choice for this type of data. The data is a time series, and RNNs are very good at that kind of data. An LSTM is much more computationally expensive to run than a GRU, and many studies have shown (1, 2) the GRU can perform on par with the LSTM but is computationally less expensive.

The absolute best performance could be achieved via CRNN, however this adds a convolutional layer, making it more computationally expensive than just a GRU. Also the data requirements of a CNN tend to be higher, therefore the hypothesis is:

due to the constraints of sparsity and computation, a single layered GRU is the most optimal approach.

Collection method

Generation of data

Because the data is so sparse, more data is required to keep a balanced data set, in addition creating noisy background data is required for the wakeword to perform in noisy environments. Experimentally it was found that there was NO flexibility of the model for any background noise to perform successful wake ups without the generated data. Although some have claimed that introducing random Gaussian noise into samples will create totally new data, this hasn't been true from observation. If a clean file fails in testing, its Gaussian counterpart will also always fail. This is because no matter the noise, it is still using the same feature set to classify both samples. Therefore, it cannot learn new features from generated data.

Experimentally, the amount of noise was selected for four levels. It was found that if the noise levels were too high, the model would always learn to identify the noise, even though examples of this noise were present with the same balance in the not-wake background noise data for both training and testing.

For the production model that included a full data collection for 2 individuals (male and female), 28 TTS samples were also used. It is important to filter out any TTS samples that sound too robotic or mispronounce the wakeword. The current hypothesis on recommended TTS requirements (in addition to at least one complete data collection on an individual) is 28 samples, however the more the better the model will generally perform. Future experiments will try to reach 40-50 TTS voices.

Modeling method

See Precise Wakeword Model Maker

  • initial training and model selection
  • incremental training

Testing method

  • mics
    • the collections were performed in the production environment with a PlayStation eye microphone
    • further testing was performed on a Blue Snowball which was not used for any collection (ensuring a more general model independent of the mic used for collection)
  • The wakeword was tested for true-positive five times in a row, as well as for a week in the production environment
  • false wake ups: the model was run for a week constantly in a very busy room. The goal was to reduce the false wake ups to under 5 a week.
  • production environment: living room with very high ceilings, tests were conducted as far as 12 meters away from the microphone.

Results

Interestingly, the room acoustics and microphone properties played much less of a role on these models than hypothesized. Only when the models have been trained incrementally on ~25k or more will the microphone properties start playing a role. This is counteracted by the TTS collection and the noise generation.

Before optimizing the data, the model was often unbalanced due to the incremental training. However, in experiments where the data optimization method was used, it reached the limits of the data set and resulted in a balanced data set.

Recommended reading

Neural Network-based Small-Footprint Flexible Keyword Spotting

Wake Word Detection Using Recurrent Neural Networks

Wakeword Data Collector

The Wakeword Data Collector is currently in a complete functional prototype phase. It solves the tricky data collection problem of wakewords.

Simply put, there are very specific minimum data requirements of the two data classes to classify wake words (wake-word and not-wake-word). These requirements are not generally known and best practices aren't usually followed.

The basic use case is: "What do I need to record to create a successful wakeword model using at the minimum, one user's voice".

The purpose of the prototype is to experimentally determine the following parameters required to successfully (user successfully uses wakeword 5 times in a row, and there are no false triggers for at least 1 hour of of testing with half an hour TV and half an hour user conversation) create a production model:

  • Number of wakeword recordings
  • Number of variants (deeper or higher pitched voice, faster, further away from mic, closer, etc.)
  • Number of background ambient noise recordings
  • Number/Length of speaking or conversational not-wake recordings (when the user is talking and the model is trained to their voice, the likelihood of wake up is high, therefore users need to record themselves talking in general)
  • Recordings of individual syllables of the wakeword (the classifier can get stuck on identifying one sound in the whole wake word, ie for 'hey jarvis' it could falsely trigger for 'hey', 'jar', 'vis', 'hey jar' or 'jarvis')

The next phase of the wakeword data collection tool is to create actual software that could be run via website or as a skill for a user to easily collect wakeword data.

This is a good reference for wake word training.

Wakeword Model Maker

The Wakeword Model Maker generates TTS voices for wakeword data. It automatically splits the training and test set, including sub-classes. It finds the optimal split by training several models and selecting the best model. Then it generates both Gaussian and background noise files. Finally it trains on several other data sets incrementally, looking for false positives which are added to the data set.

It should be considered a functional prototype.

Engine Improvements

Currently the only open source wakeword system that uses anything close to modern (I am looking at you Sphinx!) is Mycroft's Precise. It uses Tensorflow 1.13 with Keras to generate a GRU-RNN based on the data collected. It has a lot of undocumented features such as introducing background noises into wake and non-wake audio files, and even has an unofficial branch for Tensorflow 2.3.1.

TF Lite

To improve the current state. The TF 2.3.1 branch was forked, some minor bug fixes were implemented and it has been run and tested. The goals:

  • implement a Tensorflow lite model, which is lighter than the current models (26kb) to speed up the model and reduce the resources required to run the model (done)
    • the model currently exported to tflite is 21kb (without optimization methods)
  • implement TFLite post training quantization to further compress the model
  • test uncompressed vs various compressed models to determine speed vs accuracy optimization
  • deploy forked repo using TF 2.3.1 on raspi4 aarch64 (ARM64) (done)
    • deploy on Mycroft (done)
  • create lighter TFLite runner (binary?) for raspi4 aarch64 (arm64) (done)
  • deploy TFLite runner on Mycroft (done)
  • benchmark CPU usage of TFLite runner vs whole deployed repo (TF 2.3.1) vs original precise (~25-30% base CPU usage) (TF2 uncompressed easy deployment: ~13-20% CPU) (done)
  • document the deployment and usage of each component and improve existing documentation (done)

Note on Precise TF1 and TF2: Interestingly, the model can be trained using the TF1 version then still be converted into TF2. Why would you want to do this? Although it has been reported that the latest version of TF has no performance issues anymore in training, training with TF2 is still much slower than TF1. Therefore, it is recommended to train the model in TF1 and export it to TF2 (or TFLite).

Precise in Rust

Why? Well, Rust people will say it's faster, but the real reason is to have a nice binary that can run on any device, like phones.

The Precise runner was ported to Rust and used the Rust MFCC crate. From the tests performed, the results showed a performance ~10 times worse than the Python version of Rust. The problem seemed to be in the Rust MFCC crate's performance. Therefore, the SpeechPy library is being ported to Rust to have a quicker implementation. This implementation will be benchmarked against the Python version.

How to: Train a Model With Precise Manually

  • Make sure you first meet the data requirements and follow best practices (and here) in regards to data collection.
    • check to make sure all datasets are using the correct audio file format (the wakeword-recorder-py uses this format already):
      • wave file format
      • channels: 1
      • sample frequency: 16000
  • Optimally split your data (use the wakeword-data-prep script for this) otherwise do it manually:
    • Perform a random 80/20 split on the base wake-word root directory (not variations!) and background noise (in not-wake-wake-word/background) directory
    • 50/50 split for all other categories (ie variations, paragraph) based on their variation 'pairs' (further documentation to come, but its already built into the wakeword-data-prep script)
    • create several models from randomly shuffling the data (ie 5 random shuffles)
    • select best model
    • add background noise to samples
  • Gaussian (run notebook or wakeword-data-prep script)
    • precise-add-noise (dataset source folder) (background sounds folder) (output folder)
      • not currently implemented in the script, works with precise as a command
      • download the following data sets for noise (best practice: add them as sub directories to the random directory):
  • (if you don't use the script and want to train manually) Find optimal number of epochs: precise-train -e 600 jarvis_rebooted.net jarvis_rebooted/
    • For the first training, only train on base wake-word and not-wake-word (no noise, random sound files, etc.)
    • Once you are sure you are hitting ~93%-95% move on
  • Test the model: precise-test jarvis_rebooted.net jarvis_rebooted/
    • What is failing? Why?
  • Use the model: precise-listen jarvis_rebooted.net
    • Does it work? Does it work for just part of the wakeword (ie 'hey')?
    • Remember this is the weakest model you will built, it will trigger with almost any input!
  • Incremental training: precise-train-incremental jarvis_rebooted.net jarvis_rebooted/ -r jarvis_rebooted/random/
    • First incrementally train on the random conversational and TV recordings (ie from wake-word-recorder.py
    • Incremental training on random sounds also!
    • once done, run your model for roughly 300 epochs with the normal training method
    • Run a test and determine where it fails and why
    • Use the model: does it work? Does it wake up for parts of the wakeword only?
    • Ideally this final model will have few false wake ups but will detect the wake word every time, even in a noisy environment.
  • Convert model: precise-convert jarvis_rebooted.net
  • Deploy model
  • Test model in production
    • Say the wake word 5 times
    • Let the model run for 2h (at least 1h random conversation + 1h TV)
  • If model passes: congratulations
  • If model fails: back to the steps all over again!

How to: Install Tensorflow 2.3.1 on a Raspberry Pi 4 with aarch64 (arm64)

I got this from here.

To run the whole precise repo, you have to install tensorflow 2.3.1. For most platforms, this is easy. But there are specific steps for a raspi4 aarch64.

  • get a fresh start (remember, the 64-bit OS is still under development)
$ sudo apt-get update
$ sudo apt-get upgrade
  • install pip and pip3 (but I am sure you already have this!)
$ sudo apt-get install python-pip python3-pip
  • remove old versions, if not placed in a virtual environment (let pip search for them)
$ sudo pip uninstall tensorflow
$ sudo pip3 uninstall tensorflow
  • install the dependencies (if not already onboard)
$ sudo apt-get install gfortran
$ sudo apt-get install libhdf5-dev libc-ares-dev libeigen3-dev
$ sudo apt-get install libatlas-base-dev libopenblas-dev libblas-dev
$ sudo apt-get install liblapack-dev
  • If you are doing this for a specific python env (which you should!) then drop the sudo -H in the rest of the instructions
  • upgrade setuptools 47.1.1 -> 50.3.0
$ sudo -H pip3 install --upgrade setuptools
$ sudo -H pip3 install pybind11
$ sudo -H pip3 install Cython==0.29.21
  • install h5py with Cython version 0.29.21 (± 6 min @1950 MHz)
$ sudo -H pip3 install h5py==2.10.0
  • install gdown to download from Google drive
$ pip3 install gdown
  • download the wheel (seriously, last time I checked you have to get it from some dude's google drive... When will they release this package officially? Also if you don't trust this part, you will have to make the wheel yourself..)
$ gdown https://drive.google.com/uc?id=1jbkp2rSZZ3YY-AM1vuHyB9hI05zrZGHg
  • install TensorFlow (± 63 min @1950 MHz)
$ sudo -H pip3 install tensorflow-2.3.1-cp37-cp37m-linux_aarch64.whl

How to: Quick and Dirty Tensorflow Lite Precise Model Deployed in Mycroft

This briefly describes how to get a TFlite model for Precise (TF2.3.1) running directly in Mycroft. (it is strongly recommended you follow the instructions for installing TF2.3.1 first!)

  • backup the TF 1.13 Precise engine in: ~/.mycroft/precise/precise-engine, i.e. precise-engine-old
  • copy the Tensorflow 2.3.1 Tensorflow lite (TFLite) Precise repo to:
    • ~/.mycroft/precise/ and rename the directory to precise-engine
      • Note: This isn't the slimest version of the runner, needs to be cut down in the future, perhaps even turned into a binary!
  • add the .tflite model to the ~/.mycroft/precise directory
    • Question: does the params file also need to be included? (No, I don't think this makes a difference with .tflite, it is only this file so fine I guess..)
  • in ~/.mycroft/ backup the current config file: mycroft.conf, i.e. mycroft.conf.bak
  • edit the mycroft.conf to include the path for the tflite model "local_model_file": "~/.mycroft/precise/*.tflite
  • run Mycroft in debug mode and test it a bunch of times. The sensitivity and trigger-level might need to be fine tuned for this model.