-
Notifications
You must be signed in to change notification settings - Fork 5
Lab 30b: Introduction to speech processing
Natural language can be a very intuitive way for all kinds of users to interact with robots. With some robots (like Kuri), it's one of only a few ways to enable rich two-way communication.
Unfortunately for us, automatic speech recognition is a difficult problem that still doesn't have perfect general-case solutions. Things have improved markedly thanks to interest in digital assistants. Thanks to products like the Echo, practical farfield speech recognition microphone arrays is now mainstream. However, you'll find that once we adapt available tools to robotics use cases that there's still plenty of room for improvement.
Circa 2019, state of the art recognition methods are end-to-end trained neural networks. These models take raw audio data and attempt to predict a transcript. These models are primarily controlled by the large tech companies who own the data necessary to train them. Practically, this means that access to state of the art recognition is mostly done over web APIs (stream audio to remote server, collect transcripts in response). If you could run a state of the art model locally, it might be a great computational expense without the right hardware.
This is fine for many (most, even,) use cases, but it does create latency and makes your application completely dependent on a network connection. It's also not free.
For this class, we'll use an implementation of an HMM-based speech recognizer, a comparatively old school approach.
It's not critical to understand how the model works in detail (you won't have to implement it, we'll use an existing package), but Wikipedia can give you the overview. If you're curious for details "The Application of Hidden Markov Models to Automatic Speech Recognition" covers the method in detail.
In general, this approach transforms audio into a sequence of feature vectors, then attempts to infer a sequence of labels for these vectors. In HMM/graphical model terms, these feature vectors constitute observations, and the word or phoneme constitute the hidden state that you would like to predict. Each word (or often each phoneme) in the recognizer's vocabulary has an acoustic model that describes the distribution of transitions between different sound feature vectors. These acoustic models are strung together to form a larger model of how sound progresses during speech. A language model, which describes the distribution of transitions between different words, is used on top of the acoustic model to inform the probability of a particular transcription. Inference in the model is done with the Viterbi algorithm.
The best available HMM-based recognizer is CMU Sphinx. Sphinx addresses a wide variety of possible applications, but importantly it has the tools necessary for us to tailor it to a particular vocabulary. In the next lab, we'll look at instantiating a model for commands that you care about for your project.