Here is a collection of resources to make a smart speaker. Hope we can make an open source one for daily use.
The simplified flowchart of a smart speaker is like:
+---+ +----------------+ +---+ +---+ +---+
|Mic|-->|Audio Processing|-->|KWS|-->|STT|-->|NLU|
+---+ +----------------+ +---+ +---+ +-+-+
|
|
+-------+ +---+ +----------------------+ |
|Speaker|<--|TTS|<--|Knowledge/Skill/Action|<--+
+-------+ +---+ +----------------------+
- Audio Processing includes Acoustic Echo Cancellation (AEC), Beamforming, Noise Suppression (NS), etc.
- Keyword Spotting (KWS) detects a keyword (such as OK Google, Hey Siri) to start a conversation.
- Speech To Text (STT)
- Natural Language Understanding (NLU) converts raw text into structured data.
- Knowledge/Skill/Action - Knowledge base and plugins (Alexa Skill, Google Action) to provide an answer.
- Text To Speech
- Mycroft ⭐ - a hackable open source voice assistant
- dingdang robot - a 🇨🇳 voice interaction robot based on Jasper and built with raspberry pi
-
Amazon Alexa Voice Service - is the most widely used voice assistant
-
It has the smartest brain, its extension called Google Action can be created on a few steps with digitalflow.ai and its Device Action is very suit for home smart devices.
- Snowboy - DNN based hotword and wake word detection toolkit
- Honk - PyTorch reimplementation of Google's TensorFlow CNNs for keyword spotting
- ML-KWS-For-MCU - Maybe the most promise for resource constrained devices such as ARM Cortex M7 microcontroller
- Mozilla DeepSpeech - A TensorFlow implementation of Baidu's DeepSpeech architecture
- Kaldi
- PocketSphinx - a lightweight speech recognition engine using HMM + GMM
- Mimic - Mycroft's TTS engine, based on CMU's Flite (Festival Lite)
- manytts - an open-source, multilingual text-to-speech synthesis system written in pure java
- espeak-ng - an open source speech synthesizer that supports 99 languages and accents.
- ekho - Chinese text-to-speech engine
- WaveNet, Tacotron 2
-
Acoustic Echo Cancellation
-
Direction Of Arrival (DOA) - Most used DOA algorithms is GCC-PHAT
-
- BeamformIt - filter&sum beamforming
- CGMM Beamforming - a reference implementation
- MVDR Beamforming
- GSC Beamforming
-
Voice Activity Detection
- WebRTC VAD
- DNN VAD
-
Noise Suppresion
- NS of WebRTC audio processing
- PortAudio
- libsoundio
- ALSA
- PulseAudio