# 4. Keyword Spotting Techniques

**Keyword spotting(KWS) refers to the spotting and retrival of predefined keywords from audio database.**

### 1. Techniques for Keyword Spotting
Different keyword spotting techniques are discussed below.

#### a) Template-Concatenation Model
> **模板匹配**：
![](images/4_1.png)
> **缺点**: 
    - 计算量大
    - FA（False Alarm）高

> **改进**：
    - 增加背景模型，吸收FA
![](images/4_2.png)


#### b) Hidden Markov Models(HMM)
> * Better performance than the template-based systems.
* HMM对keyword 和 Filler 进行建模，Filler 可以采用以下形式
    1. Acoustic word models
    2. Acoustic sub-word models(Triphone/Monophone/syllable models)
    3. Clustered models( clustered Gaussian)
    4. Vocabulary-independent models
* 可以处理任意多的Keyword
* **缺点**：
    1. OOV无法处理
    2. Viterbi 解码，计算量偏大
    3. 测试集若与声学模型训练集失配，性能会有比较大的下降

#### c) LVCSR-based Techniques
> **先将audio识别成文本，然后从文本中查找Keyword**
* Very accurate for well-resourced languages
* Main limitation: Need large amount of trancscribed data and inability ro handle OOV words

#### d) Predictive Neural Model
> **实现**    
* 训练多个MLP predictor(每个MLP相当于一个encoder，训练使其输出 \\(\hat{a_t}\\)尽可能逼近当前输入 \\(a_t\\), **prediction residual: **\\(||\hat{a_t}-a_t||\\) )   
![](images/4_3.png)
* 对keyword按一定的单元大小(word, syllable, phoneme)进行建模，每个节点对应一个MLP
![](images/4_4.png)
* 识别时，计算每个MLP的欧式距离残差，DP算法求得最小的残差和及对应的解码路径
![](images/4_5.png)
* Keyword is detected if the accumulated prediction residual has a value lesser than a threshold value.

> **优点**:
    1. Simple structure
    2. Easy to train
    3. Training Flexibility is possible. Word/syllable/phone level training can be done.
    4. No need to train non-keyword model
    5. Non-keyword is rejected based on the accumulated prediction residual score
    
> **Reference**:  
*[1] Iso K, Watanabe T. Speaker-independent word recognition using a neural prediction model[M]//Readings in speech recognition. 1990: 443-446.*

#### e) Phone Lattice Alignment
> * 构建Phoneme lattice，DP 检索 keyword
* much faster than HMM-based approach
* no concept of vocabulary/OOV words.

#### f) Modefied Minimum Edit Distance Measure
![](images/4_6.png)

#### g) Segmental Models
![](images/4_7.png)
> **实现: **   
* 根据频谱特性对语音数据进行分段
* 对分段的语音进行聚类，用 GMM 对聚类后的数据进行建模，即（SGMM, Segmental Gaussian Mixtrue Model）
* 训练时，同样采用 SGMM 对数据进行建模，得到JMM(Joint Multigram Model）
* DP 搜索，得到结果


#### h) Multilayer Perceptron(MLP)
![](images/4_8.png)
> * **Language-Independent**
* KL divergence performs better than dot product

#### i) Deep Neural Networks
![](images/4_9.png)
> **Reference:**   
*[1] Chen G, Parada C, Heigold G. Small-footprint keyword spotting using deep neural networks[C]//ICASSP. 2014, 14: 4087-4091.*

#### j) Spectrographic Seam Patterns
![](images/4_10.png)
> * a sliding window is used for feature extraction and classfication
* extract seam-Hough feature from the speech signal
* smoothing
* SVM classifier

#### k) Spectro-Temporal Patch Features


### 2. Utterance Verification
* For **long Keywords**, Most of the systems perform well.
* For **shorter Keywords**, the performance will degrade because of the **higher false alarm rate**.

Hence, a second stage is preferred to verify the utterance identified by the first stage, **an isolated keyword verification system helps to reduce the false alarm**.
![](images/4_11.png)

#### a) Confidence Measure


#### b) Hybrid Neural Network/HMM Approach
![](images/4_12.png)
> **基本思想**：
* 从 HMM 中得到 keyword， anti-word, Filler  的似然度 \\(L(0|K_w), L(O|A_w), L(O|Fil)\\)
* 从 HMM 中得到 duration
* 将上述特征作为**后置 Neural network classifier** 的输入，使用更多的特征，提高分类的准确度。

> **Reference:**   
*[1] Ou J, Chen K, Wang X, et al. Utterance verification of short keywords using hybrid neural-network/HMM approach[C]//Info-tech and Info-net, 2001. Proceedings. ICII 2001-Beijing. 2001 International Conferences on. IEEE, 2001, 2: 671-676.*   
*[2] Wu M, Panchapagesan S, Sun M, et al. Monophone-Based Background Modeling for Two-Stage On-Device Wake Word Detection[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 5494-5498.*


#### c) Cohort Word-Level verification
>**Cohort keywords are words that have similar pronunciation to the target keywords**  

> **基本思想**：    
* 添加发音相似的anti-word，用于吸收一些发音相似的FA
* 计算anti-word 和 keyword 之间的似然度之差

>$$
LLR = log P(O|{\lambda}_{kw}) - \frac{1}{B} \sum_{i=1}^{B} log P(O|{\lambda}_{q_i}) 
$$
**or**
$$
LLR = log P(O|{\lambda}_{kw}) - \frac{1}{B} \max_{i=1}^{B} log P(O|{\lambda}_{q_i}) 
$$
where \\(O\\) is the observation sequence, \\({\lambda}_{kw}\\) is the keyword model, and \\(B\\) is the number of cohort words.