## Trigger Word Detection

目标：语音识别之关键词触发

1. 理解一下音频文件wav，以及音频分析（spectrogram）
2. 理解一下如何语音合成，以生成训练数据： Trigger Word: "Activate"、Negative 以及 Background
3. 构造一个模型，包括 Conv1D 以及 RNN（GRU），输出 探测信号。 
4. 使用已经训练好的模型，训练已经准备好的数据。。。
5. 使用预测结果，生成一个声音文件，听听触发的效果


学到：
1. 如何使用语音合成，生成训练数据
2. 构建一个模型，探测关键词

不明白的地方还有：

1. TimeDistributed 的作用
2. 下载的数据，训练产生了负面效果，不知道原理
3. 还是需要实际操作一遍


In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt


%matplotlib inline

## 1 wav文件以及spectogram


In [None]:
import IPython
IPython.display.Audio("./raw_data/activates/1.wav")

In [None]:
IPython.display.Audio("./raw_data/negatives/4.wav")

In [None]:
IPython.display.Audio("./raw_data/backgrounds/1.wav")

关键词是 “activate”，其他是干扰和背景音

In [None]:
IPython.display.Audio("audio_examples/example_train.wav")

In [None]:
from scipy.io import wavfile

rate, data = wavfile.read("audio_examples/example_train.wav")
# print("Time steps in audio recording before spectrogram", data[:,0].shape)
# print("Time steps in input after spectrogram", x.shape)

print(rate)
print(data.shape)
print(data[500:510])

声音信号是用  44.1kHz采样，  10s的声音文件，有 44,100 * 10 = 441,000 个值，这个没问题

使用频谱分析：

In [None]:
def graph_spectrogram(rate, data):
    nfft = 200 # Length of each window segment
    fs = 8000  # Sampling frequencies
    noverlap = 120 # Overlap between windows
    nchannels = data.ndim
    if nchannels == 1:
        pxx, freqs, bins, im = plt.specgram(data, nfft, fs, noverlap = noverlap)
    elif nchannels == 2:
        pxx, freqs, bins, im = plt.specgram(data[:,0], nfft, fs, noverlap = noverlap)
    return pxx

x = graph_spectrogram(rate, data)

print(x.shape)
print(x[20:25, 1000:1005])


虽然不懂上面在干什么，大概分析一些东西：

`nfft=200` 决定了分析的时间段（窗口）， `noverlap = 120` 表示重叠区域， 这样分析的步长实际是 80， 会得到： $(441000 - 200) / 80 + 1 = 5511 $

这个公式是： $ \frac{ N - window}{step} + 1 $， 为什么 $+1$ 是因为先去掉第一个窗口，此后每步对应一个窗口，共有 $ \frac{ N - window}{step} $ 个窗口， 加上第一个窗口，就是总采样数。 其实对应时间周，表示 $Tx$

101 这个数值不知道从哪里来的

In [None]:
Tx = 5511 # The number of time steps input to the model from the spectrogram
n_freq = 101 # Number of frequencies input to the model at each time step of the spectrogram
Ty = 1375 # The number of time steps in the output of our model

Ty = 1375 这个数，是打算对 1D的音频数据（ 5511个点，101个频率，相当于 5511的1D数据， 101个channel）做一次 Conv1D， 使用 filter size 为15， stride 为4， 这样就得到 $ (5511 - 15) / 4 + 1 = 1375 $ ，得到 1375 个点， 196个channel，即 Ty = 1375, vector 大小为 196

因为统一用10s的语音文件， 有以下数据：

- $441000$ (raw audio) ， 每step表示： 0.000023s 
- $5511 = T_x$ (spectrogram output steps) ，每step表示： 0.0018s 
- $10000$ (used by the `pydub` module to synthesize audio)， 每step表示： 1ms
- $1375 = T_y$ (RNN GRU 输出步数).  每step表示： 0.0072s

第三个数值10000， 用在语音合成里面： 将一段10s的语音表示为10000个step，每个step表示1ms语音。 这样在背景音里随机插入一段关键词（activate）或干扰词（negative），方便使用index

## 2 语音合成，生成训练集


In [None]:
from pydub import AudioSegment

def load_raw_audio():
    activates = []
    backgrounds = []
    negatives = []
    for filename in os.listdir("./raw_data/activates"):
        if filename.endswith("wav"):
            activate = AudioSegment.from_wav("./raw_data/activates/"+filename)
            activates.append(activate)
    for filename in os.listdir("./raw_data/backgrounds"):
        if filename.endswith("wav"):
            background = AudioSegment.from_wav("./raw_data/backgrounds/"+filename)
            backgrounds.append(background)
    for filename in os.listdir("./raw_data/negatives"):
        if filename.endswith("wav"):
            negative = AudioSegment.from_wav("./raw_data/negatives/"+filename)
            negatives.append(negative)
    return activates, negatives, backgrounds

# Load audio segments using pydub 
activates, negatives, backgrounds = load_raw_audio()

print("background len: " + str(len(backgrounds[0])))    # Should be 10,000, since it is a 10 sec clip
print("activate[0] len: " + str(len(activates[0])))     # Maybe around 1000, since an "activate" audio clip is usually around 1 sec (but varies a lot)
print("activate[1] len: " + str(len(activates[1])))     # Different "activate" clips can have different lengths 

背景音统一长度 10000， 关键词和干扰词长度较小。

** 生成训练集的方法：合成语音，在背景音中随机添加关键词和干扰词 **

这里注意几点：
1. 背景音统一长度 10000，表示10s
2. 关键词和干扰词放进去的时候， 互相不可重叠。（？为啥？干扰也可以作为背景音）
3. 输出的标签：当关键词结束的时候，对应位置，输出**50个1**，其他位置均为0，如下图。注意这里标签长度Ty = 1375

<img src="images/label_diagram.png" style="width:500px;height:200px;">
<center> **Figure 2** </center>

In [None]:
# 根据片段长度，随机选取一个插入位置
def get_random_time_segment(segment_ms, total_ms=10000):    
    segment_start = np.random.randint(low=0, high=total_ms-segment_ms)   # Make sure segment doesn't run past the 10sec background 
    segment_end = segment_start + segment_ms - 1
    return (segment_start, segment_end)


# 判断是否有重叠
def is_overlapping(sa, sb):
    return not (sa[0] > sb[1] or sa[1] < sb[0])

def is_invalid_segment(segment, previous_segments):
    for s in previous_segments:
        if is_overlapping(segment, s):
            return True
    return False


# 插入一个声音片段
def insert_audio_clip(background, audio_clip, previous_segments):
    # background, audio_clip:  AudioSegment object
    # previous_segments:  list
    
    segment_ms = len(audio_clip)
    segment_time = get_random_time_segment(segment_ms)
    
    while is_invalid_segment(segment_time, previous_segments):
        segment_time = get_random_time_segment(segment_ms)
        
    previous_segments.append(segment_time)
    
    # 插入
    new_background = background.overlay(audio_clip, position = segment_time[0])
    
    return new_background, segment_time


# 在输出标签的对应位置 插入 50个1
def insert_ones(y, segment_end_ms, total_ms=10000.0):
    # y.shape = (1, Ty)
    Ty = y.shape[1]
    segment_end_y = int(segment_end_ms * Ty / total_ms)
    start = segment_end_y + 1
    y[0, start:start+50] = 1    
    return y


# 创建训练数据集，以及标签
def create_training_example(background, activates, negatives): 
    
    np.random.seed(18)
    
    # Make background quieter
    background = background - 20

    y = np.zeros((1, Ty))
    previous_segments = []

    # 随机选取 0~4 个 avtivate 声音片段，插入background
    number_of_activates = np.random.randint(0, 5)
    random_indices = np.random.randint(len(activates), size=number_of_activates)
    
    for i in random_indices:
        activate = activates[i]
        # Insert the audio clip on the background
        background, segment_time = insert_audio_clip(background, activate, previous_segments)
        # Retrieve segment_start and segment_end from segment_time
        segment_start, segment_end = segment_time
        # Insert labels in "y"
        y = insert_ones(y, segment_end)
        
    
    # 随机选取 0~2 个 negative 声音片段，插入background
    number_of_negatives = np.random.randint(0, 3)
    random_indices = np.random.randint(len(negatives), size=number_of_negatives)
    
    for i in random_indices:
        negative = negatives[i]
        # Insert the audio clip on the background 
        background, _ = insert_audio_clip(background, negative, previous_segments)
    
    # Standardize the volume of the audio clip ，不懂
    background = background.apply_gain(-20.0 - background.dBFS)

    # Export new training example 
    wav_filename = "train.wav"
    file_handle = background.export(wav_filename, format="wav")
    print("File (train.wav) was saved in your directory.")
    
    # Get and plot spectrogram of the new recording (background with superposition of positive and negatives)
    rate, data = wavfile.read(wav_filename)
    x = graph_spectrogram(rate, data)
    
    return x, y

In [None]:
x, y = create_training_example(backgrounds[0], activates, negatives)

In [None]:
IPython.display.Audio("train.wav")

In [None]:
IPython.display.Audio("audio_examples/train_reference.wav")

## 3 Model

<img src="images/model.png" style="width:600px;height:600px;">
<center> **Figure 3** </center>



注意这里：

1. 一层 Conv1D，filters=196， kernel size是15， stride是4，这样将 5511 长度数据转成 1375 长度，作为 Tx（上面计算过）。 每个vector 维度为196
2. 一层 GRU， hidden states 有128个
3. 又一层 GRU， hidden states 有128个
4. [TimeDistributed](https://keras.io/layers/wrappers/)，这个大概起一个复制的作用

In [None]:
from keras.callbacks import ModelCheckpoint
from keras.models import Model, Sequential
from keras.layers import Dense, Activation, Dropout, Input, Masking, TimeDistributed, LSTM, Conv1D
from keras.layers import GRU, Bidirectional, BatchNormalization, Reshape
from keras.optimizers import Adam

In [None]:
def model(input_shape):
    X_input = Input(shape = input_shape)
    
    # Step 1: CONV layer
    X = Conv1D(196, 15, strides=4)(X_input)        # CONV1D
    X = BatchNormalization()(X)              # Batch normalization
    X = Activation('relu')(X)                # ReLu activation
    X = Dropout(0.8)(X)                      # dropout (use 0.8)

    # Step 2: First GRU Layer
    X = GRU(units = 128, return_sequences = True)(X)    # GRU (use 128 units and return the sequences)
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)
    X = BatchNormalization()(X)                         # Batch normalization
    
    # Step 3: Second GRU Layer 
    X = GRU(units = 128, return_sequences = True)(X)    # GRU (use 128 units and return the sequences)
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)
    X = BatchNormalization()(X)                         # Batch normalization
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)
    
    # Step 4: Time-distributed dense layer 
    X = TimeDistributed(Dense(1, activation = "sigmoid"))(X) # time distributed  (sigmoid)

    model = Model(inputs = X_input, outputs = X)
    
    return model  

In [None]:
model = model(input_shape = (Tx, n_freq))

In [None]:
model.summary()

注意这里最后一层： `TimeDist (None, 1375, 1)` 参数有 129个， 其实就是1个 `W.shape = (128, 1)` 和一个 `b` 。所以 TimeDistributed 使 Dense 共享了这些参数？？？？

## 4 训练数据

这里用了已经训练好的模型，训练已经准备好的数据。。。

In [None]:
from keras.models import load_model

model = load_model('./models/tr_model.h5')


In [None]:

# Load preprocessed training examples
X = np.load("./data/X.npy")
Y = np.load("./data/Y.npy")
X_dev = np.load("./data/X_dev.npy")
Y_dev = np.load("./data/Y_dev.npy")

print(X.shape)
print(Y.shape)
print(X_dev.shape)


In [None]:
opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])

In [None]:
model.fit(X_dev, Y_dev, batch_size = 5, epochs=1)

In [None]:
loss, acc = model.evaluate(X_dev, Y_dev)
print("Dev set accuracy = ", acc)


这个使用 accuracy评分。 实际上accuracy不是好的metric。 因为输出大部分都是0，全部预测为0，也能到 90%的准确率。

**从课程里面下载的数据， X，Y，在训练一个epoch反而使模型变差了。 不用训练反而能正确输出**，怀疑下载的数据有点问题

## 5 预测及输出

In [None]:

# 预测
def detect_triggerword(filename):
    plt.subplot(2, 1, 1)
    
    rate, data = wavfile.read(filename)
    x = graph_spectrogram(rate, data)
    # the spectogram outputs (freqs, Tx) and we want (Tx, freqs) to input into the model
    x  = x.swapaxes(0, 1)
    x = np.expand_dims(x, axis=0)
    predictions = model.predict(x)
    
    plt.subplot(2, 1, 2)
    plt.plot(predictions[0,:,0])
    plt.ylabel('probability')
    plt.show()
    return predictions

# 合成
chime_file = "audio_examples/chime.wav"
def chime_on_activate(filename, predictions, threshold):
    audio_clip = AudioSegment.from_wav(filename)
    chime = AudioSegment.from_wav(chime_file)
    Ty = predictions.shape[1]
    # Step 1: Initialize the number of consecutive output steps to 0
    consecutive_timesteps = 0
    # Step 2: Loop over the output steps in the y
    for i in range(Ty):
        # Step 3: Increment consecutive output steps
        consecutive_timesteps += 1
        # Step 4: If prediction is higher than the threshold and more than 75 consecutive output steps have passed
        if predictions[0,i,0] > threshold and consecutive_timesteps > 75:
            # Step 5: Superpose audio and background using pydub
            audio_clip = audio_clip.overlay(chime, position = ((i / Ty) * audio_clip.duration_seconds)*1000)
            # Step 6: Reset consecutive output steps to 0
            consecutive_timesteps = 0
        
    audio_clip.export("chime_output.wav", format='wav')

In [None]:
filename = "./raw_data/dev/1.wav"
prediction = detect_triggerword(filename)
chime_on_activate(filename, prediction, 0.5)
IPython.display.Audio("./chime_output.wav")

In [None]:
filename  = "./raw_data/dev/2.wav"
prediction = detect_triggerword(filename)
chime_on_activate(filename, prediction, 0.5)
IPython.display.Audio("./chime_output.wav")
IPython.display.Audio("./chime_output.wav")

In [None]:
# Preprocess the audio to the correct format
def preprocess_audio(filename):
    # Trim or pad audio segment to 10000ms
    padding = AudioSegment.silent(duration=10000)
    segment = AudioSegment.from_wav(filename)[:10000]
    segment = padding.overlay(segment)
    # Set frame rate to 44100
    segment = segment.set_frame_rate(44100)
    # Export as wav
    segment.export(filename, format='wav')

Once you've uploaded your audio file to Coursera, put the path to your file in the variable below.

In [None]:
your_filename = "audio_examples/my_audio.wav"

preprocess_audio(your_filename)
IPython.display.Audio(your_filename) # listen to the audio you uploaded 

Finally, use the model to predict when you say activate in the 10 second audio clip, and trigger a chime. If beeps are not being added appropriately, try to adjust the chime_threshold.

In [None]:
chime_threshold = 0.5
prediction = detect_triggerword(your_filename)
chime_on_activate(your_filename, prediction, chime_threshold)
IPython.display.Audio("./chime_output.wav")