# ASR Model

In this notebook we create a Automatic Speech Recognition Model based on [Mozilla's DeepSpeech Model](https://github.com/mozilla/DeepSpeech) and [Baidu's Deep Speech Paper](https://arxiv.org/abs/1412.5567). A lot of the code in this notebook is based on the source code of the [deepspeech-keras library](https://github.com/val260/DeepSpeech-Keras) and in some cases will directly parallel that code (we would undoubtedly fail a plagarism checker and that's why we would like to be upfront with our sources). Developing this notebook, however, required us to 
* understand how each component of the DeepSpeech model works,
* migrate a library distributed over many files into a single notebook,
* and convert code from Keras to Tensorflow 2 Keras

## Link To Drive



There are a few files that we will need to run this notebook. These should be located in Model directory within our repository. If you have been following our setup instructions in the Github.ipynb file, you won't have to change any of the file paths below.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd "/content/drive/My Drive/Research/FIRE/2020-Speech-Recognition/Model"
!ls

/content/drive/My Drive/Research/FIRE/2020-Speech-Recognition/Model
alphabet.py	 callbacks.py	     data	__pycache__	tensorboard
alphabet.txt	 checkpoints	     hello.m4a	results.bin	yellow.m4a
ASR_Model.ipynb  configuration.yaml  model.png	StarWars60.wav


## Install Required Libraries

As long as you are using Google Colab, the only required library is the python_speech_features library. All other libraries come included with Google Colab. In Tensorflow 2.3, some of the functionality was deprecated, so we also rollback the tensorflow version to 2.1.0.

In [None]:
!pip install python_speech_features
!pip install tensorflow==2.1.0

## Encode Audio

To perform speech recognition our audio data, we want to first preprocess each audio file to provide us with input features. The code blocks below extract the features from a given audio file using librosa and python_speech_features. We also run this code on a sample audio file with 60 seconds of the Stars Wars theme to test our ability to create features from audio data.

In [5]:
import numpy as np
import python_speech_features
import librosa

conf = {"winfunc": np.hamming, "winlen": 0.025, "winstep": .01, "nfilt": 80}

def get_features(files):
    mfccs = [make_features(file) for file in files]
    X = align(mfccs)
    return X

def make_features(file_path):
    """ Use `python_speech_"features` lib to extract MFCC features from the audio file. """
    audio, fs = librosa.load(file_path, sr=16000)
    audio = (audio * 32768).astype("int16")
    feat, energy = python_speech_features.fbank(audio, samplerate=fs, **conf)
    features = np.log(feat)
    return features

def align(arrays, default=0):
    """ Pad arrays along time dimensions. Return the single array (batch_size, time, features). """
    max_array = max(arrays, key=len)
    X = np.full(shape=[len(arrays), *max_array.shape], fill_value=default, dtype=np.float64)
    for index, array in enumerate(arrays):
        time_dim, features_dim = array.shape
        X[index, :time_dim] = array
    return X

In [6]:
X = get_features(["StarWars60.wav"])
X.shape

(1, 5999, 80)

## Alphabet

The next step is to get our alphabet. The alphabet defines what letters are included in our classified text. Our alphabet file includes 36 characters right now, but we may want to reduce this to only the 26 letters and an empty space from the English alphabet. 

In [7]:
from alphabet import Alphabet
alphabet = Alphabet("alphabet.txt")

## Create Model

In the code cells below, we create the model that we will be using for Speech To Text Recognition. 

In [8]:
from tensorflow import config
list_physical_devices = config.list_physical_devices
model_dir = "models/"
gpus = list_physical_devices("GPU")

### Layers

To build the model, we use the Functional API provided by Tensorflow Keras. The code cells below build the model based on the structure of DeepSpeech. There are also some optimizations included for whether there are gpus being utilized for this task.

In [9]:
from typing import List
from tensorflow.keras import Model
import tensorflow
#from keras.initializers import np
from tensorflow import expand_dims, squeeze
from tensorflow.compat.v1.keras.layers import CuDNNLSTM
from tensorflow.keras.layers import Input, Lambda, LSTM, Bidirectional, Dense, ReLU, \
    TimeDistributed, BatchNormalization, Dropout, ZeroPadding2D, Conv2D, Reshape

In [59]:
def get_model():
    input_dim = 80
    is_gpu = len(gpus) > 0
    output_dim = 28
    context = 7
    units = 1024
    dropouts = [0.1, .1, 0]
    #random_state = 1

    #np.random.seed(1)
    #tensorflow.random.set_seed(random_state)
    input_tensor = Input([None, input_dim], name='X')                           # Define input tensor [time, features]
    x = Lambda(expand_dims, arguments=dict(axis=-1))(input_tensor)              # Add 4th dim (channel)
    x = ZeroPadding2D(padding=(context, 0))(x)                                  # Fill zeros around time dimension
    receptive_field = (2*context + 1, input_dim)                                # Take into account fore/back-ward context
    x = Conv2D(filters=units, kernel_size=receptive_field)(x)                   # Convolve signal in time dim
    x = Lambda(squeeze, arguments=dict(axis=2))(x)                              # Squeeze into 3rd dim array
    x = ReLU(max_value=20)(x)                                                   # Add non-linearity
    x = Dropout(rate=dropouts[0])(x)                                            # Use dropout as regularization

    x = TimeDistributed(Dense(units))(x)                                        # 2nd and 3rd FC layers do a feature
    x = ReLU(max_value=20)(x)                                                   # extraction base on the context
    x = Dropout(rate=dropouts[1])(x)

    x = TimeDistributed(Dense(units))(x)
    x = ReLU(max_value=20)(x)
    x = Dropout(rate=dropouts[2])(x)

    x = Bidirectional(CuDNNLSTM(units, return_sequences=True) if is_gpu else     # LSTM handle long dependencies
                        LSTM(units, return_sequences=True, ),
                        merge_mode='sum')(x)

    output_tensor = TimeDistributed(Dense(output_dim, activation='softmax'))(x)  # Return at each time step prob along characters

    model = Model(inputs=input_tensor, outputs=output_tensor)
    return model

model = get_model()
model.summary()

Model: "model_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
X (InputLayer)               [(None, None, 80)]        0         
_________________________________________________________________
lambda_8 (Lambda)            (None, None, 80, 1)       0         
_________________________________________________________________
zero_padding2d_4 (ZeroPaddin (None, None, 80, 1)       0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, None, 1, 1024)     1229824   
_________________________________________________________________
lambda_9 (Lambda)            (None, None, 1024)        0         
_________________________________________________________________
re_lu_12 (ReLU)              (None, None, 1024)        0         
_________________________________________________________________
dropout_12 (Dropout)         (None, None, 1024)        0   

### Loss Function

For our loss function, we use a CTC Loss. The CTC loss helps with the problem of alignment where we don't know which portion of the audio corresponds to which letter (just that certain audio corresponds to certain phrases or words). This loss is described in detail [here](https://distill.pub/2017/ctc/). 

In [11]:
import tensorflow as tf
def ctc_loss(y, y_hat):
    print("calculating loss")
    def get_length(tensor):
        lengths = tf.reduce_sum(tf.ones_like(tensor), 1)
        return tf.reshape(tf.cast(lengths, tf.int32), [-1, 1])


    sequence_length = get_length(tf.reduce_max(y_hat, 2))
    label_length = get_length(y)
    ret = tf.keras.backend.ctc_batch_cost(y, y_hat, sequence_length, label_length)
    print(ret)
    return ret

loss = ctc_loss

### Optimizer

For our optimizer, we use the standard Adam Optimzier. The configurations below are the default configurations for an Adam Optimizer but can be tuned in the future.

In [12]:
from tensorflow.keras.optimizers import Optimizer, SGD, Adam

optimizer = Adam(
    learning_rate=0.001,
    beta_1=0.9,
    beta_2=0.999,
    epsilon=1e-07,
    amsgrad=False,
    name="Adam",
)

### Compile The Model

We have enough components at this point to compile our model with the given optimizer and loss. The DeepSpeech Keras model also included target_tensors in their compiled model. I don't have a good enough understanding of target_tensors to state what they do, but they have been discontinued starting Tensorflow 2.2.0. Currently the lines for adding target_tensors have been commented out, but we may need to consider how to fix this issue of deprecation.

In [60]:
from tensorflow.keras.utils import multi_gpu_model
#print(tf.__version__)
gpus_num = len(gpus)
compiled_model = multi_gpu_model(model, gpus_num) if gpus_num > 1 else model
y = Input(name='y', shape=[None], dtype='int32')
compiled_model.compile(optimizer, loss, target_tensors=[y])
#compiled_model.template_model = model

calculating loss
Tensor("loss_4/time_distributed_14_loss/ExpandDims:0", shape=(None, 1), dtype=float32)


### Test The Model

We can also test our model on the sample audio file that we had extracted features from earlier. The decoder has not been set up yet so we take a rudimentary approach of removing all consecutive duplicates. We don't expect this model to classify anything at this point since it has been created with random weights, but this allows us to add a sanity check to see that our model is processing our input features and creating an output vector of probabilities for each character. 

In [61]:
y_hat = compiled_model.predict_on_batch(X)
arr = [alphabet.string_from_label(l) for l in np.apply_along_axis(np.argmax, 1, y_hat[0])]
arr = ''.join([arr[0]] + [arr[i] for i in range(1,len(arr)) if arr[i] != arr[i-1]])
print(arr)

qdqdqdqd


## Callbacks

In [54]:
from tensorflow.keras.callbacks import Callback, TerminateOnNaN, LearningRateScheduler, ReduceLROnPlateau, History
import importlib
import callbacks
importlib.reload(callbacks)
from callbacks import ResultKeeper, CustomModelCheckpoint, CustomTensorBoard, CustomEarlyStopping
callbacks = []
callbacks.append(TerminateOnNaN())
callbacks.append(ResultKeeper("results.bin"))
callbacks.append(CustomModelCheckpoint('checkpoints'))
callbacks.append(CustomTensorBoard('tensorboard'))
callbacks.append(CustomEarlyStopping(mini_targets={5: 200, 10:100}, monitor="val_loss", patience=3))
#lr_decay = lambda epoch, lr: lr / np.power(.1, epoch)
#callbacks.append(LearningRateScheduler(lr_decay, verbose= 1))

## Train The Model

In [46]:
X = get_features(["hello.m4a", "yellow.m4a", "hello.m4a", "yellow.m4a"])
X.shape

(4, 114, 80)

In [47]:
Y = ["hello", "yello", "hello", "yello"]
labels = alphabet.get_batch_labels(Y)
labels

array([[ 8,  5, 12, 12, 15],
       [25,  5, 12, 12, 15],
       [ 8,  5, 12, 12, 15],
       [25,  5, 12, 12, 15]])

In [62]:
compiled_model.fit(X,labels,callbacks=callbacks,batch_size=2,epochs=10,validation_split=.5)

Train on 2 samples, validate on 2 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


<tensorflow.python.keras.callbacks.History at 0x7f0ed37af908>

In [None]:
y_hat = compiled_model.predict_on_batch(X)
y_hat

## Decoder

In [57]:
def batch_tensorflow_decode(y_hat, decoder, alphabet):
    """ Enable to batch decode using tensorflow decoder. """
    labels, = decoder([y_hat])
    return alphabet.get_batch_transcripts(labels)

In [43]:
from functools import partial
from tensorflow.keras import backend as K

def get_decoder(output_tensor):
    def get_length(tensor):
        lengths = tf.reduce_sum(tf.ones_like(tensor), 1)
        return tf.cast(lengths, tf.int32)

    sequence_length = get_length(tf.reduce_max(output_tensor, 2))
    top_k_decoded, _ = K.ctc_decode(output_tensor, sequence_length, greedy=False, beam_width=64)
    print(top_k_decoded[0])
    decoder = K.function([output_tensor], [top_k_decoded[0]])
    return decoder

print(model.output)
decoder = get_decoder(model.output)
decoder = partial(batch_tensorflow_decode, alphabet=alphabet, decoder=decoder)

Tensor("time_distributed_8/Identity:0", shape=(None, None, 28), dtype=float32)
Tensor("SparseToDense_2:0", shape=(None, None), dtype=int64)


In [64]:
decoder(y_hat)

['l', 'l', 'l', 'l']

## Training on A Real Dataset

In [None]:
!pip install pydub

In [None]:
import tensorflow_datasets as tfds
ds = tfds.load('speech_commands', split='train', shuffle_files=True)
assert isinstance(ds, tf.data.Dataset)
print(ds)

<_OptionsDataset shapes: {audio: (None,), label: ()}, types: {audio: tf.int64, label: tf.int64}>


In [None]:
import tensorflow_datasets as tfds
ds = tfds.load("common_voice",split="train", shuffle_files=True, data_dir="./data")
assert isinstance(ds, tf.data.Dataset)
print(ds)

In [None]:
def make_features_from_audio(audio):
    """ Use `python_speech_"features` lib to extract MFCC features from the audio file. """
    #audio, fs = librosa.load(file_path, sr=16000)
    audio = audio.astype("int16")
    feat, energy = python_speech_features.fbank(audio, samplerate=16000, **conf)
    features = np.log(feat)
    return features

In [None]:
classes = "down,go,left,no,off,on,right,stop,up,yes,,".split(",")
classes

['down', 'go', 'left', 'no', 'off', 'on', 'right', 'stop', 'up', 'yes', '', '']

In [None]:
from tqdm import tqdm

gen = ds.as_numpy_iterator()
X = []
Y = []
for e in tqdm(gen):
    if e["label"] < 10:
        X.append(make_features_from_audio(e["audio"]))
        Y.append(classes[e["label"]])

85511it [03:47, 376.09it/s]


In [None]:
labels = alphabet.get_batch_labels(Y)
X_align = align(X)

In [None]:
history = compiled_model.fit(X_align,labels,callbacks=callbacks,batch_size=32,epochs=10,validation_split=.05)

In [None]:
y_hat = compiled_model.predict_on_batch(X_align[:10])
print(Y[:10])
decode = lambda y_hat: ''.join([alphabet.string_from_label(l) for l in np.apply_along_axis(np.argmax, 1, y_hat)])
reduce = lambda arr: [arr[0]] + [arr[i] for i in range(1,len(arr)) if arr[i] != arr[i-1]]
[decode(yh) for yh in y_hat]

['stop', 'down', 'no', 'yes', 'yes', 'left', 'go', 'off', 'stop', 'left']


['', '', '', '', '', '', '', '', '', '']

## Visualize Results

In [None]:
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [None]:
%tensorboard --logdir tensorboard