## Introduction to End-To-End Automatic Speech Recognition

We'll be using the [AN4 dataset from CMU](http://www.speech.cs.cmu.edu/databases/an4/)(with processing using `sox`)

## What is ASR ?

**Automatic Speech Recognition**, refers to the problem of getting a program to automatically transcribe spoken language (speech-to-text). Our goal is usually to have a model that minimizes the **Word Error Rate (WER)** metric when transcribing speech input. 

Traditional speech recognition takes a generative approach, modeling the full pipeline of how speech sounds are produced in order to evaluate a speech sample. We would start from a **language model** that encapsulates the most likely orderings of words that are generated (e.g an n-gram model), to a **pronunciation model** for each word in that ordering (e.g. a pronunciation table), to an **acoustic model** that translates those pronunciations to audio waveforms (e.g. a gaussian mixture model).

Then, if we receive some spoken input, our goal would be to find the most likely sequence of text that would result in the given audio  according to our generative pipeline of models. Overall, with traditional speech recognition, we try to model `Pr(audio|transcript)*Pr(transcript)`, and take the argmax of this over the possible transcripts.

We can see the appeal of **End-To-End ASR architectures**: discriminative models that simply take the audio input and give a textual output, andf in which all the components are trained together towards the same goal. The model's encoder would be akin to an acoustic model for extracting speech features, which can then be directly piped to a decoder which output texts. If desired, we could integrate a language model that would improve our predictions.

|

## End-to-End ASR

we want to directly learn `Pr(transcript|audio)` in order to predict the transcripts from the original audio. Since we are dealing with sequential info RNNs are obvious choice. But now we have a pressing problem to deal with: since our input sequence (number of audio timesteps) is not the same length as our desired output (transcript length), how do we match each timestep from the audio  data to the correct output characters ?

Earlier speech recognition approaches relied on **temporally-aligned data**, in which each segment of time in an audio file was matched up to a corresponding speech sound such as phoneme or word. However, if we would like to have the flexibility to predict letter-by-letter to prevent OOV (out of vocabulary) issues, then each time step in the data would have to be labeled with the letter sound that the speaker is making at that point in the audio file. With that information, it seems like we should simply be able to try to predict the correct letter for each time step and then collapse the repeated letters (`LLAAAAPPTOOOOP` -> `LAPTOP`) . It turns out that this idea has some problems: not only does alignment make the dataset incredibly labor-intensive to label, but also, what do we do  with words like 'books' that contain consecutive repeated letters ? 

Modern end-to-end approaches get around this using methods that don't require manual alignment at all, so that the input-output pairs are really just the raw audio and the 
transcript - no extra data or labelling required. 2 popular approaches to do this: Connectionist Temporal Classification (CTC) and sequence-to-sequence models with attention

## CTC

In normal speech recognition prediction output, we would expect to have characters such as the letters from A through Z, numbers 0 through 9, spaces ("_") and so on. CTC introduces a new intermediate output token called the **blank token** ("-") that is useful getting around the alignment problem. 

With CTC, we still predict one token per time segment of speech, but we use the blank token to figure out where we can and can't collapse the predictions. The appearance of a blank token helps separate repeating letters that should be collapsed. For instance, with an audio snippet segment into `T=11` time steps, we could get predictions that look like `BOO-OOO--KK`, which would then collapse to `BO-O-K` and then we would remove the blank tokens to get our final output `BOOK`.

Now, we can predict one output token per time step, then collapse and clean to get sensible output without any fear of ambiguity from repeating letters! A simple way of getting predictions like this would be to apply a bidirectionnal RNN to the audio input, apply softmax over each time step's output, and then take the token with the highest probability. The method of taking always the best token at each time step is called **greedy decoding, or max decoding**


To calculate the loss for backprop, we would like to know the log probability of the model producing the correct transcript, `log(Pr(transcript|audio))`. We can get the log probability of a single intermediate output sequence (e.g. `BOO-OOO--KK`) by summing over the log probabilities we get from each token's softmax value, but note that the resulting sum is different from the log probability of the transcript itself (`BOOK`). This is because there are multiple possible output sequences of the same length that can be collapsed to the same transcript and so we need to **marginalize over every valid sequence of length `T` that collapse to the transcript**

Therefore, to get our transcript's log proba given our audio input, we must sum the log probabilities if every sequence of length `T` that collapses to the transcript. In practice, we can use a dynamic programming approach to calculate this, accumulating our log probabilities over different "paths" through the softmax outputs at each time step.

## Sequence-to-Sequence with Attention

One problem with CTC is that predictions at different time steps are conditionally independent, which is an issue because the words in a continuous utterance tend to be related to each other in some sensible way. With this conditional independence assumption, we can't learn a language model that can represent such dependencies, though we can add a language model on top of the CTC output to mitigate this to some degree

A popular alternative is to use sequence-to-sequence model with attention. A typical seq2seq model for ASR consists of some sort of **bidirectional RNN encoder** that consume the audio sequence timestep-by-timestep, and where the output are then passed to an **attention-based decoder**. Each prediction from the decoder is based on attending to some parts of the entire encoded input, as well as the previously outputted tokens.

The outputs of the decoder can be anything from word pieces to phonemes to letters, and since predictions are not directly tied to time steps of the input, we can just continue producing tokens one-by-one until an end token is given (or we reach a specified max output length). This way, we do not need to deal with audio alignment, and our predicted transcript is just the sequence of the outputs given by our decoder.


## Taking a Look at Our Data (AN4)

The AN4 dataset, also known as the Alphanumeric dataset, was collected and published by Carnegie Mellon University. It consists of recordings of people spelling out addresses, names, telephone numbers, etc., one letter or number at a time, as well as their corresponding transcripts. We choose to use AN4 for this tutorial because it is relatively small, with 948 training and 130 test utterances, and so it trains quickly.

Before we get started, let's download and prepare the dataset. The utterances are available as `.sph` files, so we will need to convert them to `.wav` for later processing. If you are not using Google Colab, please make sure you have [Sox](http://sox.sourceforge.net/) installed for this step--see the "Downloads" section of the linked Sox homepage. (If you are using Google Colab, Sox should have already been installed in the setup cell at the beginning.)

In [1]:
import os 
data_dir = 'data/'

if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [3]:
import glob
import os 
import subprocess
import tarfile 
import wget

# Download the dataset
print("*****")
if not os.path.exists(data_dir + 'an4_sphere.tar.gz'):
    an4_url = 'https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz'
    an4_path = wget.download(an4_url,data_dir)
    print(f"Dataset downloaded at {an4_path}" )
else:
    print("Tarfile already exists")
    an4_path = data_dir + '/an4_sphere.tar.gz'

if not os.path.exists(data_dir + '/an4/'):
    # untar and convert .sph to .wav (using sox)
    tar = tarfile.open(an4_path)
    tar.extractall(path=data_dir)
    print('Converting .sph to .wav..')
    sph_list = glob.glob(data_dir + 'an4/**/*.sph', recursive=True)
    for sph_path in sph_list:
        wav_path = sph_path[:-4] + ".wav"
        cmd = ['sox', sph_path, wav_path]
        subprocess.run(cmd)
print("Finished conversion. \n*****")

*****
Dataset downloaded at data//an4_sphere.tar.gz
Converting .sph to .wav..
Finished conversion. 
*****
