# Overview

In the original [publication](https://arxiv.org/pdf/1211.3711.pdf), Graves defined his sequence transducer in the context of an RNN. In fact, there is a section in the text with the heading "Recurrent Neural Network Transducer". As a result, some texts will refer to the Neural Sequence Transducer as an RNNT or RNN-T model. 

Over time however, the model has been extended and generalized so that it need not use an RNN. Additionally it is considdered a sequence to sequence model. More information on it's relation to other models can be found [here](https://lorenlugosch.github.io/posts/2020/11/transducer/).

# Use Cases

>Although RNN-T, CTC and AED offer very good accuracy in recognizing speech, RNN-T typically outperform the others and while being naturally suitable for online streaming mode, it enables the development and deployment of real-time speech recognition applications.
>
> https://whatsnext.nuance.com/innovation-research/automatic-speech-recognition-on-prediction-network-architecture/

# Alignment

## The Problem

While reading up on the Neural Sequence Tranducer Model I kept coming accross the concept of alignment. 

I found that it was one of the core motivations for the [publication](https://arxiv.org/pdf/1211.3711.pdf) of the Transducer.

> RNNs traditionally require a pre-defined alignment between the input and output sequences to perform transduction. This is a severe limitation since finding the alignment is the most difficult aspect of
many sequence transduction problems.

The author claims that prior to the advancements made in this paper:

> RNNs are usually restricted to problems where the alignment between the input and output sequence is known in advance.

At first it was hard to understand what the term was referring to. The math was quite dense and abstract. So I googled to find more intuitive explanations of what alignment is in terms of machine translation.

I found the following articles which helped me understand what alignment means and what is being aligned. In short, I found that alignment is the process or mapping the tokens in the input sequence to tokens in the output sequence:

Consider the translation problems where there is not a one-to-one mapping between the input sequence and the output sequence. For example, translating French to English, the number of words in the french phrase may not match the english phrase. We can see that when translating the phrase "It's raining" to french. The equivalent phrase would be "Il pleut". We see our input has the same number of words as the output. But what about "It is raining"? In this circomstance we have three inputs to two outputs; there is not a one to one mapping between the words.

[stackoverflow answer](https://stats.stackexchange.com/questions/272012/what-does-alignment-between-input-and-output-mean-for-recurrent-neural-network)

Additionally, the order of words may not be the same. For example, consider this translation from Portugese to english:

```
Uma maçã grande e vermelha
(1)   (2)  (3)  (4)   (5)
 |      \ /   _______/
 |       X   /
 |      / \ /
 |     /   X
 |    /   / \
(1) (3) (5)  (2)
 A  big red apple
```

[Stackoverlow answer](https://ai.stackexchange.com/questions/26184/what-is-the-purpose-of-alignment-in-the-self-attention-mechanism-of-transforme) -> Transformers will automatically solve alignment while translating.

This problem exists in many architectures, not just RNN-Ts ([source](https://arxiv.org/abs/2112.07806)).



## Implimentation Details

One of the core components of that way that the RNN-T handles alignment is the null character or set $\varnothing$. This character basically means "nothing", "no character". The $\varnothing$ character is injected into the output sequence to denote gaps between the input and output sequence. One such permutation might be as follows:

$$ i1, i2, i3, i4 => o1, o2, \varnothing, o3$$


Assuming that there is a permutation which is the correct representation of the output sequence, the model is then trained so that the probability of obtaining the output sequence (which doesn't conatin the $\varnothing$ character) is maximized given that coresponding permutation. In this case, the model is trained on the alignments and the alignments play an integral role in producing the correct outputs.

The process of doing maximizing the likelihood is a bit over my head but is discussed in the original paper as well as in this helpful [article](https://lorenlugosch.github.io/posts/2020/11/transducer/).

# Architecture

The model consists of two RNNs

<center><img src='./images/neural_sequence_transducer_model_architecture.png' style='width:30%'></center>

## Transcription Network
The first RNN is referred to as the Transcription Network which the author defined as:

> a bidirectional RNN (Schuster & Paliwal, 1997) that scans the input sequence $x$ forwards and backwards with two separate hidden layers, both of which feed forward to a single output layer. Bidirectional RNNs are preferred because each output vector depends on the whole input sequence (rather than on the previous inputs only, as
is the case with normal RNNs); however we have not tested to what extent this impacts performance.
> 
> ...
> 
> The transcription network is similar to a Connectionist Temporal Classification RNN, which also uses a null output to define a distribution over input-output alignments.


This network accepts the input sequence and outputs a sequence of transcription vectors (also referred to as the transcription sequence in the paper).

## Predition Network

The second network is referred to as the Prediction Network. This scans the output sequence (not the teanscription vectors) and outputs a prediction vector sequence.

## Output Distribution (Joiner)

According to the publication, the output disribuion quantifies the liklihoos of obtaining a particular alignment $a$ given the input sequence $x$. It is defined as $Pr(a \in \bar{y*} | x)$.

In more recent [articles](https://whatsnext.nuance.com/innovation-research/automatic-speech-recognition-on-prediction-network-architecture/) the calculation of the probability distribution is treated as a discrete module referred to as the joiner. This module is seen as being interchangable, alowing multiple calculations for the distribution. 