Overview



# Clarification


The Neural Transducer is different from, but strongly related to, the Sequence Transducer (also published by Graves). The latter being cited as:

- \[8] Alex Graves. Sequence Transduction with Recurrent Neural Networks. In International Conference on Machine Learning: Representation Learning Workshop, 2012.
- \[9] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech Recognition with Deep Recurrent Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.

13.

# Motivation

## CTC Model Shortcomings

I read online, in a number of places that the shortcomings of the CTC model were one of the main reasons for the publication of the transducer. For [example](https://www.assemblyai.com/blog/an-overview-of-transducer-models-for-asr/):

> RNNTs were inspired by the limitations of CTC being highly dependent on an external Language Model to perform well.

While the publication of the transducer did not directly mention the CTC as a motivation. I did mention the shortcomings of the CTC's independence assumption (which we will cover) in a small section discussing the related "sequence transducer" model which the neural transducer claims to generalize. That being said, I was curious, so here is the argument:

Connectionist Temporal Classification (CTC) Models are another model [published](https://www.cs.toronto.edu/~graves/icml_2006.pdf?ref=assemblyai.com) by Graves six years earlier in 2006.

A CTC is considered to be simple because it consists of one module, the Encoder, which is used to model ...

<center><img src='./images/ctc_architecture.png' stype='width:50%'></center>

A key assumption is that model predictions are assumed to be independent (i.e. non-correlated). As discussed in this [article](https://www.assemblyai.com/blog/an-overview-of-transducer-models-for-asr/):

> a CTC model ... is, theoretically, overall less accurate because the CTC loss function does not incorporate context.
> 
> Imagine a CTC model outputs the transcript "I have for apples". As a human reading this, you can immediately spot the error. "I have for apples" should be "I have four apples". 
For a CTC model "for" is just as sensical as "four" as they are phonetically similar. Since a CTC model’s outputs are conditionally independent of each other, the output of the word "for" does not take into consideration the surrounding context of words "I have ... apples
> 
> Because of these shortcomings, CTC models require an external Language Model, trained separately on millions to billions of sentences, to correct any linguistic errors the CTC model may output.".

To summarize the shortcomings:

- Reliance on external language model
- Fast to train but slow to converge
- Poor performance on proper nouns
- Worse performance compared to transducer

## Online Translation

The motivation according to the original publication was as follows:

> Sequence-to-sequence models have achieved impressive results on various tasks. However, they are unsuitable for tasks that require incremental predictions to be made as more data arrives or tasks that have long input sequences and output sequences.
> 
> ...
>
> Speech recognition is an example of such an online task ... Similarly, instant translation systems
>
> (In such scenarios) users prefer seeing an ongoing transcription of speech over receiving it at the “end” of a utterance... systems would be much more effective if ... translated online, rather than after entire utterances.
>
> This is because (these systems) generate an output sequence conditioned on an entire input sequence. In this paper, we present a Neural Transducer that can make incremental predictions as more input arrives, without redoing the entire computation.

**Note**: The paper is specifically sites several papers when defining sequence to sequence models as we will discuss later.

# Origin

The Neural Transducer was [published](https://arxiv.org/abs/1511.04868) in November 2015.



The paper goes on to site a bunch of papers (17, 4, 1, 6, 3, 20, 18, 15, 19) in reference to the term "sequence to sequence model:

- \[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning 
to Align and Translate. In International Conference on Learning Representations, 2015
- \[3] William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell. arXiv preprint
arXiv:1508.01211, 2015.
- \[4] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwen, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical
Machine Translation. In Conference on Empirical Methods in Natural Language Processing, 201
- \[6] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. AttentionBased Models for Speech Recognition. In Neural Information Processing Systems, 2015.4
- \[15] Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, JianYun Nie, Jianfeng Gao, and Bill Dolan. A neural network approach to context-sensitive generation of
conversational responses. arXiv preprint arXiv:1506.06714, 2015.
- \[17] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning with Neural Networks. In
Neural Information Processing Systems, 2014.- 
- \[18] Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a
foreign language. In NIPS, 201- 5.
- \[19] Oriol Vinyals and Quoc V. Le. A neural conversational model. In ICML Deep Learning Workshop, 2015.
..

But it is consistent in refereing to \[17] and end even borrows a diagram from that paper to explain the differences in model architectures.

> RNNTs however, did not get any real serious attention until the paper “Streaming End-to-end Speech Recognition for Mobile Devices” in 2018, which demonstrated the ability to use RNNTs on mobile devices for accurate speech recognition.
> https://www.assemblyai.com/blog/an-overview-of-transducer-models-for-asr/

# Characteristics






.

> The Transducer (sometimes called the “RNN Transducer” or “RNN-T”, though it need not use RNNs) is a sequence-to-sequence model.
>
> https://lorenlugosch.github.io/posts/2020/11/transducer/

According to the [publication](https://arxiv.org/abs/1511.04868) The first major difference is the conditional model is not based on the entire sequence but the currently available (possibly partial) sequence.

> Unlike sequence-to-sequence models, the Neural Transducer computes the next-step distribution conditioned on the partially observed input sequence and the partially generated sequence. At each time step, the transducer can decide to emit zero to many output symbols.

As such, when new information arrives in a new time step the model can evaluate this inormation and determine whether it has enough information to translate or if it needs to keep waiting until it does.

Making an inference, I believe the next major difference between the classes of models lie in the intermediary vector bring passed between the encoder and decoder. In \[17] a fixed length vector is being passed between encoder and decoder; but looking at the the paper I believe the transducer  

Additionally the paper provides the folowing diagram to compare/contrast the transducer from the req2seq model:

> <center><img src='./images/seq2seq_vs_transducer.png'></center>
>
> Figure 1: High-level comparison of our method with sequence-to-sequence models. (a) Sequence-tosequence model [17]. (b) The Neural Transducer (this paper) which emits output symbols as data come in (per block) and transfers the hidden state across blocks.

# Architecture

> The Encoder models the acoustic features of speech, and the Predictor acts as a Language Model to learn language information from the training data. Finally, the Joint network takes in the predictions from the Encoder and Predictor to produce a label.
> 
> The Predictor and the Joiner network are conditionally dependent, so the next prediction is reliant on the previous prediction. These combinations of modules trained jointly make an external Language Model unnecessary to gain high accurac
>
> https://www.assemblyai.com/blog/an-overview-of-transducer-models-for-asr/y.