# Overview

# Origin and Etymology
## Ancient Times
The concept of encoding and decoding are nothing new. They have been used for thousands if not tens of thousands of years with respect to sending and receiving messages or recording information in documents. The basic idea of encoding is that the "natural form" of a text is not condusive to the task at hand. Maybe because we need to transmit the information through some physical mechanism which requires the information to be presented in a different format; for example, sending morris code over a wire. Or maybe because we want to keep the information secret; for example we may encrypt the data before saving it. Consequently, the process of decoding is simply the process of translating the encoded text back into it's original format.

The origins of the word come from the latin word *codex* which refers to a book of laws. Within the codex are the codes or laws for a particular subject. Thinking abstractly, in terms of a laungage, the codex defines how information is structured etc. within a language. Using the codex one can construct segments of text which are in compliance with the codex and thus are in-code or en-coded (from the old french).

Despite being used for thousands of years, from what I gather, the word encode first appeared in the 1930s

## Machine Translation (1940s - 2000s)
In the context of machine learning, the term's meaning is still consistent with it's historical usage.

According to this [article](https://syncedreview.com/2017/08/17/history-and-frontier-of-the-neural-machine-translation/#:~:text=In%20the%20year%20of%201997,do%20machine%20translations%20%5B2%5D.):

> Machine translation (MT) is utilizing the power of machines to do “automatic translation of text from one natural language (the source language) to another (the target language)” [1]. The idea of doing translation using machines was first raised by Warren Weaver in 1949. For a long time (1950s~1980s), machine translation was done through the study of the linguistic information about the source and target languages, generating translations based on the dictionaries and grammars, which is called rule-based machine translation (RBMT). With the development of Statistics, statistical models started to be applied to machine translation, which generates translations based on the analysis of bilingual text corpus. This method is known as the statistical machine translation (SMT), which gained better performance than RBMT and dominated the field from the 1980s to 2000s. In the year of 1997, Ramon Neco and Mikel Forcada came up with the idea of using “encoder-decoder” structure to do machine translations [2]. A few years later in 2003, a group of researchers at the University of Montreal led by Yoshua Bengio developed a language model based on neural networks [3], which improved the data sparsity problem of traditional SMT models. Their work laid a foundation for the future usage of neural networks on machine translation.
> 
> 
> References:
> 
> - (1) Russell, S. & Norvig, P. (1995). Artificial intelligence: a modern approach
> - (2) Neco, R. P., & Forcada, M. L. (1997, June). Asynchronous translations with recurrent neural nets. In Neural Networks, 1997., International Conference on (Vol. 4, pp. 2535-2540). IEEE.
> - (3) Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of machine learning research, 3(Feb), 1137-1155.

Machine translation, specifically SMT had several issues to overcome however:

## The Reordering Problem

According to this [paper](https://qmro.qmul.ac.uk/xmlui/handle/123456789/2517):
> The reordering problem in SMT originates from the fact that not all the words in a sentence can be consecutively translated. This means words must be skipped and be translated out of their order in the source sentence to produce a fluent and grammatically correct sentence in the target language. The main reason that reordering is needed is the fundamental word order differences between languages.

For example, in english we might say "the white house" while in french we might say "la maison blanche" (directy translated as "the house white".

> Therefore, reordering becomes a more dominant issue, the more source and target languages are structurally different.


According to this [article](https://direct.mit.edu/coli/article/42/2/163/1530/A-Survey-of-Word-Reordering-in-Statistical-Machine) published in 2016:

> Word reordering is one of the most difficult aspects of statistical machine translation (SMT), and an important factor of its quality and efficiency. Despite the vast amount of research published to date, the interest of the community in this problem has not decreased, and no single method appears to be strongly dominant across language pairs. Instead, the choice of the optimal approach for a new translation task still seems to be mostly driven by empirical trials.
> 
> ...
> 
> the core SMT methods (Brown et al. 1990, 1993; Berger et al. 1996; Koehn, Och, and Marcu 2003) learn direct correspondences between source and target language from collections of translated sentences, without the need for abstract linguistic representations. The main advantages of SMT are versatility and cost-effectiveness: In principle, the same modeling framework can be applied to any pair of languages with minimal engineering effort, given sufficient amounts of translation data. However, experience in a diverse range of language pairs has revealed that this form of modeling is highly sensitive to structural differences between source and target language, particularly at the level of word order.
> 
> ...
> 
> Searching for the overall best translation throughout the space of all possible reorderings is, however, computationally intractable (Knight 1999). This crucial fact has motivated an impressive amount of research around two inter-related questions: namely, how to effectively restrict the set of allowed word permutations and how to detect the best permutation among them.
> ...
>
>  String-based SMT (Sections 2.1 and 2.2) treats translation as a sequential task: The target sentence is built from left to right while the input units are visited in different orders and no dependencies other than word adjacency are considered. ... Tree-based SMT (Section 2.3) posits the existence of a tree structure to explain translation as a hierarchical process and to capture dependencies among non-adjacent text units. Problem decomposition is therefore based on this structure: An optimal translation is sought for each word span corresponding to a node in the tree, from the leaves up to the root. Whereas string-based SMT has to search over all input permutations that do not violate some general reordering constraints, tree-based SMT considers only those permutations that result from transforming a given tree representing the input sentence (as for example permuting each node's children).
> ...
> 



## Neural Machine Translation (2010s - Present)

> In 2013, Nal Kalchbrenner and Phil Blunsom proposed a new end-to-end encoder-decoder structure for machine translation [4]. This model will encode a given source text into a continuous vector using Convolutional Neural Network (CNN), and then use Recurrent Neural Network (RNN) as the decoder to transform the state vector into the target language. Their work can be treated as the birth of the Neural Machine Translation (NMT), which is a method that uses deep learning neural networks to map among natural language. NMT’s nonlinear mapping differs from the linear SMT models, and describes the semantic equivalence using the state vectors which connect encoder and decoder. In addition, the RNN is supposed to be capable of capturing information behind an infinite length of sentences and solving the problem of “long distance reordering” [29]. However, the problem of “exploding/vanishing gradient” [28] makes RNN hard to actually handle the long distance dependencies; accordingly, the NMT model did not achieve a good performance at the beginning.
> 
> References
>
> - (4) Kalchbrenner, N., & Blunsom, P. (2013, October). Recurrent Continuous Translation Models. In EMNLP (Vol. 3, No. 39, p. 413).
> - (28) Pascanu, R., Mikolov, T., & Bengio, Y. (2013, February). On the difficulty of training recurrent neural networks. In International Conference on Machine Learning (pp. 1310-1318).
> - (29) Sudoh, K., Duh, K., Tsukada, H., Hirao, T., & Nagata, M. (2010, July). Divide and translate: improving long distance reordering in statistical machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR (pp. 418-427). Association for Computational Linguistics.
>
> [source](https://syncedreview.com/2017/08/17/history-and-frontier-of-the-neural-machine-translation/#:~:text=In%20the%20year%20of%201997,do%20machine%20translations%20%5B2%5D.)

### Vanishing/Exploding Gratient Problem

One of the most common problems when working with Deep Neural Networks is referred to as the Vanishing and/or Exploding Gradient Problem. 

So what is a gradient, and what does it have to do with Deep Neural Networks? 

Recall that when training a neural network, we perform forward propogation followed by back propogation. With forward propogation, the input data flows through the input layers, then the hidden layers, then the output layer of the the neural network in effect generating the training output. Then based on the correct output, backpropogation occurs; moving in the reverse direction as forwad propogation, the back propogation process adjusts the weights in the network yielding a more accurate prediction given the same input. In order for back propogation to work, there must be some process employed which will optimize the weights to yield the best result. And this is where the relation with the gradient comes in. The mathematic process underlying the backpropogation is typically gradient discent; an optimiazation algorithm which uses the gradient to find an assumed local minimal.

OK, so we now know what a gradient is, what's the vanishing or exploding gradient problem?

The vanishing gradient problem arrises when the gradient used by gradient dissent approaches zero before back propogation has reached the upper levels. Normally, the gradient approaches zero as we approach the local minima, but in scenarios where we approach zero in the deeper layers, the network is not actually getting claibrated. The wieghts in the higher level will remain reletively unchanged resulting in poor predictive performance. The exploding gradient is the opposite: in this scenario the greadient keeps growing which in turn again prevents convergence and model calibration.

Some texts will refer to the neural network or its training process as being, or becoming, unstable when the underlying model suffers from the vanishing/exploding gradient problem.

In 1991 Hochreiter published his [doctoral thesis](http://www.bioinf.jku.at/publications/older/3804.pdf) which provided an analysis of the gradient problem. [According to Schmidhuber](https://people.idsia.ch/~juergen/fundamentaldeeplearningproblem.html), the professor and thesis supervisor, Hochreiter's paper formally showed that deep neural networks are hard to train, because they suffer from the now famous problem of vanishing or exploding gradients.

### Short Term vs Long Term Memory Problem

So what causes the vanishing or exploding gradient problem?

From what I could find, there are a number of causes. On such issue mentioned on [StackOverflow](https://stackoverflow.com/questions/54286472/q-why-long-short-term-memorylstm-is-called-as-a-long-and-short-both-type-of-m) is that the NN may be "forgetting" important information relevant to the prediction rendering it unable to "remember".

> Think of for example a piece of text. "Barnie is a big red dog, with little ears and a long black tail. He is 12 years old". If your task was to figure out what "He" refers to in the second sentence, you would send this data into an LSTM network, and it would analyze each word individually. The calculations for a single word is the Short-Term Memory. However the calculations of each word (the hidden state), ... is passed on and included when analyzing the next word. ... therefore storing the Short-Term data (calculations of individual word) over Long periods of time (passing the hidden states to the next word).

The trick here is that there needs to be a balance as not all information is relevant, and the longer term your memory the more complext the model and the more noise potentially introduced into the system.

In response to this problem, new model architectures are proposed which provided enhanced memory capabilities. We will talk about these in more detail later in this article. Additionally, new attention mechanisms are proposed which will filter out irrelevant information provided by the memory mechanisms.

So we see that the vanishing or exploding gradient problems are related to the Short Term vs Long Term Memory problem.

### Potential Solutions To The Gradient Problem

In order to prevent this from happening, one solution is initializing weights to random values. This is a common approach used when searching via gradient dissent. By initializing the weights to a random value, the search algorithm has a better change of converging. A good read on the subject can be found [here](https://www.comet.com/site/blog/vanishing-exploding-gradients-in-deep-neural-networks/#:~:text=Exploding%20is%20the%20opposite%20of,Network%2C%20not%20the%20activation%20function).


In addition to this approach, Schmidhuber [states](https://people.idsia.ch/~juergen/fundamentaldeeplearningproblem.html) (in 2013) that currently there are four methods currently known to overcome the vanishing gradient problem:

1. Unsupervised pre-training for a hierarchy of (recurrent) neural networks

    Accoding to Schmidhuber, "This greatly facilitated subsequent supervised credit assignment through back-propagation."
   
   - ([1](https://people.idsia.ch/~juergen/fundamentaldeeplearningproblem.html)) Sepp Hochreiter's Fundamental Deep Learning Problem
   - (2) J. Schmidhuber. Learning complex, extended sequences using the principle of history compression, Neural Computation, 4(2):234-242, 1992 (based on TR FKI-148-91, 1991).

2. LSTM-like networks

    These types of networks avoid the problem through special architecture unaffected by it

3. Faster GPU-based

    Thse do not solve the problem, but reduce the impact of the problem so that despite the poorly trained network, the models still perform within the bounds of practicalitly.

4. Using alternat optimization algorithms besides gradient Dissent

   The space of NN weights can be searched by algorithms which do not rely on gradient matrices or gradient dissent.

Additionally we will see that newer architectures are able to use gradients reliably and produce better results. For example ([Mikolav et al. 2014](https://arxiv.org/abs/1412.7753)).

#### Long Short-Term Memory (LSTM) and Memory Distance

Long Short-Term Memory (LSTM) and is a type of Recurrent Neural Network (RNN) [published](https://www.researchgate.net/publication/13853244_Long_Short-term_Memory) in 1995 by Hochreiter under the supervision of Schmidhuber. This was followed up by a more popular [paper](https://www.bioinf.jku.at/publications/older/2604.pdf) in 1997 which is typically sourced as the origin of the term.  It's notariaty primarily stems from the claim that it is able to overcome the vanishing gradient problem But in, addition, the paper boasts more stable and faster training times resulting in RNNs which outperform the previous generation.

Another term that pops up tangentially is the concept of distance. We might see long term memory and long distance memory used interchangibly. The basic relationship between the two terms comes from the way that LSTM and memory in general works in the context of machine translation. An input sequence of text is provided and the mechanism generating the translation needs to "keep in mind" a certain trailing set of information. The idea with memory is that it's effectively storing an array of sequence tokens, the more recently observed ones having occured in the short term and being indexed closer to, with a shorter distance from, the current token being evalutated.

Another problem that LSTM claims to solve is that it extends the distance or term of the memory while allowing the model to filter out the noise (unimportant tokens in the input sequence).

Since then the original LSTM model has lead to a whole sub-family of models. More information on this topic can be found [here](https://mindmajix.com/what-is-lstm).

Additionally, because of it's complexity, researchers have attempted to obtain similar results with reduced complexity. Reading between the lines in some of these white papers, it looks like the math/architecture in LSTM is quite complicated compared to the modern front runners in the space. 


### Sequence to Sequence (seq2seq)

A quick note on teminology: 

At a certain point within the machine translation field, people started thinking about translation as a mapping between a sequence of text in different languages. For example Tomáš Mikolov's 2012 [PhD thesis](https://www.fit.vut.cz/study/phd-thesis-file/283/283.pdf).

The first [paper](https://arxiv.org/abs/1511.06391) I could find which uses the term seq2seq was published in Nov 2015 by Vinyals, Benjio, and Kudler. It describes seq2seq as an existing framework which employs the chain rule (back propogation and/or maximum likelihood) and implicitly characterizes the framework as having an encoder-decoder architecture.

I bring this up because it starts to appear in papers, surverys, and summaries and I figure we should introduce it asap.

### The Fixed Vector Problem

While the LSTM model was the first to overcome the vanishing/exploding gradient problem it introduced a new problem: the fixed vector problem. The approach uses a fixed dimensionality of the intermediary sequence being passed between the encoder and decoder.

This problem continues for some time, for example in September 2014, Sutskever et al. published a [paper](https://arxiv.org/abs/1409.3215) which presented a novel unsupervised approach to sequence learning (I.e. learning how to map sequences). Being unsupervised, the approach made minimal structural assumptions about the sequences. We use the term method or approach, rather than model, because it uses the encoder-decoder framework and thus multiple models; a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector., 2013".)

### Variable Vectors, Context Vectors,  Prallel Training, and Attention

The solution however, began to be uncovered when the following monumental advancements were made:

First, In June 2014, Cho, Benjio, et al. publish a [paper](https://arxiv.org/abs/1406.1078) propose a novel neural network model called RNN Encoder-Decoder in which the encoder and decoder are both RNNs and are trained in parallel. The paper considers the LSTM unit as one of many activation functions that can "plug into" an RNN and suggests that it's hidden unit is much simpler to compute. It notes that the hidden unit performs a similar function to the LSTM unit but "may also be considered an adaptive variant of a leaky-integration
unit (Bengio et al., 2013)".

Then, in the same month, June 2014, Mnih et al. publish a [paper](https://arxiv.org/abs/1406.6247) introducing the concept of Reccurrant Attention Model (RAM) which I believe is the first use of attention within RNNs. In this architecture, the model automatically adjusts the "bandwidth" or the "vector length" being observed durign the training process.

And in Sept 2014, Bahdanau, Cho, Benjio. published a [paper](https://arxiv.org/abs/1409.0473) propose an encoder-decoder which is not using a fixed length intermediary but a:

> an extension to the encoder–decoder model which learns to align and translate jointly. Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with
these source positions and all the previous generated target words.
>
> The most important distinguishing feature of this approach from the basic encoder–decoder is that
it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. We show this allows a model to cope better with long sentences.

This paper introduces the concept of a Context Vector which replaces the prior fixed length intermediary representations.

## Transformers and Attention

In June 2017, the transformer architecture was [published](https://arxiv.org/abs/1706.03762) by Vaswani et al. This new architecture incorporated and expanded upon the prior advancements with respect to attention. This model based solely on attention mechanisms, dispenses with recurrence and convolutions entirely.

# Architecture
The family of encoder-decoder models are characterized by their architectures which encode
a source sentence into a fixed-length vector from which a decoder generates a translation. 