# Overview

# Origin

The Transformer model was [published](https://arxiv.org/abs/1706.03762) in 2017 as a solution to address several problems in the machine translation space. Machine translation, as the name suggests, is the process of using a machine (an ml model) to translate from one language into another. For more information about its lineage and the course of events leading to its development see the complementarty article on [encoder-decoders](Encoder-Decoders.ipynb).

# Architecture

In the 2017 [paper](https://arxiv.org/abs/1706.03762) titled "Attention Is All You Need" The following architectural diagram is published:

<center><img src="./images/transformer_model_architecture.png" style="width:50%"></center>

In my humble opinion, this diagram leaves a lot to be desired. I will try to expand and explain the various components.

## Inputs

The paper explains that the "inputs" shows in the architectural diagream refers to a sequence of symbols. As this paper is written to address issues of machine translation we can assume that the symbols are linguistic symols aka. alphabetical characters or letters.

## Input Embeddings

The paper does not specify how the embeddings are calculated. Instead, it states that it uses learned embeddings similar to those used by other sequence transduction models. 



### Neural Sequence Transducer Model

**Note**: Reading through the paper, in all other instances, the term transduction model is prefixed with either "sequence" or "neural sequence". This Distinction is important as I have found that a Neural Transducer is different than a Neural Sequence Transducer per it's author.

Reading through the paper I found a passage which I believe helps define what a Neural Sequence Transduction Model is:

> Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35].

Looking at the sources I see:

> \[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly 
learning to align and translate. CoRR, abs/1409.0473, 2014
> 
> \[5] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, 
and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistica 
machine translation. CoRR, abs/1406.1078, 201
> 
> \[35] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural
networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014

And cross referencing these (specifically \[[2](https://arxiv.org/abs/1409.0473)]) with Graves' paper on Neural Transducers I see the same sources being cited:

> Graves, A. (2012). Sequence transduction with recurrent neural networks. In Proceedings of the
29th International Conference on Machine Learning (ICML 2012).
.4.


Each token in the input sequence is converted into a vector of dimension $d_{model}$. This means that if the input sequence is $N$ tokens long, the resulting embedding matrix would be  $N \times d_{model}$.

I had not heard this term "sequence transduction model" before. I [googled](https://machinelearningmastery.com/transduction-in-machine-learning/) to find out what this term means and I belive it's meaning is derived from that of transductive reasoning or transductive logic.

<center><img src='./images/logic_diagram.png' style='width:50%'></center>

With transductive logic, we are able to look at examples (i.e. train on examples) and learn how to predict the outcomes without explicitly defining an a prioiri model of all the rules involved in the predictions. Instead, the rules are abstracted away an implicitly "baked into" the transductive process as the training process takes place.

In this way, some neural networks are transducers; i.e. models that are able to infer a model which allows us to directly predict an output given an input without needed to evaluate an intermediary to affect a rule set.r.

I was able to find a citation for the original [publication](https://arxiv.org/abs/1511.04868) of the transducer.

I was curious as to what specific algorithm is used to learn the embeddings. Accodring to a 

## Positional Encoding
### Purpose
The purpose of the positional encoding is to record the location of the tokens in the input sequence relative to each other as well as their absolute position. The positional information of symbols contains or implies important syntactic information that is relevant when training the model and making predictions.

As the paper notes, there are many choices for methodology of assigning positional encodings. 

After googling I found this [article](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/) explaining that there are four criteria for selecting a good positional encoding algorithm:
1. Each position should be represented by a unique encoding
2. Distance between positions should be consistent
3. The algorithm should be deterministic
4. The range (output of the function) should be bounded

The importance of the last criteria was not initially intuitive. I didnt understant why, for example, simply increment the tokens in the input sequence starting with 0. Reading on, The paper suggests this will likely prevent the transformer model from generalizing. If we try to make predictions for sequences which are longer or of different length than the training set, the model may not perform well as it has been trained to only consider a specific ranfe of values.

Conveniently, the method chosen for the transformer model meets these criteria. Additionally, the authors offer the following reasons for why it chose the methods it did:
- They needed a way to combine the positional information with the information contained in the word embeddings. Their solution was to sum calculate the pair-wise sum. Thus they needed an algorithm which would result in the exact same dimensionality (i.e. a matrix with the same dimensions).
- They wanted the algorithm to be simple so that the attention mechanism in the model could easily learn it. The proposed method is expressed as a simple recursion of a linear function which they hypothesisze would be easy for the model to learn.

### Algorithm

The transformer accepts input sequences of length $N$. From these inputs it generates the coresponding Input Embeddings of dimension $N \times d_{model}$, where each token has a coresponding embedding represented by a $d_{model}$ dimension vector. The transformer then calculates the Positional Encodings $PE$ coresponding to each Embedding. Each Position Encoding, for each token, is also represented by a $d_{model}$ dimension vector.

<center><img src='./images/position_encoding_simple.png' style="width:30%"></center>
The Positional Encoding $PE$ for the token at position $pos$ and dimension $i$ can be calculated as:

$$ PE_{(pos,2i)} = sin(pos/10000^{2i/d_{model}}) $$

$$ PE_{(pos,2i+1)} = cos(pos/10000^{2i/d_{model}}) $$

We can simplify this equation by introducing $w_i=\frac{1}{10000^{2i/d_{model}}}$ such that:

$$ PE_{(pos,2i)} = sin(pos*w_i) $$

$$ PE_{(pos,2i+1)} = cos(pos*w_i) $$

This will result in a positional encoding vector for a given token at position $pos$ such that::

$$ \vec{PE}_{pos} = \begin{bmatrix}
sin(pos*w_i) \\
cos(pos*w_i) \\
\vdots \\
sin(pos*w_{d/2}) \\
cos(pos*w_{d/2}) \\
\end{bmatrix}$$

Note: we can use $d/2$ rather than $d_{model}$ because we are calculating two values for ach element $i$ in $d_{model}$ so it would be redundant.

#### Intuition

At this point I found myself wondering why we are using this complicated structure. It may satisfy the criteria mentioned earlier, but how can this possibly tell us anythign meaningful about the position of a token in a sequence?

I found a very helpful explanation in this [article](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/).

We can see an example of the values of the sinusoidal functions below. Each token in a sequence with a max length of 50 has a coresponding value for each dimension of the PE space. In the graph below, each row represents the individual PE vector associated with a particular token position. We can see that this algorithm produces a "criss-cross" patten which initially is quite similar to that of a grid or matrix.

<center><img src='./images/position_encoding_128_bit_example.png' style='width: 75%'></center>

Looking carefullt at this grid we see that things begin to warp, as you go deeper in the x-axis, you notice that the horizontal y-axis clustering of values being to stretch and drift. It appears as if each depth has it's own rate of change relative to the positions. And this is the key observation which helps explain the method behind the madness.

What we are doing here creates a pattern that is very similar to the one created by numbers represented in binary code. Recall that a binary code, or binary number, consists of a series of bits. Ultimately, the number represented by the bit string is calculated by adding together powers of two; the power or exponent being equal to the bit's position in the string. Bit strings are read left to right so, with the least significant bit (ie. the bit coresponding to $2^0$ being the right most bit. For example, whtih 4-bit string, Looking at the number 3 for example, we would have: $0 \times 2^3 + 0 \times 2^2 + 1 \times 2^1 + 1 \times 2^0$. Extrapolating this further we would arrive at the following table:

<center><img src='./images/binary_table.png' style='width:30%'></center>

Notice how the least significant bit is changing the fastest while the more significant bits change more slowly. This pattern is very similar to the pattern that is shown by the sinusoidal functions. All we need to do is reverse the order of the bits and the similarity between the patterns emerges:


<center><img src='./images/position_encoding_128_bit_example_compared.png' style='width: 100%'></center>



The original paper did not have much information about the calculations involved in deriving the positional encoding. I googled and found this [article](https://towardsdatascience.com/transformers-explained-visually-part-2-how-it-works-step-by-step-b49fa4a64f34) which did offer some guidance.

It states that the Position Encoding is computed independently of the input sequence.

## Positional Embedding

# Implimentations