<a href="https://colab.research.google.com/github/rahiakela/transformers-for-natural-language-processing/blob/main/1-model-architecture-of-the-transformer/1_positional_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Positional encoding

The Transformer's subsequent layers do not start empty-handed. They have learned word embeddings that already provide information on how the words can be associated.

However, a big chunk of information is missing because no additional vector or information indicates a word's position in a sequence.

The designers of the Transformer came up with yet another innovative feature: positional encoding.

Let's see how positional encoding works.

We enter this positional encoding function of the Transformer with no idea of the position of a word in a sequence:

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/position-encoding.png?raw=1' width='800'/>

We cannot create independent positional vectors that would have a high cost on the training speed of the Transformer and make attention sub-layers very complex to work with. 

**The idea is to add a positional encoding value to the input embedding instead of having additional vectors to describe the position of a token in a sequence.**

We also know that the Transformer expects a fixed size $d_{model} = 512$ (or other constant value for the model) for each vector of the output of the positional encoding function.

## Setup

In [None]:
!pip install --upgrade gensim

In [None]:
import torch
import nltk
nltk.download('punkt')

In [3]:
import math
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize 
import gensim 
from gensim.models import Word2Vec 
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import warnings 
warnings.filterwarnings(action = 'ignore') 

In [None]:
!wget https://raw.githubusercontent.com/rahiakela/transformers-for-natural-language-processing/main/1-model-architecture-of-the-transformer/text.txt

In [5]:
dprint=0 # prints outputs if set to 1, default=0

# loading txt file
sample = open("text.txt", "r")
s = sample.read()

# processing escape characters
f = s.replace("\n", " ")

data = []
# sentence parsing
for i in sent_tokenize(f):
  temp = []
  # tokenize the sentence into words
  for j in word_tokenize(i):
    temp.append(j.lower())
  data.append(temp)

# Creating Skip Gram model 
model = gensim.models.Word2Vec(data, min_count=1, size=512, window=5, sg=1)

In [6]:
# 1-The 2-black 3-cat 4-sat 5-on 6-the 7-couch 8-and 9-the 10-brown 11-dog 12-slept 13-on 14-the 15-rug.
word1 = "black"
word2 = "brown"

pos1 = 2
pos2 = 10

a = model[word1]
b = model[word2]

if dprint == 1:
  print(a)

# compute cosine similarity
dot = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)

cos = dot / (norm_a * norm_b)

aa = a.reshape(1, 512)
bb = b.reshape(1, 512)
cos_lib = cosine_similarity(aa, bb)
cos_lib

array([[0.99987805]], dtype=float32)

## Introduction

If we go back to the sentence we used in the word embedding sub-layer, we can see that **black** and **brown** may be similar, but they are far apart:

```
The black cat sat on the couch and the brown dog slept on the rug.
```

The word **black** is in position 2, pos=2, and the word **brown** is in position 10, pos=10.

Our problem is to find a way to add a value to the word embedding of each word so that it has that information. However, we need to add a value to the $d_{model} = 512$ dimensions! For each word embedding vector, we need to find a way to provide information to $i$ in the $range(0,512)$ dimensions of the word embedding vector of **black** and **brown**.

There are many ways to achieve this goal. The designers found a clever way to use a unit sphere to represent positional encoding with sine and cosine values that will thus remain small but very useful.

Vaswani et al. (2017) provide sine and cosine functions so that we can generate different frequencies for the positional encoding (PE) for each position and each dimension i of the dmodel = 512 of the word embedding vector:

$$ PE_{(pos,2i)}=sin\begin{pmatrix} \frac{pos}{10000\frac{2i}{d_{model}}} \end{pmatrix}$$

$$ PE_{(pos,2i+1)}=cos\begin{pmatrix} \frac{pos}{10000\frac{2i}{d_{model}}} \end{pmatrix}$$

If we start at the beginning of the word embedding vector, we will begin with a constant (512), $i=0$, and end with $i=511$. **This means that the sine function will be applied to the even numbers and the cosine function to the odd numbers**. Some implementations do it differently. In that case, the domain of the sine function can be $i \in [0, 255]$  and the domain of the cosine function can be $i \in [256, 512]$. This will produce similar results.

A literal translation into Python produces the following code for a positional vector $pe[0][i]$ for a position `pos`:

In [7]:
def positional_encoding(pos, pe):
  for i in range(0, 512, 2):
    pe[0][i] = math.sin(pos / (10000 ** ((2 * i) / 512)))
    pe[0][i + 1] = math.cos(pos / (10000 ** ((2 * i) / 512)))
  return pe

Before going further, you might want to see the plot of the sine function, for example, for pos=2.

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/pos-2-plot.png?raw=1' width='800'/>

Before going further, you might want to see the plot of the sine function, for example, for pos=10.

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/pos-10-plot.png?raw=1' width='800'/>

If we go back to the sentence we are parsing in this section, we can see that black is in position pos=2 and brown is in position pos=10:

```
The black cat sat on the couch and the brown dog slept on the rug.
```

If we apply the sine and cosine functions literally for pos=2, we obtain a size=512 positional encoding vector:

In [8]:
d_model=512
max_length=20
max_len=max_length

pe = torch.zeros(max_len, d_model)
positional_encoding(2, pe)

tensor([[ 9.0930e-01, -4.1615e-01,  9.5814e-01,  ...,  1.0000e+00,
          2.1492e-08,  1.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        ...,
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00]])

We also obtain a size=512, positional encoding vector for position 10, pos=10:

In [9]:
positional_encoding(10, pe)

tensor([[-5.4402e-01, -8.3907e-01,  1.1878e-01,  ...,  1.0000e+00,
          1.0746e-07,  1.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        ...,
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00]])

When we look at the results we obtained with an intuitive literal translation, we would now like to check whether the results are meaningful.

The cosine similarity function used for word embedding comes in handy for having
a better visualization of the proximity of the positions:

In [10]:
cosine_similarity(positional_encoding(2, pe), positional_encoding(10, pe))

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.

The similarity between the position of the words black and brown and the lexical
field (groups of words that go together) similarity is different:

```python
cosine_similarity(black, brown)= [[0.9998901]]
```

The encoding of the position shows a lower similarity value than the word
embedding similarity.

The positional encoding has taken these words apart. Bear in mind that word
embeddings will vary with the corpus used to train them.

**The problem is now how to add the positional encoding to the word embedding
vectors.**

## Adding positional encoding to the embedding vector

The authors of the Transformer found a simple way by merely adding the positional encoding vector to the word embedding vector:

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/position-encoding2.png?raw=1' width='800'/>

If we go back and take the word embedding of black, for example, and name it
$y_1=black$, we are ready to add it to the positional vector $pe(2)$ we obtained with positional encoding functions. We will obtain the positional encoding $pc(black)$ of the input word black:

$$ pc(black)=y_1 + pe(2)$$

The solution is straightforward. **However, if we apply it as shown, we might lose the information of the word embedding, which will be minimized by the positional encoding vector.**

There are many possibilities to increase the value of $y_1$ to make sure that the information of the word embedding layer can be used efficiently in the subsequent layers.

One of the many possibilities is to add an arbitrary value to $y_1$, the word embedding of black:

$$y_1 * math.sqrt(d_{model})$$

We can now add the positional vector to the embedding vector of the word black,
both of which are the same size 512.

```python
for i in range(0, 512,2):
  pe[0][i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
  pc[0][i] = (y[0][i]*math.sqrt(d_model))+ pe[0][i]
  pe[0][i+1] = math.cos(pos / (10000 ** ((2 * i)/d_model)))
  pc[0][i+1] = (y[0][i+1]*math.sqrt(d_model))+ pe[0][i+1]
```


Let's define some parameters and variables.

In [12]:
pe1=aa.copy()  # pe for 1
pe2=aa.copy()  # pe for 2
pe3=aa.copy()  # pe for 3
paa=aa.copy()  # y for 1
pba=bb.copy()  # y for 2

d_model=512
max_print=d_model
max_length=20

In [13]:
for i in range(0, max_print, 2):
  pe1[0][i] = math.sin(pos1 / (10000 ** ((2 * i) / d_model)))
  paa[0][i] = (paa[0][i] * math.sqrt(d_model)) + pe1[0][i]

  pe1[0][i + 1] = math.cos(pos1 / (10000 ** ((2 * i) / d_model)))
  paa[0][i + 1] = (paa[0][i + 1] * math.sqrt(d_model)) + pe1[0][i + 1]

  if dprint == 1:
    print(i, pe1[0][i], i + 1, pe1[0][i + 1])
    print(i, paa[0][i], i + 1, paa[0][i + 1])
    print("\n")

max_len = max_length
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)

print(pe[:, 0::2])

tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 8.4147e-01,  8.2186e-01,  8.0196e-01,  ...,  1.1140e-04,
          1.0746e-04,  1.0366e-04],
        [ 9.0930e-01,  9.3641e-01,  9.5814e-01,  ...,  2.2279e-04,
          2.1492e-04,  2.0733e-04],
        ...,
        [-9.6140e-01, -6.3753e-01, -1.1153e-01,  ...,  1.8938e-03,
          1.8268e-03,  1.7623e-03],
        [-7.5099e-01, -9.9638e-01, -8.6358e-01,  ...,  2.0052e-03,
          1.9343e-03,  1.8659e-03],
        [ 1.4988e-01, -4.9773e-01, -9.2024e-01,  ...,  2.1165e-03,
          2.0418e-03,  1.9696e-03]])


In [14]:
print(pe[:, 1::2])

tensor([[ 1.0000,  1.0000,  1.0000,  ...,  1.0000,  1.0000,  1.0000],
        [ 0.5403,  0.5697,  0.5974,  ...,  1.0000,  1.0000,  1.0000],
        [-0.4161, -0.3509, -0.2863,  ...,  1.0000,  1.0000,  1.0000],
        ...,
        [-0.2752, -0.7704, -0.9938,  ...,  1.0000,  1.0000,  1.0000],
        [ 0.6603,  0.0850, -0.5042,  ...,  1.0000,  1.0000,  1.0000],
        [ 0.9887,  0.8673,  0.3914,  ...,  1.0000,  1.0000,  1.0000]])


The same operation is applied to the word brown and all of the other words in a
sequence. The output of this algorithm, which is not rule-based, might slightly vary during each run.

In [20]:
for i in range(0, max_print, 2):
  pe2[0][i] = math.sin(pos2 / (10000 ** ((2 * i) / d_model)))
  pba[0][i] = (pba[0][i] * math.sqrt(d_model)) + pe2[0][i]

  pe2[0][i + 1] = math.cos(pos2 / (10000 ** ((2 * i) / d_model)))
  pba[0][i + 1] = (pba[0][i + 1] * math.sqrt(d_model)) + pe2[0][i + 1]

  if dprint == 1:
    print(i, pe2[0][i], i + 1, pe2[0][i + 1])
    print(i, pba[0][i], i + 1, pba[0][i + 1])
    print("\n")

We can apply the cosine similarity function to the positional encoding vectors of black and brown:

```python
cosine_similarity(pc(black), pc(brown)= [[0.9627094]]
```

We now have a clear view of the positional encoding process through the three
cosine similarity functions we applied to the three states representing the words black and brown:


In [23]:
print(word1, word2)

cos_lib = cosine_similarity(aa, bb)
print(cos_lib, " >> word similarity")

cos_lib = cosine_similarity(pe1, pe2)
print(cos_lib, "  >> positional similarity")

cos_lib = cosine_similarity(paa, pba)
print(cos_lib, "  >> positional encoding similarity")

if dprint == 1:
  print(word1)
  print("embedding")
  print(aa)
  print("positional encoding")
  print(pe1)
  print("encoded embedding")
  print(paa)

  print(word2)
  print("embedding")
  print(ba)
  print("positional encoding")
  print(pe2)
  print("encoded embedding")
  print(pba)

black brown
[[0.99987805]]  >> word similarity
[[0.8600013]]   >> positional similarity
[[0.9608184]]   >> positional encoding similarity


We saw that the initial word similarity of their embeddings was very high, with a value of 0.99. Then we saw the positional encoding vector of positions 2 and
10 drew these two words apart with a lower similarity value of 0.86.

Finally, we added the word embedding vector of each word to its respective
positional encoding vector. We saw that this brought the cosine similarity of the two words to 0.96.

**The positional encoding of each word now contains the initial word embedding
information and the positional encoding values.**

**The output of positional encoding is the multi-head attention sub-layer.**