https://www.youtube.com/watch?v=ZMxVe-HK174&list=PLTl9hO2Oobd97qfWC40gOSU8C0iu0m2l4&index=3

Positional encodings are used for 3 reasons:  
- periodicity: given a word vector, how much attention we should pay to 5, 10, 15 words after it
- contrain values of the positional encodings between 1 and -1, avoids the problem during the calculation of attention whereby a word vector cannot attend to another word vector that is far away in terms of position. When calulcating the attention matrices, without bounded values for the positional encoding, a given vector will not be able to attend to other vectors that are far away from it and will not be able to derive any contetx from it
- easy to extroplate for long sequences : positional encodings can be determined for words in sequences of lengths that have not been seen in the training set

In [1]:
import torch
import torch.nn as nn

In [2]:
max_sequence_length = 10
# number of dimensions for PE
d_model = 6

$$PE(position, 2i)=sin(\frac{position}{10000\frac{2i}{d_{model}}})$$ 

$$PE(position, 2i+1)=cos(\frac{position}{10000\frac{2i}{d_{model}}})$$ 

which can be re-written as :  

$PE(position, i)=sin(\frac{position}{10000\frac{i}{d_{model}}})$, when $i$ is even  

$PE(position, i)=cos(\frac{position}{10000\frac{i-1}{d_{model}}})$, when $i$ is odd

*i* is the dimension index  
*position* is the position of the word in the sequence

In [3]:
even_i = torch.arange(0, d_model, 2).float() 

In [4]:
even_denominator = torch.pow(10000, even_i/d_model)

In [5]:
odd_i = torch.arange(1, d_model, 2).float()

In [6]:
odd_denominator = torch.pow(10000, (odd_i -1)/d_model)

In [7]:
even_denominator

tensor([  1.0000,  21.5443, 464.1590])

In [8]:
odd_denominator

tensor([  1.0000,  21.5443, 464.1590])

*even_denominator* and *odd_denominator* are the same, so we can do the same actions on just one of the variables and call the resulting variable *denominator*

In [9]:
denominator = even_denominator

In [10]:
position = torch.arange(max_sequence_length, dtype=torch.float).reshape(max_sequence_length, 1)

In [11]:
even_PE = torch.sin(position / denominator)
odd_PE = torch.cos(position / denominator)


In [12]:
even_PE

tensor([[ 0.0000,  0.0000,  0.0000],
        [ 0.8415,  0.0464,  0.0022],
        [ 0.9093,  0.0927,  0.0043],
        [ 0.1411,  0.1388,  0.0065],
        [-0.7568,  0.1846,  0.0086],
        [-0.9589,  0.2300,  0.0108],
        [-0.2794,  0.2749,  0.0129],
        [ 0.6570,  0.3192,  0.0151],
        [ 0.9894,  0.3629,  0.0172],
        [ 0.4121,  0.4057,  0.0194]])

In [13]:
odd_PE

tensor([[ 1.0000,  1.0000,  1.0000],
        [ 0.5403,  0.9989,  1.0000],
        [-0.4161,  0.9957,  1.0000],
        [-0.9900,  0.9903,  1.0000],
        [-0.6536,  0.9828,  1.0000],
        [ 0.2837,  0.9732,  0.9999],
        [ 0.9602,  0.9615,  0.9999],
        [ 0.7539,  0.9477,  0.9999],
        [-0.1455,  0.9318,  0.9999],
        [-0.9111,  0.9140,  0.9998]])

In [14]:
# interleave the odd and even positional encodings
stacked = torch.stack([even_PE, odd_PE], dim=2)
# provides vectors of positional encodings for each word (in this case, 10 words)
PE = torch.flatten(stacked, start_dim=1, end_dim=2)

In [15]:
PE.shape

torch.Size([10, 6])

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_sequence_length):
        super().__init__()
        self.max_sequence_length = max_sequence_length
        self.d_model =d_model

    def forward(self):
        even_i = torch.arange(0, self.d_model, 2).float()
        denominator = torch.pow(1000, even_id/self.model)
        position = torch.arange(max_sequence_length, dtype=torch.float).reshape(max_sequence_length, 1)
        even_PE = torch.sin(position / denominator)
        odd_PE = torch.cos(position / denominator)
        # interleave the odd and even positional encodings
        stacked = torch.stack([even_PE, odd_PE], dim=2)
        # provides vectors of positional encodings for each word (in this case, 10 words)
        PE = torch.flatten(stacked, start_dim=1, end_dim=2)
        return PE

pe = PositionalEncoding(d_model=5, max_sequence_length=10)
pe.forward()