# Positioning Encoding

The order of position of the letters in a word is important to convey the intented meaning.  It's crucial to capture the position of each token to maintain the semantic meaning, as tokens are processed independently and simultaneously.

Consider the phrases:

"King and Queen are awesome"

&

"Queen and King are awesome"

These sentences are slightly different but their vector representations of embeddings are identical.  We introduce positional encoding to address this issue.  It incorperates information about the position of each embedding within a sequence.  Positional encoding is added to the input embeddings, enabling the model to diffrentiate between the positions of various elements in the input sequence.

Adding positional encoding to our example phrases, you will see that the vector representation would be different than if we did not.


![Positional Encoding Image](/Users/williamzebrowski/transform_llm/images/cosine-sine.png)

Positional encoding consists of a series of sine and consine waves and involves two parameters:

1. the `pos` parameter

- PE(`pos`, 2i)
- PE(`pos`, 2i + 1)

..represents the position of the sine wave over time. Similar to the time variable `t` or `x` coordinate in a standard plot: 

![Position Param Image](/Users/williamzebrowski/transform_llm/images/pos-param.png)

2. The parameter `i`, which is the `dimention index`;

- PE(pos, `2i`)
- PE(pos, `2i + 1`)

..effectively generates a unique sine or cosine wave for each embedding, controlling the number of oscillations for each wave. An oscillation refers to a single cycle of a wave, moving from the highest point to the lowest point and back to the highest point. The number of oscillations refers to how many times the wave cycles within the positional encoding. Each sine and cosine wave controls the number of oscillations for each wave, and these waves are added to different dimensions of the word embeddings. This process allows the model to encode positional information within each embedding. Each one of these waves is added to a different dimension in the word embedding.

This is part of the design of positional encodings in transformer models, which use sine and cosine functions to generate these values.

- For even indices `(0, 2, ...)` - even, the positional encoding uses a sine function.
- For odd indices `(1, 3, ...)` - odd, the positional encoding uses a cosine function.

where:
`pos` is the position of the word in the sentence.
`i` is the index within the embedding vector.
`d` is the dimensionality of the word embeddings (4 in your example).


![Dimension Index Image](/Users/williamzebrowski/transform_llm/images/dim-index.png)

Explore further with an example and begin with a sequence of embeddings for the phrase 'Transformers are awesome'.  Each row represents a embedding for a specific word and each column cooresponds with an element in that embedding.

Input Embeddings:

| Transformers| 0.2 | 0.4 | 0.1 | 0.3 |
|-------------|-----|-----|-----|-----|
|     are     | 0.5 | 0.2 | 0.7 | 0.9 |
|   awesome   | 0.8 | 0.6 | 0.4 | 0.2 |

`POS(t)` is the specific position of each word embedding within the sequence.  

| <span style="color: purple">POS(t)</span>|
|-------|
| 0     |
| 1     |
| 2     |

In this example, the sequence length is 3.

Each word embedding has a demensionality of 4 and will be differientiated based on whether its indeces are odd or even. 

| Dimensions | 0   | <span style="color: purple">1</span> | 2   | <span style="color: purple">3</span> |
|-------------|-----|-----|-----|-----|


Let's add positional encoding to the embedding for `Transformers`:

| Transformers| 0.2 | 0.4 | 0.1 | 0.3 |
|-------------|-----|-----|-----|-----|

If you remember:
- For even indices `(0, 2, ...)`, the positional encoding uses a sine function.
- For odd indices `(1, 3, ...)`, the positional encoding uses a cosine function.

For each dimension in the positional encoding, you introduce the corresponding sine and consine waves.  Thus for `Transformers (pos = 0)` 

For `i=0`, a sine wave is added, and this pattern is repeated for `i=1`, ensuring that each dimnsion is uniquely represente by its own wave. 

`i=0` 

![Position 0 Image](/Users/williamzebrowski/transform_llm/images/tran_pos0.png)

`i=1`

![Position 1 Image](/Users/williamzebrowski/transform_llm/images/tran_pos1.png)



Similiarly, the positional encoding values are calculated for the words "are" and "awesome"

A positional encoding table:

| Transformers|   0  |   1   |  0   |   1  |
|-------------|------|-------|------|------|
|     are     | 0.84 |  0.54 | 0.01 | 0.99 |
|   awesome   | 0.90 | -0.41 | 0.02 | 0.99 |


The 3 graphs represent positional encoding, one for each dimension of the word embeddings.  For the sequence length of 3, you see only three values for each sine function.

![Positional Graph Image](/Users/williamzebrowski/transform_llm/images/pos_graph.png)

Let's generate positional encoding with an embedding dimension of 8 to depict a realistic setting:

| Transformers|   0  |   1   |  0   |   1  |   0  |   1  |  0   |   1  |
|-------------|------|-------|------|------|------|------|------|------|
|     are     | 0.84 |  0.54 | 0.01 | 0.99 | 0.03 |   1  |   0  |   1  |
|   awesome   | 0.90 | -0.41 | 0.02 | 0.99 | 0.05 |   1  | 0.01 |   1  | 
|    pos=3    |  -   |   -   |   -  |  -   |  -   |   -   |   -  |  -  |
|    pos=4    |  -   |   -   |   -  |  -   |  -   |   -   |   -  |  -  |
|    pos=99   |  -1  |  0.04 | 0.02 |  -1  | 0.61 | -0.79 | 0.38 | 0.92|


Positional encoding can be conceptualized as a series of vectors, where each vector captures a specific location within the sequence. When this poitional vector is added to its corresponding embedding vector, the combination preservces the positional information, ensuring the elements sequence order is maintained within the resulting vector.  In models such as GPT, positional encodings are not static but rather learnable paramters.  These learning params, represented by tensors, are added to the embedding vector and optimized during training.  

Segment embeddings used in certain models, such as BERT, are related to positional encodings, providing additional positional informaiton. You can intergrate segment embeddings into the existing embeddings alongside the positional encodings. 

1. For instance, if the embedding size is 4, a word might be represented as:

`x_w = [0.2, 0.4, 0.1, 0.3]`

2. For a position `i` in a sequence, the positional encoding might be:

`p_i = [0, 1, 0, 1]`

3. Adding positional embeddings to word embeddings.  This produces a new vector that contains both the words semantic meaning and its position in the sequence.

Let's perform the addition element-wise:
- For the first element: `0.2 + 0 = 0.2`
- For the second element: `0.4 + 1 = 1.4`
- For the third element: `0.1 + 0 = 0.1`
- For the fourth element: `0.3 + 1 = 1.3`

`x'_w = x_w + p_i = [0.2, 0.4, 0.1, 0.3] + [0, 1, 0, 1] = [0.2, 1.4, 0.1, 1.3]`

4. Learnable positional encoding in GPT models

Learnable positional encodings offer flexibility and can adapt to the specific patterns of the dataset, potentially leading to better performance on tasks where the relative position of tokens is particularly important.

The learnable positional encoding for position `i` might initailly be:

`w_i = [0.02, 0.04, 0.08, 0.06]`

5. `Segment Embeddings` are additional embeddings used in models like BERT to distinguish between different segments of text (eg., sentences in a document)

Segment embeddings allow the model to distinguish between different parts of the input, such as the question and answer in a question-answering task. By adding a unique embedding to tokens from different segments, the model can learn segment-specific representations.

`s_i = [0.05, 0.07, 0.1, 0.02]`

6.  Integrating `Segment Embeddings` to the existing word embeddings and positional encodings.

The integration of segment embeddings with word embeddings and positional encodings enriches the input representation with multiple facets of information: semantic content (word embeddings), position in the sequence (positional encodings), and role in the context of multiple sequences (segment embeddings). This comprehensive representation enables the model to perform complex reasoning over the input.


`x''_w = x'_w + s_i = [0.11, 0.33, 0.77, 0.55] + [0.05, 0.07, 0.1, 0.02] = [0.16, 0.4, 0.87, 0.57]`




Let's see how we can implement positional encoding in PyTorch

1. We generate actual embeddings in a input sequence the positional encoder will use.

In [1]:
import torch
import torch.nn as nn

vocab = {'Transformers': 0, 'are': 1, 'awesome': 2}
token_indices = [vocab['Transformers']]

VOCAB_SIZE = len(vocab)  # Size of the vocabulary
EMBEDDING_DIM = 4  # Dimension of the embedding vector

# Create an embedding layer
embedding = nn.Embedding(VOCAB_SIZE, EMBEDDING_DIM)

# Convert token_indices to a tensor
token_indices_tensor = torch.tensor(token_indices, dtype=torch.long)

# Generate embeddings for the given indices
transformer_token = embedding(token_indices_tensor)

print(transformer_token)

tensor([[ 0.4098,  0.1499, -0.3744,  0.0252]], grad_fn=<EmbeddingBackward0>)


Next, we bring in the generated embedding for the word 'Transformers' and create the positional embedding by using the sin and cosign algorithms, adding them to the original embedding to create a final positional encoded embedding.

We will break down each line that's needed and include comments as reference to what exactly is happening:

In [2]:
import math
import torch

# Define parameters
emb_size = 4
seq_len = 1

# Let's get the "Transformer" word embedding from above:
print(f"\nWord embedding:")
print(transformer_token)

# Calculate the scaling factor for each dimension
# This corresponds to 1 / 10000^(2i/d_model) in the formula
den = torch.exp(-torch.arange(0, emb_size, 2) * math.log(10000) / emb_size)
print("\nScaling factor:")
print(den)

# Create a position tensor for the sequence
pos = torch.arange(0, seq_len).reshape(seq_len, 1)
print("\nPosition:")
print(pos)

# Initialize the positional embedding tensor with zeros
# This tensor will hold the final positional encodings
pos_embedding = torch.zeros((seq_len, emb_size))
print("\nInitialize Position Embedding:")
print(pos_embedding)

# Fill the even-indexed columns (0, 2, ...) of the positional embedding with sine values
# The sine function is applied to the element-wise product of the position and denominator tensors
pos_embedding[:, 0::2] = torch.sin(pos * den)
print("\nsine:")
print(pos_embedding[:, 0::2])

# Fill the odd-indexed columns (1, 3, ...) of the positional embedding with cosine values
# The cosine function is applied to the element-wise product of the position and denominator tensors
pos_embedding[:, 1::2] = torch.cos(pos * den)
print("\nconsine:")
print(pos_embedding[:, 1::2])

# Add positional encodings to the token embedding
encoded_embedding = transformer_token + pos_embedding

# Print the positional encodings and final encoded embedding
print("\nPositional Encoding:")
print(pos_embedding)
print("\nFinal Encoded Embedding:")
print(encoded_embedding)


Word embedding:
tensor([[ 0.4098,  0.1499, -0.3744,  0.0252]], grad_fn=<EmbeddingBackward0>)

Scaling factor:
tensor([1.0000, 0.0100])

Position:
tensor([[0]])

Initialize Position Embedding:
tensor([[0., 0., 0., 0.]])

sine:
tensor([[0., 0.]])

consine:
tensor([[1., 1.]])

Positional Encoding:
tensor([[0., 1., 0., 1.]])

Final Encoded Embedding:
tensor([[ 0.4098,  1.1499, -0.3744,  1.0252]], grad_fn=<AddBackward0>)


Great! Now that we have a sense on how Positional Embeddings are created and added to Input Embeddings, let's create a PositionalEncoding class that allows us to input embeddings and create positional embeddings. This class will calculate the positional encodings and add them to the provided embeddings.

In [3]:
class PositionalEncoding(nn.Module):
    """
    Positional encoding module to add positional information to the embeddings.
    """
    def __init__(self, d_model, max_len=10):
        """
        Initialize the positional encoding.
        
        Args:
            d_model (int): The dimension of the model.
            max_len (int): The maximum length of the sequences.
        """
        super(PositionalEncoding, self).__init__()
        
        # Create a positional encoding matrix as per the formula
        self.encoding = torch.zeros(max_len, d_model)
        pos = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        den = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        self.encoding[:, 0::2] = torch.sin(pos * den)
        self.encoding[:, 1::2] = torch.cos(pos * den)
        self.encoding = self.encoding.unsqueeze(0)

    def forward(self, x):
        """
        Add positional encoding to the input tensor.
        
        Args:
            x (torch.Tensor): The input tensor with shape (batch_size, seq_len, d_model).
        
        Returns:
            torch.Tensor: The input tensor with added positional encoding.
        """
        seq_len = x.size(1)
        return x + self.encoding[:, :seq_len, :].to(x.device)

# Define parameters
emb_size = 4
seq_len = 3  # Length of the sequence: "Transformers are awesome"

# Vocabulary and token indices
vocab = {'Transformers': 0, 'are': 1, 'awesome': 2}
token_indices = [vocab['Transformers'], vocab['are'], vocab['awesome']]

VOCAB_SIZE = len(vocab)  # Size of the vocabulary
EMBEDDING_DIM = 4  # Dimension of the embedding vector

# Create an embedding layer
embedding = nn.Embedding(VOCAB_SIZE, EMBEDDING_DIM)

# Convert token_indices to a tensor
token_indices_tensor = torch.tensor(token_indices, dtype=torch.long)

# Generate embeddings for the given indices
embeddings = embedding(token_indices_tensor)

print(f"\nWord embeddings:")
print(embeddings)

# Initialize the positional encoding module with a smaller max_len
pos_encoding = PositionalEncoding(d_model=emb_size, max_len=10)  # Adjusted max_len

# Add positional encodings to the token embeddings
encoded_embeddings = pos_encoding(embeddings.unsqueeze(0))

# Print the positional encodings and final encoded embeddings
print("\nPositional Encoding:")
print(pos_encoding.encoding[:, :seq_len, :])
print("\nFinal Encoded Embeddings:")
print(encoded_embeddings)



Word embeddings:
tensor([[ 0.0428, -0.5897, -1.4232, -0.2149],
        [ 0.0338, -0.7020, -1.5582,  0.4966],
        [ 0.5453, -0.2923,  0.2946, -1.0503]], grad_fn=<EmbeddingBackward0>)

Positional Encoding:
tensor([[[ 0.0000,  1.0000,  0.0000,  1.0000],
         [ 0.8415,  0.5403,  0.0100,  0.9999],
         [ 0.9093, -0.4161,  0.0200,  0.9998]]])

Final Encoded Embeddings:
tensor([[[ 0.0428,  0.4103, -1.4232,  0.7851],
         [ 0.8752, -0.1617, -1.5482,  1.4966],
         [ 1.4546, -0.7085,  0.3146, -0.0505]]], grad_fn=<AddBackward0>)


The output above gives us the ability to run a full sequence through a PositionalEncoding class. 

1. We can see the Input embeddings we created using the input sequence "Transformers are awesome".

2. We used sine and cosine functions to generate positional encodings based on the position of each word in the sequence.

3. he positional encodings were added to the input embeddings, resulting in the final encoded embeddings. This process adds positional information to the embeddings, allowing the model to understand the order of words in the sequence.

Attention Mechanism

- Attention mechanisms emply the query, key, and value matrices.
- The query vector should align with the same row across keyand value matrices.
- YOu can apply attention mechanisms to word embeddings. This process helps capture contextual relationships between words.
- You can refine the attensino formula by incorperating the softmax function on the output of the dot product between the query vector and the key.
- For employing attension to sequences, you can conslidate all the query vectors into a single matrix


Self-Attention Mechinisms

This self-attension mechanism is the heart of the language transformer.  Each word in the sequence attends every other word in parallel to generate embeddings.

This predicts the next word in the sentence. 

For example, in the table the sequ3nce of words is presented on the left side and the prediction of words in on the right:

| Input sequence (wt-2,wt-1)|  Predicted word  |
|---------------------------|------------------|
|         Not like          |       Hate       |
|         Not hate          |       Like       |
|         Do like           |       Like       |
|         Do hate           |       Hate       |

When you insert the sequence 'Not Like' the language modeling will predict 'Hate'.  Similarily, the input word 'Do like' would predict the word like. When the context of the word changes, the meaning of the proceeding words also changes.

The words transform into matrices of the embedding sequence where each embedded word represents a column vector within the sequence matrix.

![Image](/Users/williamzebrowski/transform_llm/images/atten-mech-wembed.png)

For example, X, represents the matrix and the subscript 'Not, Like' represents the word sequence with each column corresponding to a word embedding.

![Image](/Users/williamzebrowski/transform_llm/images/atten-mech-matrix-subs.png)

This means, the Matrix X, 'Not Like' contains the embedding for 'Not' and 'Like'.

Therefor, each matrix is converted into a matrix representation and represented as a sample sequence in the dataset.

### Query projections with bias

![Image](/Users/williamzebrowski/transform_llm/images/query-learnable-params.png)


Self-attention mechanisms and language modeling involve 3 key components, `query Q`, `key K` and `value V`.

This process begins with input embeddings combined with learnable parameters.  You can derive a query matrix by multiplying the input sequence matrix by learnable parameters known as query projections `weights and biases`.

Here, a row vector of 1's with the same length as the number of tokens is also used. 

### Key projections with bias

![Image](/Users/williamzebrowski/transform_llm/images/key-learnable-params.png)

Similarily, you can derive the key matrix by multilying the input sequence matrix with a different set of learnable parameters, key projection `weights and biases`


### Key projections with bias

![Image](/Users/williamzebrowski/transform_llm/images/value-learnable-params.png)

You can apply the same process to obtain the value matrix, where the input is again multiplied by a set of learnable parameters know as value projection `weights and biases`




In [4]:
import torch

# Assume emb_size = 4 for simplicity
emb_size = 4
# Number of words in the phrase
num_words = 3

# Create dummy embeddings for the phrase "Transformers are awesome"
token_embeddings = torch.randn(num_words, emb_size)
print("Initial Token Embeddings:")
print(token_embeddings)

Initial Token Embeddings:
tensor([[-2.1941e-01, -1.9092e-04,  7.6247e-01, -9.9991e-01],
        [ 2.3665e+00, -7.6504e-01,  7.8260e-01, -9.7778e-01],
        [-3.4908e-01, -6.9195e-02, -3.0242e-01, -9.7220e-02]])


In [5]:
import math
import torch
import torch.nn as nn

class PositionalEncoding(nn.Module):
    def __init__(self, emb_size, dropout, maxlen=5000):
        super(PositionalEncoding, self).__init__()
        # Create a positional encoding matrix as per the Transformer paper's formula
        den = torch.exp(-torch.arange(0, emb_size, 2) * math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)
        
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: torch.Tensor) -> torch.Tensor:
        # Apply the positional encodings to the input token embeddings
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

In [6]:
!pip install -q torchtext torch matplotlib pandas numpy tqdm torchdata

In [7]:
!pip install --upgrade torch torchdata



In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import io
import re
import math

import torch
import torch.nn as nn
from torch import optim
from torch.utils.data import DataLoader
from torch.utils.data.dataset import Dataset
import torch.nn.functional as F

from torchtext.datasets import YahooAnswers
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import torchtext.transforms as T
from torch.hub import load_state_dict_from_url
from torchtext.data.functional import sentencepiece_tokenizer, load_sp_model

from tqdm.notebook import trange, tqdm



In [9]:
from torch.distributions import Categorical

In [10]:
# Define the hyperparameters
learning_rate = 1e-4

nepochs = 100

batch_size = 32

max_len_q = 32
max_len_a = 64

data_set_root = "../datasets"

# We'll be using the YahooAnswers Dataset
# Note that for torchtext these datasets are NOT Pytorch dataset classes "YahooAnswers" is a function that
# returns a Pytorch DataPipe!

# Pytorch DataPipes vvv
# https://pytorch.org/data/main/torchdata.datapipes.iter.html

# vvv Good Blog on the difference between DataSet and DataPipe
# https://medium.com/deelvin-machine-learning/comparison-of-pytorch-dataset-and-torchdata-datapipes-486e03068c58

# Depending on the dataset sometimes the dataset doesn't download and gives an error
# and you'll have to download and extract manually 
# "The datasets supported by torchtext are datapipes from the torchdata project, which is still in Beta status"

# Un-comment to triger the DataPipe to download the data vvv
dataset_train = YahooAnswers(root=data_set_root, split="train")
data = next(iter(dataset_train))

# Side-Note I've noticed that the WikiText dataset is no longer able to be downloaded :(

  from .autonotebook import tqdm as notebook_tqdm


ModuleNotFoundError: Package `portalocker` is required to be installed to use this datapipe.Please use `pip install 'portalocker>=2.0.0'` or`conda install -c conda-forge 'portalocker>=2/0.0'`to install the package

In [None]:
# ## "Train" a Sentence Piece Tokenizer with the train data capping the vocab size to 20000 tokens
# from torchtext.data.functional import generate_sp_model

# with open(os.path.join(data_set_root, "datasets/YahooAnswers/train.csv")) as f:
#     with open(os.path.join(data_set_root, "datasets/YahooAnswers/data.txt"), "w") as f2:
#         for i, line in enumerate(f):
#             text_only = "".join(line.split(",")[1:])
#             filtered = re.sub(r'\\|\\n|;', ' ', text_only.replace('"', ' ').replace('\n', ' ')) # remove newline characters
#             f2.write(filtered.lower() + "\n")


# generate_sp_model(os.path.join(data_set_root, "datasets/YahooAnswers/data.txt"), 
#                   vocab_size=20000, model_prefix='spm_user_ya')

In [None]:
class YahooQA(Dataset):
    def __init__(self, num_datapoints, test_train="train"):
        self.df = pd.read_csv(os.path.join(data_set_root, "datasets/YahooAnswers/" + test_train + ".csv"),
                              names=["Class", "Q_Title", "Q_Content", "A"])
        
        self.df.fillna('', inplace=True)
        self.df['Q'] = self.df['Q_Title'] + ': ' + self.df['Q_Content']
        self.df.drop(['Q_Title', 'Q_Content'], axis=1, inplace=True)
        self.df['Q'] = self.df['Q'].str.replace(r'\\n|\\|\\r|\\r\\n|\n|"', ' ', regex=True)
        self.df['A'] = self.df['A'].str.replace(r'\\n|\\|\\r|\\r\\n|\n|"', ' ', regex=True)

    def __getitem__(self, index):
        question_text = self.df.loc[index]["Q"].lower()
        answer_text = self.df.loc[index]["A"].lower()

        return question_text, answer_text

    def __len__(self):
        return len(self.df)

In [None]:
dataset_train = YahooQA(num_datapoints=data_set_root, test_train="train")
dataset_test = YahooQA(num_datapoints=data_set_root, test_train="test")

In [None]:
data_loader_train = DataLoader(dataset_train, batch_size=batch_size, shuffle=True, num_workers=4, drop_last=True)
data_loader_test = DataLoader(dataset_test, batch_size=batch_size, shuffle=True, num_workers=4)

In [None]:
sp_model = load_sp_model("spm_user_ya.model")
tokenizer = sentencepiece_tokenizer(sp_model)

In [None]:
def yield_tokens(file_path):
    with io.open(file_path, encoding = 'utf-8') as f:
        for line in f:
            yield [line.split("\t")[0]]
            
vocab = build_vocab_from_iterator(yield_tokens("spm_user_ya.vocab"), 
                                  specials= ['<pad>', '<soq>', '<eoq>', '<soa>', '<eoa>', '<unk>'], # special case tokens
                                  special_first=True)
vocab.set_default_index(vocab['<unk>'])

In [None]:
tokenizer_transform = T.SentencePieceTokenizer("spm_user_ya.model")

In [None]:
q_tranform = T.Sequential(
    # Tokeniz with pre-existing Tokenizer
    T.SentencePieceTokenizer("spm_user_ya.model"),
    ## converts the sentences to indices based on given vocabulary
    T.VocabTransform(vocab=vocab),
    ## Add <sos> at beginning of each sentence. 1 because the index for <sos> in vocabulary is
    # 1 as seen in previous section
    T.AddToken(1, begin=True),
    # Crop the sentance if it is longer than the max length
    T.Truncate(max_seq_len=max_len_q),
    ## Add <eos> at beginning of each sentence. 2 because the index for <eos> in vocabulary is
    # 2 as seen in previous section
    T.AddToken(2, begin=False),
    # Convert the list of lists to a tensor, this will also
    # Pad a sentence with the <pad> token if it is shorter than the max length
    # This ensures all sentences are the same length!
    T.ToTensor(padding_value=0)
)

a_tranform = T.Sequential(
    # Tokeniz with pre-existing Tokenizer
    T.SentencePieceTokenizer("spm_user_ya.model"),
    ## converts the sentences to indices based on given vocabulary
    T.VocabTransform(vocab=vocab),
    ## Add <sos> at beginning of each sentence. 1 because the index for <sos> in vocabulary is
    # 1 as seen in previous section
    T.AddToken(3, begin=True),
    # Crop the sentance if it is longer than the max length
    T.Truncate(max_seq_len=max_len_a),
    ## Add <eos> at beginning of each sentence. 2 because the index for <eos> in vocabulary is
    # 2 as seen in previous section
    T.AddToken(4, begin=False),
    # Convert the list of lists to a tensor, this will also
    # Pad a sentence with the <pad> token if it is shorter than the max length
    # This ensures all sentences are the same length!
    T.ToTensor(padding_value=0)
)

In [None]:
# sinusoidal positional embeds
class SinusoidalPosEmb(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, x):
        device = x.device
        half_dim = self.dim // 2
        emb = math.log(10000) / (half_dim - 1)
        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
        emb = x[:, None] * emb[None, :]
        emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
        return emb

    
# Attention block with self-attention with/without causal masking
class AttentionBlock(nn.Module):
    def __init__(self, hidden_size=128, num_heads=4, masking=True):
        super(AttentionBlock, self).__init__()
        self.masking = masking

        self.multihead_attn = nn.MultiheadAttention(hidden_size, num_heads=num_heads, batch_first=True)
                
    def forward(self, x_in, kv_in):
        if self.masking:
            bs, l, h = x_in.shape
            mask = torch.triu(torch.ones(l, l, device=x_in.device), 1).bool()
        else:
            mask = None
            
        return self.multihead_attn(x_in, kv_in, kv_in, attn_mask=mask)[0]

    
# Transformer block with self-attention with/without causal masking
class TransformerBlock(nn.Module):
    def __init__(self, hidden_size=128, num_heads=4, decoder=False, masking=True):
        super(TransformerBlock, self).__init__()
        self.decoder = decoder

        self.norm1 = nn.LayerNorm(hidden_size)
        self.attn1 = AttentionBlock(hidden_size=hidden_size, num_heads=num_heads, masking=masking)
        
        if self.decoder:
            self.norm2 = nn.LayerNorm(hidden_size)
            self.attn2 = AttentionBlock(hidden_size=hidden_size, num_heads=num_heads, masking=False)
        
        self.norm_mlp = nn.LayerNorm(hidden_size)
        self.mlp = nn.Sequential(nn.Linear(hidden_size, hidden_size * 4),
                                 nn.ELU(),
                                 nn.Linear(hidden_size * 4, hidden_size))
                
    def forward(self, x, kv_cross=None):
        x = self.attn1(x, x) + x
        x = self.norm1(x)

        if self.decoder:
            x = self.attn2(x, kv_cross) + x
            x = self.norm2(x)

        x = self.mlp(x) + x
        return self.norm_mlp(x)
    
    
class Encoder(nn.Module):
    def __init__(self, num_emb, hidden_size=128, num_layers=3, num_heads=4):
        super(Encoder, self).__init__()
        
        # Create an embedding for each token
        self.embedding = nn.Embedding(num_emb, hidden_size)
        self.pos_emb = SinusoidalPosEmb(hidden_size)
        
        self.blocks = nn.ModuleList([
            TransformerBlock(hidden_size, num_heads, decoder=False, masking=False) for _ in range(num_layers)
        ])
                
    def forward(self, input_seq):        
        input_embs = self.embedding(input_seq)
        bs, l, h = input_embs.shape

        # Add a unique embedding to each token embedding depending on it's position in the sequence
        seq_indx = torch.arange(l, device=input_seq.device)
        pos_emb = self.pos_emb(seq_indx).reshape(1, l, h).expand(bs, l, h)
        embs = input_embs + pos_emb
        
        for block in self.blocks:
            output = block(embs)
        
        return output

    
class Decoder(nn.Module):
    def __init__(self, num_emb, hidden_size=128, num_layers=3, num_heads=4):
        super(Decoder, self).__init__()
        
        # Create an embedding for each token
        self.embedding = nn.Embedding(num_emb, hidden_size)
        self.pos_emb = SinusoidalPosEmb(hidden_size)
        
        self.blocks = nn.ModuleList([
            TransformerBlock(hidden_size, num_heads, decoder=True) for _ in range(num_layers)
        ])
                
        self.fc_out = nn.Linear(hidden_size, num_emb)
        
    def forward(self, input_seq, encoder_output):        
        input_embs = self.embedding(input_seq)
        bs, l, h = input_embs.shape

        # Add a unique embedding to each token embedding depending on it's position in the sequence
        seq_indx = torch.arange(l, device=input_seq.device)
        pos_emb = self.pos_emb(seq_indx).reshape(1, l, h).expand(bs, l, h)
        embs = input_embs + pos_emb
        
        for block in self.blocks:
            output = block(embs, kv_cross=encoder_output)
        
        return self.fc_out(output)

    
# "Encoder-Decoder" Style Transformer with self-attention
class EncoderDecoder(nn.Module):
    def __init__(self, num_emb, hidden_size=128, num_layers=(3, 3), num_heads=4):
        super(EncoderDecoder, self).__init__()
        
        # Create an embedding for each token
        self.encoder = Encoder(num_emb=num_emb, hidden_size=hidden_size, 
                               num_layers=num_layers[0], num_heads=num_heads)
        
        self.decoder = Decoder(num_emb=num_emb, hidden_size=hidden_size, 
                               num_layers=num_layers[1], num_heads=num_heads)

        
    def forward(self, input_seq, target_seq):        
        encoded_seq = self.encoder(input_seq)
        decoded_seq = self.decoder(target_seq, encoded_seq)

        return decoded_seq

In [None]:
device = torch.device(0 if torch.cuda.is_available() else 'cpu')

In [None]:
hidden_size = 512

num_layers = (3, 6)
num_heads = 16

# Create model
tf_generator = EncoderDecoder(num_emb=len(vocab), num_layers=num_layers, 
                              hidden_size=hidden_size, num_heads=num_heads).to(device)

# Initialize the optimizer with above parameters
optimizer = optim.Adam(tf_generator.parameters(), lr=learning_rate, weight_decay=1e-4)

# Define the loss function
loss_fn = nn.CrossEntropyLoss()

# Custom transform that will randomly replace a token with <pad>
# td = TokenDrop(prob=0.2)

In [None]:
# Let's see how many Parameters our Model has!
num_model_params = 0
for param in tf_generator.parameters():
    num_model_params += param.flatten().shape[0]

print("-This Model Has %d (Approximately %d Million) Parameters!" % (num_model_params, num_model_params//1e6))

In [None]:
training_loss_logger = []

In [None]:
for epoch in trange(0, nepochs, leave=False, desc="Epoch"):
    tf_generator.train()
    steps = 0
    for q_text, a_text in tqdm(data_loader_train, desc="Training", leave=False):
        q_text_tokens = q_tranform(list(q_text)).to(device)
        a_text_tokens = a_tranform(list(a_text)).to(device)
        a_input_text = a_text_tokens[:, 0:-1]
        a_output_text = a_text_tokens[:, 1:]
        
        bs = q_text_tokens.shape[0]

        pred = tf_generator(q_text_tokens, a_input_text)

        loss = loss_fn(pred.transpose(1, 2), a_output_text)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        training_loss_logger.append(loss.item())
        

In [None]:
_ = plt.figure(figsize=(10, 5))
_ = plt.plot(training_loss_logger[1000:])
_ = plt.title("Training Loss")

In [None]:
window_size = 512
data = np.convolve(np.array(training_loss_logger), np.ones(window_size)/window_size, mode="valid")
_ = plt.figure(figsize=(10, 5))
_ = plt.plot(data[10000:])
_ = plt.title("Training Loss")

In [None]:
q_text, a_text = next(iter(data_loader_test))

In [None]:
q_text[0]

In [None]:
a_text[0]

In [None]:
# init_prompt = ["what is that largest ocean in the world?: "]
init_prompt = [q_text[0]]

input_tokens = q_tranform(init_prompt).to(device)

# Add Start-Of-Answer token to prompt the network to start generating the answer!
# input_tokens = torch.cat((input_tokens, 3 * torch.ones(1, 1, device=device).long()), 1)
soa_token = 3 * torch.ones(1, 1).long()
print(input_tokens)
print(vocab.lookup_tokens(input_tokens[0].cpu().numpy()))

In [None]:
temp = 0.8

In [None]:
log_tokens = [soa_token]
tf_generator.eval()

with torch.no_grad():
    encoded_seq = tf_generator.encoder(input_tokens.to(device))

    for i in range(100):
        input_tokens = torch.cat(log_tokens, 1)
        data_pred = tf_generator.decoder(input_tokens.to(device), encoded_seq)
#         We can take the token with the highest prob
#         input_tokens = data_pred[:, -1].argmax().reshape(1, 1)
        
        # Or sample from the distribution of probs!
        dist = Categorical(logits=data_pred[:, -1]/temp)
        next_tokens = dist.sample().reshape(1, 1)
        
        log_tokens.append(next_tokens.cpu())
        
        if next_tokens.item() == 4:
            break

In [None]:
pred_text = "".join(vocab.lookup_tokens(torch.cat(log_tokens, 1)[0].numpy()))
print(pred_text)

In [None]:
pred_text.replace("▁", " ").replace("<unk>", "").replace("<eoa>", "")

In [None]:
plt.plot(F.softmax(data_pred[:, -1]/temp, -1).cpu().numpy().flatten())