`transformer - language translation`

# model
> code and explanation of the transformer model architecture.

<img src="https://miro.medium.com/max/1280/1*oh6wljc7WoW8G-0KNNBJww.jpeg" width=800>

> * **transformer model** is a transduction model.
> * **transduction model** is a type of neural network model that maps input sequence to output sequence of **variable length**.

## input embedding
> converts the subword tokenized input data into **embeddings** of size **(sequence,d_model)**
* where **sequence** is total number of input tokens, and
* **d_model** is the dimension of the transformer model that is fixed: **512**

**what happens to the input tokens after we feed them into the `Input Embedding Layer`?**
> the **Input Embedding Layer** converts the input tokens into **512-dimensional embeddings** that is each token is represented by a vector of size 512, for example, say the size of the input tokens (after subword tokenization) is 6 then the output matrix will be of the size **(6,512)**.

1. **input words ---> subword tokenizer ---> tokens: (6,)**
> * `["token1 token2 token3 token4"]` ---> **BPE** ---> `["token1", "token2", "token3", "token4", "token5", "token6"]`

2. **tokens:(6,) ---> Input Embedding Layer ---> embeddings: (6,512)**
> * token1 --> `[0.0004, 0.34, ... 512th value]` a single token is represented by a vector of size 512
> * token2 --> `[0.0004, 0.34, ... 512th value]` a single token is represented by a vector of size 512
> * token3 --> `[0.0004, 0.34, ... 512th value]` a single token is represented by a vector of size 512
> * token4 --> `[0.0004, 0.34, ... 512th value]` a single token is represented by a vector of size 512
> * token5 --> `[0.0004, 0.34, ... 512th value]` a single token is represented by a vector of size 512
> * token6 --> `[0.0004, 0.34, ... 512th value]` a single token is represented by a vector of size 512

now, these embeddings are fed into the Econder Block.

In [31]:
import torch
import torch.nn as nn

class InputEmbeddings(nn.Module):
    def __init__(self, d_model:int, vocab_size:int) -> None:
        """_summary_

        Args:
            d_model (int): size of the word embedding vector
            vocab_size (int): vocabulary size -> how many words are there in the vocabulary
        """
        super().__init__()
        self.d_model=d_model
        self.vocab_size=vocab_size
        self.embedding=nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=d_model
        )

    def forward(self,x):
        return self.embedding(x) * torch.sqrt(self.d_model)

## positonal encoding
Positional encoding is a way of adding information about the position of each word in a sequence to the word embeddings. This is important because the transformer model does not use recurrence or convolution, and therefore does not have any inherent notion of word order. Positional encoding is done by mapping each position to a vector of the same size as the word embedding, and then adding them together. The vector for each position is computed using a combination of sine and cosine functions

> * it captures the **position** of each word.
> * computed only once and **reused** for every sentence during the **training** and **inference**.
> * we add this **positional embedding matrix** with the **input embedding matrix**

<img src="https://miro.medium.com/max/1272/1*YqVm4d_OmlE-J17r4i2yIg.png" width=500>

In [37]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model:int, seq_len:int, dropout:float) -> None:
        """
        Args:
            d_model (int): size of the word embedding vector
            seq_len (int): maximum length of the sentence
            dropout (_type_): to reduce overfitting
        """
        super().__init__()
        self.d_model-d_model
        self.seq_len=seq_len
        self.dropout=nn.Dropout(p=dropout)

        # create positional encoding matrix of shape (seq_len,d_model)
        pe=torch.zeros(size=(seq_len, d_model))
        # create a vector of shape (seq_len, 1)
        position=torch.arange(start=0,end=seq_len, dtype=torch.float).unsqueeze(dim=1)
        denominator=torch.exp(torch.arange(start=0,end=d_model,step=2).float() * (torch.log(10000)/d_model))
        # apply the sin to even position
        pe[:, ::2]=torch.sin(input=position*denominator)
        # apply the cos to odd position
        pe[:, 1::2]=torch.cos(input=position*denominator)

        # convert dim to batch processing: (seq_len, d_model) ---> (1, seq_len, d_model)
        pe=pe.unsqueeze(dim=0)

        # save the tensor to the module but not as a learned parameter
        self.register_buffer(name="pe", tensor=pe)

    def forward(self, x):
        # note that, we don't want those postional encoding(pe) tensor to be learned as it i created only once
        x=x + (self.pe[:, :x.shape[1], :]).requires_grad(False)
        return self.dropout(x)

## layer normalization