# 1. introduction

In this notebook we'll explore the Transformer architecture, a neural network that takes advantage of parallel processing and allows you to substantially speed up the training process. 

**After this Project you'll be able to**:
* Understand each step of Transformer Network
* understand Attention
* how to change paper to code 

### What is Transformer Model? 
The Transformer model, introduced in Attention Is All You Need (2017), is a deep learning architecture that replaces recurrence with self-attention to process sequences in parallel.
* the paper link [attention is all you need](https://arxiv.org/abs/1706.03762)

<img src="images/Transformer.png" alt="Encoder" width="600"/>
<caption><center><font color='Green'><b>Figure 1:  The Transformer- model architecture.</font></center></caption>


# 2. import Dependencies

In [3]:
# standard
import numpy as np
import matplotlib.pyplot as plt

# Tensorflow
import tensorflow as tf
from tensorflow.keras.layers import Embedding, MultiHeadAttention, Dense, Input, Dropout, LayerNormalization

# 3. Token Embedding

This layer converts token indices into dense vectors, similar to how PyTorch/TensorFlow embedding layers work.

<a name='6'></a>
### 3.1 - He Initialization

Finally, try "He Initialization"; this is named for the first author of He et al., 2015. 


Implement the following function to initialize our parameters with He initialization.  $\sqrt{\frac{2}{\text{dimension of the previous layer}}}$, which is what He initialization recommends for layers with a ReLU activation. 

In [4]:
class TokenEmbedding:
    def __init__(self, VOCAB_SIZE, d_model):

        # VOCAB_SIZE: total number of unique tokens which means total vocabulary 
        # d_model: dimension  of each token's embedding vector 
        
        # initialize imbedding matrix with xavier method 
        self.VOCAB_SIEZ = VOCAB_SIZE
        self.d_model = d_model

        self.embedding_matrix = np.random.randn(VOCAB_SIZE, d_model) * np.sqrt(2/d_model)

    def toDenseVector(self, TOKEN_INDICES):
        # this change the token to its represented embedding matrix
        return self.embedding_matrix[TOKEN_INDICES]

### TEXT CASE

In [None]:
VOCAB_SIZE = 10000
d_model = 128
token_embedding = TokenEmbedding(VOCAB_SIZE, d_model)

token_indices = np.array([
    [1, 5, 7, 2, 8],  # First sequence
    [4, 3, 9, 6, 0]   # Second sequence
])

embedded_tokens = token_embedding.toDenseVector(token_indices)
print("Token Embeddings Shape:", embedded_tokens.shape)  # Expected: (2, 5, 128) 

""" 
2 => Batch size (in this case 2 sentence at once)
5 => Sequence Length ( 5 word on each sentence )
128 => embedding dense 
"""

Token Embeddings Shape: (2, 5, 128)


' \n2 => Batch size (in this case 2 sentence at once)\n5 => Sequence Length ( 5 word on each sentence )\n128 => embedding dense \n\n'

# 4. Positional Encoding

In sequence to sequence tasks, the relative order of our data is extremely important to its meaning. When we were training sequential neural networks such as RNNs, we fed our inputs into the network in order. Information about the order of our data was automatically fed into our model.  However, when we train a Transformer network using multi-head attention, we feed our data into the model all at once. While this dramatically reduces training time, there is no information about the order of our data. This is where positional encoding is useful - we can specifically encode the positions of our inputs and pass them into the network using these sine and cosine formulas:
$$
PE_{(pos, 2i)}= sin\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
\tag{1}$$

* $d$ is the dimension of the word embedding and positional encoding
* $pos$ is the position of the word.
* $k$ refers to each of the different dimensions in the positional encodings, with $i$ equal to $k$ $//$ $2$.
  
This ensures each position has a unique encoding and helps the model differentiate between words based on their order.

In [12]:
import numpy as np

class PositionalEncoding:
    def __init__(self, max_seq_length, d_model):
        self.max_seq_length = max_seq_length
        self.d_model = d_model
        self.positional_encoding = self.compute_positional_encoding(max_seq_length, d_model)

    def compute_positional_encoding(self, max_seq_length, d_model):
        position = np.arange(max_seq_length)[:, np.newaxis]
        div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
        pos_enc = np.zeros((max_seq_length, d_model))
        pos_enc[:, 0::2] = np.sin(position * div_term)
        pos_enc[:, 1::2] = np.cos(position * div_term)
        return pos_enc

    def forward(self, token_embeddings):
        batch_size, seq_length, d_model = token_embeddings.shape
        assert d_model == self.d_model, "Embedding dimension must match d_model"
        return token_embeddings + self.positional_encoding[:seq_length, :]



TEST CASE

In [13]:
# Example usage
max_seq_length = 100  # Maximum sequence length
d_model = 128        # Embedding dimension

pos_enc = PositionalEncoding(max_seq_length, d_model)

# Generate dummy token embeddings (batch_size=2, seq_length=5, d_model=128)
token_embeddings = np.random.randn(2, 5, d_model)

# Add positional encoding
encoded_tokens = pos_enc.forward(token_embeddings)

print("Positional Encoded Shape:", encoded_tokens.shape)  # Expected: (2, 5, 128)

Positional Encoded Shape: (2, 5, 128)


# 5. Scaled Dot Product Attention

# 6. Multi-Head Attention

# 7. FeedForward Network

# 8. Layer Normalization & Skip Connection

# 9. Encoder

# 10. Decoder

# 11. Transformer Model