# Normal Implementation

In [1]:
prompt = "Hello World I am an attention seeker I love attention"

In [2]:
dic_words = {s.lower():i for i,s in enumerate(sorted(prompt.split(" ")))}

In [3]:
print(dic_words)

{'hello': 0, 'i': 2, 'world': 3, 'am': 4, 'an': 5, 'attention': 7, 'love': 8, 'seeker': 9}


In [5]:
tokens = [dic_words[i.lower()] for i in prompt.split(" ")]
print(tokens)

import torch
tokens_tf = torch.tensor(tokens)
print(tokens_tf)

[0, 3, 2, 4, 5, 7, 9, 2, 8, 7]
tensor([0, 3, 2, 4, 5, 7, 9, 2, 8, 7])


In [10]:
from torch.nn import Embedding

torch.manual_seed(123)
vocab_size = 50000
embedder = Embedding(vocab_size, 3)

In [11]:
token_embedding = embedder(tokens_tf).detach()
print(token_embedding)
print(token_embedding.shape)

tensor([[ 0.3374, -0.1778, -0.3035],
        [-1.1925,  0.6984, -1.4097],
        [-0.2196, -0.3792,  0.7671],
        [ 0.1794,  1.8951,  0.4954],
        [ 0.2692, -0.0770, -1.0205],
        [ 1.3010,  1.2753, -0.2010],
        [-1.1481, -1.1589,  0.3255],
        [-0.2196, -0.3792,  0.7671],
        [ 0.4965, -1.5723,  0.9666],
        [ 1.3010,  1.2753, -0.2010]])
torch.Size([10, 3])


In [12]:
d = token_embedding.shape[1]
d_q, d_k, d_v = 24, 24, 28

In [13]:
w_query = torch.nn.Parameter(torch.rand(d,d_q))
w_key = torch.nn.Parameter(torch.rand(d,d_k))
w_value = torch.nn.Parameter(torch.rand(d,d_v))

In [15]:
query = token_embedding @ w_query
key = token_embedding @ w_key
value = token_embedding @ w_value

In [21]:
import math
from torch.nn.functional import softmax
q_kt = query @ key.T
f = q_kt/math.sqrt(d_k)
s = softmax(f, dim = -1)
context_vector = s @ value

#If you want detailed explanation on code expand below:

##Map each word with a value

In [None]:
word_map = {s.lower(): i for i, s in enumerate(sorted(prompt.split(" ")))}
print(word_map)

##Sentence embedding

In [None]:
tokens_int = [word_map[i.lower()] for i in prompt.split(" ")]
print(tokens_int)

In [None]:
import torch
tokens_tf = torch.tensor(tokens_int)
print(tokens_tf)

##Self Attention mechanism works on vectors and not on strings or int indices. Hence we need to convert to vectors


###Use of manual_seed(123)


The `torch.manual_seed(123)` function is used to set the **random seed** for PyTorch's random number generator. Let’s break down what this means and why it’s important.

---

##### **What Does `123` Indicate?**
- The number `123` is an arbitrary integer value that you choose as the **seed** for the random number generator.
- This seed initializes the random number generator, ensuring that any random operations in your code (e.g., weight initialization in neural networks, random shuffling of data, etc.) produce the **same results every time** you run the code.

---

#### **Why Do We Use `torch.manual_seed(123)`?**

##### 1. **Reproducibility**
- In machine learning and deep learning, many operations involve randomness, such as:
  - Initializing weights in a neural network.
  - Shuffling data during training.
  - Dropout layers.
- Without setting a random seed, these operations would produce different results each time you run the code, making it difficult to reproduce experiments or debug issues.
- By setting a fixed seed (e.g., `123`), you ensure that the random numbers generated are the same every time you run the code. This makes your experiments **reproducible**.

##### 2. **Debugging**
- When debugging a model, you want to ensure that the behavior is consistent across runs. If the results change every time you run the code, it becomes difficult to identify whether a change in behavior is due to a bug or randomness.
- Setting a random seed eliminates this variability, making debugging easier.

##### 3. **Fair Comparisons**
- When comparing different models or hyperparameters, you want to ensure that the comparison is fair and not influenced by random initialization or other stochastic factors.
- By using the same random seed, you ensure that all models start from the same initial conditions, making the comparison meaningful.

---

#### **How Does It Work?**
- When you call `torch.manual_seed(123)`, PyTorch’s random number generator is initialized with the seed value `123`.
- Any subsequent random operations (e.g., `torch.rand()`, `nn.Embedding` initialization, etc.) will produce the same sequence of random numbers every time you run the code.

---

#### **Example**

##### Without Setting a Seed
```python
import torch

# Random tensor without setting a seed
print(torch.rand(3))  # Output will be different every time
```

Output (Run 1):
```
tensor([0.1234, 0.5678, 0.9101])
```

Output (Run 2):
```
tensor([0.4321, 0.8765, 0.1098])
```

- The output is different each time because the random number generator is initialized with a different seed.

---

##### With Setting a Seed
```python
import torch

# Set the random seed
torch.manual_seed(123)

# Random tensor with a fixed seed
print(torch.rand(3))  # Output will be the same every time
```

Output (Run 1):
```
tensor([0.2961, 0.5166, 0.2517])
```

Output (Run 2):
```
tensor([0.2961, 0.5166, 0.2517])
```

- The output is the same every time because the random number generator is initialized with the same seed (`123`).

---

#### **Why Use `123` Specifically?**
- The value `123` is arbitrary. You can use any integer value as the seed (e.g., `42`, `100`, `999`, etc.).
- The choice of seed does not affect the quality of randomness; it only ensures reproducibility.
- Commonly used seeds include `42` (a popular choice in the machine learning community) or `123` (as in this example).

---

#### **When to Use `torch.manual_seed`?**
You should use `torch.manual_seed` in the following scenarios:
1. **Weight Initialization**:
   - When initializing the weights of a neural network, you want to ensure that the initialization is consistent across runs.
   ```python
   torch.manual_seed(123)
   model = nn.Linear(10, 1)  # Weights will be the same every time
   ```

2. **Data Shuffling**:
   - When shuffling data during training, you want to ensure that the order of the data is consistent across runs.
   ```python
   torch.manual_seed(123)
   indices = torch.randperm(100)  # Same shuffle every time
   ```

3. **Dropout Layers**:
   - Dropout layers use randomness to deactivate neurons during training. Setting a seed ensures that the same neurons are deactivated every time.
   ```python
   torch.manual_seed(123)
   dropout = nn.Dropout(0.5)
   ```

4. **Embedding Layers**:
   - Embedding layers use random initialization for their weights. Setting a seed ensures that the embeddings are initialized consistently.
   ```python
   torch.manual_seed(123)
   embed = nn.Embedding(100, 10)  # Same embeddings every time
   ```

---

#### **Summary**
- `torch.manual_seed(123)` sets the random seed for PyTorch’s random number generator to `123`.
- This ensures that any random operations in your code produce the same results every time you run it, which is crucial for **reproducibility**, **debugging**, and **fair comparisons**.
- The value `123` is arbitrary; you can use any integer value as the seed.

In [None]:
torch.manual_seed(123)

###Setting a vocabulary size

This is the total number of unique words in the vocabulary. Here, we assume a large vocabulary size of 50,000 words.

In [None]:
vocab_size = 50000

###Creating an Embedding Layer

*   **What is an embedding layer?

  - An embedding layer is a lookup table that maps integer indices (representing words) to dense vectors of fixed size (embeddings).

  - Each word in the vocabulary is represented by a vector of size 3 (in this case).



* **Parameters:**

  - vocab_size: The size of the vocabulary (50,000 in this case).

  - 3: The dimensionality of the embedding vectors. Each word will be represented by a 3-dimensional vector.



* **How it works:**

  - The embedding layer is essentially a matrix of size (vocab_size, embedding_dim). For example, if vocab_size = 50000 and embedding_dim = 3, the embedding layer is a matrix of size (50000, 3).

  - When you pass an integer index to the embedding layer, it looks up the corresponding row in this matrix and returns the embedding vector.

In [None]:
import torch.nn as nn
embed = nn.Embedding(vocab_size, embedding_dim=3)

###Embedding the tokenized int indices

* **Input:**

  - tokens_tf: A tensor of integer indices representing the sentence.

* **What happens here?**

  - The embedding layer (embed) takes the integer indices and maps each index to its corresponding embedding vector.

  - For example:

    - Index 0 -> Embedding vector [0.1, 0.2, 0.3]

    - Index 8 -> Embedding vector [0.4, 0.5, 0.6]

    - And so on.

* **Output:**

  - The output is a tensor of shape (sequence_length, embedding_dim), where:

  - sequence_length is the number of words in the sentence (9 in this case).

  - embedding_dim is the dimensionality of the embedding vectors (3 in this case).

* **Why .detach()?**

  - The .detach() method is used to remove the tensor from the computation graph. This is done to ensure that the tensor is treated as a constant and does not participate in gradient calculations during backpropagation. This is often used when you want to use the embedded vectors as input to another part of the model without affecting the gradients. This makes detach() particularly useful in scenarios where you want to avoid unnecessary memory usage or stop gradient calculations on specific parts of your tensor.

  - https://medium.com/biased-algorithms/understanding-tensor-detach-in-pytorch-a-practical-guide-e859a7713f28#:~:text=detach()%20is%20invaluable%20when,memory%20optimization%20in%20complex%20models.



In [None]:
embedded_sentence = embed(tokens_tf).detach()

In [None]:
print(embedded_sentence)
print(embedded_sentence.shape)

##Defining Weight Matrices

The embedding dimension (d) is required to define the size of the weight matrices (W_query, W_key, W_value) for the self-attention mechanism.

In [None]:
d = embedded_sentence.shape[1]

**Purpose:**
  - These variables define the dimensionality of the query, key, and value vectors in the self-attention mechanism.

**What do these represent?:**

  - d_q: Dimensionality of the query vectors.

  - d_k: Dimensionality of the key vectors.

  - d_v: Dimensionality of the value vectors.

**Why are they different?:**

  - In some self-attention implementations, the dimensionality of queries, keys, and values can be different. However, in many cases (e.g., Transformer models), d_q and d_k are the same to allow for dot-product attention.

  - Here, d_q = 24, d_k = 24, and d_v = 28 are arbitrary choices for demonstration.

(It’s important to note that d
 represents the size of each word vector, x
.)

Since we are computing the dot-product between the query and key vectors, these two vectors have to contain the same number of elements (dq=dk
). However, the number of elements in the value vector v(i)
, which determines the size of the resulting context vector, is arbitrary.

In [None]:
d_q, d_k, d_v = 24, 24, 28

Weight Matrices for Self-Attention

1. torch.rand(d_q, d):

  - Generates a random tensor of shape (d_q, d) with values sampled from a uniform distribution between 0 and 1.

  - For example, if d_q = 24 and d = 3, this creates a tensor of shape (24, 3).

2. torch.nn.Parameter(...):

  - In neural networks, certain tensors (e.g., weights of layers) need to be updated during training to minimize the loss function. These tensors are called learnable parameters.

  - By wrapping a tensor with torch.nn.Parameter, we tell PyTorch that this tensor should be updated during training via backpropagation.

  - Wraps the tensor as a learnable parameter. This means that during training, the values of these tensors will be updated via backpropagation.

  - Parameters are automatically tracked by PyTorch for gradient computation.

  - These weight matrices are learnable parameters that will be updated during training.

  - During the forward pass, they are used to project the input embeddings into query, key, and value vectors.

  - During backpropagation, PyTorch computes gradients for these matrices and updates their values to minimize the loss.


3. W_query, W_key, W_value:

  - These are the weight matrices for the query, key, and value projections, respectively.

  - Their shapes are:

    - W_query: (d_q, d)

    - W_key: (d_k, d)

    - W_value: (d_v, d)

In [None]:
w_query = torch.nn.Parameter(torch.rand(d,d_q))
w_key = torch.nn.Parameter(torch.rand(d,d_k))
w_value = torch.nn.Parameter(torch.rand(d,d_v))

## Computing unnormalized attention weights

In [None]:
query = embedded_sentence @ w_query
keys = embedded_sentence @ w_key
values = embedded_sentence @ w_value

### Attention Formula

####$\text{Attention}(Q, K, V) = \text{SoftMax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$.

In [None]:
q_kt = query @ keys.T

In [None]:
import math
f = q_kt / math.sqrt(d_k)

Softmax

In [None]:
import torch.nn.functional as F
s = F.softmax(f, dim = -1)

In [None]:
context_vector = s @ values
print(context_vector)