### Step 0: Import the necessary libraries

In [None]:
import numpy as np
from scipy.special import softmax # Scientific Python

### Step 1: Represent the Input

* d<sub>model</sub> for original tranformer is 512.
* We scale it down to d<sub>model</sub> = 4.

Let's say we have an input of size = 3 and, since d<sub>model</sub> is 4, each of the 3 inputs must have 4 dimensions.

In [None]:
print("Input: 3 inputs, d_model: 4")
x = np.array([[1.0, 0.0, 1.0, 0.0], # Input 1
              [0.0, 2.0, 0.0, 2.0], # Input 2
              [1.0, 1.0, 1.0, 1.0]]) # Input 3
print(x)

Input: 3 inputs, d_model: 4
[[1. 0. 1. 0.]
 [0. 2. 0. 2.]
 [1. 1. 1. 1.]]


### Step 2: Initialize the weight matrices

Each input has 3 weights:
* Q<sub>w</sub> to train the queries.
* K<sub>w</sub> to train the keys.
* V<sub>w</sub> to train the values.

The weight matrices in the original transformer is of dimension d<sub>k</sub> = 64 dimensions.
Let's scale the matrices down to d<sub>k</sub> = 3

In [None]:
print("Weight matrix for query vector")
w_query = np.array([
    [1, 0, 1],
    [1, 0, 0],
    [0, 0, 1],
    [0, 1, 1],
])
print(w_query)

print("\nWeight matrix for key vector")
w_key = np.array([
    [0, 0, 1],
    [1, 1, 0],
    [0, 1, 0],
    [1, 1, 0]
])
print(w_key)

print("\nWeight matrix for value vector")
w_value = np.array([
    [0, 2, 0],
    [0, 3, 0],
    [1, 0, 3],
    [1, 1, 0]
])
print(w_value)

Weight matrix for query vector
[[1 0 1]
 [1 0 0]
 [0 0 1]
 [0 1 1]]

Weight matrix for key vector
[[0 0 1]
 [1 1 0]
 [0 1 0]
 [1 1 0]]

Weight matrix for value vector
[[0 2 0]
 [0 3 0]
 [1 0 3]
 [1 1 0]]


### Step 3: Matrix multiplication to obtain Q, K, V matrices, where each matrix contains 3 vectors for Input 1, Input 2, and Input 3

In [None]:
print(f"\nMatrix multiplication to get Q matrix")
Q = np.matmul(x, w_query)
print(Q)

print("\nMatrix multiplication to get K matrix")
K = np.matmul(x, w_key)
print(K)

print("\nMatrix multiplication to get V matrix")
V = np.matmul(x, w_value)
print(V)


Matrix multiplication to get Q matrix
[[1. 0. 2.]
 [2. 2. 2.]
 [2. 1. 3.]]

Matrix multiplication to get K matrix
[[0. 1. 1.]
 [4. 4. 0.]
 [2. 3. 1.]]

Matrix multiplication to get V matrix
[[1. 2. 3.]
 [2. 8. 0.]
 [2. 6. 3.]]


### Step 4: Scaled attention scores

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$

Where:
- \( Q \) is the Query matrix,
- \( K \) is the Key matrix,
- \( V \) is the Value matrix,
- \( d_k \) is the dimension of the Key vectors.


$$ \sqrt{d_{\text{model}}} = \sqrt{3} \approx 1.75 \approx 1 \text{, for this example} $$



In [None]:
print("Scaled Attention Scores")
d_k = 1
attention_scores = np.matmul(Q, K.T)/d_k
print(attention_scores)

Scaled Attention Scores
[[ 2.  4.  4.]
 [ 4. 16. 12.]
 [ 4. 12. 10.]]


**Key Notes:**
* Rows represent different queries.
* Columns represent different keys.

**Oberservations:**
* Higher values in a row mean that the query pays more attention to the corresponding key.
* For example, in the second row, the value 16 means that the second query gives much more attention to the second key compared to the others.

### Step 5: Scaled softmax attention score for each vector

In [None]:
print("Scaled softmax score for each vector")
attention_scores[0] = softmax(attention_scores[0])
attention_scores[1] = softmax(attention_scores[1])
attention_scores[2] = softmax(attention_scores[2])

print(attention_scores)

Scaled softmax score for each vector
[[6.33789383e-02 4.68310531e-01 4.68310531e-01]
 [6.03366485e-06 9.82007865e-01 1.79861014e-02]
 [2.95387223e-04 8.80536902e-01 1.19167711e-01]]


### 6. Final attention representation

In [None]:
print("Attention values for input#1 vector")
# Loop to calculate attention values for input #1
print("Attention 1")
attention1=attention_scores[0].reshape(-1,1)
attention1=attention_scores[0][0]*V[0]
print(attention1)
print("Attention 2")
attention2=attention_scores[0][1]*V[1]
print(attention2)
print("Attention 3")
attention3=attention_scores[0][2]*V[2]
print(attention3)

Attention values for input#1 vector
Attention 1
[0.06337894 0.12675788 0.19013681]
Attention 2
[0.93662106 3.74648425 0.        ]
Attention 3
[0.93662106 2.80986319 1.40493159]


**Key Notes:**
* Rows represent the attention values (output) for each query.
* Columns represent the contributions from the value vectors to each attention output.

**Observation:**
* `Attention 2 [0.93662106 3.74648425 0.]`:
1. This is the attention output for the second query. It shows that the second query focuses more heavily on the second value vector (with a large weight on 3.74648425).
2. The 0.0 in the third column suggests the second query ignores the contribution of the third value vector entirely.
* `Attention 3 [0.93662106 2.80986319 1.40493159]`:
1. This is the attention output for the third query. It indicates a more balanced weighting of the value vectors, with contributions from all three value vectors to different extents.
2. The largest weight here is 2.80986319, suggesting the third query focuses mostly on the second value vector.


### Step 7: Summing up the results

To get the attention values for the first word, input #1, we sum up column-wise the attention values.

In [None]:
attention_for_input1 = attention1 + attention2 + attention3
attention_for_input1

array([1.93662106, 6.68310531, 1.59506841])

### Multihead Attention

In this, we simply repeat the above 7 steps for each head and each word. Afer that we concatenation the 8 heads attention results, by using **`hstack`**. An **`hstack`** is used to concatenate arrays/vectors horizontally.

## Tranformers from Hugging Face

In [None]:
# Ensure that hugging face transformers are installed
! pip -qq install transformers

In [None]:
from transformers import pipeline

translator = pipeline('translation_en_to_fr')

print(translator("Demostration of transformers, which helps us to translate between languages easily!!", max_length=80))

No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]



[{'translation_text': 'Démostration des transformateurs, qui nous aide à traduire facilement entre les langues!!'}]
