**MultiHead Attention Sublayer**

- The input of the multi-attention sublayer of the first layer of the encoder stack is a vector that 
contains the embedding and the positional encoding of each word. The next layers of the stack 
do not start these operations over

In [1]:
#@ LOADING THE REQUIRED LIBRARIES AND DEPENDENCIES
import numpy as np
from scipy.special import softmax

**Step 1: Represent the Input**

In [2]:
#@ REPRESENT THE INPUT
print("Step 1: Input: 3 inputs, d_model=4")

x = np.array([[1.0, 0.0, 1.0, 0.0],              # Input 1
              [0.0, 2.0, 0.0, 2.0,],             # Input 2
              [1.0, 1.0, 1.0, 1.0]])             # Input 3

print(x)

Step 1: Input: 3 inputs, d_model=4
[[1. 0. 1. 0.]
 [0. 2. 0. 2.]
 [1. 1. 1. 1.]]


**Step 2: Initializing the weight matrices**

In [3]:
#@ INITIALIZING THE WEIGHT MATRICES
print("Step 2: Weight 3 Dimensions x d_model=4")
print("W_query")
w_query = np.array([[1, 0, 1],
                   [1, 0, 0],
                   [0, 0, 1],
                   [0, 1, 1]])

print(w_query)

Step 2: Weight 3 Dimensions x d_model=4
W_query
[[1 0 1]
 [1 0 0]
 [0 0 1]
 [0 1 1]]


In [4]:
print("W_key")
w_key =np.array([[0, 0, 1],
                 [1, 1, 0],
                 [0, 1, 0],
                 [1, 1, 0]])
print(w_key)

W_key
[[0 0 1]
 [1 1 0]
 [0 1 0]
 [1 1 0]]


In [5]:
print("W_value")
w_value = np.array([[0, 2, 0],
                    [0, 3, 0],
                    [1, 0, 3],
                    [1, 1, 0]])

print(w_value)

W_value
[[0 2 0]
 [0 3 0]
 [1 0 3]
 [1 1 0]]


**Step 3: Matrix Multiplication to obtain Q, K, and V**

In [6]:
print("Queries: x * w_query")
Q = np.matmul(x, w_query)
print(Q)

Queries: x * w_query
[[1. 0. 2.]
 [2. 2. 2.]
 [2. 1. 3.]]


In [7]:
print("Keys: x * w_key")
K = np.matmul(x, w_key)
print(K)

Keys: x * w_key
[[0. 1. 1.]
 [4. 4. 0.]
 [2. 3. 1.]]


In [8]:
print("Values: x * w_value")
V = np.matmul(x, w_value)
print(V)

Values: x * w_value
[[1. 2. 3.]
 [2. 8. 0.]
 [2. 6. 3.]]


**Step 4: Scaled Attention Scores**

In [9]:
#@ SCALED ATTENTION SCORES
k_d = 1                      # Square root of k_d
attention_scores = (Q @ K.transpose()/k_d)
print(attention_scores)

[[ 2.  4.  4.]
 [ 4. 16. 12.]
 [ 4. 12. 10.]]


**Step 5: Scaled Softmax Attention Scores for Each Vector**

In [10]:
#@ SCALED SOFTMAX ATTENTION SCORES FOR EACH VECTOR
attention_scores[0]=softmax(attention_scores[0])
attention_scores[1]=softmax(attention_scores[1])
attention_scores[2]=softmax(attention_scores[2])
print(attention_scores[0])
print(attention_scores[1])
print(attention_scores[2])

[0.06337894 0.46831053 0.46831053]
[6.03366485e-06 9.82007865e-01 1.79861014e-02]
[2.95387223e-04 8.80536902e-01 1.19167711e-01]


**Step 6: The Final Attention Representation**

In [11]:
#@ ATTENTION REPRESENTATIONS
print("Attention 1")
attention1=attention_scores[0].reshape(-1,1)
attention1=attention_scores[0][0]*V[0]
print(attention1)

print("\nAttention 2")
attention2=attention_scores[0][1]*V[1]
print(attention2)

print("\nAttention 3")
attention3=attention_scores[0][2]*V[2]
print(attention3)

Attention 1
[0.06337894 0.12675788 0.19013681]

Attention 2
[0.93662106 3.74648425 0.        ]

Attention 3
[0.93662106 2.80986319 1.40493159]


**Step 7: Summing up the Results**

In [12]:
#@ SUM THE RESULTS TO CREATE OUTPUT MATRIX
attention_input1 = attention1+attention2+attention3
print(attention_input1)

[1.93662106 6.68310531 1.59506841]


**Step8: Do the Similar steps for inputs 1 to 3**

In [13]:
attention_head1=np.random.random((3, 64))
print(attention_head1)

[[0.51803227 0.06593474 0.57898974 0.26156816 0.45903392 0.54301963
  0.36944504 0.78062509 0.84132218 0.80327461 0.76400561 0.53534402
  0.68802208 0.3250726  0.94307388 0.56236725 0.71137752 0.70393443
  0.80570251 0.07422293 0.80974836 0.93014795 0.14419805 0.91441808
  0.64473565 0.7361935  0.97487457 0.83497554 0.99635858 0.55097136
  0.18568183 0.68184303 0.20092957 0.64811001 0.53270291 0.63644098
  0.49144787 0.75809754 0.50862987 0.91943953 0.77325465 0.33735139
  0.76314969 0.93319312 0.78110282 0.1827154  0.53686553 0.45725193
  0.35386854 0.6857252  0.37433003 0.69211331 0.47597565 0.2182687
  0.52112479 0.46600617 0.0706057  0.76639653 0.2188229  0.05410546
  0.77817158 0.06417579 0.11750989 0.12067076]
 [0.78121092 0.55654516 0.11760405 0.80981876 0.97131738 0.74502757
  0.9683498  0.35195239 0.13098691 0.77428144 0.99673375 0.9935002
  0.2529348  0.02599951 0.50620214 0.43188773 0.05327587 0.22232563
  0.19142361 0.20829672 0.15946058 0.1350743  0.91098839 0.60906112
  0

**Step 9: The Output of Heads of Attention Sublayer**

In [14]:
z0h1=np.random.random((3, 64))
z1h2=np.random.random((3, 64))
z2h3=np.random.random((3, 64))
z3h4=np.random.random((3, 64))
z4h5=np.random.random((3, 64))
z5h6=np.random.random((3, 64))
z6h7=np.random.random((3, 64))
z7h8=np.random.random((3, 64))
print("Shape of one head: ",z0h1.shape)
print("Dimension of 8 heads:" ,64*8)

Shape of one head:  (3, 64)
Dimension of 8 heads: 512


**Step 10: Concatenation of Output of the Heads**

In [15]:
#@ CONCATENATING THE HEADS OF THE OUTPUTS
output_attention=np.hstack((z0h1,z1h2,z2h3,z3h4,z4h5,z5h6,z6h7,z7h8))
print(output_attention)

[[0.63543689 0.26536128 0.82749327 ... 0.83781243 0.17844576 0.72299349]
 [0.66922363 0.25694862 0.71219946 ... 0.93571269 0.82527417 0.44343282]
 [0.16468276 0.13075808 0.15698728 ... 0.3454452  0.03596274 0.93268568]]


In [17]:
#@ IMPLEMENTING THE TRANSFORMER MODEL FROM HUGGING FACE
from transformers import pipeline
translator = pipeline("translation_en_to_fr")
print(translator("Hello, I am Saugat Regmi, Currently learning Machine Learning", max_length=50))

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'translation_text': "Bonjour, je suis Saugat Regmi, en cours d'apprentissage Machine Learning"}]


**merci**