**MultiHead Attention Sublayer**

- The input of the multi-attention sublayer of the first layer of the encoder stack is a vector that 
contains the embedding and the positional encoding of each word. The next layers of the stack 
do not start these operations over

In [1]:
#@ LOADING THE REQUIRED LIBRARIES AND DEPENDENCIES
import numpy as np
from scipy.special import softmax

**Step 1: Represent the Input**

In [2]:
#@ REPRESENT THE INPUT
print("Step 1: Input: 3 inputs, d_model=4")

x = np.array([[1.0, 0.0, 1.0, 0.0],              # Input 1
              [0.0, 2.0, 0.0, 2.0,],             # Input 2
              [1.0, 1.0, 1.0, 1.0]])             # Input 3

print(x)

Step 1: Input: 3 inputs, d_model=4
[[1. 0. 1. 0.]
 [0. 2. 0. 2.]
 [1. 1. 1. 1.]]


**Step 2: Initializing the weight matrices**

In [3]:
#@ INITIALIZING THE WEIGHT MATRICES
print("Step 2: Weight 3 Dimensions x d_model=4")
print("W_query")
w_query = np.array([[1, 0, 1],
                   [1, 0, 0],
                   [0, 0, 1],
                   [0, 1, 1]])

print(w_query)

Step 2: Weight 3 Dimensions x d_model=4
W_query
[[1 0 1]
 [1 0 0]
 [0 0 1]
 [0 1 1]]


In [4]:
print("W_key")
w_key =np.array([[0, 0, 1],
                 [1, 1, 0],
                 [0, 1, 0],
                 [1, 1, 0]])
print(w_key)

W_key
[[0 0 1]
 [1 1 0]
 [0 1 0]
 [1 1 0]]


In [5]:
print("W_value")
w_value = np.array([[0, 2, 0],
                    [0, 3, 0],
                    [1, 0, 3],
                    [1, 1, 0]])

print(w_value)

W_value
[[0 2 0]
 [0 3 0]
 [1 0 3]
 [1 1 0]]


**Step 3: Matrix Multiplication to obtain Q, K, and V**

In [6]:
print("Queries: x * w_query")
Q = np.matmul(x, w_query)
print(Q)

Queries: x * w_query
[[1. 0. 2.]
 [2. 2. 2.]
 [2. 1. 3.]]


In [7]:
print("Keys: x * w_key")
K = np.matmul(x, w_key)
print(K)

Keys: x * w_key
[[0. 1. 1.]
 [4. 4. 0.]
 [2. 3. 1.]]


In [8]:
print("Values: x * w_value")
V = np.matmul(x, w_value)
print(V)

Values: x * w_value
[[1. 2. 3.]
 [2. 8. 0.]
 [2. 6. 3.]]


**Step 4: Scaled Attention Scores**

In [9]:
#@ SCALED ATTENTION SCORES
k_d = 1                      # Square root of k_d
attention_scores = (Q @ K.transpose()/k_d)
print(attention_scores)

[[ 2.  4.  4.]
 [ 4. 16. 12.]
 [ 4. 12. 10.]]


**Step 5: Scaled Softmax Attention Scores for Each Vector**

In [10]:
#@ SCALED SOFTMAX ATTENTION SCORES FOR EACH VECTOR
attention_scores[0]=softmax(attention_scores[0])
attention_scores[1]=softmax(attention_scores[1])
attention_scores[2]=softmax(attention_scores[2])
print(attention_scores[0])
print(attention_scores[1])
print(attention_scores[2])

[0.06337894 0.46831053 0.46831053]
[6.03366485e-06 9.82007865e-01 1.79861014e-02]
[2.95387223e-04 8.80536902e-01 1.19167711e-01]


**Step 6: The Final Attention Representation**

In [11]:
#@ ATTENTION REPRESENTATIONS
print("Attention 1")
attention1=attention_scores[0].reshape(-1,1)
attention1=attention_scores[0][0]*V[0]
print(attention1)

print("\nAttention 2")
attention2=attention_scores[0][1]*V[1]
print(attention2)

print("\nAttention 3")
attention3=attention_scores[0][2]*V[2]
print(attention3)

Attention 1
[0.06337894 0.12675788 0.19013681]

Attention 2
[0.93662106 3.74648425 0.        ]

Attention 3
[0.93662106 2.80986319 1.40493159]


**Step 7: Summing up the Results**

In [12]:
#@ SUM THE RESULTS TO CREATE OUTPUT MATRIX
attention_input1 = attention1+attention2+attention3
print(attention_input1)

[1.93662106 6.68310531 1.59506841]


**Step8: Do the Similar steps for inputs 1 to 3**

In [13]:
attention_head1=np.random.random((3, 64))
print(attention_head1)

[[7.02968061e-01 4.78933152e-01 1.55845508e-01 9.88262330e-01
  9.76938629e-01 9.78654027e-01 3.77457489e-01 5.06229302e-01
  7.51550229e-01 9.59441860e-01 5.70817809e-02 1.31432354e-01
  4.02944707e-01 4.06536263e-01 8.06707901e-01 4.55212756e-01
  3.80748823e-01 5.77114545e-01 9.74070787e-01 7.08102374e-01
  8.83868278e-01 9.02558661e-01 2.77232430e-02 7.30864194e-01
  6.57596439e-01 7.69553162e-01 7.95469750e-01 6.14468045e-01
  3.20692290e-01 1.42336004e-01 2.06866035e-02 4.26051180e-02
  3.18676726e-01 1.55514270e-01 9.05085397e-01 1.74706790e-02
  4.87706995e-02 9.58028477e-01 7.92824838e-01 3.00061095e-01
  5.08009776e-02 1.49978213e-01 5.51797925e-01 1.13682038e-01
  7.75111448e-02 1.59331427e-01 9.06658179e-01 2.81491899e-01
  6.09717204e-01 7.63891560e-01 2.43232169e-01 5.17703150e-01
  6.81168058e-01 5.52535938e-01 3.96887037e-01 2.93843453e-02
  4.57757113e-01 9.16622655e-01 2.66865969e-02 2.60963474e-01
  3.88968526e-01 7.36779249e-01 9.05427539e-02 9.92435488e-01]
 [8.933

**Step 9: The Output of Heads of Attention Sublayer**

In [14]:
z0h1=np.random.random((3, 64))
z1h2=np.random.random((3, 64))
z2h3=np.random.random((3, 64))
z3h4=np.random.random((3, 64))
z4h5=np.random.random((3, 64))
z5h6=np.random.random((3, 64))
z6h7=np.random.random((3, 64))
z7h8=np.random.random((3, 64))
print("Shape of one head: ",z0h1.shape)
print("Dimension of 8 heads:" ,64*8)

Shape of one head:  (3, 64)
Dimension of 8 heads: 512


**Step 10: Concatenation of Output of the Heads**

In [15]:
#@ CONCATENATING THE HEADS OF THE OUTPUTS
output_attention=np.hstack((z0h1,z1h2,z2h3,z3h4,z4h5,z5h6,z6h7,z7h8))
print(output_attention)

[[0.98440752 0.14025898 0.55047621 ... 0.3296116  0.31675955 0.78442565]
 [0.73133939 0.64405049 0.46110023 ... 0.56983026 0.56319405 0.46128129]
 [0.44253367 0.41572166 0.62073751 ... 0.76825721 0.88000215 0.23723273]]


In [17]:
#@ IMPLEMENTING THE TRANSFORMER MODEL FROM HUGGING FACE
from transformers import pipeline
translator = pipeline("translation_en_to_fr")
print(translator("Hello, I am Saugat Regmi. Currently learning Machine Learning", max_length=50))

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'translation_text': 'Bonjour, je suis Saugat Regmi.'}]


**merci**