# Multi-head attention mechanism 

Instead of having a single attention head, we can use multiple attention heads. That is, in the previous section, we learned how to compute the attention matrix $Z$. Instead of computing a single attention matrix $Z$, we can compute multiple attention matrices. But what is the use of computing multiple attention matrices? 

Let’s understand this with an example. Consider a phrase - All is well. Say, we need to compute the self-attention of the word 'well'. After computing the similarity score, suppose we have:

![title](images/23.png)

As we can observe from the above figure, the self-attention value of the word 'well' is a sum of value vectors weighted by the scores. If you look at the above equation closely, the attention value of the actual word 'well' is dominated by the other word 'All'. That is, since we are multiplying the value vector of the word 'All' by 0.6 and the value vector of the actual word 'well' by only 0.4, it implies that  $z_{\text{well}}$ will contain 60% of values from the value vector of the word 'All' and only 40% of values from the value vector of the actual word 'well'. Thus here the attention values of the actual word 'well' is dominated by other word 'All'. 

This will be useful only in circumstances where the meaning of the actual word is ambiguous. That is, consider a sentence: 

'A dog ate the food because it was hungry'



Say, we are computing the self-attention for the word 'it'. After computing the similarity score, suppose we have:

![title](images/24.png)
As we can observe from the above equation, here the attention value of the word 'it' is just the value vector of the word 'dog'. Here, the attention value of the actual word 'it' is dominated by the word 'dog'. But this is fine here since the meaning of the word 'it' is ambiguous as it may refer to either dog or food.  

Thus, if the value vector of other words dominates the actual word in cases like shown above where the actual word is ambiguous then this dominance is useful else it will cause an issue in understanding the right meaning of the word. So, in order to make sure that our results are accurate, instead of computing a single attention matrix, we will compute multiple attention matrices and then concatenate their results. The idea of using multi-head attention is that instead of using a single attention head, if we use multiple attention heads, then our attention matrix will be more accurate. Let’s explore more about this in detail. 

Let’s suppose, we are computing two attention matrices $Z_1$ and $Z_2$. First, let's compute the attention matrix $Z_1$:

We learned that to compute the attention matrix, we create three new matrices called a query, key, and value matrices. To create the query $Q_1$, key $K_1$, and value $V_1$ matrices, we introduce three new weight matrices called $W^Q_1,W^K_1,W^V_1, $. We create the query, key, and value matrix, by multiplying the input matrix $X$  by   $W^Q_1,W^K_1,W^V_1, $  respectively. 

Now the attention matrix, $Z_1$ can be computed as:

$$Z_1 = \text{softmax} \bigg ( \frac{Q_1 K_1^T}{\sqrt{d_k}}\bigg) V_1 $$


Now, let’s compute the second attention matrix $Z_2$:

We learned that to compute the attention matrix, we create three new matrices called a query, key, and value matrices. To create the query $Q_2$, key $K_2$, and value $V_2$ matrices, we introduce three new weight matrices called $W^Q_2,W^K_2,W^V_2, $. We create the query, key, and value matrix, by multiplying the input matrix $X$  by   $W^Q_2,W^K_2,W^V_2, $  respectively. 

Now the attention matrix, $Z_2$ can be computed as:

$$Z_2 = \text{softmax} \bigg ( \frac{Q_2 K_2^T}{\sqrt{d_k}}\bigg) V_2 $$


Similarly, we can compute $h$ number of attention matrices. Suppose, we have 8 attention matrices $Z_1$ to $Z_8$ then we can just concatenate all the attention heads (attention matrices) and multiply the result by a new weight matrix $W^0$ and create the final attention matrix as shown below: 

$$\text{Multi-head attention} = \text{Concatenate} (Z_1,Z_2, \dots Z_i, \dots Z_8) W_0 $$

Now that we have learned how the multi-attention mechanism works, we will learn another interesting concept called positional encoding in the next section. 

















