## Multi Head Attention

### Why scaling is important before softmax

* 

In [3]:
import torch
import pandas as pd
import numpy as np
import random

In [2]:
# define the tensor
tensor = torch.tensor([0.5, -0.3, 0.7, -0.4, -0.3])

# apply softmax without scaling
softmax_result = torch.softmax(tensor, dim =-1)
print("Softmax without scaling:", softmax_result)

# apply softmax with scaling
scaled_tensor = tensor*8
softmax_scaled_result = torch.softmax(scaled_tensor, dim = -1)
print("Softmax with scaling:", softmax_scaled_result)

Softmax without scaling: tensor([0.2836, 0.1274, 0.3463, 0.1153, 0.1274])
Softmax with scaling: tensor([1.6787e-01, 2.7892e-04, 8.3145e-01, 1.2533e-04, 2.7892e-04])


* This is not good as the weights are not proportionally distributed. This can be an issue with the attention mechanism.
* **Thus, we need to have normalisation before applying softmax on the tensor to have better proportion of weights to add upto 1.**
* **Normalization** is done by dividing the dot product of query matrix and key matrix (transpose). We get the attention scores.
* Then we convert the **attention scores** to **attention weights**
* We need to make the variance of the dot product stable


In [None]:
# Function to compute variance before and after scaling

def compute_variance(dim, trials =1000):
    dot_products = []
    scaled_dot_products = []

    # Generate multiple random vectors and compute dot products
    for _ in range(trials):
        q = np.random.randn(dim)
        k = np.random.randn(dim)

        # getting dot product
        dot_product = np.dot(q,k)
        dot_products.append(dot_product)

        # scale the dot product by sqrt
        scaled_dot_product = dot_product/np.sqrt(dim)
        scaled_dot_products.append(scaled_dot_product)

    