# Self Attention

The self-attention mechanism, also known as intra-attention or scaled dot-product attention, is a key component in transformer-based models. It allows the model to capture relationships between different words within the same sentence or sequence by assigning attention weights.

To explain the self-attention mechanism, let's consider the sample sentence "The cat sat on the mat." We'll use a simplified version of the self-attention mechanism using dot products.

   **Embedding**: First, each word in the sentence is transformed into a word embedding, which represents the word's meaning in a vector space.

   **Query, Key, and Value**: The word embeddings are used to generate three types of vectors: query, key, and value. These vectors are obtained by multiplying the word embeddings by learnable weights (often referred to as matrices) specific to each type. Each word embedding is transformed into query, key, and value vectors. These vectors allow the model to compare and relate different words in the sentence. 
   
   For example, the word "cat" will have query, key, and value vectors associated with it. 
    
   The query is a vector representation that encodes the information or context that we want to retrieve from the input sequence. (The word or token in the English sentence that we want to translate at a particular time step, such as "cat".)
    
   The key vector represents the elements or features of the input sequence that we want to attend to or retrieve information from.(The representations or embeddings of the words in the input English sentence, for example, "I", "have", "a", "cat".)

   The value vector represents the information or content associated with each element in the input sequence. (The corresponding representations or embeddings of the words in the input English sentence, such as the word embeddings of "I", "have", "a", "cat".)
    
   The key vectors capture the relationships and importance of different elements, while the value vectors hold the actual information or content associated with those elements. The attention mechanism utilizes the key and value vectors together to compute attention weights and generate the context vector.    

   **Similarity Calculation**: To compute the attention weights, we calculate the similarity between each query vector and all key vectors. This can be done by taking the dot product between the query and key vectors and scaling the result by the square root of the dimension of the key vectors.

   **Attention Weights** : The similarity scores are further processed using the softmax function to obtain attention weights. The softmax ensures that the attention weights sum up to 1 and represent the importance or relevance of each word in the sentence.

   **Context Vector**: The attention weights are used to weigh the value vectors. The value vectors hold the information that the model will focus on for each word. The weighted sum of the value vectors, using the attention weights as coefficients, produces the context vector, which captures the relevant information from the entire sentence.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding

In [2]:
def self_attention(sentence, W_query, W_key, W_value):
    # Compute query, key, and value vectors
    queries = np.dot(sentence, W_query)
    keys = np.dot(sentence, W_key)
    values = np.dot(sentence, W_value)
    
    # Compute similarity scores using dot product and scaling
    scores = np.dot(queries, keys.T) / np.sqrt(keys.shape[-1])
    
    # Compute attention weights using softmax
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True)
    
    # Compute context vector using attention weights and values
    context_vector = np.dot(attention_weights, values)
    
    return context_vector, attention_weights


# Sample sentence
sentence = np.array([[0.2, 0.3, -0.1],
                     [0.5, -0.2, 0.4],
                     [-0.3, 0.1, 0.6],
                     [0.4, 0.5, -0.3]])

# Learnable weight matrices
W_query = np.random.randn(sentence.shape[1], sentence.shape[1])
W_key = np.random.randn(sentence.shape[1], sentence.shape[1])
W_value = np.random.randn(sentence.shape[1], sentence.shape[1])

# Apply self-attention mechanism
context_vector, attention_weights = self_attention(sentence, W_query, W_key, W_value)

# Print results
print("Sentence:")
print(sentence)
print("\nContext Vector:")
print(context_vector)
print("\nAttention Weights:")
print(attention_weights)

Sentence:
[[ 0.2  0.3 -0.1]
 [ 0.5 -0.2  0.4]
 [-0.3  0.1  0.6]
 [ 0.4  0.5 -0.3]]

Context Vector:
[[ 0.06301839 -0.13436515  0.01266342]
 [-0.29911514 -0.23966649  0.09011734]
 [-0.10170689 -0.14678699  0.18305492]
 [ 0.21985998 -0.09596336 -0.04224468]]

Attention Weights:
[[0.22933771 0.17550718 0.4118571  0.18329801]
 [0.25683411 0.27349775 0.09179122 0.37787692]
 [0.20338818 0.41051068 0.13779926 0.24830188]
 [0.1920107  0.10477828 0.57860311 0.1246079 ]]


## Self Attention Keras 

We will use Keras Embedding layer to generate word embeddings. 

We first tokenize the sentence and convert it to sequences of word indices using the tokenizer. Then, we create an embedding layer with the specified input dimension (vocab_size) and output dimension (embedding_dim).

We apply the embedding layer to the padded sequences, resulting in embedded sequences. These embedded sequences represent the word embeddings for the sentence.

Finally, we pass the embedded sequences to the self_attention function along with the learnable weight matrices to obtain the context vector and attention weights. The results are printed out for visualization.

In [3]:
def self_attention(sentence, W_query, W_key, W_value):
    # Compute query, key, and value vectors
    queries = tf.matmul(sentence, W_query)
    keys = tf.matmul(sentence, W_key)
    values = tf.matmul(sentence, W_value)
    
    # Compute similarity scores using dot product and scaling
    scores = tf.matmul(queries, keys, transpose_b=True) / tf.sqrt(tf.cast(keys.shape[-1], tf.float32))
    
    # Compute attention weights using softmax
    attention_weights = tf.nn.softmax(scores, axis=-1)
    
    # Compute context vector using attention weights and values
    context_vector = tf.matmul(attention_weights, values)
    
    return context_vector, attention_weights

In [4]:
# Sample English sentence
sentence = "I love natural language processing"

# Prepare tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts([sentence])

# Convert sentence to sequence of word indices
sequences = tokenizer.texts_to_sequences([sentence])
padded_sequences = pad_sequences(sequences, padding="post")

# Create word index and vocabulary size
word_index = tokenizer.word_index
vocab_size = len(word_index) + 1

# Define embedding dimension
embedding_dim = 100

# Create embedding layer
embedding_layer = Embedding(input_dim=vocab_size, output_dim=embedding_dim)

# Apply embedding layer to the padded sequences
embedded_sequences = embedding_layer(padded_sequences)

# Learnable weight matrices
W_query = np.random.randn(embedding_dim, embedding_dim)
W_key = np.random.randn(embedding_dim, embedding_dim)
W_value = np.random.randn(embedding_dim, embedding_dim)

# Apply self-attention mechanism
context_vector, attention_weights = self_attention(embedded_sequences, W_query, W_key, W_value)

# Print results
print("Sentence:")
print(sentence)
print("\nContext Vector:")
print(context_vector)
print("\nAttention Weights:")
print(attention_weights)

Metal device set to: Apple M2 Max
Sentence:
I love natural language processing

Context Vector:
tf.Tensor(
[[[-0.21116059  0.1338353   0.13333878  0.05117755  0.30933744
   -0.0305737  -0.06057891  0.08202386  0.0193999  -0.13128015
    0.19713676  0.17998238  0.00066628 -0.03445598  0.11257076
   -0.11821473  0.156113    0.05705294 -0.05195492  0.05657417
    0.07310655  0.06868353 -0.01852534  0.01886342  0.05810512
   -0.16444445  0.0838085  -0.10200395 -0.07993244  0.08214889
   -0.09336944  0.13650562  0.09028789 -0.05318436 -0.00255455
    0.16088843 -0.06157837 -0.01500967 -0.02165653 -0.026959
    0.03200218 -0.06000656  0.11999457 -0.12722476 -0.18554088
   -0.0638378  -0.14260234 -0.11518177  0.22430842  0.12695412
    0.00529285  0.00837352  0.00952021 -0.10221426 -0.07873211
    0.05573184 -0.10702249 -0.10238901  0.02699872 -0.05801383
   -0.1904501   0.09435353  0.31905353 -0.08835943  0.10708644
   -0.08998151 -0.13991088 -0.15970846 -0.09349819 -0.16864042
   -0.2549234