<a href="https://colab.research.google.com/github/rsidorchuk93/transformers/blob/main/Transformers_attention_encoding_theory_and_implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Algorithm implementing attention mechanism in transformers

Theory:


1.   **What types of optimizers cost function exist? Gradient descent is a simple one, what others?** Adam an adaptive optimization algorithm that combines ideas from both momentum and RMSprop optimization methods. Adam adapts the learning rate for each parameter based on the first and second moments of the gradients, allowing it to converge faster and more reliably than other optimization methods.
2.   **Difference between autoregression and bi-directional encoder.** Autoregression is a specific technique for time series forecasting that relies on previous values to predict the next value in the sequence, while a bi-directional encoder is a type of neural network architecture that allows the model to take into account both past and future context when making predictions.
3.   **Objective of encoder and decoder**. The encoder's objective is to transform the input data into a lower-dimensional feature representation, while the decoder's objective is to reconstruct the original input data from the compressed representation produced by the encoder. The autoencoder neural network's objective is to learn a compressed representation of the input data that captures the most important information while being able to reconstruct the original input data with minimal error.
4.   **What loss function transformers use.** The cross-entropy loss is a measure of the difference between the predicted probability distribution (output of the model) and the true probability distribution (ground truth).
The softmax function is typically used in conjunction with the cross-entropy loss to convert the output of the model, which is a vector of scores, into a probability distribution over the vocabulary of words. The softmax function normalizes the scores so that they sum to 1, allowing them to be interpreted as probabilities.
5.   **Positional encoding - how it is calculated** Positional encoding is a technique used in transformer-based neural networks to add information about the relative position of tokens in a sequence. This allows the model to better understand the order of the sequence. Positional encoding consists of a set of sine and cosine functions with different frequencies and phases, which are added to the embedding vector of each token in the sequence. The frequency of each sine and cosine function depends on the position of the token and the dimension of the embedding vector. The amplitude of the functions decreases as the frequency increases, allowing the model to capture fine-grained positional information for low-frequency functions and coarse-grained positional information for high-frequency functions.

In [None]:
# importing libraries, can't use pytorch
import numpy as np
from scipy.special import softmax

In [None]:
# 1. Tokenize input text
def tokenize(text):
  tokens = list(set(text))  
  return tokens

tokens = tokenize('My name is Roman')
print(tokens)


# additional step is to position embeddings

['M', 'n', 'm', 'e', 'y', ' ', 'i', 'o', 'R', 's', 'a']


In [None]:
# 2. Calculate embeddings
# define embeddings dimension
emb_dim = 30
def embeddings(tokens, emb_dim):
  # initialize embedding matrix with random values
  emb_matrix = np.random.randn(len(tokens), emb_dim)

  # loop over tokens and update embedding matrix
  for i, token in enumerate(tokens):
    # calculate embeddings for characters
    char_embeddings = []
    for char in token:
      # initialize char embedding with random values
      char_emb = np.random.randn(emb_dim)
      char_embeddings.append(char_emb)

    # calculate mean
    token_emb = np.mean(char_embeddings, axis = 0)

    # update embedding matrix
    emb_matrix[i] = token_emb

  # calculate emb mean and return embedded
  embed = np.mean(emb_matrix, axis = 0)
  return embed

embed = embeddings(tokens, emb_dim)
embed

array([-0.61304224, -0.19007164,  0.16118197, -0.55563408, -0.35936641,
        0.26078099,  0.16680578, -0.07712182, -0.1242666 ,  0.00095757,
       -0.41096915, -0.22245473, -0.41146693, -0.05862648,  0.05245742,
       -0.37957559, -0.46196666,  0.27357544, -0.44123281, -0.11279968,
       -0.34191517, -0.09971721, -0.03584275, -0.12380152, -0.12210476,
       -0.17348563, -0.20965604,  0.38626767,  0.4699147 ,  0.0538929 ])

In [None]:
# 3. Implement self attention from embeddings

def self_attention(embed):
  # dot product of embeddigns with itself
  dot_product = np.dot(embed, embed.T)

  # apply softmax 
  attn_weights = softmax(dot_product, axis = -1)

  # weighted sum
  weighted_sum = np.dot(attn_weights, embed)

  # normzlize the weighted sum
  norm_weighted_sum = weighted_sum / np.sum(attn_weights)

  return norm_weighted_sum


self_attention(embed)

array([-0.61304224, -0.19007164,  0.16118197, -0.55563408, -0.35936641,
        0.26078099,  0.16680578, -0.07712182, -0.1242666 ,  0.00095757,
       -0.41096915, -0.22245473, -0.41146693, -0.05862648,  0.05245742,
       -0.37957559, -0.46196666,  0.27357544, -0.44123281, -0.11279968,
       -0.34191517, -0.09971721, -0.03584275, -0.12380152, -0.12210476,
       -0.17348563, -0.20965604,  0.38626767,  0.4699147 ,  0.0538929 ])

the above self-attention layer is followed by feed-forward network network, each is followed by layer normalization

encoder takes as inputs:

1.   number of self-attention heads

2.   dropout probability for forward-feed network

3.   feed forward dimension, number of layers, and hidden dimension