# Attention Mechanisms and Transformer Models

## Introduction

Attention mechanisms have revolutionized the field of deep learning, particularly in natural language processing (NLP). They enable models to focus on specific parts of the input when generating each part of the output, effectively capturing long-range dependencies and improving performance on tasks like machine translation, text summarization, and question answering.

Transformer models, introduced by Vaswani et al. in 2017 [[1]](#ref1), leverage attention mechanisms entirely, dispensing with recurrence and convolutions. This innovation has led to significant advancements in NLP and beyond.

In this tutorial, we'll delve into attention mechanisms, explore the Transformer architecture, implement a Transformer model using TensorFlow and Keras, and discuss the latest developments in this rapidly evolving field.

## Table of Contents

1. [Understanding Attention Mechanisms](#1)
   - [Motivation](#1.1)
   - [Mathematical Formulation](#1.2)
2. [The Transformer Model](#2)
   - [Architecture Overview](#2.1)
   - [Multi-Head Attention](#2.2)
   - [Positional Encoding](#2.3)
3. [Implementing a Transformer Model](#3)
   - [Dataset Preparation](#3.1)
   - [Building the Transformer](#3.2)
   - [Training the Model](#3.3)
   - [Evaluating Performance](#3.4)
4. [Latest Developments in Transformers](#4)
   - [BERT (Bidirectional Encoder Representations from Transformers)](#4.1)
   - [GPT Series](#4.2)
   - [T5 (Text-to-Text Transfer Transformer)](#4.3)
   - [Vision Transformers (ViT)](#4.4)
5. [Conclusion](#5)
6. [References](#6)


<a id="1"></a>
## 1. Understanding Attention Mechanisms

<a id="1.1"></a>
### Motivation

In traditional sequence-to-sequence (seq2seq) models, the encoder compresses the input sequence into a fixed-length context vector, which the decoder then uses to generate the output sequence. This approach can struggle with long input sequences because the fixed-length vector cannot capture all the necessary information.

Attention mechanisms address this limitation by allowing the decoder to selectively focus on different parts of the input sequence at each decoding step.

<a id="1.2"></a>
### Mathematical Formulation

The attention mechanism computes a weighted sum of the encoder's hidden states, where the weights (attention scores) reflect the relevance of each hidden state to the current decoding step.

Given:

- Encoder hidden states: $( \mathbf{H} = [\mathbf{h}_1, \mathbf{h}_2, \dots, \mathbf{h}_n] )$
- Decoder hidden state at time $( t )$: $( \mathbf{s}_t )$

The attention weights $( \alpha_{t,i} )$ are computed using an alignment model $( a )$:

$[
e_{t,i} = a(\mathbf{s}_{t-1}, \mathbf{h}_i)
]$

$[
\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{k=1}^{n} \exp(e_{t,k})}
]$

The context vector $( \mathbf{c}_t )$ is the weighted sum:

$[
\mathbf{c}_t = \sum_{i=1}^{n} \alpha_{t,i} \mathbf{h}_i
]$

The decoder then uses $( \mathbf{c}_t )$ to generate the output at time $( t )$.

**Alignment Model Options:**

- **Dot Product:** $( e_{t,i} = \mathbf{s}_{t-1}^\top \mathbf{h}_i )$
- **Additive (Bahdanau Attention):** $( e_{t,i} = \mathbf{v}^\top \tanh(\mathbf{W}_1 \mathbf{s}_{t-1} + \mathbf{W}_2 \mathbf{h}_i) )$

**Reference:**

- Bahdanau, D., Cho, K., & Bengio, Y. (2014). *Neural Machine Translation by Jointly Learning to Align and Translate*. [arXiv:1409.0473](https://arxiv.org/abs/1409.0473)

<a id="2"></a>
## 2. The Transformer Model

<a id="2.1"></a>
### Architecture Overview

The Transformer architecture removes recurrence entirely and relies solely on attention mechanisms to capture dependencies between input and output.

**Key Components:**

- **Encoder:** Processes the input sequence and generates representations.
- **Decoder:** Generates the output sequence, using the encoder's representations and previous outputs.

**Advantages:**

- **Parallelization:** Without recurrence, computations can be parallelized over sequence positions.
- **Long-Range Dependencies:** Attention allows direct connections between any two positions in the sequence.

<a id="2.2"></a>
### Multi-Head Attention

Multi-head attention allows the model to focus on different positions and aspects of the input.

**Scaled Dot-Product Attention:**

Given queries $( \mathbf{Q} )$, keys $( \mathbf{K} )$, and values $( \mathbf{V} )$:

$[
\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}
]$

- $( d_k )$: Dimension of the key vectors (used for scaling).

**Multi-Head Attention:**

$[
\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) \mathbf{W}^O
]$

Where each head is:

$[
\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V)
]$

- $( \mathbf{W}_i^Q, \mathbf{W}_i^K, \mathbf{W}_i^V )$: Projection matrices for queries, keys, and values.
- $( h )$: Number of heads.

<a id="2.3"></a>
### Positional Encoding

Since the Transformer has no recurrence or convolution, positional encoding is added to the input embeddings to provide the model with information about the position of each token in the sequence.

**Sinusoidal Positional Encoding:**

$[
\text{PE}_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
]$

$[
\text{PE}_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
]$

- $( pos )$: Position in the sequence.
- $( i )$: Dimension index.
- $( d_{\text{model}} )$: Dimension of the model.

**Reference:**

- Vaswani, A., et al. (2017). *Attention Is All You Need*. [arXiv:1706.03762](https://arxiv.org/abs/1706.03762)

In [None]:
# Positional Encoding
import numpy as np
import tensorflow as tf

def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
    return pos * angle_rates

def positional_encoding(position, d_model):
    angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                            np.arange(d_model)[np.newaxis, :],
                            d_model)

    # Apply sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

    # Apply cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    pos_encoding = angle_rads[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)

# Example usage
pos_encoding = positional_encoding(50, 512)
print(pos_encoding.shape)  # (1, 50, 512)

In [None]:
# Scaled Dot-Product Attention
def scaled_dot_product_attention(q, k, v, mask):
    """Calculate the attention weights."""
    matmul_qk = tf.matmul(q, k, transpose_b=True)

    # Scale matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    # Add the mask to the scaled tensor
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)

    # Softmax on the last axis
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)

    output = tf.matmul(attention_weights, v)

    return output, attention_weights

In [None]:
# Multi-Head Attention
from tensorflow.keras.layers import Layer

class MultiHeadAttention(Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0

        self.num_heads = num_heads
        self.d_model = d_model

        self.depth = d_model // num_heads

        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, depth)."""
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)

        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        # Scaled Dot-Product Attention
        scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)

        # Concatenate heads
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))

        output = self.dense(concat_attention)

        return output, attention_weights

<a id="3"></a>
## 3. Implementing a Transformer Model

We'll implement a Transformer model for machine translation using TensorFlow and Keras.

<a id="3.1"></a>
### Dataset Preparation

We'll use a simple English-to-Portuguese translation dataset.

In [None]:
import tensorflow_datasets as tfds

# Load the dataset
examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True, as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

# Tokenization
import tensorflow_text  # Required to run tokenizer ops

model_name = 'ted_hrlr_translate_pt_en_converter'
tokenizers = tf.saved_model.load(model_name)

def tokenize_pairs(pt, en):
    pt = tokenizers.pt.tokenize(pt)
    pt = pt.to_tensor()

    en = tokenizers.en.tokenize(en)
    en = en.to_tensor()
    return pt, en

BUFFER_SIZE = 20000
BATCH_SIZE = 64

train_dataset = train_examples.map(tokenize_pairs)
train_dataset = train_dataset.cache()
train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE)
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)

val_dataset = val_examples.map(tokenize_pairs)
val_dataset = val_dataset.padded_batch(BATCH_SIZE)

<a id="3.2"></a>
### Building the Transformer

Now we'll build the Transformer model by combining the components we've implemented.

In [None]:
# Encoder Layer
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
            tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
        ])

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):
        attn_output, _ = self.mha(x, x, x, mask)  # Self-attention
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # Residual connection

        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # Residual connection

        return out2

In [None]:
# Decoder Layer
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(DecoderLayer, self).__init__()

        self.mha1 = MultiHeadAttention(d_model, num_heads)
        self.mha2 = MultiHeadAttention(d_model, num_heads)

        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
            tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
        ])

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        self.dropout3 = tf.keras.layers.Dropout(rate)

    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # Masked MHA
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + x)

        attn2, attn_weights_block2 = self.mha2(enc_output, enc_output, out1, padding_mask)  # MHA with encoder output
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)

        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)

        return out3, attn_weights_block1, attn_weights_block2

In [None]:
# Complete Transformer Model
class Transformer(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff,
                 input_vocab_size, target_vocab_size, pe_input, pe_target, rate=0.1):
        super(Transformer, self).__init__()

        self.encoder = Encoder(num_layers, d_model, num_heads, dff,
                               input_vocab_size, pe_input, rate)

        self.decoder = Decoder(num_layers, d_model, num_heads, dff,
                               target_vocab_size, pe_target, rate)

        self.final_layer = tf.keras.layers.Dense(target_vocab_size)

    def call(self, inp, tar, training, enc_padding_mask,
             look_ahead_mask, dec_padding_mask):
        enc_output = self.encoder(inp, training, enc_padding_mask)

        dec_output, attention_weights = self.decoder(
            tar, enc_output, training, look_ahead_mask, dec_padding_mask)

        final_output = self.final_layer(dec_output)

        return final_output, attention_weights

<a id="3.3"></a>
### Training the Model

We'll compile and train the Transformer model.

In [None]:
# Hyperparameters
num_layers = 4
d_model = 128
dff = 512
num_heads = 8
input_vocab_size = tokenizer_pt.vocab_size + 2
target_vocab_size = tokenizer_en.vocab_size + 2
dropout_rate = 0.1

# Instantiate the Transformer
transformer = Transformer(num_layers, d_model, num_heads, dff,
                          input_vocab_size, target_vocab_size,
                          pe_input=input_vocab_size,
                          pe_target=target_vocab_size,
                          rate=dropout_rate)

# Optimizer
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super(CustomSchedule, self).__init__()

        self.d_model = d_model
        self.d_model = tf.cast(self.d_model, tf.float32)

        self.warmup_steps = warmup_steps

    def __call__(self, step):
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)

        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

learning_rate = CustomSchedule(d_model)
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, 
                                     epsilon=1e-9)

# Loss and Metrics
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True, reduction='none')(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_sum(loss_) / tf.reduce_sum(mask)

train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')

<a id="3.4"></a>
### Evaluating Performance

After training, we can evaluate the model's performance on the validation set.

In [None]:
# Evaluation Function
def evaluate(inp_sentence):
    start_token = [tokenizer_pt.vocab_size]
    end_token = [tokenizer_pt.vocab_size + 1]

    # Tokenize input sentence
    inp_sentence = start_token + tokenizer_pt.encode(inp_sentence) + end_token
    encoder_input = tf.expand_dims(inp_sentence, 0)

    decoder_input = [tokenizer_en.vocab_size]
    output = tf.expand_dims(decoder_input, 0)

    for i in range(40):
        enc_padding_mask, combined_mask, dec_padding_mask = create_masks(
            encoder_input, output)

        # Predictions
        predictions, attention_weights = transformer(encoder_input, 
                                                     output,
                                                     False,
                                                     enc_padding_mask,
                                                     combined_mask,
                                                     dec_padding_mask)

        predictions = predictions[:, -1:, :]  # (batch_size, 1, vocab_size)
        predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)

        if predicted_id == tokenizer_en.vocab_size + 1:
            return tf.squeeze(output, axis=0), attention_weights

        output = tf.concat([output, predicted_id], axis=-1)

    return tf.squeeze(output, axis=0), attention_weights

# Function to translate
def translate(sentence):
    result, attention_weights = evaluate(sentence)

    predicted_sentence = tokenizer_en.decode([i for i in result 
                                              if i < tokenizer_en.vocab_size])

    print('Input: {}'.format(sentence))
    print('Predicted translation: {}'.format(predicted_sentence))

# Example
translate("este é um problema que temos que resolver.")

<a id="4"></a>
## 4. Latest Developments in Transformers

<a id="4.1"></a>
### BERT (Bidirectional Encoder Representations from Transformers)

- Introduced by Devlin et al. in 2018 [[2]](#ref2)
- Uses the encoder part of the Transformer
- Pre-trained on large text corpora with masked language modeling and next sentence prediction
- Achieves state-of-the-art results on various NLP tasks

<a id="4.2"></a>
### GPT Series

- Generative Pre-trained Transformer models by OpenAI
- GPT-1 (2018), GPT-2 (2019), GPT-3 (2020), GPT-4 (2023)
- Uses the decoder part of the Transformer
- Pre-trained on large text corpora
- Capable of generating coherent and contextually relevant text

<a id="4.3"></a>
### T5 (Text-to-Text Transfer Transformer)

- Introduced by Raffel et al. in 2019 [[3]](#ref3)
- Unified framework that converts all NLP tasks into a text-to-text format
- Achieves strong performance across a variety of tasks

<a id="4.4"></a>
### Vision Transformers (ViT)

- Introduced by Dosovitskiy et al. in 2020 [[4]](#ref4)
- Applies Transformer architecture to image recognition tasks
- Splits images into patches and treats them as tokens

<a id="5"></a>
## 5. Conclusion

Attention mechanisms and Transformer models have fundamentally changed the landscape of deep learning, particularly in NLP. By effectively modeling long-range dependencies and enabling parallel computation, Transformers have achieved state-of-the-art results across numerous tasks. Ongoing research continues to push the boundaries, extending Transformers to new domains and improving their efficiency.

<a id="6"></a>
## 6. References

1. <a id="ref1"></a>Vaswani, A., et al. (2017). *Attention Is All You Need*. [arXiv:1706.03762](https://arxiv.org/abs/1706.03762)
2. <a id="ref2"></a>Devlin, J., et al. (2018). *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*. [arXiv:1810.04805](https://arxiv.org/abs/1810.04805)
3. <a id="ref3"></a>Raffel, C., et al. (2019). *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer*. [arXiv:1910.10683](https://arxiv.org/abs/1910.10683)
4. <a id="ref4"></a>Dosovitskiy, A., et al. (2020). *An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale*. [arXiv:2010.11929](https://arxiv.org/abs/2010.11929)

---

This notebook provides an in-depth exploration of attention mechanisms and Transformer models. You can run the code cells to see how Transformers are implemented and experiment with the models.