# Text processing with Transformers


A transformer is a deep learning architecture for processing, understanding, and generating text in human language.

It was developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper ["Attention Is All You Need"](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf).

Some of the most impactful LLMs, including BERT, GPT, and T5, to name a few, are all based on transformers.

## Transformer architecture

<img src="./img/transformers.png" alt="transformers" style="width: 300px;"/>

Transformer has two main stacks:

* Encoder

* Decoder

Each stack has number of layers containing Multi-Head Attention and Feed-Forward layers. 

They don't have recurrent or convolutional layers.

**Transformers vs RNN:**

* Transformers do not rely on recurrent layers as part of their neural network components. 

* They can significantly outperform RNNs in capturing long-range dependencies in large text data sequences, thanks to the so-called attention mechanisms, which together with token positional encoding, are capable of weighting the relative importance of different words in a sentence when making inferences.

* Thanks to attention mechanisms, transformers handle tokens simultaneously rather than sequentially, leading to faster model training and inference.

**Types of transformer architecture:**

* Encoder-Decoder: translation, summarization (T5, BART)
* Encoder only: text classification, extractive QA (BERT)
* Decoder only: text generation, generative QA (GPT)

# PyTorch Transformer

The model dimension **d_model** refers to the dimensionality of embeddings used throughout the entire model to represent inputs, outputs, and the intermediate information processed in between. 

Attention mechanisms typically have multiple heads that perform parallel computations, specializing in capturing different types of text dependencies. The number of heads, specified in **nhead**, is normally set as a divisor of the model dimension. 

The depth of the model largely depends on the number of encoder and decoder layers.

In [1]:
import torch
import torch.nn as nn

In [8]:
d_model = 512
nhead = 8
num_encoder_layers = 6
num_decoder_layers = 6

In [None]:
# PyTorch Transfomer implementation
model = nn.Transformer(
    d_model=d_model,
    nhead=n_heads,
    num_encoder_layers=num_encoder_layers,
    num_decoder_layers=num_decoder_layers,
) 

In [10]:
print(model)

Transformer(
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, o

# Attention mechanism & Positional encoding

<img src="./img/attention.png" alt="attention.png" style="width: 400px;"/>

The Attention Mechanism in the transformer assigns importance to words within a sentence. 

In example, 'it' is understood to be more related to 'animal', 'street' and 'the' in descending order of significance. This ensures that in tasks like translation, the machine's interpretation aligns with the human understanding.

**SELF** and **MULTI-HEAD** attention:

* Self-Attention assigns significance to words within a sentence. In *The cat, which was on the roof, was scared,* the mechanism links "was scared" directly to "The cat". 

* Multi-Head Attention is akin to deploying multiple spotlights. In the same example, "was scared" could relate to "The cat," signify "the roof," or point to "was on".

## Positional encoding

Attention mechanism requires information about the *position of each token in the sequence*. 

The positional encoding precedes attention layer and supplies information about the position of each token in a sequence.

Instead of token index, Transformers use a positional encoding scheme, where each position/index is mapped to a vector calculated by sine and cosine functions of varying frequencies.

<img src="./img/positional_encoding.png" alt="positional_encoding" style="width: 400px;"/>

The output of the positional encoding layer is a matrix, where each row of the matrix represents an encoded object of the sequence summed with its positional information.

<img src="./img/positional_encoding_end.png" alt="positional_encoding_end" style="width: 400px;"/>

In [12]:
import math

class PositionalEncoder(nn.Module):
    def __inti__(self, d_model, max_seq_length=512):
        super().__init()
        self.d_model = d_model
        self.max_seq_length = max_seq_length

        # initialize positional encoding matrix
        pe = torch.zeros(max_seq_length, d_model)

        # initialize position of indices in the sequence
        # 'unsqueeze' function aligns the tensor shape with the shape of the input embeddings
        position = torch.arange(0, max_seq_length, dtype=float).unsqueeze(1)

        # scaler for positional indices
        div_term = torch.exp(
            torch.arange(0, d_model, 2, dtype=float) *
            -(math.log(10000.0) / d_model)
        )
        # apply scaler to positional indices combined with sine and cosine functions
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # unsqueeze to add batch dimension
        pe = pe.unsqueeze(0)

        # set matrix as non-trainable using register_buffer
        self.register_buffer('pe', pe)

    def forward(self, x):
        # add the positional encodings to the whole sequence embeddings contained in tensor x
        x = x + self.pe[:, :x.size(1)]
        return x

## Attention mechanism in details

<img src="./img/attention_mechanism.png" alt="attention_mechanism" style="width: 400px;"/>

Each embedding is first projected into three matrices of equal dimension -query, key, and values- by applying three separate linear transformations each having learned their own weights during training.

Scaled dot-product is the most common self-attention approach, which applies dot-product (or cosine) similarity between every query-key pair in a sequence to yield a matrix of attention scores between words.

Softmax scaling helps obtain a matrix of attention weights, indicating the relevance or attention that the model must pay to each token in a sequence like "orange is my favorite fruit" for a given query token, such as "orange". In this example, "favorite" and "fruit" are the two words to pay the highest attention to when processing the word "orange".

Attention weights are then multiplied by the values to obtain updated token embeddings with relevant information about the sequence.

Transformers implemeted with multiple attention heads to learn various tasks, see on the picture below.

<img src="./img/multi_headed_attention.png" alt="multi_headed_attention" style="width: 800px;"/>

Multi-headed attention concatenates attention-head outputs and linearly projects them to keep consistent embedding dimensions.

In [15]:
class MultiHeadAttention(nn.Module):
    def __inti__(self, d_model, num_heads):
        super().__init()
        self.d_model = d_model
        # number of attention heads handling embedding size head_dim
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        # linear transformations for attention inputs
        self.query_linear = nn.Linear(d_model, d_model)
        self.key_linear = nn.Linear(d_model, d_model)
        self.value_linear = nn.Linear(d_model, d_model)
        
        # final concatenated output
        self.output_layer = nn.Linear(d_model, d_model)

    def split_heads(self, x, batch_size):
        # splits the input across the heads
        x = x.view(batch_size, -1, self.num_heads, self.head_dim)
        return x.permute(0,2,1,3).contiguous().view(batch_size * self.num_heads, -1 , self.head_dim)

    def compute_attention(self, query, key, mask=None):
        # calculates the attention weights inside each head
        scores = torch.matmul(query, key.permute(1,2,0))
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-1e9"))
        attention_weights = nn.torch.functional.softmax(scores, dim=-1)
        return attention_weights

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        query = self.split_heads(
            self.query_linear(query), batch_size
        )
        key = self.split_heads(
            self.key_linear(key), batch_size
        )
        value = self.split_heads(
            self.value_linear(value), batch_size
        )

        attention_weights = self.compute_attention(query, key, mask)

        output = torch.matmul(attention_weights, value)
        output = output.view(
            batch_size, 
            self.num_heads, 
            -1, 
            self.head_dim
        ).permute(0,2,1,3).contiguous().view(batch_size, -1 , self.d_model)

        return self.output_layer(output)

## TransformerEncoder Model

In [2]:
sample_texts = [
    'I love this product',
    'This is terrible',
    'Could be better',
    'This is the best',
]

In [3]:
labels = [1,0,0,1]

In [4]:
train_data, test_data = sample_texts[:3], sample_texts[3:]
train_labels, test_labels = labels[:3], labels[3:]

TransformerEncoderLayer:

* d_model - influences the model's representational depth
* nhead - determines how many word contexts the model can focus on simultaneously, impacting its contextual understanding

In [9]:
class TransformerEncoder(nn.Module):
    def __init__(self, embed_size, heads, num_layers, dropout):
        super().__init__()
        self.encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=embed_size, nhead=heads),
            num_layers=num_layers
        )
        self.fc = nn.Linear(embed_size, 2)

    def forward(self, x):
        x = self.encoder(x)
        x = x.mean(dim=1)
        return self.fc(x)

In [None]:
model = TransformerEncoder(embed_size=512, heads=8, num_layers=3, dropout=0.5)

In [12]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_function = nn.CrossEntropyLoss()

In [None]:
# training loop
for epoch in range(10):
    for sentence, label in zip(train_data,train_labels):
        # Split the sentences into tokens and stack the embeddings
        tokens = sentence.split()
        data = torch.stack([token_embeddings[i] for i in tokens], dim=1)
        output = model(data)
        loss = loss_function(output, torch.tensor([label]))
        # Zero the gradients and perform a backward pass
        optimizer.zero_grad()
        loss.backwards()
        optimizer.step()

In [13]:
# testing
def predict(sentence):
    model.eval()
    # Deactivate the gradient computations and get the sentiment prediction.
    with torch.no_grad():
        tokens = sentence.split()
        data = torch.stack([token_embeddings.get(token, torch.rand((1, 512))) for i in tokens], dim=1)
        output = model(data)
        predicted = torch.argmax(output, dim=1)
    return 'positive' if predicted.item() == 1 else 'negative'

## RNN with Attention Model

In [2]:
sample_text = [
    "the animal didn't cross the street because it was too tired",
    "the cat sat on the mat",
]

In [7]:
# vocabulary and word index
vocab = set(' '.join(sample_text).split())
word_to_idx = {word:idx for idx, word in enumerate(vocab)}
ix_to_word = {idx:word for idx, word in enumerate(vocab)}

In [15]:
# encoder/decoder data
pairs = [sentence.split() for sentence in sample_text]
input_data = [[word_to_idx[word] for word in sentence[:-1]] for sentence in pairs]
target_data = [word_to_idx[sentence[-1]] for sentence in pairs]

inputs = [torch.tensor(seq, dtype=torch.long) for seq in input_data]
targets = torch.tensor(target_data, dtype=torch.long)

In [33]:
embedding_dim = 10
hidden_dim = 16
vocab_size = len(vocab)

class RNNWithAttentionModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super().__init__()
        
        # Embedding layer translates word indexes to vectors
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # RNN layer for sequentail processing
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)

        # Attention layer computes word significanes, performing linear transformation of hidden_dim to one,
        # yielding a singular attention score per word
        self.attention = nn.Linear(hidden_dim, 1)

        # Final layer outputting vocab_size pinpoints the predicted word index
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        
        # word indexes are embedded
        x = self.embeddings(x)

        # process embeddings in sequentail layer generating output for each word
        out, _ = self.rnn(x)

        # attention scores are derived by applying a linear transformation to the RNN outputs, 
        # normalizing using softmax, and reshaping the tensor using squeeze two to simplify attention calculations.
        att_weights = torch.nn.functional.softmax(self.attention(out).squeeze(2), dim=1)

        # Context vector is formulated by multiplying attention scores with RNN outputs, 
        # creating a weighted sum of the outputs, where weights are the attention scores. 
        # The unsqueeze two operation is important for adjusting tensor dimensions for matrix multiplication with RNN outputs. 
        # The context vector is then summed using torch-dot-sum to feed into the fc layer for the final prediction.
        context = torch.sum(att_weights.unsqueeze(2) * out, dim=1)
        
        return self.fc(context)

def pad_sequences(batch):
    """
    Ensures consistent sequence lengths by padding the input sequences 
    with torch-dot-cat and torch-dot-stack, avoiding any potential length discrepancies
    """
    max_len = max([len(i) for i in batch])
    return torch.stack(
        [torch.cat([seq, torch.zeros(max_len - len(seq)).long()])
             for seq in batch]
    )

In [34]:
loss_finction = nn.CrossEntropyLoss()
model = RNNWithAttentionModel(vocab_size, embedding_dim, hidden_dim)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

In [37]:
# training
for epoch in range(300):
    model.train()
    optimizer.zero_grad()
    padded_inputs = pad_sequences(inputs)
    output = model(padded_inputs)
    loss = loss_finction(output, targets)
    loss.backward()
    optimizer.step()

    if (epoch % 100) == 0:
        print(f'epoch {epoch}, loss: {loss.item()}')

epoch 0, loss: 0.0005866951541975141
epoch 100, loss: 0.0004257845284882933
epoch 200, loss: 0.00032556717633269727


In [40]:
# testing
for seq, target in zip(input_data, target_data):
    input_test = torch.tensor(seq, dtype=torch.long).unsqueeze(0)
    model.eval()
    output = model(input_test)
    predictions = ix_to_word[torch.argmax(output).item()]
    print(f'\nInput: {" ".join([ix_to_word[i] for i in seq])}')
    print(f'\nTarget: {ix_to_word[target]}')
    print(f'\nOutput: {predictions}')


Input: the animal didn't cross the street because it was too

Target: tired

Output: tired

Input: the cat sat on the

Target: mat

Output: mat
