<a href="https://colab.research.google.com/github/rastringer/code_first_ml/blob/main/transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/transformer_architecture.png?raw=true" width="500"/>

## Word vectors



In [None]:
import spacy

# Load the pre-trained English model
nlp = spacy.load("en_core_web_sm")

# Example sentence
sentence = "Word vectors are awesome!"

# Process the sentence using spaCy
doc = nlp(sentence)

# Access the word vectors for each token in the sentence
for token in doc:
    print(f"{token.text}: {token.vector[:5]}...")  # Displaying the first 5 components of the vector


Word: [-0.6131599 -1.3359097  0.5144141  0.582945  -0.5807983]...
vectors: [-0.16492632  1.2883415  -0.17733231  0.5517031   0.6896729 ]...
are: [ 0.08803613  0.7008368  -0.5576158  -0.5089512   0.14558357]...
awesome: [ 1.9866399  0.9055047 -0.7493017 -0.5317862 -0.8689432]...
!: [-0.9200196  -0.08114609  0.08567562 -1.8008881   0.09100179]...


<img src="https://github.com/rastringer/code_first_ml/blob/main/images/word_vectors.png?raw=true" width="500"/>

### The difficulties of word embeddings

Word vectors and embeddings are very useful however the abstract nature of language can cause problems when assigning numerical values to words. For example, in the following sentences, "trainers" has a different meaning based on the context.

"Mustafa loved running in his new trainers"

"Svitlana said the gym had the best trainers around"

Linguists call these words with unrelated meanings  *homonyms*. Another term is *polysemy*, which means a word can mean the same thing but have a slightly different meaning. For example,

"Joan wrote a program to calculate eucledian distance"

"The program featured Mozart's The Marriage of Figaro"

In short, we need a way of finding the meaning of words based on their relevance to other words in the text. Step forward, Attention.



## Attention

There are two steps in the transformer during which the model learns what words and text mean. This in ML parlance is updating the "hidden state" for inputs to the model.

The first is the attention stage, the transformer compares each word to all the other words in a sequence, looking for context and shared significance.

The second is the feed forward step, where the model tries to capture more complex patterns and relationships between words. These are accomplished by mathematical transformations.

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/attention_diagram.png?raw=true" width="800"/>

[Diagram](https://distill.pub/2016/augmented-rnns/) from Olah and Carter, 2016

### From text to genomics and vision

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/vit_transformer.png?raw=true" width="800"/>

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/vit_attention.png?raw=true" width="400"/>

Images from ["An image is worth 16 x 16 words"](https://arxiv.org/pdf/2010.11929.pdf)

### Self-attention

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/self_attention.png?raw=true" width="800"/>

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/multi_head.png?raw=true" width="800"/>

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/transformer_encoder.png?raw=true" width="800"/>

With each layer, the model's understanding of the text improves. Here are the outputs up to layer 23 of GPP-2 when given the following prompt:

"Q: What is the capital of France?"
"A: Paris"
"Q: What is the capital of Poland?"
"A: "

```
0  ( [ The:,
 at and Act A
1  A The ( [ Is59
 At and40
2  A [ ( The At Is Act at59,
3  A [ ( Act At Is The CH An at
4  A [ At Q (Q The Are M An
5  A M No At The payable Q Qu (Q
6  No M A The C Die An H En Qu
7  C A No The M n P N H An
8  A The C P H No n Ass N T
9  A C No nil The Ch P An H N
10  A The G C N P No Me An Le
11  A C N None P G The Pr Ce H
12  Unknown None C G A N Bar The Ch P
13  C P N G B A Unknown St None The
14  St N G P Poland B C Pol A D
15  Poland P St Pol Warsaw Polish N B G Germany
16  Poland Warsaw Polish Poles Budapest Prague Pol Germany Berlin Moscow
17  Poland Warsaw Polish Poles Budapest Prague � Pol Lithuania Moscow
18  Poland Warsaw Polish Prague Budapest Poles Moscow � Berlin Kiev
19  Warsaw Poland Polish Budapest Prague Moscow Berlin Kiev � Frankfurt
20  Warsaw Poland Prague Budapest Polish Moscow Kiev Berlin Frankfurt Brussels
21  Warsaw Poland Polish Prague Budapest � Kiev Sz Berlin Moscow
22  Warsaw Poland Prague Budapest K W Kiev Sz Moscow Berlin
23  Warsaw W K Br Po B L Z P Poland

```


In [None]:
import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        self.query_proj = nn.Linear(embed_dim, self.head_dim * num_heads)
        self.key_proj = nn.Linear(embed_dim, self.head_dim * num_heads)
        self.value_proj = nn.Linear(embed_dim, self.head_dim * num_heads)

        self.softmax = nn.Softmax(dim=-1)
        self.output_proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        batch_size, seq_len, embed_dim = x.size()

        # Calculate queries, keys, and values (all with multiple heads)
        queries = self.query_proj(x).view(batch_size, seq_len, self.num_heads, self.head_dim)
        keys = self.key_proj(x).view(batch_size, seq_len, self.num_heads, self.head_dim)
        values = self.value_proj(x).view(batch_size, seq_len, self.num_heads, self.head_dim)

        # Transpose for efficient dot-product attention calculation
        queries = queries.transpose(1, 2)  # [batch_size, num_heads, seq_len, head_dim]
        keys = keys.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention
        scores = torch.matmul(queries, keys.transpose(-2, -1)) / (self.head_dim ** 0.5)
        attention_weights = self.softmax(scores)

        # Apply attention weights to values
        output = torch.matmul(attention_weights, values)

        # Concatenate heads and project back to original embedding dimension
        output = output.transpose(1, 2).contiguous()
        output = output.view(batch_size, seq_len, embed_dim)
        output = self.output_proj(output)

        return output



torch.Size([2, 5, 256])


In [None]:
import torch
import torch.nn as nn

class TransformerEncoderLayer(nn.Module):
    def __init__(self, embed_dim, num_heads, hidden_dim, dropout=0.1):
        super().__init__()
        self.self_attention = SelfAttention(embed_dim, num_heads)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embed_dim)
        )
        self.norm2 = nn.LayerNorm(embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Self-attention
        x = self.self_attention(x) + x  # Residual connection
        x = self.norm1(x)

        # Feed-forward
        x = self.feed_forward(x) + x  # Residual connection
        x = self.norm2(x)

        return self.dropout(x)

class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_layers, num_heads, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.layers = nn.ModuleList([
            TransformerEncoderLayer(embed_dim, num_heads, hidden_dim) for _ in range(num_layers)
        ])

    def forward(self, x):
        x = self.embedding(x)
        for layer in self.layers:
            x = layer(x)
        return x


In [None]:
# Assuming you have preprocessed data in source_sentence_tokens and target_sentence_tokens

model = TransformerEncoder(vocab_size, embed_dim, num_layers, num_heads, hidden_dim)
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    # ...
    source_embeddings = model(source_sentence_tokens)

    # Hypothetical: Next steps would be feeding these embeddings to a decoder
    # to generate the target language translation and calculate loss using criterion.
    # ...


NameError: name 'vocab_size' is not defined

In [None]:

# Example usage
input_tensor = torch.randn(2, 5, 256)  # (batch_size, sequence_length, embedding_dim)
attention_layer = SelfAttention(embed_dim=256, num_heads=8)
output = attention_layer(input_tensor)
print(output.shape)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# Define the Transformer model
class Transformer(nn.Module):
    def __init__(self, input_vocab_size, output_vocab_size, d_model=256, nhead=4, num_encoder_layers=3, num_decoder_layers=3):
        super(Transformer, self).__init__()

        self.embedding = nn.Embedding(input_vocab_size, d_model)
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers
        )
        self.fc = nn.Linear(d_model, output_vocab_size)

    def forward(self, src, tgt):
        src = self.embedding(src)
        tgt = self.embedding(tgt)

        output = self.transformer(src, tgt)
        output = self.fc(output)

        return output

# Generate some toy data
input_vocab_size = 10
output_vocab_size = 10
seq_length = 5
batch_size = 2

# Random input and target sequences with the same batch size
input_sequence = torch.randint(0, input_vocab_size, (batch_size, seq_length))
target_sequence = torch.randint(0, output_vocab_size, (batch_size, seq_length))  # Make sure the batch size matches

# Initialize the Transformer model
model = Transformer(input_vocab_size, output_vocab_size)
# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 50
for epoch in range(num_epochs):
    # Forward pass
    output = model(input_sequence, target_sequence)

    # Reshape output and target for loss calculation
    output = output.view(-1, output_vocab_size)
    target_sequence = target_sequence.view(-1)

    # Calculate the loss
    loss = criterion(output, target_sequence)

    # Backward pass and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print the loss every few epochs
    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

# Testing the model
test_input_sequence = torch.randint(0, input_vocab_size, (batch_size, seq_length))
output = model(test_input_sequence, target_sequence)
predicted_sequence = torch.argmax(output, dim=2)

print("Input Sequence:")
print(test_input_sequence)
print("Target Sequence:")
print(target_sequence.view(batch_size, seq_length))
print("Predicted Sequence:")
print(predicted_sequence)




RuntimeError: the batch number of src and tgt must be equal