<a href="https://colab.research.google.com/github/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/04_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Transformer

We’re actually quite close to developing our own version of the famous
Transformer model. The encoder-decoder architecture with positional encoding
is missing only a few details to effectively "transform and roll out" :-)

First, we need to revisit the multi-headed attention mechanism to make it less
computationally expensive by using narrow attention. Then, we’ll learn about a
new kind of normalization: layer normalization.

Finally, we’ll add some more bells
and whistles: dropout, residual connections, and more "layers".

##Setup

In [None]:
try:
    import google.colab
    import requests
    url = 'https://raw.githubusercontent.com/dvgodoy/PyTorchStepByStep/master/config.py'
    r = requests.get(url, allow_redirects=True)
    open('config.py', 'wb').write(r.content)
except ModuleNotFoundError:
    pass

from config import *
config_chapter10()
# This is needed to render the plots in this chapter
from plots.chapter8 import *
from plots.chapter9 import *
from plots.chapter10 import *

Downloading files from GitHub repo to Colab...
Finished!


In [None]:
import copy
import numpy as np

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset, random_split, TensorDataset
from torchvision.transforms import Compose, Normalize, Pad

from data_generation.square_sequences import generate_sequences
from data_generation.image_classification import generate_dataset
from helpers import index_splitter, make_balanced_sampler
from stepbystep.v4 import StepByStep
# These are the classes we built in Chapter 9
from seq2seq import PositionalEncoding, subsequent_mask, EncoderDecoderSelfAttn

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

##Narrow Attention

We used full attention heads to build a multi-headed attention
and we called it wide attention. Although this mechanism works well, it gets
prohibitively expensive as the number of dimensions grows.

That’s when the narrow attention comes in: Each attention head will get a chunk of the
transformed data points (projections) to work with.

###Chunking

The attention heads do not
use chunks of the original data points, but rather those of their
projections.

Why?

To understand why, let’s take an example of an affine transformation, one that
generates "values" ($v_0$) from the first data point ($x_0$).

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/images/attn_narrow_transf.png?raw=1)

The transformation above takes a single data point of four dimensions (features) and turns it into a "value" (also with four dimensions) that’s going to be used in the attention mechanism.

At first sight, it may look like we’ll get the same result whether we split the inputs
into chunks or we split the projections into chunks. But that’s definitely not the case.
So, let’s zoom in and look at the individual weights inside that transformation.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/images/multihead_chunking.png?raw=1)

On the left, the correct approach: It computes the projections first and chunks them later. It is clear that each value in the projection (from $v_{00}$ to $v_{03}$) is a linear combination of all features in the data point.

Since each head is working with a subset of the projected
dimensions, these projected dimensions may end up
representing different aspects of the underlying data. For
natural language processing tasks, for example, some attention
heads may correspond to linguistic notions of syntax and
coherence. A particular head may attend to the direct objects of
verbs, while another head may attend to objects of prepositions,
and so on.

Now, compare it to the wrong approach, on the right: By chunking it first, each value in the projection is a linear combination of a subset of the features only.

Why is it so bad?

First, it is a simpler model (the wrong approach has only eight weights while the correct one has sixteen), so its learning capacity is limited. Second, since each head can only look at a subset of the features, they simply cannot learn about longrange dependencies in the inputs.

Now, let’s use a source sequence of length two as input, with each data point
having four features like the chunking example above, to illustrate our new self-attention mechanism.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/images/narrow-attention1.png?raw=1)

The flow of information goes like this:

* Both data points (x0 and x1) go through distinct affine transformations to generate the corresponding "values" (v0 and v1) and "keys" (k0 and k1), which we’ll be calling projections.

* Both data points also go through another affine transformation to generate the corresponding "queries" (q0 and q1).

* Each projection has the same number of dimensions as the inputs (four).

* Instead of simply using the projections, as former attention heads did, this attention head uses only a chunk of the projections to compute the context vector.

* Since projections have four dimensions, let’s split them into two chunks—blue (left) and green (right)—of two dimensions each.

* The first attention head uses only blue chunks to compute its context vector, which, like the projections, has only two dimensions.

* The second attention head (not depicted in the figure above) uses the green chunks to compute the other half of the context vector, which, in the end, has the desired dimension.

* Like the former multi-headed attention mechanism, the context vector goes through a feed-forward network to generate the "hidden states" (only the first one is depicted in the figure above).



###Multi-Headed Attention

The new multi-headed attention class is more than a combination of both the
Attention and MultiHeadedAttention classes: It implements the chunking of the projections and introduces dropout for attention scores.

In [None]:
class MultiHeadAttention(nn.Module):

  def __init__(self, n_heads, d_model, dropout=0.1):
    super(MultiHeadAttention, self).__init__()

    self.n_heads = n_heads
    self.d_model = d_model
    self.d_k = int(d_model / n_heads)

    # Affine transformations for Q, K, and V
    self.QUERY = nn.Linear(d_model, d_model)
    self.KEY = nn.Linear(d_model, d_model)
    self.VALUE = nn.Linear(d_model, d_model)

    self.linear_layer = nn.Linear(d_model, d_model)
    self.dropout = nn.Dropout(p=dropout)
    self.alphas = None

  def make_chunks(self, x):
    batch_size, seq_len = x.size(0), x.size(1)
    # N, L, D -> N, L, n_heads * d_k
    x = x.view(batch_size, seq_len, self.n_heads, self.d_k) # splits its last dimension in two
    # N, n_heads, L, d_k
    x = x.transpose(1, 2)
    return x

  def init_keys(self, key):
    # N, n_heads, L, d_k
    # Chunking the key, and value projections
    self.projection_key = self.make_chunks(self.KEY(key))
    self.projection_value = self.make_chunks(self.VALUE(key))

  def score_function(self, query):
    # scaled dot product
    # N, n_heads, L, d_k x # N, n_heads, d_k, L -> N, n_heads, L, L
    # Chunking the query projections
    projection_query = self.make_chunks(self.QUERY(query))
    dot_products = torch.matmul(projection_query, self.projection_key.transpose(-2, -1))
    scores = dot_products / np.sqrt(self.d_k)
    return scores

  def attention(self, query, mask=None):
    # Query is batch-first: N, L, D
    # Score function will generate scores for each head
    scores = self.score_function(query)  # N, n_heads, L, L
    if mask is not None:
      scores = scores.masked_fill(mask==0, -1e9)
    alphas = F.softmax(scores, dim=-1)  # N, n_heads, L, L
    alphas = self.dropout(alphas)
    self.alphas = alphas.detach()

    # N, n_heads, L, L x N, n_heads, L, d_k -> N, n_heads, L, d_k
    context = torch.matmul(alphas, self.projection_value)
    return context

  def output_function(self, contexts):
    # N, L, D
    output = self.linear_layer(contexts)  # N, L, D
    return output

  def forward(self, query, mask=None):
    if mask is not None:
      # N, 1, L, L - every head uses the same mask
      mask = mask.unsqueeze(1)

    # N, n_heads, L, d_k
    contexts = self.attention(query, mask=mask)
    # Concatenating the context vectors
    # N, L, n_heads, d_k
    contexts = contexts.transpose(1, 2).contiguous()
    # N, L, n_heads * d_k = N, L, d_model
    contexts = contexts.view(query.size(0), -1, self.d_model)
    # N, L, d_model
    output = self.output_function(contexts)
    return output

We can generate some dummy points corresponding to a mini-batch of 16
sequences (N), each sequence having two data points (L), each data point having
four features (F):

In [None]:
dummy_points = torch.randn(16, 2, 4) # N, L, F
multi_head_attention = MultiHeadAttention(n_heads=2, d_model=4, dropout=0.0)
multi_head_attention.init_keys(dummy_points)
output = multi_head_attention(dummy_points) # N, L, D
output.shape

torch.Size([16, 2, 4])

Since we’re using the data points as "keys," "values," and "queries," this is a selfattention mechanism.

The figure below depicts a multi-headed attention mechanism with its two heads,
blue (left) and green (right), and the first data point being used as a "query" to
generate the first "hidden state" ($h_0$).

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/images/narrow-attention2.png?raw=1)


The important thing to remember here is: "**Multi-headed attention chunks the projections, not the inputs.**"

###Stacking Encoders and Decoders

Let’s make our encoder-decoder architecture deeper by stacking two encoders on
top of one another, and then do the same with two decoders.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/images/stacking-encoders-decoders1.png?raw=1)

The output of one encoder feeds the cross-attention mechanism of all stacked
decoders. The output of one decoder feeds the next, and the last decoder outputs predictions as usual.

The figure above represents an encoder-decoder architecture with two "layers"
each. But we’re not stopping there: We’re stacking six "layers"!

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/images/stacking-encoders-decoders2.png?raw=1)

By the way, that’s exactly how a Transformer is built!

Cool! Is this a Transformer already then?

Not yet, no. We need to work further on the "sub-layers" to transform the architecture above into a real Transformer.

###Wrapping "Sub-Layers"

As our model grows deeper, with many stacked "layers," we’re going to run into
familiar issues, like the vanishing gradients problem. In computer vision models, this issue was successfully addressed by the addition of other components, like batch normalization and residual connections.

But we also know that dropout works pretty well as a regularizer, so we can throw
that in the mix as well.

We’ll wrap each and every "sub-layer" with them! Cool, right? But that brings up another question: How to wrap them?

It turns out, we can wrap a "sub-layer" in one of two ways: norm-last or norm-first.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/images/stacking-encoders-decoders3.png?raw=1)

Let’s turn the diagrams above into equations:

$$
\Large
\begin{aligned}
&\text{outputs}_{\text{norm-last}}=&\text{norm(inputs + dropout(sublayer(inputs))}
\\
&\text{outputs}_{\text{norm-first}}=&\text{inputs + dropout(sublayer(norm(inputs)))}
\end{aligned}
$$

The equations are almost the same, except for the fact that the norm-last wrapper normalizes the outputs and the norm-first wrapper normalizes the inputs. That’s a small, yet important, difference.

Why?

If you’re using positional encoding, you want to normalize your inputs, so norm-first is more convenient.

We’ll normalize the final outputs; that is, the output of the last "layer".

From now on, we’re sticking with norm-first, thus normalizing the inputs:


$$
\Large
\begin{aligned}
&\text{outputs}_{\text{norm-first}}=&\text{inputs + dropout(sublayer(norm(inputs)))}
\end{aligned}
$$

By wrapping each and every "sub-layer" inside both encoder "layers" and decoder "layers," we’ll arrive at the desired Transformer architecture.


###Transformer Encoder

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/images/stacking-encoders-decoders4.png?raw=1)

On the left, the encoder uses a norm-last wrapper, and its output (the encoder’s states) is given by:

$$
\large
\begin{aligned}
&\text{outputs}_{\text{norm-last}}=&\text{norm}(\underbrace{\text{norm(inputs + att(inputs))}}_{\text{Output of SubLayer}_0} + \text{ffn}(\underbrace{\text{norm(inputs + att(inputs))}}_{\text{Output of SubLayer}_0}))
\end{aligned}
$$

On the right, the encoder uses a norm-first wrapper, and its output (the encoder’s states) is given by:

$$
\large
\begin{aligned}
&\text{outputs}_{\text{norm-first}}=&\underbrace{\text{inputs + att(norm(inputs))}}_{\text{Output of SubLayer}_0}+\text{ffn(norm(}\underbrace{\text{inputs + att(norm(inputs))}}_{\text{Output of SubLayer}_0}))
\end{aligned}
$$

The norm-first wrapper allows the inputs to flow unimpeded (the inputs aren’t
normalized) all the way to the top while adding the results of each "sub-layer" along
the way.

Which one is best?

There is no straight answer to this question. It actually  placing the batch normalization layer before or after the
activation function.

Let’s see it in code, starting with the "layer," and all its wrapped "sub-layers":

In [None]:
class EncoderLayer(nn.Module):

  def __init__(self, n_heads, d_model, ff_units, dropout=0.1):
    super().__init__()

    self.n_heads = n_heads
    self.d_model = d_model
    self.ff_units = ff_units
    self.dropout = dropout

    self.self_attention_heads = MultiHeadAttention(n_heads, d_model, dropout)
    self.feed_forward_network = nn.Sequential(
        nn.Linear(d_model, ff_units),
        nn.ReLU(),
        nn.Dropout(dropout),
        nn.Linear(ff_units, d_model)
    )

    # define layer normalization
    self.norm1 = nn.LayerNorm(d_model)
    self.norm2 = nn.LayerNorm(d_model)
    self.dropout1 = nn.Dropout(dropout)
    self.dropout2 = nn.Dropout(dropout)

  def forward(self, query, mask=None):
    # Sublayer #0
    # Norm
    norm_query = self.norm1(query)
    # Multi-headed Attention
    self.self_attention_heads.init_keys(norm_query)
    states = self.self_attention_heads(norm_query, mask)
    # Add
    attention = query + self.dropout1(states)

    # Sublayer #1
    # Norm
    norm_attention = self.norm2(attention)
    # Feed Forward
    output = self.feed_forward_network(norm_attention)
    # Add
    output = attention + self.dropout2(output)
    return output

Now we can stack a bunch of "layers" like that to build an actual encoder.Its constructor takes an instance of an EncoderLayer, the number
of "layers" we’d like to stack on top of one another, and a max length of the source
sequence that’s going to be used for the positional encoding.

The final outputs are, as usual, the states of the
encoder that will feed the cross-attention mechanism of every "layer" of the
decoder.

In [None]:
class TransformerEncoder(nn.Module):

  def __init__(self, encoder_layer, n_layers=1, max_len=100):
    super().__init__()

    self.d_model = encoder_layer.d_model
    self.positional_encoding = PositionalEncoding(max_len, self.d_model)
    self.norm_layer = nn.LayerNorm(self.d_model)
    self.layers = nn.ModuleList([copy.deepcopy(encoder_layer) for _ in range(n_layers)])

  def forward(self, query, mask=None):
    # Positional Encoding
    x = self.positional_encoding(query)
    for layer in self.layers:
      x = layer(x, mask)
    # Norm
    return self.norm_layer(x)

###Transformer Decoder