<a href="https://colab.research.google.com/github/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/04_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Transformer

We’re actually quite close to developing our own version of the famous
Transformer model. The encoder-decoder architecture with positional encoding
is missing only a few details to effectively "transform and roll out" :-)

First, we need to revisit the multi-headed attention mechanism to make it less
computationally expensive by using narrow attention. Then, we’ll learn about a
new kind of normalization: layer normalization.

Finally, we’ll add some more bells
and whistles: dropout, residual connections, and more "layers".

##Setup

In [None]:
try:
    import google.colab
    import requests
    url = 'https://raw.githubusercontent.com/dvgodoy/PyTorchStepByStep/master/config.py'
    r = requests.get(url, allow_redirects=True)
    open('config.py', 'wb').write(r.content)
except ModuleNotFoundError:
    pass

from config import *
config_chapter10()
# This is needed to render the plots in this chapter
from plots.chapter8 import *
from plots.chapter9 import *
from plots.chapter10 import *

In [None]:
import copy
import numpy as np

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset, random_split, TensorDataset
from torchvision.transforms import Compose, Normalize, Pad

from data_generation.square_sequences import generate_sequences
from data_generation.image_classification import generate_dataset
from helpers import index_splitter, make_balanced_sampler
from stepbystep.v4 import StepByStep
# These are the classes we built in Chapter 9
from seq2seq import PositionalEncoding, subsequent_mask, EncoderDecoderSelfAttn

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

##Narrow Attention

We used full attention heads to build a multi-headed attention
and we called it wide attention. Although this mechanism works well, it gets
prohibitively expensive as the number of dimensions grows.

That’s when the narrow attention comes in: Each attention head will get a chunk of the
transformed data points (projections) to work with.

###Chunking

The attention heads do not
use chunks of the original data points, but rather those of their
projections.

Why?

To understand why, let’s take an example of an affine transformation, one that
generates "values" ($v_0$) from the first data point ($x_0$).

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/images/attn_narrow_transf.png?raw=1)

The transformation above takes a single data point of four dimensions (features) and turns it into a "value" (also with four dimensions) that’s going to be used in the attention mechanism.

At first sight, it may look like we’ll get the same result whether we split the inputs
into chunks or we split the projections into chunks. But that’s definitely not the case.
So, let’s zoom in and look at the individual weights inside that transformation.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/images/multihead_chunking.png?raw=1)

On the left, the correct approach: It computes the projections first and chunks them later. It is clear that each value in the projection (from $v_{00}$ to $v_{03}$) is a linear combination of all features in the data point.

Since each head is working with a subset of the projected
dimensions, these projected dimensions may end up
representing different aspects of the underlying data. For
natural language processing tasks, for example, some attention
heads may correspond to linguistic notions of syntax and
coherence. A particular head may attend to the direct objects of
verbs, while another head may attend to objects of prepositions,
and so on.

Now, compare it to the wrong approach, on the right: By chunking it first, each value in the projection is a linear combination of a subset of the features only.

Why is it so bad?

First, it is a simpler model (the wrong approach has only eight weights while the correct one has sixteen), so its learning capacity is limited. Second, since each head can only look at a subset of the features, they simply cannot learn about longrange dependencies in the inputs.

Now, let’s use a source sequence of length two as input, with each data point
having four features like the chunking example above, to illustrate our new selfattention mechanism.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/images/attn_narrow_first_head.png?raw=1)

The flow of information goes like this:

* Both data points (x0 and x1) go through distinct affine transformations to
generate the corresponding "values" (v0 and v1) and "keys" (k0 and k1), which
we’ll be calling projections.

* Both data points also go through another affine transformation to generate
the corresponding "queries" (q0 and q1).

* Each projection has the same number of dimensions as the inputs (four).

* Instead of simply using the projections, as former attention heads did, this
attention head uses only a chunk of the projections to compute the context
vector.

* Since projections have four dimensions, let’s split them into two chunks—blue (left) and green (right)—of two dimensions each.

* The first attention head uses only blue chunks to compute its context vector, which, like the projections, has only two dimensions.

* The second attention head (not depicted in the figure above) uses the green chunks to compute the other half of the context vector, which, in the end, has
the desired dimension.

* Like the former multi-headed attention mechanism, the context vector goes
through a feed-forward network to generate the "hidden states" (only the first
one is depicted in the figure above).



###Multi-Headed Attention