## 11.8. Transformers for Vision

### 11.8.1. Model

首先我们拥有一个height为 h 和width为 w 的图片， 然后我们将图片分割为 p x p , 最终我们将原图变成了 m = h x w / ( p x p ) 个数量的patches, 然后每一个patch展平为 c x p x p 长度的vector.

我们将 m 个vectors的最开头 追加一个用于表示 cls 的vector， 最终我们有 m + 1 个vectors.

将m + 1个vecotrs添加 positional Embedding 位置编码，然后再传入Transformer中训练.

最终取出第一个vector，做Norm和MLP，生成了 class 类别标签.

In [2]:
import torch
from torch import nn
from d2l import torch as d2l

### 11.8.2. Patch Embedding



In [3]:
class PatchEmbedding(nn.Module):
    def __init__(self, img_size=96, patch_size=16, num_hiddens=512):
        super().__init__()
        def _make_tuple(x):
            if not isinstance(x, (list, tuple)):
                return (x, x)
            return x
        img_size, patch_size = _make_tuple(img_size), _make_tuple(patch_size)
        self.num_patches = (img_size[0] // patch_size[0]) * (
            img_size[1] // patch_size[1])
        self.conv = nn.LazyConv2d(num_hiddens, kernel_size=patch_size,
                                  stride=patch_size)

    def forward(self, X):
        # Output shape: (batch size, no. of patches, no. of channels)
        return self.conv(X).flatten(2).transpose(1, 2)  # 4, 3, 96, 96 -> 4, 512, 6, 6 -> 4, 36, 512 其中 36 是 number of query, 512 是 d = dimension

In [4]:
img_size, patch_size, num_hiddens, batch_size = 96, 16, 512, 4
patch_emb = PatchEmbedding(img_size, patch_size, num_hiddens)
X = torch.zeros(batch_size, 3, img_size, img_size)
print(patch_emb(X).shape,
                (batch_size, (img_size//patch_size)**2, num_hiddens))

torch.Size([4, 36, 512]) (4, 36, 512)




### 11.8.3. Vision Transformer Encoder

ViT的 MLP 和 Transformer的position-wise FFN 不同, 虽然其指代的是同一个。First, here the activation function uses the Gaussian error linear unit (GELU), which can be considered as a smoother version of the ReLU. Second, dropout is applied to the output of each fully connected layer in the MLP for regularization. 首先使用了 GELU 而不是 ReLU ( 虽然Transformer中有使用ReLU, 在FFN中 ), 接着引入了dropout( Trans 中没有ReLU ).

The vision Transformer encoder block implementation just follows the pre-normalization design in Fig. 11.8.1, where normalization is applied right before multi-head attention or the MLP. In contrast to post-normalization (“add & norm” in Fig. 11.7.1), where normalization is placed right after residual connections, pre-normalization leads to more effective or efficient training for Transformers. 使用了 pre-normalization 代替 post-normalization 即 "先 norm 再 MSA", 这样更加快速和有效.

In [5]:
class ViTMLP(nn.Module):
    def __init__(self, mlp_num_hiddens, mlp_num_outputs, dropout=0.5):
        super().__init__()
        self.dense1 = nn.LazyLinear(mlp_num_hiddens)
        self.gelu = nn.GELU()
        self.dropout1 = nn.Dropout(dropout)
        self.dense2 = nn.LazyLinear(mlp_num_outputs)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x):
        return self.dropout2(self.dense2(self.dropout1(self.gelu(
            self.dense1(x)))))

In [6]:
class ViTBlock(nn.Module):
    def __init__(self, num_hiddens, norm_shape, mlp_num_hiddens,
                 num_heads, dropout, use_bias=False):
        super().__init__()
        self.ln1 = nn.LayerNorm(norm_shape)
        self.attention = d2l.MultiHeadAttention(num_hiddens, num_hiddens, num_hiddens, 
            num_hiddens, num_heads,
                                                dropout, use_bias)
        self.ln2 = nn.LayerNorm(norm_shape)
        self.mlp = ViTMLP(mlp_num_hiddens, num_hiddens, dropout)

    def forward(self, X, valid_lens=None):
        X = X + self.attention(*([self.ln1(X)] * 3), valid_lens)
        return X + self.mlp(self.ln2(X))

In [7]:
X = torch.ones((2, 100, 24))
encoder_blk = ViTBlock(24, 24, 48, 8, 0.5)
encoder_blk.eval()
print(encoder_blk(X).shape, X.shape)

torch.Size([2, 100, 24]) torch.Size([2, 100, 24])


### 11.8.4. Putting It All Together

将上面的步骤组合起来，需要在Patch Embedding后，dropout前增加position Embedding，即cls

In [8]:
class ViT(nn.Module):
    """Vision Transformer."""
    def __init__(self, img_size, patch_size, num_hiddens, mlp_num_hiddens,
                 num_heads, num_blks, emb_dropout, blk_dropout, lr=0.1,
                 use_bias=False, num_classes=10):
        super().__init__()
        # self.save_hyperparameters()
        self.patch_embedding = PatchEmbedding(
            img_size, patch_size, num_hiddens)
        self.cls_token = nn.Parameter(torch.zeros(1, 1, num_hiddens))
        num_steps = self.patch_embedding.num_patches + 1  # Add the cls token
        # Positional embeddings are learnable
        self.pos_embedding = nn.Parameter(
            torch.randn(1, num_steps, num_hiddens))
        self.dropout = nn.Dropout(emb_dropout)
        self.blks = nn.Sequential()
        for i in range(num_blks):
            self.blks.add_module(f"{i}", ViTBlock(
                num_hiddens, num_hiddens, mlp_num_hiddens,
                num_heads, blk_dropout, use_bias))
        self.head = nn.Sequential(nn.LayerNorm(num_hiddens),
                                  nn.Linear(num_hiddens, num_classes))

    def forward(self, X):
        X = self.patch_embedding(X)
        X = torch.cat((self.cls_token.expand(X.shape[0], -1, -1), X), 1)
        X = self.dropout(X + self.pos_embedding)
        for blk in self.blks:
            X = blk(X)
        return self.head(X[:, 0])

In [9]:
img_size, patch_size = 96, 16
num_hiddens, mlp_num_hiddens, num_heads, num_blks = 512, 2048, 8, 2
emb_dropout, blk_dropout, lr = 0.1, 0.1, 0.1
model = ViT(img_size, patch_size, num_hiddens, mlp_num_hiddens, num_heads,
            num_blks, emb_dropout, blk_dropout, lr)

AttributeError: 'ViT' object has no attribute 'save_hyperparameters'