### Vision Transformer
Vision Transformer（ViT）是一种基于注意力机制的图像分类模型，它是由Alexey Dosovitskiy等人在论文"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"中提出的。ViT的核心思想是将图像分割成固定数量的图块，然后将这些图块转换为序列，再通过Transformer架构进行处理。

以下是Vision Transformer的主要构架要点：

1. **图像分块（Image Patching）**：
   - 将输入图像分成若干个固定大小的非重叠图块。每个图块通常是一个正方形区域，可以通过将图像划分为网格来实现。这些图块被视为模型的输入。

2. **嵌入层（Embedding Layer）**：
   - 将每个图块转换为一个向量，这个向量被视为图块的表示。通常，一个简单的线性变换（包括一个可学习的权重矩阵）被用来实现这个转换。这一步的目的是将图像中的空间信息转化为模型可以理解的向量表示。

3. **位置嵌入（Positional Embedding）**：
   - 为了引入图块的位置信息，ViT引入了位置嵌入，将每个位置信息嵌入到对应的图块表示中。这是因为Transformer模型本身并不具备处理序列中元素的位置信息的能力。

4. **Transformer Encoder**：
   - 将嵌入的图块序列输入到Transformer编码器中。Transformer编码器由多个相同的层组成，每个层包含自注意力机制（Self-Attention）和前馈神经网络（Feedforward Neural Network）。这些层允许模型捕捉输入序列中的长程依赖关系。

5. **全局平均池化（Global Average Pooling）**：
   - 将Transformer编码器的输出序列的每个位置上的向量进行平均，得到一个全局的图像表示。这个全局表示被送入一个用于图像分类的线性层。

整体上，Vision Transformer的关键思想是利用Transformer的强大的序列建模能力来处理图像信息。通过将图像转换为序列，并使用自注意力机制来捕获图块之间的关系，ViT在图像分类任务上取得了令人印象深刻的性能。

（正如卷积神经网络将卷积作为前置的层，这次将注意力机制作为神经网络的前置layer）

https://arxiv.org/abs/2010.11929

最原始的研究论文。

### Prepare Env

In [None]:
# For this notebook to run with updated APIs, we need torch 1.12+ and torchvision 0.13+
try:
    import torch
    import torchvision
    assert int(torch.__version__.split(".")[1]) >= 12 or int(torch.__version__.split(".")[0]) == 2, "torch version should be 1.12+"
    assert int(torchvision.__version__.split(".")[1]) >= 13, "torchvision version should be 0.13+"
    print(f"torch version: {torch.__version__}")
    print(f"torchvision version: {torchvision.__version__}")
except:
    print(f"[INFO] torch/torchvision versions not as required, installing nightly versions.")
    !pip3 install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
    import torch
    import torchvision
    print(f"torch version: {torch.__version__}")
    print(f"torchvision version: {torchvision.__version__}")

In [None]:
# Continue with regular imports
import matplotlib.pyplot as plt
import torch
import torchvision

from torch import nn
from torchvision import transforms

# Try to get torchinfo, install it if it doesn't work
try:
    from torchinfo import summary
except:
    print("[INFO] Couldn't find torchinfo... installing it.")
    !pip install -q torchinfo
    from torchinfo import summary

# Try to import the going_modular directory, download it from GitHub if it doesn't work
try:
    from going_modular import data_setup, engine
    from helper_functions import download_data, set_seeds, plot_loss_curves
except:
    # Get the going_modular scripts
    print("[INFO] Couldn't find going_modular or helper_functions scripts... downloading them from GitHub.")
    !git clone https://github.com/mrdbourke/pytorch-deep-learning
    !mv pytorch-deep-learning/going_modular .
    !mv pytorch-deep-learning/helper_functions.py . # get the helper_functions.py script
    !rm -rf pytorch-deep-learning
    from going_modular.going_modular import data_setup, engine
    from helper_functions import download_data, set_seeds, plot_loss_curves

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

### 1. Get data

In [None]:
# Download pizza, steak, sushi images from GitHub
image_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                           destination="pizza_steak_sushi")
image_path

In [None]:
# Setup directory paths to train and test images
train_dir = image_path / "train"
test_dir = image_path / "test"

### 2. Create datasets and dataloaders （check batch size and image size）

In [None]:
# Create image size (from Table 3 in the ViT paper)
IMG_SIZE = 224

# Create transform pipeline manually
manual_transforms = transforms.Compose([
    transforms.Resize((IMG_SIZE, IMG_SIZE)),
    transforms.ToTensor(),
])
print(f"Manually created transforms: {manual_transforms}")

In [None]:
# 原本的batch size是4096我们的32的128倍，但是我们还是用32吧。穷。
# Set the batch size
BATCH_SIZE = 32 # this is lower than the ViT paper but it's because we're starting small

# Create data loaders
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
    train_dir=train_dir,
    test_dir=test_dir,
    transform=manual_transforms, # use manually created transforms
    batch_size=BATCH_SIZE
)

train_dataloader, test_dataloader, class_names

In [None]:
# Get a batch of images
image_batch, label_batch = next(iter(train_dataloader))

# Get a single image from the batch
image, label = image_batch[0], label_batch[0]

# View the batch shapes
image.shape, label

In [None]:
# Plot image with matplotlib
plt.imshow(image.permute(1, 2, 0)) # rearrange image dimensions to suit matplotlib [color_channels, height, width] -> [height, width, color_channels]
plt.title(class_names[label])
plt.axis(False);

### 3. Replication the paper on a overview

讨论模型结构：block组成模型，layer组成block，在layer中对于输入的tensor，作用很多function在上面，比如forward函数：（很像拼乐高）

1. Layer - takes an input, performs a function on it, returns an output.
2. Block - a collection of layers, takes an input, performs a series of functions on it, returns an output.
3. Architecture (or model) - a collection of blocks, takes an input, performs a series of functions on it, returns an output.

框架自我概述：

Embedded Patches 对图像进行初步处理: 图像扁平化处理为向量（Patch Embedding）和位置编码（Position Embedding）的组合。

x_input = [class_token, image_patch_1, image_patch_2, image_patch_3...] + [class_token_position, image_patch_1_position, image_patch_2_position, image_patch_3_position...]

Transformer Encoder: LNorm -> MHA(多头注意力机制) -> LNorm -> MLP(多层感知机)

MHA block: 多头注意力机制是Transformer模型中的一个关键组件，用于处理序列数据。在多头注意力机制中，输入序列的不同部分可以同时受到注意力的关注。每个头都学习不同的注意力模式，从而使模型能够捕捉序列中的不同信息。**每个头的输出通过残差连接与原始输入相加，形成最终的输出。**

x_output_MSA_block = MSA_layer(LN_layer(x_input)) + x_input
Notice the skip/residual connection on the end (adding the input of the layers to the output of the layers).

MLP block: 一般来说多层感知机的结构式线性层-非线性层-线性层-非线性层这样。

这里是：**layer norm -> linear layer -> non-linear layer -> dropout -> linear layer -> dropout**

在Transformer模型中，多层感知机（MLP）主要作用于两个地方：每个注意力头的输出和每个位置的输出：

1. **每个注意力头的输出：** 在多头自注意力机制中，每个注意力头产生的输出被串联（concatenate）在一起，然后经过一个线性变换和激活函数，最终得到每个位置的综合表示。在这个串联后的向量上应用多层感知机有助于对不同注意力头学到的信息进行更复杂的组合和转换。这个多层感知机的作用是引入非线性变换，使得模型能够更好地学习复杂的关系和特征。

2. **每个位置的输出：** 在Transformer的每个编码器层中，对每个位置的表示（即每个位置的注意力头输出的串联）进行多层感知机的处理。这个多层感知机通常由两个线性变换和激活函数组成。这种处理有助于捕捉位置特定的信息和关系，使得模型能够更好地理解输入序列的结构。

综合起来，多层感知机在Transformer中的作用是引入非线性变换，增强模型对输入数据的表达能力。这有助于模型更好地学习序列中的复杂关系，提高其表示能力，从而使得Transformer在处理自然语言处理等序列任务时能够取得更好的性能。在实践中，多层感知机通常包括一个线性变换、一个激活函数（比如ReLU），再接一个线性变换，这样的结构使得模型能够学到更复杂的映射。

x_output_MLP_block = MLP_layer(LN_layer(x_output_MSA_block)) + x_output_MSA_block
Notice the skip/redidual connection on the end (adding the input of the layers to the output of the layers).

LNorm block: 正规化处理，有助于减少训练时间，和提高模型泛化能力。（因为可以减少过拟合风险）

0 index of x_output_MLP_block: 线性变换
y = Linear_layer(LN_layer(x_output_MLP_block[0]))

---

Patch embedding（补丁嵌入）通常是指在计算机视觉领域中，将图像分成小块（通常是正方形的小区域），然后对每个小块进行嵌入（embedding）操作，将其映射到一个低维向量空间中。在传统的CNNs中，输入图像经过一系列卷积和池化操作，逐渐减小空间维度，最终得到一个全局的表示。而在使用补丁嵌入的情况下，图像被分割成小块，每个小块经过嵌入操作后得到一个向量表示，这些向量被组合成整个图像的表示。这种方法的优势在于能够更好地捕捉局部特征，因为每个小块都有自己的嵌入表示。这在处理大型图像时可以更高效地使用计算资源，同时也有助于模型更好地理解图像中的局部结构。补丁嵌入常常与自注意力机制（self-attention）结合使用，这种结合通常出现在一些图像处理的深度学习模型。在这些模型中，补丁嵌入被引入以处理变尺寸的输入图像，使得模型更加灵活。

Position embedding（位置嵌入）是一种在深度学习中处理序列数据的技术。它主要用于为模型提供关于输入序列中每个元素在序列中的位置信息。在自然语言处理（NLP）中，位置嵌入通常用于处理文本序列，如句子或段落。在深度学习模型中，特别是那些使用自注意力机制（self-attention）的模型，如Transformer，序列中的元素通常被同时处理，而不考虑它们在序列中的位置。为了使模型能够区分不同位置的元素，引入了位置嵌入。位置嵌入的一种常见方式是通过在输入嵌入向量中添加一个表示位置的向量。这个向量包含了关于元素在序列中位置的信息。在Transformer模型中，通常使用正弦和余弦函数的组合来生成位置嵌入。这样的组合能够捕捉到相对位置之间的关系。




### It's all about the embedding. 
将高维数据映射到低维数据的表示形式，是最重要的一步。

### 4. Split data into patches and creating the class, position and patch embedding

In [4]:
# calculating patch embedding input and output shapes by hand
# create example values
height = 224
width = 224
color_channels = 3
patch_size = 16

# calculate N (number of patches)
number_of_patches = int((height * width) / patch_size ** 2)
print(f"Number of patches (N) with image height (H={height}), width (W={width}) and patch size (P={patch_size}): {number_of_patches}.")

Number of patches (N) with image height (H=224), width (W=224) and patch size (P=16): 196.


- input shape is image with 2D size: H * W * C
- output shape is sequence of flattened 2D patches with size: N * (P ** 2 * C)

是一种高纬数据降维的过程，三维降为二维。

In [5]:
# input shape
embedding_layer_input_shape = (height, width, color_channels)

# output shape
embedding_layer_output_shape = (number_of_patches, patch_size ** 2 * color_channels)

print(f"Input shape (single 2D image): {embedding_layer_input_shape}")
print(f"Output shape (single 2D image flattened into patches): {embedding_layer_output_shape}")

Input shape (single 2D image): (224, 224, 3)
Output shape (single 2D image flattened into patches): (196, 768)


In [1]:
224 * 224 * 3, 224 * 224

(150528, 50176)

In [7]:
196 * 768

150528

In [None]:
# view single image
plt.imshow(image.permute(1, 2, 0))
plt.title(class_names[label])
plt.axis(False)

In [None]:
# change image shape to be compatible with matplotlib (color_channels, height, width)
image_permuted = image.permute(1, 2, 0)

# index to plot the top row of patched pixels
patch_size = 16
plt.figure(figsize=(patch_size, patch_size))
plt.imshow(image_permuted[:patch_size, :, :]);

In [None]:
# Turn image into patches (only on patch_size : 224 x 16)
img_size = 224
patch_size = 16
num_patches = img_size/patch_size
assert img_size % patch_size == 0, "Image size must be divisible by patch size"
print(f"Number of patches per row: {num_patches}\nPatch size: {patch_size} pixels x {patch_size} pixels")

# create a series of subplots
fig, axs = plt.subplots(
    nrows = 1,
    ncols = img_size // patch_size,
    figsize = (num_patches, num_patches),
    shares = True,
    sharey = True
)

# iterate through number of patches in the top row
for i, patch in enumerate(range(0, img_size, patch_size)):
    axs[i].imshow(image_permuted[:patch_size, patch:patch + patch_size, :]);
    axs[i].set_xlabel(i + 1)
    axs[i].set_xticks([])
    axs[i].set_yticks([])

In [None]:
# Turn image into patches (full image 224 x 224)
img_size = 224
patch_size = 16
num_patches = img_size / patch_size
assert img_size % patch_size == 0, "Image size must be divisible by patch size"
print(f"Number of patches per row: {num_patches}\
        \nNumber of patches per column: {num_patches}\
        \nTotoal patches: {num_patches * num_patches}\
        \nPatch size: {patch_size} pixels x {patch_size} pixels")

# create a series of subplots
fig, axs = plt.subplots(
    nrows = img_size // patch_size,
    ncols = img_size // patch_size,
    figsize = (num_patches, num_patches),
    shares = True,
    sharey = True
)

# Loop through height and width of image
for i, patch_height in enumerate(range(0, img_size, patch_size)):
    for i, patch_width in enumerate(range(0, img_size, patch_size)):
    # plot the permuted image patch
        axs[i, j].imshow(image_permuted[patch_height:patch_height + patch_size,
                                        patch_width:patch_width + patch_size, :]);
        axs[i, j].set_ylabel(i + 1,
                             rotation = "horizontal",
                             horizontalalignment = "right",
                             verticalalignment = "center")
        axs[i, j].set_xlabel(j + 1)
        axs[i, j].set_xticks([])
        axs[i, j].set_yticks([])
        axs[i, j].label_outer()

# set a super title
fig.suptitle(f"{class_names[label]} -> Patchified", fontsize=16)
plt.show()

### Create image patches with torch.nn.Conv2d()
turn these patches into an embedding with torch.nn.Conv2d() with params kernel_size and stride tobe patch_size.（也就是说上面的被切割的图形小块，可以用2d卷积的方式得到，只需要设置核的边长和步长都为patch_size）

In [None]:
from torch import nn

# Set the patch size
patch_size=16

# Create the Conv2d layer with hyperparameters from the ViT paper
conv2d = nn.Conv2d(in_channels=3, # number of color channels
                   out_channels=768, # from Table 1: Hidden size D, this is the embedding size
                   kernel_size=patch_size, # could also use (patch_size, patch_size)
                   stride=patch_size,
                   padding=0)

In [None]:
# Pass the image through the convolutional layer
image_out_of_conv = conv2d(image.unsqueeze(0)) # add a single batch dimension (height, width, color_channels) -> (batch, height, width, color_channels)
print(image_out_of_conv.shape)

its output shape can be read as:

torch.Size([1, 768, 14, 14]) -> [batch_size, embedding_dim, feature_map_height, feature_map_width]

In [None]:
# Plot random 5 convolutional feature maps
import random
random_indexes = random.sample(range(0, 758), k=5) # pick 5 numbers between 0 and the embedding size
print(f"Showing random convolutional feature maps from indexes: {random_indexes}")

# Create plot
fig, axs = plt.subplots(nrows=1, ncols=5, figsize=(12, 12))

# Plot random image feature maps
for i, idx in enumerate(random_indexes):
    image_conv_feature_map = image_out_of_conv[:, idx, :, :] # index on the output tensor of the convolutional layer
    axs[i].imshow(image_conv_feature_map.squeeze().detach().numpy())
    axs[i].set(xticklabels=[], yticklabels=[], xticks=[], yticks=[]);

In [None]:
# Get a single feature map in tensor form
single_feature_map = image_out_of_conv[:, 0, :, :]
single_feature_map, single_feature_map.requires_grad

### Flattening the patch embedding with torch.nn.Flatten()

In [None]:
# Current tensor shape
print(f"Current tensor shape: {image_out_of_conv.shape} -> [batch, embedding_dim, feature_map_height, feature_map_width]")

In [None]:
# Create flatten layer
flatten = nn.Flatten(start_dim=2, # flatten feature_map_height (dimension 2)
                     end_dim=3) # flatten feature_map_width (dimension 3)

### Put patch parts all togather

In [None]:
# 1. View single image
plt.imshow(image.permute(1, 2, 0)) # adjust for matplotlib
plt.title(class_names[label])
plt.axis(False);
print(f"Original image shape: {image.shape}")

# 2. Turn image into feature maps
image_out_of_conv = conv2d(image.unsqueeze(0)) # add batch dimension to avoid shape errors
print(f"Image feature map shape: {image_out_of_conv.shape}")

# 3. Flatten the feature maps
image_out_of_conv_flattened = flatten(image_out_of_conv)
print(f"Flattened image feature map shape: {image_out_of_conv_flattened.shape}")

In [None]:
# Get flattened image patch embeddings in right shape
image_out_of_conv_flattened_reshaped = image_out_of_conv_flattened.permute(0, 2, 1) # [batch_size, P^2•C, N] -> [batch_size, N, P^2•C]
print(f"Patch embedding sequence shape: {image_out_of_conv_flattened_reshaped.shape} -> [batch_size, num_patches, embedding_size]")

In [None]:
# Get a single flattened feature map
single_flattened_feature_map = image_out_of_conv_flattened_reshaped[:, :, 0] # index: (batch_size, number_of_patches, embedding_dimension)

# Plot the flattened feature map visually
plt.figure(figsize=(22, 22))
plt.imshow(single_flattened_feature_map.detach().numpy())
plt.title(f"Flattened feature map shape: {single_flattened_feature_map.shape}")
plt.axis(False);

In [None]:
# See the flattened feature map as a tensor
single_flattened_feature_map, single_flattened_feature_map.requires_grad, single_flattened_feature_map.shape

In [None]:
# 1. Create a class which subclasses nn.Module
class PatchEmbedding(nn.Module):
    """
    Turns a 2D input image into a 1D sequence learnable embedding vector.

    Args:
        in_channels (int): Number of color channels for the input image. (Defaults to 3)
        patch_size (int): Size of patches to convert input image into.(Defaults to 16)
        embedding_dim (int): Size of embedding to turn image into. (Defaults to 768)
    """
    # 2. Initialize the class with appropriate variable
    def __init__(
        self,
        in_channels: int=3,
        patch_size: int=16,
        embedding_dim: int=768
    ):
        super().__init__()

        # 3. Create a layer to turn an image into patches
        self.patcher = nn.Conv2d(
            in_channels = in_channels,
            out_channels = embedding_dim,
            kernel_size = patch_size,
            stride = patch_size,
            padding = 0
        )

        # 4. Create a layer to flatten the patch feature maps into a single dimension
        self.flatten = nn.Flatten(
            start_dim = 2, 
            end_dim = 3
        )

    # 5. Define the forward method
    def forward(self, x):
        # Create assertion to check that inputs are the correct shape
        image_resolution = x.shape[-1]
        assert image_resolution % patch_size == 0, f"Input image size must be divisble by patch size, image shape: {image_resolution}, patch_size: {patch_size}"

        # Perform the forward pass
        x_patched = self.patcher(x)
        x_flattened = self.flatten(x_patched)
        # 6. Make sure the output shape has the right order
        return x_flattened.permute(0, 2, 1) # adjust so the embedding is on the final dimension [batch_size, P^2*C, N] -> [batch_size, N, P^2*C]

In [None]:
set_seeds()

# Create an instance of patch embedding layer
patchify = PatchEmbedding(
    in_channels = 3, 
    patch_size = 16,
    embedding_dim = 768
)

# Pass a single image through
print(f"Input image shape: {image.unsqueeze(0).shape}")
patch_embedded_image = patchify(image.unsqueeze(0)) # add a extra batch dimension on the 0th index, otherwise will error
print(f"Output patch embedding shape: {patch_embedded_image.shape}")

### Class token embedding

In [None]:
# View the patch embedding and patch embedding shape
print(patch_embedded_image)
print(f"Patch embedding shape: {patch_embedded_image.shape} -> [batch_size, number_of_patches, embedding_dimension]")

patch_embedding_with_class_token = torch.cat((class_token, patch_embedding), dim=1)

这里使用torch.ones()填充初始的class，理想的是设置随机的class使用torch.randn()

In [4]:
import torch

# 创建一个张量，并设置 requires_grad=True
x = torch.tensor([2.0], requires_grad=True)

# 进行一些操作
y = x ** 2
z = 2 * y + 3

# 对 z 进行求导
z.backward()

# 打印梯度
print(x.grad)

tensor([8.])


In [None]:
# Get the batch size and embedding dimension
batch_size = patch_embedded_image.shape[0]
embedding_dimension = patch_embedded_image.shape[-1]

# Create the class token embedding as a learnable parameter that shares the same size as the embedding dimension (D)
class_token = nn.Parameter(torch.ones(batch_size, 1, embedding_dimension), # [batch_size, number_of_tokens, embedding_dimension]
                           requires_grad=True) # make sure the embedding is learnable 意味着可以进行反向传播更新梯度

# Show the first 10 examples of the class_token
print(class_token[:, :, :10])

# Print the class_token shape
print(f"Class token shape: {class_token.shape} -> [batch_size, number_of_tokens, embedding_dimension]")

In [None]:
# Add the class token embedding to the front of the patch embedding
patch_embedded_image_with_class_embedding = torch.cat((class_token, patch_embedded_image),
                                                      dim=1) # concat on first dimension

# Print the sequence of patch embeddings with the prepended class token embedding
print(patch_embedded_image_with_class_embedding)
print(f"Sequence of patch embeddings with class token prepended shape: {patch_embedded_image_with_class_embedding.shape} -> [batch_size, number_of_patches, embedding_dimension]")

关注到number_of_patches的数量增加了一个

### Position Embedding
很重要，因为他显示了序列中各个patch的前后关系，不然他们将失去关系性。

In [6]:
# View the sequence of patch embeddings with the prepended class embedding
patch_embedded_image_with_class_embedding, patch_embedded_image_with_class_embedding.shape

In [None]:
# Calculate N (number of patches)
number_of_patches = int((height * width) / patch_size**2)

# Get embedding dimension
embedding_dimension = patch_embedded_image_with_class_embedding.shape[2]

# Create the learnable 1D position embedding
position_embedding = nn.Parameter(torch.ones(1,
                                             number_of_patches+1,
                                             embedding_dimension),
                                  requires_grad=True) # make sure it's learnable

# Show the first 10 sequences and 10 position embedding values and check the shape of the position embedding
print(position_embedding[:, :10, :10])
print(f"Position embeddding shape: {position_embedding.shape} -> [batch_size, number_of_patches, embedding_dimension]")

In [None]:
# Add the position embedding to the patch and class token embedding
patch_and_position_embedding = patch_embedded_image_with_class_embedding + position_embedding
print(patch_and_position_embedding)
print(f"Patch embeddings, class token prepended and positional embeddings added shape: {patch_and_position_embedding.shape} -> [batch_size, number_of_patches, embedding_dimension]")

### Put them all together
步骤分解；：
1. 一开始图像上是`[3, 224, 224]`形状。
2. 加上一个batch大小变成（**unsqueeze**）`[1, 3, 224, 224]`形状。(加一个维度是为了后面放进torch的layer的时候形状适配)
3. 通过一层**Conv2D**的特征提取后变成`[1, 768, 14, 14]`（batch_size, embedding_dim, feature_map_height, feature_map_width）形状。（将输出channels设置为768了，然后核大小和步长是patch的大小224/16=14）
4. 通过一层**Flatten**的扁平化，将2，3维度的向量进行处理，得到`[1, 768, 196]`形状。（batch_size, embedding_dim, num_patches）
5. 经过**permute**维度变化交换2，3维度位置变为`[1, 196, 768]`形状。
6. 创建一个`[1, 1, 768]`形状的ClassToken（使用**nn.Parameter**(ones，或者randn, required_grad=True)，随机化最好）
7. 使用**torch.cat**将ClassToken和patch_embedded_image在维度1上进行组合得到`[1, 197, 768]`形状。
8. 创建一个`[1, 197, 768]`形状的PositionEmbedding（使用**nn.Parameter**(ones，或者randn, required_grad=True)，随机化最好）
9. 将PositionEmbedding加在上上一步的结果中，得到同样形状`[1, 197, 768]`的最终结果。

In [None]:
set_seeds()

# 1. Set patch size
patch_size = 16

# 2. Print shape of original image tensor and get the image dimensions
print(f"Image tensor shape: {image.shape}")
height, width = image.shape[1], image.shape[2]

# 3. Get image tensor and add batch dimension
x = image.unsqueeze(0)
print(f"Input image with batch dimension shape: {x.shape}")

# 4. Create patch embedding layer
patch_embedding_layer = PatchEmbedding(in_channels=3,
                                       patch_size=patch_size,
                                       embedding_dim=768)

# 5. Pass image through patch embedding layer
patch_embedding = patch_embedding_layer(x)
print(f"Patching embedding shape: {patch_embedding.shape}")

# 6. Create class token embedding
batch_size = patch_embedding.shape[0]
embedding_dimension = patch_embedding.shape[-1]
class_token = nn.Parameter(torch.ones(batch_size, 1, embedding_dimension),
                           requires_grad=True) # make sure it's learnable
print(f"Class token embedding shape: {class_token.shape}")

# 7. Prepend class token embedding to patch embedding
patch_embedding_class_token = torch.cat((class_token, patch_embedding), dim=1)
print(f"Patch embedding with class token shape: {patch_embedding_class_token.shape}")

# 8. Create position embedding
number_of_patches = int((height * width) / patch_size**2)
position_embedding = nn.Parameter(torch.ones(1, number_of_patches+1, embedding_dimension),
                                  requires_grad=True) # make sure it's learnable

# 9. Add position embedding to patch embedding with class token
patch_and_position_embedding = patch_embedding_class_token + position_embedding
print(f"Patch and position embedding shape: {patch_and_position_embedding.shape}")

### Multi-Head Attention

In [None]:
# 1. Create a class that inherits from nn.Module
class MultiheadSelfAttentionBlock(nn.Module):
    "Create a multi-head self-attention block."
    # 2. Initialize the class with hyperparameters from Table 1
    def __init__(
        self, 
        embedding_dim:int = 768,
        num_heads:int = 12,
        attn_dropout:float = 0
    ):
        super().__init__()

        # 3. Create the Norm layer (LN)
        self.layer_norm = nn.LayerNorm(normalized_shape = embedding_dim)

        # 4. Create the MHA layer
        self.multihead_attn = nn.MultiheadAttention(
            embed_dim = embedding_dim,
            num_heads = num_heads,
            dropout = attn_dropout,
            batch_first = True
        )

    # 5. Create a forward() method to pass the data through the layers
    def forward(self, x):
        x = self.layer_norm(x)
        attn_output, _ = self.multihead_attn(
            query = x,
            key = x,
            value = x, 
            need_weights = False
        )
        return attn_output

In [None]:
# Create an instance of MHABlock
multihead_self_attention_block = MultiheadSelfAttentionBlock(
    embedding_dim = 768,
    num_heads = 12
)

# Pass patch and position image embedding through MHABlock
patched_image_through_msa_block = multihead_self_attention_block(patch_and_position_embedding)
print(f"Input shape of MSA block: {patch_and_position_embedding.shape}")
print(f"Output shape MSA block: {patched_image_through_mha_block.shape}")

### Multilayer Percaptron(MLP)
layer norm -> linear layer -> non-linear layer -> dropout -> linear layer -> dropout

In [None]:
# 1. Create a class that inherits from nn.Module
class MLPBlock(nn.Module):
    "Create a layer normalized multilayer perceptron block."
    # 2. Initialize the class with hyperparameters
    def __init__(
        self,
        embedding_dim:int = 768,
        mlp_size:int = 3072, # MLP size from Table 1 for ViT-Base
        dropout:float = 0.1  # Dropout from Table 3 for ViT-Base
    ):
        super().__init__()

        # 3. Create the Norm layer (LN)
        self.layer_norm = nn.LayerNorm(normalized_shape = embedding_dim)

        # 4. Create the Multilayer perceptron layers
        self.mlp = nn.Sequential(
            nn.Linear(in_features = embedding_dim,
                      out_features = mlp_size),
            nn.GELU(),
            nn.Dropout(p = dropout),
            nn.Linear(in_features = mlp_size,
                      out_features = embedding_dim),
            nn.Dropout(p = dropout)
        )

    # 5. Create a forward() method to pass the data through the layers
    def forward(self, x):
        x = self.layer_norm(x)
        x = self.mlp(x)
        return x

In [None]:
# Create an instance of MLPBlock
mlp_block = MLPBlock(
    embedding_dim = 768,
    mlp_size = 3072,
    dropout = 0.1
)

# Pass output of MHABlock through MLPBlock
patched_image_through_mlp_block = mlp_block(patched_image_through_msa_block)
print(f"Input shape of MLP block: {patched_image_through_msa_block.shape}")
print(f"Output shape of MLP block: {patched_image_through_mlp_block.shape}")

### Transformer Encoder
Add residual connection: x_input -> MSA_block -> MSA_block_output + **x_input** -> MLP_block -> MLP_block_output + MSA_block_output + **x_input** -> ...

In [None]:
# 1. Create a class that inherits from nn.Module
class TransformerEncoderBlock(nn.Module):
    "Create a Transformer Encoder block."
    # 2. Initialize the class with hyperarameters from Table 1 and 3
    def __init__(
        self,
        embedding_dim:int = 768,
        num_heads:int = 12,
        mlp_size:int = 3072,
        mlp_dropout:float = 0.1,
        attn_dropout:float = 0
    ):
        super().__init__()

        # 3. Create MSA block (equation 2)
        self.msa_block = MultiheadSelfAttentionBlock(
            embedding_dim = embedding_dim,
            num_heads = num_heads,
            attn_dropout = attn_dropout
        )

        # 4. Create MLP block (equation 3)
        self.mlp_block = MLPBlock(
            embedding_dim = embedding_dim,
            mlp_size = mlp_size,
            dropout = dropout
        )

    # 5. Create a forward() method
    def forward(self, x):

        # 6. Create residual connection for MSA block (add the input to the output)
        x = self.msa_block(x) + x

        # 7. Create residual connection for MLP block (add the input to the output)
        x = self.mlp_block(x) + x

        return x

In [None]:
# Create an instance of TransformerEncoderBlock
transformer_encoder_block = TransformerEncoderBlock()

# # Print an input and output summary of our Transformer Encoder (uncomment for full output)
# summary(model=transformer_encoder_block,
#         input_size=(1, 197, 768), # (batch_size, num_patches, embedding_dimension)
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"])

Pytorch Transformer layers from nn.torch:

In [None]:
# Create the same as above with torch.nn.TransformerEncoderLayer()
torch_transformer_encoder_layer = nn.TransformerEncoderLayer(
    d_model = 768,
    nhead = 12,
    dim_feedforward = 3072,
    dropout = 0.1,
    activation = "gelu",
    batch_first = True,
    norm_first = True
)

torch_transformer_encoder_layer

In [None]:
# # Get the output of PyTorch's version of the Transformer Encoder (uncomment for full output)
# summary(model=torch_transformer_encoder_layer,
#         input_size=(1, 197, 768), # (batch_size, num_patches, embedding_dimension)
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"])

由于 ViT 架构使用多个 Transformer 层，每个层堆叠在整个架构的顶部（表 1 显示了 ViT-Base 的情况下有 12 层），因此您可以使用以下命令来执行此操作torch.nn.TransformerEncoder(encoder_layer, num_layers)：

- encoder_layer- 使用创建的目标 Transformer Encoder 层torch.nn.TransformerEncoderLayer()。
- num_layers- 堆叠在一起的 Transformer Encoder 层的数量。

### Create ViT (put all together)

In [None]:
# 1. Create a ViT class that inherits from nn.Module
class ViT(nn.Module):
    """Creates a Vision Transformer architecture with ViT-Base hyperparameters by default."""
    # 2. Initialize the class with hyperparameters from Table 1 and Table 3
    def __init__(self,
                 img_size:int=224, # Training resolution from Table 3 in ViT paper
                 in_channels:int=3, # Number of channels in input image
                 patch_size:int=16, # Patch size
                 num_transformer_layers:int=12, # Layers from Table 1 for ViT-Base
                 embedding_dim:int=768, # Hidden size D from Table 1 for ViT-Base
                 mlp_size:int=3072, # MLP size from Table 1 for ViT-Base
                 num_heads:int=12, # Heads from Table 1 for ViT-Base
                 attn_dropout:float=0, # Dropout for attention projection
                 mlp_dropout:float=0.1, # Dropout for dense/MLP layers
                 embedding_dropout:float=0.1, # Dropout for patch and position embeddings
                 num_classes:int=1000): # Default for ImageNet but can customize this
        super().__init__() # don't forget the super().__init__()!

        # 3. Make the image size is divisble by the patch size
        assert img_size % patch_size == 0, f"Image size must be divisible by patch size, image size: {img_size}, patch size: {patch_size}."

        # 4. Calculate number of patches (height * width/patch^2)
        self.num_patches = (img_size * img_size) // patch_size**2

        # 5. Create learnable class embedding (needs to go at front of sequence of patch embeddings)
        self.class_embedding = nn.Parameter(data=torch.randn(1, 1, embedding_dim),
                                            requires_grad=True)

        # 6. Create learnable position embedding
        self.position_embedding = nn.Parameter(data=torch.randn(1, self.num_patches+1, embedding_dim),
                                               requires_grad=True)

        # 7. Create embedding dropout value
        self.embedding_dropout = nn.Dropout(p=embedding_dropout)

        # 8. Create patch embedding layer
        self.patch_embedding = PatchEmbedding(in_channels=in_channels,
                                              patch_size=patch_size,
                                              embedding_dim=embedding_dim)

        # 9. Create Transformer Encoder blocks (we can stack Transformer Encoder blocks using nn.Sequential())
        # Note: The "*" means "all"
        self.transformer_encoder = nn.Sequential(*[TransformerEncoderBlock(embedding_dim=embedding_dim,
                                                                            num_heads=num_heads,
                                                                            mlp_size=mlp_size,
                                                                            mlp_dropout=mlp_dropout) for _ in range(num_transformer_layers)])

        # 10. Create classifier head
        self.classifier = nn.Sequential(
            nn.LayerNorm(normalized_shape=embedding_dim),
            nn.Linear(in_features=embedding_dim,
                      out_features=num_classes)
        )

    # 11. Create a forward() method
    def forward(self, x):

        # 12. Get batch size
        batch_size = x.shape[0]

        # 13. Create class token embedding and expand it to match the batch size (equation 1)
        class_token = self.class_embedding.expand(batch_size, -1, -1) # "-1" means to infer the dimension (try this line on its own)

        # 14. Create patch embedding (equation 1)
        x = self.patch_embedding(x)

        # 15. Concat class embedding and patch embedding (equation 1)
        x = torch.cat((class_token, x), dim=1)

        # 16. Add position embedding to patch embedding (equation 1)
        x = self.position_embedding + x

        # 17. Run embedding dropout (Appendix B.1)
        x = self.embedding_dropout(x)

        # 18. Pass patch, position and class embedding through transformer encoder layers (equations 2 & 3)
        x = self.transformer_encoder(x)

        # 19. Put 0 index logit through classifier (equation 4)
        x = self.classifier(x[:, 0]) # run on each sample in a batch at 0 index

        return x

In [None]:
# Example of creating the class embedding and expanding over a batch dimension
batch_size = 32
class_token_embedding_single = nn.Parameter(data=torch.randn(1, 1, 768)) # create a single learnable class token
class_token_embedding_expanded = class_token_embedding_single.expand(batch_size, -1, -1) # expand the single learnable class token across the batch dimension, "-1" means to "infer the dimension"

# Print out the change in shapes
print(f"Shape of class token embedding single: {class_token_embedding_single.shape}")
print(f"Shape of class token embedding expanded: {class_token_embedding_expanded.shape}")

In [None]:
set_seeds()

# Create a random tensor with same shape as a single image
random_image_tensor = torch.randn(1, 3, 224, 224) # (batch_size, color_channels, height, width)

# Create an instance of ViT with the number of classes we're working with (pizza, steak, sushi)
vit = ViT(num_classes=len(class_names))

# Pass the random image tensor to our ViT instance
vit(random_image_tensor)

In [None]:
from torchinfo import summary

# # Print a summary of our custom ViT model using torchinfo (uncomment for actual output)
# summary(model=vit,
#         input_size=(32, 3, 224, 224), # (batch_size, color_channels, height, width)
#         # col_names=["input_size"], # uncomment for smaller output
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"]
# )

### Train model

In [None]:
from going_modular import engine

# Setup the optimizer to optimize our ViT model parameters using hyperparameters from the ViT paper
optimizer = torch.optim.Adam(params=vit.parameters(),
                             lr=3e-3, # Base LR from Table 3 for ViT-* ImageNet-1k
                             betas=(0.9, 0.999), # default values but also mentioned in ViT paper section 4.1 (Training & Fine-tuning)
                             weight_decay=0.3) # from the ViT paper section 4.1 (Training & Fine-tuning) and Table 3 for ViT-* ImageNet-1k

# Setup the loss function for multi-class classification
loss_fn = torch.nn.CrossEntropyLoss()

# Set the seeds
set_seeds()

# Train the model and save the training results to a dictionary
results = engine.train(model=vit,
                       train_dataloader=train_dataloader,
                       test_dataloader=test_dataloader,
                       optimizer=optimizer,
                       loss_fn=loss_fn,
                       epochs=10,
                       device=device)

**结果不尽人意的原因**：

原本的模型是在一个很大的数据集上进行，大的batch，以及很长的训练时间。正因为如此，原本的模型用到了很多过拟合的技术防止其过拟合。毕竟有这么多的数据和参数。

所以我们的小批量数据并不能有很好的结果。

这也说明了，现在的科学实验能有很好的结果，很多时候也得益于大数据和高性能计算机。**算力，数据，算法**，缺一不可。随着时代的发展，模型也越变越大。

In [None]:
from helper_functions import plot_loss_curves

# Plot our ViT model's loss curves
plot_loss_curves(results)

### Using a pretrained ViT from torchvision.models on the same dataset

In [None]:
# The following requires torch v0.12+ and torchvision v0.13+
import torch
import torchvision
print(torch.__version__)
print(torchvision.__version__)

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

In [None]:
# 1. Get pretrained weights for ViT-Base
pretrained_vit_weights = torchvision.models.ViT_B_16_Weights.DEFAULT # requires torchvision >= 0.13, "DEFAULT" means best available

# 2. Setup a ViT model instance with pretrained weights
pretrained_vit = torchvision.models.vit_b_16(weights=pretrained_vit_weights).to(device)

# 3. Freeze the base parameters
for parameter in pretrained_vit.parameters():
    parameter.requires_grad = False

# 4. Change the classifier head (set the seeds to ensure same initialization with linear head)
set_seeds()
pretrained_vit.heads = nn.Linear(in_features=768, out_features=len(class_names)).to(device)
# pretrained_vit # uncomment for model output

In [None]:
# # Print a summary using torchinfo (uncomment for actual output)
# summary(model=pretrained_vit,
#         input_size=(32, 3, 224, 224), # (batch_size, color_channels, height, width)
#         # col_names=["input_size"], # uncomment for smaller output
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"]
# )

In [None]:
from helper_functions import download_data

# Download pizza, steak, sushi images from GitHub
image_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                           destination="pizza_steak_sushi")
image_path

In [None]:
# Setup train and test directory paths
train_dir = image_path / "train"
test_dir = image_path / "test"
train_dir, test_dir

ensure your own custom data is transformed/formatted in the same way the data the original model was trained on.

In [None]:
# Get automatic transforms from pretrained ViT weights
pretrained_vit_transforms = pretrained_vit_weights.transforms()
print(pretrained_vit_transforms)

In [None]:
# Setup dataloaders
train_dataloader_pretrained, test_dataloader_pretrained, class_names = data_setup.create_dataloaders(train_dir=train_dir,
                                                                                                     test_dir=test_dir,
                                                                                                     transform=pretrained_vit_transforms,
                                                                                                     batch_size=32) # Could increase if we had more samples, such as here: https://arxiv.org/abs/2205.01580 (there are other improvements there too...)


In [None]:
from going_modular.going_modular import engine

# Create optimizer and loss function
optimizer = torch.optim.Adam(params=pretrained_vit.parameters(),
                             lr=1e-3)
loss_fn = torch.nn.CrossEntropyLoss()

# Train the classifier head of the pretrained ViT feature extractor model
set_seeds()
pretrained_vit_results = engine.train(model=pretrained_vit,
                                      train_dataloader=train_dataloader_pretrained,
                                      test_dataloader=test_dataloader_pretrained,
                                      optimizer=optimizer,
                                      loss_fn=loss_fn,
                                      epochs=10,
                                      device=device)

In [None]:
# Plot the loss curves
from helper_functions import plot_loss_curves

plot_loss_curves(pretrained_vit_results)

In [None]:
# Save the model
from going_modular import utils

utils.save_model(model=pretrained_vit,
                 target_dir="models",
                 model_name="08_pretrained_vit_feature_extractor_pizza_steak_sushi.pth")

In [None]:
from pathlib import Path

# Get the model size in bytes then convert to megabytes
pretrained_vit_model_size = Path("models/08_pretrained_vit_feature_extractor_pizza_steak_sushi.pth").stat().st_size // (1024*1024) # division converts bytes to megabytes (roughly)
print(f"Pretrained ViT feature extractor model size: {pretrained_vit_model_size} MB")

In [None]:
# predict on custom data
import requests

# Import function to make predictions on images and plot them
from helper_functions import pred_and_plot_image

# Setup custom image path
custom_image_path = image_path / "04-pizza-dad.jpeg"

# Download the image if it doesn't already exist
if not custom_image_path.is_file():
    with open(custom_image_path, "wb") as f:
        # When downloading from GitHub, need to use the "raw" file link
        request = requests.get("https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/04-pizza-dad.jpeg")
        print(f"Downloading {custom_image_path}...")
        f.write(request.content)
else:
    print(f"{custom_image_path} already exists, skipping download.")

# Predict on custom image
pred_and_plot_image(model=pretrained_vit,
                    image_path=custom_image_path,
                    class_names=class_names)