# **Self-Attention Computer Vision**

Self-Attention Computer Vision, known technically as self_attention_cv, is a PyTorch based library providing a one-stop solution for all of the self-attention based requirements. It includes varieties of self-attention based layers and pre-trained models that can be simply employed in any custom architecture. Rather than building the self-attention layers or blocks from scratch, this library helps its users perform model building in no-time. On the other hand, the pre-trained heavy models such as TransUNet, ViT can be incorporated into custom models and can finish training in minimal time even in a CPU environment!  According to its contributors Adaloglou Nicolas and Sergios Karagiannakos, the library is still under development by updating the latest models and architectures.

To read about it more, please refer [this](https://analyticsindiamag.com/pytorch-code-for-self-attention-computer-vision/) article.

This notebook has reference from the following sources and papers

https://github.com/The-AI-Summer/self-attention-cv
https://arxiv.org/pdf/1706.03762.pdf
https://analyticsindiamag.com/going-beyond-cnn-stand-alone-self-attention/
https://arxiv.org/pdf/2101.11605

# **Code Implementation**

In [None]:
!python -m pip install pip --upgrade --user -q --no-warn-script-location
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn tensorflow keras opencv-python pillow scikit-image torch torchvision \
    self_attention_cv --user -q --no-warn-script-location

import IPython
IPython.Application.instance().kernel.do_shutdown(True)


## Multi-head Self Attention

According to the authors of the paper, Attention Is All You Need,

"""An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values. The weight assigned to each value is computed by a compatibility function of the query with the corresponding key."""

In [None]:
import torch
from self_attention_cv import MultiHeadSelfAttention

model = MultiHeadSelfAttention(dim=64)
x = torch.rand(16, 10, 64)  # [batch, tokens, dim]
mask = torch.zeros(10, 10)  # tokens X tokens
mask[5:8, 5:8] = 1
y = model(x, mask)

print('Shape of output is: ', y.shape)
print('-'*70)
print('Output corresponding to the first token/patch in the first batch \n')
print(y.detach().numpy()[0][0])


## Axial Attention

Axial attention is a special kind of self-attention layers collection incorporated in autoregressive models such as Axial Transformers that take high-dimensional data as input such as high-resolution images. The following codes demonstrate Axial attention block implementation with randomly generated image data of size 64 by 64

In [None]:
# Axial Attention
from self_attention_cv import AxialAttentionBlock

model = AxialAttentionBlock(in_channels=256, dim=64, heads=8)
x = torch.rand(1, 256, 64, 64)  # [batch, tokens, dim, dim]
y = model(x)

print('Shape of output is: ', y.shape)
print('-'*70)
print('Output corresponding to the first token/patch in the first batch \n')
print(y.detach().numpy()[0][0])

## Bottleneck Attention

Bottleneck Transformers employ multi-head self-attention layers in multiple computer vision tasks. The whole transformer block is available as a module in our library. The Bottleneck block is demonstrated in the following codes with randomly generated images of size 32 by 32.

In [None]:
from self_attention_cv.bottleneck_transformer import BottleneckBlock
x = torch.rand(1, 512, 32, 32)
bottleneck_block = BottleneckBlock(in_channels=512, fmap_size=(32, 32), heads=4, out_channels=1024, pooling=True)
y = bottleneck_block(x)

print('Shape of output is: ', y.shape)
print('-'*70)
print('Output corresponding to the first patch in the first head, first batch \n')
print(y.detach().numpy()[0][0][0])

## Transformer Encoder


The encoder part of base Transformer architecture can be simply obtained using the module TransformerEncoder. The following codes demonstrate usage of this module with randomly generated tokens of dimension 64. 

In [None]:
# Transformer Encoder
from self_attention_cv import TransformerEncoder

model = TransformerEncoder(dim=64,blocks=6,heads=8)
x = torch.rand(16, 10, 64)  # [batch, tokens, dim]
mask = torch.zeros(10, 10)  # tokens X tokens
mask[5:8, 5:8] = 1
y = model(x,mask)

print('Shape of output is: ', y.shape)
print('-'*70)
print('Output corresponding to the first token/patch in the first batch \n')
print(y.detach().numpy()[0][0])

## Vision Transformer 

Vision Transformer (ViT) became popular with all kinds of computer vision tasks, achieving state-of-the-art performance in many applications at its publication time. Though few other latest architectures outperform ViT, most of them are built on top of it. The basic ViT is available as a module so that it can be simply used in any custom architecture. The following codes demonstrate the module’s usage with randomly generated 3-channel colored images of ize 256 by 256 in a 10-class classification problem. 

In [None]:
from self_attention_cv import ViT

model = ViT(img_dim=256, in_channels=3, patch_dim=16, num_classes=10,dim=512)
x = torch.rand(2, 3, 256, 256)
y = model(x) # [2,10]

print('Shape of output is: ', y.shape)
print('-'*70)
print('Output corresponding to the first image \n')
print(y.detach().numpy()[0])

## Vision Transformer with ResNet50

The Vision Transformer backed with ResNet performs greatly with many of the computer vision tasks. The following codes demonstrate the corresponding module’s usage with randomly generated 3-channel colored images of size 256 by 256 in a 10-class classification problem. 

In [None]:
from self_attention_cv import ResNet50ViT

model = ResNet50ViT(img_dim=256, pretrained_resnet=False, 
                        blocks=6, num_classes=10, 
                        dim_linear_block=256, dim=256)
x = torch.rand(2, 3, 256, 256)
y = model(x) # [2,10]

print('Shape of output is: ', y.shape)
print('-'*70)
print('Output corresponding to the first image \n')
print(y.detach().numpy()[0])

## TransUNet 

TransUNet is the present state-of-the-art architecture in Medical Image Segmentation tasks. This architecture is available as a module in the self_attention_cv library. The following codes demonstrate the module’s usage with randomly generated 3-channel colored images of dimensions 128 by 128. The output of the model built with this module corresponds to the dimensions of the input images.

In [None]:
from self_attention_cv.transunet import TransUnet
x = torch.rand(2, 3, 128, 128)
model = TransUnet(in_channels=3, img_dim=128, vit_blocks=8,
vit_dim_linear_mhsa_block=512, classes=5)
y = model(x) # [2, 5, 128, 128]

print('Shape of output is: ', y.shape)
print('-'*70)
print('Output corresponding to the first image \n')
print(y.detach().numpy()[0][0])

## 1D Absolute Positional Embedding

Two forms of positional embeddings are fed into a self-attention layer to denote memory vectors’ position, namely, absolute positioning and relative positioning. Position-aware self-attention models exhibit memory efficiency and improved performance. Self-attention Computer Vision library has separate modules for absolute and relative position embeddings for 1D and 2D sequential data. The following codes demonstrate application of 1-dimensional absolute positional embedding of tokens of dimension 64 with the corresponding module.

In [None]:
from self_attention_cv.pos_embeddings import AbsPosEmb1D

model = AbsPosEmb1D(tokens=20, dim_head=64)
# batch heads tokens dim_head
x = torch.rand(2, 3, 20, 64)
y = model(x)

print('Shape of output is: ', y.shape)
print('-'*70)
print('Output corresponding to the first token in the first head, first batch \n')
print(y.detach().numpy()[0][0][0])

## 1D Relative Positional Embedding

Relative positional embedding helps greater performance in Neural Machine Translation compared to absolute positional embedding. The following codes demonstrate the application of 1-dimensional relative positional embedding of tokens of dimension 64 with the corresponding module.

In [None]:
from self_attention_cv.pos_embeddings import RelPosEmb1D

model = RelPosEmb1D(tokens=20, dim_head=64, heads=3)
x = torch.rand(2, 3, 20, 64)
y = model(x)

print('Shape of output is: ', y.shape)
print('-'*70)
print('Output corresponding to the first token in the first head, first batch \n')
print(y.detach().numpy()[0][0][0])

## 2D Relative Positional Embedding

The following codes demonstrate the 2-dimensional relative positional embedding module usage with input feature map patches of dimension 32 by 32.

In [None]:
from self_attention_cv.pos_embeddings import RelPosEmb2D
dim = 32  # spatial dim of the feat map
model = RelPosEmb2D(
    feat_map_size=(dim, dim),
    dim_head=128)

x = torch.rand(2, 4, dim*dim, 128)
y = model(x)

print('Shape of output is: ', y.shape)
print('-'*70)
print('Output corresponding to the first patch in the first head, first batch \n')
print(y.detach().numpy()[0][0][0])