<h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spacing: 1px; background-color: #f6f5f5; color :#6666ff; border-radius: 200px 200px; text-align:center">Vision Transformers</h1>

![](https://neurohive.io/wp-content/uploads/2020/10/archhhh2-770x388.png)


<p style = "font-family: garamond; font-size: 20px; font-style: normal; border-radius: 10px 10px; text-align:center">We split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence.<br><br>A major challenge of applying Transformers without CNN to images is applying Self-Attention between pixels. ViT has overcome this problem by segmenting images into small patches (like 16x16 as implemented in this notebook).<br><br>Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.</p>

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Install and Import Libraries</p>

In [None]:
!pip install -q vit-pytorch
!pip install -q nystrom-attention

In [None]:
# Python library used for working with arrays.
import numpy as np

# Python library to interact with the file system.
import os

# Software library written for data manipulation and analysis. 
import pandas as pd

# fastai library for computer vision tasks
from fastai.vision.all import *
from fastai.metrics import *

# Developing and training neural network based deep learning models.
import torch

# Vision Transformer
from vit_pytorch.efficient import ViT

# Nystromformer
from nystrom_attention import Nystromformer

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Load Training Data</p>

In [None]:
dataset_path = Path('../input/ranzcr-clip-catheter-line-classification')
os.listdir(dataset_path)

In [None]:
train_df = pd.read_csv(dataset_path/'train.csv')

In [None]:
train_df.head()

In [None]:
train_df['path'] = train_df['StudyInstanceUID'].map(lambda x:str(dataset_path/'train'/x)+'.jpg')
train_df = train_df.drop(columns=['StudyInstanceUID'])
train_df.head(10)

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Image Augmentation on Dataset</p>

In [None]:
# Transforms we need to do for each image in the dataset (ex: resizing).
item_tfms = RandomResizedCrop(224, min_scale=0.75, ratio=(1.,1.)) 

# Transforms that can take place on a batch of images (ex: many augmentations).
batch_tfms = [*aug_transforms(size=224, max_warp=0), Normalize.from_stats(*imagenet_stats)]

In [None]:
label_names = list(train_df.columns[:11])

In [None]:
data = DataBlock(blocks=(ImageBlock, MultiCategoryBlock(encoded=True, vocab=label_names)), # multi-label target
                 splitter = RandomSplitter(seed=42),# split data into training and validation subsets.
                 get_x = ColReader(12),# obtain the input images.
                 get_y = ColReader(list(range(11))), # obtain the targets.
                 item_tfms = item_tfms,
                 batch_tfms = batch_tfms)

In [None]:
dls = data.dataloaders(train_df,bs=16)

# We can call show_batch() to see what a sample of a batch looks like.
dls.show_batch()

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Nystromformer</p>

![](https://raw.githubusercontent.com/lucidrains/nystrom-attention/master/diagram.png)

<p style = "font-family: garamond; font-size: 20px; font-style: normal; border-radius: 10px 10px; text-align:center">The proposed architecture of efficient self-attention via Nystrom approximation.Each box represents an input, output, or intermediate matrix.The variable name and the size of the matrix are inside box. × denotes matrix multiplication, and + denotes matrix addition.<br><br>The orange colored boxes are those matrices used in the Nystrom approximation. The green boxes are the skip connection added in parrallel to the approximation.The dashed bounding box illustrates the three matrices of Nystroom approximate softmax matrix in self-attention.

In [None]:
efficient_transformer = Nystromformer(
    # Last dimension of output tensor after linear transformation nn.Linear(..., dim).
    dim = 128,
    # Number of Transformer blocks.
    depth = 6,
    # Number of heads in Multi-head Attention layer.
    heads = 8,
    # # number of landmarks
    num_landmarks = 256
)

In [None]:
model = ViT(
    # Last dimension of output tensor after linear transformation nn.Linear(..., dim).
    dim = 128,
    # #If you have rectangular images, make sure your image size is the maximum of the width and height
    image_size = 224,
    # n = (image_size // patch_size) ** 2 and n must be greater than 16.
    patch_size = 16,
    # Number of classes to classify.
    num_classes = 11,
    # plugin your own sparse attention transformer (Linformer/Reformer/Nystromformer)
    transformer = efficient_transformer
)

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Training</p>

In [None]:
# Group together some dls, a model, and metrics to handle training
learn = Learner(dls, model, metrics = [accuracy_multi]) # Compute accuracy when input and target are the same size.

In [None]:
# Choosing a good learning rate
learn.lr_find()

In [None]:
# We can use the fine_tune function to train a model with this given learning rate
learn.fine_tune(1,base_lr=0.0002290867705596611)

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Load Test File</p>

In [None]:
sample_df = pd.read_csv(dataset_path/'sample_submission.csv')
sample_df.head()

In [None]:
_sample_df = sample_df.copy()
_sample_df['PatientID'] = 'None'
_sample_df['path'] = _sample_df['StudyInstanceUID'].map(lambda x:str(dataset_path/'test'/x)+'.jpg')
_sample_df = _sample_df.drop(columns=['StudyInstanceUID'])
test_dl = dls.test_dl(_sample_df)


In [None]:
test_dl.show_batch()

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Test Time Augmentation</p>

![](https://preview.ibb.co/kH61v0/pipeline.png)

<p style = "font-family: garamond; font-size: 20px; font-style: normal; border-radius: 10px 10px; text-align:center">Similar to what Data Augmentation is doing to the training set, the purpose of Test Time Augmentation is to perform random modifications to the test images. Thus, instead of showing the regular, “clean” images, only once to the trained model, we will show it the augmented images several times. We will then average the predictions of each corresponding image and take that as our final guess.<br><br>The reason why it works is that, by averaging our predictions, on randomly modified images, we are also averaging the errors. The error can be big in a single vector, leading to a wrong answer, but when averaged, only the correct answer stand out.</p>

In [None]:
# Return predictions on the ds_idx dataset or dl using Test Time Augmentation
preds, _ = learn.tta(dl=test_dl,n=8)

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Test Time Augmentation</p>

In [None]:
submission_df = sample_df
for i in range(len(submission_df)):
    for j in range(len(label_names)):
        submission_df.iloc[i, j+1] = preds[i][j].numpy().astype(np.float32)

In [None]:
submission_df.head(10)

In [None]:
submission_df.to_csv(f'submission.csv', index=False)

<p p style = "font-family: garamond; font-size:40px; font-style: normal;background-color: #f6f5f5; color :#ff0066; border-radius: 10px 10px; text-align:center">Upvote the kernel if you find it insightful!</p>