###  **Agricultural Crop (Paddy) Disease Classification**

##### **Introduction:**

Image classification has long been one of the standard tasks for benchmarking the performances of alogrithms in computer vision. In the past decade, variants of deep Convolutional Neural Network (CNN) like VGGNet, Inception, ResNet, etc. have made extraordinary success within the field. While for another mainstream in deep learning dealing with natural language, attention and bidirectional encoding mechanisms adopted in Transformer has tremendously driven a shift in the mainstream applications of language models. Several years before researchers have investigated implanting the Transformer design into the computer vision task, giving rise to the families of Vision Transformer (ViT) models. 

Below project aimed to classify the diseases appeared in the paddy (a common agricultural crop in South and South-east Asia as a main staple. 9 disease classes and 1 class reserved as normal class were annotated on over 10,000 images in the dataset. From below 10 images per class were sampled in random and visualized through openCV. The images were generally looking similar and quite challenging to be distinguished through human eyes. 

<br>

There were 3 models being tested: <br>

<ol>
<li>a Inception-ResNet model, which could customize the depth of residual convolution blocks and in between each sequence of these blocks, an inception convolution block would be used to down-sample the image pixels;  </li><br>
<li>a Vision Transformer model, with the standard components of Transformer encoder and a MLP head feed forward network to project the features to the 10 classes. </li><br>
<li>a convolution and pooling based Vision Transformer model, with all dense layer operations in the Transformer encoder and subsequent pooling layer replaced by 2D-convolution layer operations. </li><br>
</ol>

The idea of Vision Transformer is to partition the image into patches of smaller sizes. These patches contained the spatial sequential and proximity information of the pixels in an image, like words within a sentence or paragraph. Then, the patches would be arranged in order, and padded with positional embeddings, to be fed into the multi-head attention layer. The output from the attention layer would be passed to feed-forward network to proceed with the classification training.

The Vision Transformers codes were referred from the following github repo:  <a href="https://github.com/lucidrains/vit-pytorch">https://github.com/lucidrains/vit-pytorch</a>,  while for the convolution-based vision transformer model (the 3rd model), the codes were modified in this notebook to combine the ideas in Cvt and PiT models.

Results had shown that the Inception-ResNet, though adding more dropout rates and having the fewest parameters compared to the two Vision Transformers, severely overfitted. It simply failed to learn the key features from the images of high similarity, with a 19% accuracy on the testing set. The Vision Transformer, on the other hand, had much milder overfitting problem, and achieved much higher accuracy of 75% (convolution-based) and 82% (standard ViT) on the testing set.

To explain this, generally the attention mechanism in the Vision Transformers might have inspiringly been able to capture tiny and hard-to-recognize features from the images, and hence boosting the performance of the model to classify the disease classes.
<br>


In [1]:
import numpy as np
import pandas as pd
import os
import json
import random
import matplotlib.pyplot as plt
import collections
import math
from functools import reduce

In [2]:
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms
from torchvision import datasets
from torch.utils import data
from sklearn.model_selection import train_test_split

In [4]:
## Add configuration
!mkdir -p ~/.kaggle
!cp '/content/drive/MyDrive/Colab Notebooks/kaggle.json' ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [5]:
## Download dataset from Kaggle
!kaggle competitions download -c paddy-disease-classification

Downloading paddy-disease-classification.zip to /content
100% 1.02G/1.02G [00:42<00:00, 51.9MB/s]
100% 1.02G/1.02G [00:42<00:00, 25.8MB/s]


In [None]:
!unzip paddy-disease-classification.zip

In [7]:
## image dataloader
img_norm = transforms.Compose([
           transforms.Resize((640,480)),
           transforms.ToTensor()
           ])
train_data = datasets.ImageFolder(root = "/content/train_images/", transform = img_norm)
train_data_loader = DataLoader(train_data, batch_size = 16, shuffle=True)

In [None]:
## check image dataset
img, _ = train_data[0]
print("Number of Images: " + str(len(train_data)))
print("Number of Classes: " + str(len(train_data.classes)))
print("Image Shape: " + str(img.size()))

Number of Images: 10407
Number of Classes: 10
Image Shape: torch.Size([3, 640, 480])


In [None]:
## number of images per class
[(x, y[1]) for x, y in zip(train_data.classes, list(collections.Counter(train_data.targets).items()))]

[('bacterial_leaf_blight', 479),
 ('bacterial_leaf_streak', 380),
 ('bacterial_panicle_blight', 337),
 ('blast', 1738),
 ('brown_spot', 965),
 ('dead_heart', 1442),
 ('downy_mildew', 620),
 ('hispa', 1594),
 ('normal', 1764),
 ('tungro', 1088)]

In [None]:
query_pos = [[pos for pos, (x, y) in enumerate(train_data) if y == n] for n in range(len(train_data.classes))]

In [None]:
print(len(query_pos))
print([len(x) for x in query_pos])

10
[479, 380, 337, 1738, 965, 1442, 620, 1594, 1764, 1088]


In [8]:
## set class weight for class imbalance
class_weights = list(collections.Counter(train_data.targets).items())
class_weights = torch.Tensor([c[1] for c in class_weights])
class_weights_norm = 1 / (class_weights / torch.max(class_weights))

In [None]:
print(class_weights)

tensor([ 479.,  380.,  337., 1738.,  965., 1442.,  620., 1594., 1764., 1088.])


In [None]:
print(class_weights_norm)

tensor([3.6827, 4.6421, 5.2344, 1.0150, 1.8280, 1.2233, 2.8452, 1.1066, 1.0000,
        1.6213])


In [None]:
!pip install einops

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting einops
  Downloading einops-0.4.1-py3-none-any.whl (28 kB)
Installing collected packages: einops
Successfully installed einops-0.4.1


In [9]:
## Inception ResNet

class InceptionBlock(nn.Module):
    def __init__(self, dim_in, dim_out):
        super().__init__()
        self.conv1 = nn.Conv2d(dim_in, dim_out, kernel_size=1, stride=1, padding=0)
        self.conv2a = nn.Conv2d(dim_in, dim_out, kernel_size=3, stride=4, padding=1)
        self.conv2b = nn.Conv2d(dim_out, dim_out * 4, kernel_size=3, stride=4, padding=1)
        self.conv2c = nn.Conv2d(dim_out * 4, dim_out, kernel_size=1, stride=1, padding=0)
        self.postconv = nn.Conv2d(dim_out * 3, dim_out, kernel_size=1, stride=1, padding=0)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=4, padding=1)
        self.bn = nn.BatchNorm2d(dim_out)
        self.act = nn.LeakyReLU()
    def forward(self, x):
        x1 = self.maxpool(self.act(self.bn(self.conv1(x))))
        x2 = self.act(self.bn(self.conv2a(x)))
        x3 = self.act(self.bn(self.conv2c(self.conv2b(self.conv1(x)))))
        x = torch.cat((x1, x2, x3), 1)
        x = self.act(self.bn(self.postconv(x)))
        return x

class ResBlock(nn.Module):
    def __init__(self, dim_in, dim_out):
        super().__init__()
        self.conv1 = nn.Conv2d(dim_in, dim_out, kernel_size=1, stride=1, padding=0)
        self.conv2 = nn.Conv2d(dim_out, dim_out, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv2d(dim_out, dim_in, kernel_size=1, stride=1, padding=0)
        self.bn = nn.BatchNorm2d(dim_in)
        self.act = nn.LeakyReLU()
    def forward(self, x):
        x_re = self.conv3(self.conv2(self.conv1(x)))
        x_re = self.act(self.bn(x_re))
        x = torch.add(x, x_re)
        return x

class InceptionResNet(nn.Module):
    def __init__(self, depth, res_depth, dim_in, dim_out, height, width, nclass):
        super().__init__()
        self.depth = depth
        self.nclass = nclass
        self.res_depth = res_depth
        self.height, self.width = height, width
        self.out_height = self.height // (4 ** self.depth) \
                            if self.height % (4 ** self.depth) == 0 \
                            else self.height // (4 ** self.depth) + 1
        self.out_width = self.width // (4 ** self.depth) \
                            if self.width % (4 ** self.depth) == 0 \
                            else self.width // (4 ** self.depth) + 1

        self.conv_in = nn.Conv2d(dim_in, dim_out, kernel_size=1, stride=1, padding=0)
        dim_in, dim_out = dim_out, dim_out
        layers = nn.ModuleList([])
        for n in range(self.depth):
            for m in range(self.res_depth):
                layers.append(ResBlock(dim_in, dim_out * 2))
            layers.append(InceptionBlock(dim_in, dim_out * 4))
            layers.append(nn.Dropout(0.2))
            dim_in, dim_out = dim_out * 4, dim_out * 4
        self.layers = nn.Sequential(*layers)
        self.dense1 = nn.Linear(dim_out * self.out_height * self.out_width, self.out_height * self.out_width)
        self.dense2 = nn.Linear(self.out_height * self.out_width, self.nclass)
        self.bn = nn.BatchNorm1d(self.out_height * self.out_width)
        self.act = nn.ReLU()
        self.flatten = nn.Flatten()
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        x = self.conv_in(x)
        x = self.layers(x)
        x = self.flatten(x)
        x = self.dropout(x)
        x = self.act(self.bn(self.dense1(x)))
        x = self.dropout(x)
        x = self.dense2(x)
        return x

In [10]:
cnn_model = InceptionResNet(depth=3, res_depth=2, dim_in=3, dim_out=8, height=640, width=480, nclass=10)

In [11]:
cnn_model = cnn_model.cuda()

In [None]:
## Vision-Transformer
from einops import rearrange, repeat
from einops.layers.torch import Rearrange

def pair(t):
    return t if isinstance(t, tuple) else (t, t)

class PreNorm(nn.Module):
    def __init__(self, dim, fn):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn
    def forward(self, x, **kwargs):
        return self.fn(self.norm(x), **kwargs)

## Dense layer for outputting Attention vectors
class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim, dropout = 0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, dim),
            nn.Dropout(dropout)
        )
    def forward(self, x):
        return self.net(x)

## Multi-head Transformer Attention module
class Attention(nn.Module):
    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.1):
        super().__init__()
        inner_dim = dim_head * heads
        project_out = not (heads == 1 and dim_head == dim)
        
        self.heads = heads
        self.scale = dim_head ** -0.5
        self.attend = nn.Softmax(dim = -1)
        self.dropout = nn.Dropout(dropout)
        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
        self.to_out = nn.Sequential(
            nn.Linear(inner_dim, dim),
            nn.Dropout(dropout)
        )
        
    def forward(self, x):
        ## map and split
        qkv = self.to_qkv(x).chunk(3, dim = -1)
        ## 3-dim to 4-dim and reshape
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)
        ## dot product
        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
        ## attention
        attn = self.attend(dots)
        attn = self.dropout(attn)
        ## attention output and v
        out = torch.matmul(attn, v)
        ## reshape 4-dim to 3-dim and dense output
        out = rearrange(out, 'b h n d -> b n (h d)')
        return self.to_out(out)

## Transformer Encoder
class Transformer(nn.Module):
    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.1):
        super().__init__()
        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)),
                PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout))
            ]))
            
    def forward(self, x):
        for attn, ff in self.layers:
            x = attn(x) + x
            x = ff(x) + x
        return x

## Projected patch embedding -> Positional Embedding -> Transformer Encoder -> MLP head
class ViT(nn.Module):
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', 
                 channels = 3, dim_head = 64, dropout = 0.1, emb_dropout = 0.01):
        super().__init__()
        image_height, image_width = pair(image_size)
        patch_height, patch_width = pair(patch_size)
        num_patches = (image_height // patch_height) * (image_width // patch_width)
        patch_dim = channels * patch_height * patch_width
        
        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
            nn.Linear(patch_dim, dim),
        )
        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        self.dropout = nn.Dropout(emb_dropout)
        self.pool = pool
        self.to_latent = nn.Identity()
        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)
        self.mlp_head = nn.Sequential(
            nn.LayerNorm(dim),
            nn.Linear(dim, num_classes)
        )

    def forward(self, img):
        x = self.to_patch_embedding(img)
        b, n, _ = x.shape
        cls_tokens = repeat(self.cls_token, '1 1 d -> b 1 d', b = b)
        x = torch.cat((cls_tokens, x), dim=1)
        x += self.pos_embedding[:, :(n + 1)]
        x = self.dropout(x)
        x = self.transformer(x)
        x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]
        x = self.to_latent(x)
        return self.mlp_head(x)

In [None]:
vit_model = ViT(
    image_size = (640, 480),
    patch_size = 32,
    num_classes = 10,
    dim = 1024,
    depth = 5,
    heads = 10,
    dim_head = 128,
    mlp_dim = 2048,
    pool = 'cls', 
    channels = 3, 
    dropout = 0.1,
    emb_dropout = 0.01
)

In [None]:
vit_model = vit_model.cuda()

In [None]:
## Convolution-based Vision-Transformer

from math import sqrt
from einops import rearrange, repeat
from einops.layers.torch import Rearrange
import torch.nn.functional as F
from fractions import Fraction

def pair(t):
    return t if isinstance(t, tuple) else (t, t)

def cast_tuple(val, num):
    return val if isinstance(val, tuple) else (val,) * num

def conv_output_size(image_size, kernel_size, stride, padding = 0):
    return int(((image_size - kernel_size + (2 * padding)) / stride) + 1)

class LayerNorm(nn.Module):
    def __init__(self, dim, eps = 1e-5):
        super().__init__()
        self.eps = eps
        self.g = nn.Parameter(torch.ones(1, dim, 1, 1))
        self.b = nn.Parameter(torch.zeros(1, dim, 1, 1))
    def forward(self, x):
        var = torch.var(x, dim = 1, unbiased = False, keepdim = True)
        mean = torch.mean(x, dim = 1, keepdim = True)
        return (x - mean) / (var + self.eps).sqrt() * self.g + self.b

class PreNorm(nn.Module):
    def __init__(self, dim, fn):
        super().__init__()
        self.norm = LayerNorm(dim)
        self.fn = fn
    def forward(self, x, **kwargs):
        x = self.norm(x)
        return self.fn(x, **kwargs)
    
# depthwise convolution, for attention-projection & token-pooling
class DepthWiseConv2d(nn.Module):
    def __init__(self, dim_in, dim_out, kernel_size, padding, stride, bias = True):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(dim_in, dim_out, kernel_size = kernel_size, stride = stride, 
                      padding = padding, groups = dim_in, bias = bias),
            nn.Conv2d(dim_out, dim_out, kernel_size = 1, bias = bias)
        )
    def forward(self, x):
        return self.net(x)
    
class FeedForward(nn.Module):
    def __init__(self, dim, aspect_ratio, hidden_dim, dropout = 0.1):
        super().__init__()
        self.aspect_ratio = aspect_ratio
        self.net = nn.Sequential(
            nn.Conv2d(dim, hidden_dim, 1),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Conv2d(hidden_dim, dim, 1),
            nn.Dropout(dropout)
        )
    def forward(self, x):
        return self.net(x)

class Attention(nn.Module):
    def __init__(self, dim, aspect_ratio, proj_kernel, heads = 8, dim_head = 64, dropout = 0.1):
        super().__init__()
        inner_dim = dim_head *  heads
        padding = proj_kernel // 2
        self.aspect_ratio = aspect_ratio
        self.heads = heads
        self.scale = dim_head ** -0.5

        self.attend = nn.Softmax(dim = -1)
        self.dropout = nn.Dropout(dropout)
        
        self.to_q = DepthWiseConv2d(dim, inner_dim, proj_kernel, padding = padding, stride = 1, bias = False)
        self.to_kv = DepthWiseConv2d(dim, inner_dim * 2, proj_kernel, padding = padding, stride = 1, bias = False)
        self.to_out = nn.Sequential(
            nn.Conv2d(inner_dim, dim, 1),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        ## split along first dimension into 3 tensors
        b, n, _, y, h = *x.shape, self.heads
        q, k, v = (self.to_q(x), *self.to_kv(x).chunk(2, dim = 1))
        q, k, v = map(lambda t: rearrange(t, 'b (h d) x y -> (b h) (x y) d', h = h), (q, k, v))
        
        dots = torch.einsum('b i d, b j d -> b i j', q, k) * self.scale
        attn = self.attend(dots)
        attn = self.dropout(attn)
        
        out = torch.einsum('b i j, b j d -> b i d', attn, v)
        out = rearrange(out, '(b h) (x y) d -> b (h d) x y', h = h, y = y)
        out = self.to_out(out)
        return out

## Transformer Encoder
class Transformer(nn.Module):
    def __init__(self, dim, aspect_ratio, proj_kernel, depth, heads, dim_head, mlp_dim, dropout = 0.1):
        super().__init__()
        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                PreNorm(dim, Attention(dim, aspect_ratio, 
                                       proj_kernel = proj_kernel, 
                                       heads = heads, 
                                       dim_head = dim_head, 
                                       dropout = dropout)),
                PreNorm(dim, FeedForward(dim, aspect_ratio, mlp_dim, dropout = dropout))
            ]))
            
    def forward(self, x):
        for attn, ff in self.layers:
            x = attn(x) + x
            x = ff(x) + x
        return x
    
## Pooling layer
class Pool(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.downsample = DepthWiseConv2d(dim, dim * 2, kernel_size = 3, stride = 2, padding = 1)
    def forward(self, x):
        tokens = self.downsample(x)
        return tokens

class PoolConViT(nn.Module):
    def __init__(
        self, *,
        image_size,
        patch_size,
        num_classes,
        proj_kernel = 3, 
        dim = 256,
        depth = (3, 3, 3), 
        heads = 10,
        mlp_dim = 2048,
        dim_head = 64,
        dropout = 0.1,
        emb_dropout = 0.01,
        channels = 3
    ):
        super().__init__()
        heads = cast_tuple(heads, len(depth))
        image_height, image_width = pair(image_size)
        patch_height, patch_width = pair(patch_size)
        patch_dim = channels * patch_height * patch_width
        aspect_ratio = image_height / image_width

        unfold_height = image_height // (patch_size // 2)
        unfold_width = image_width // (patch_size // 2)
        output_height_size = conv_output_size(image_height, patch_height, patch_height // 2)
        output_width_size = conv_output_size(image_width, patch_width, patch_width // 2)
        num_patches = output_height_size * output_width_size

        self.to_patch_embedding = nn.Sequential(
            nn.Unfold(kernel_size = patch_size, stride = patch_size // 2),
            Rearrange('b n c -> b c n'),
            nn.Linear(patch_dim, dim * heads[0])
        )
        
        self.pre_encoder_head = nn.Sequential(
            nn.Linear(dim * heads[0], dim),
            Rearrange('b c n -> b n c'),
            nn.Linear(num_patches + 1, unfold_height * unfold_width),
            Rearrange('b c (m n) -> b c m n', m = unfold_height)
        )
        
        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim * heads[0]))
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim * heads[0]))
        self.dropout = nn.Dropout(emb_dropout)

        layers = []
        aspect_height = unfold_height
        for ind, (layer_depth, layer_heads) in enumerate(zip(depth, heads)):
            not_last = ind < (len(depth) - 1)
            layers.append(Transformer(dim, aspect_ratio, 
                                      proj_kernel, layer_depth, 
                                      layer_heads, dim_head, mlp_dim, dropout))
            if not_last:
                layers.append(Pool(dim))
                dim *= 2
        self.layers = nn.Sequential(*layers)
        
        self.mlp_head = nn.Sequential(
            Rearrange('b c m n -> b (c m n)'),
            nn.Linear(dim * \
                      math.ceil(unfold_height / (2 ** (len(depth) - 1))) * \
                      math.ceil(unfold_width / (2 ** (len(depth) - 1))), 
                      num_classes)
        )

    def forward(self, img):
        x = self.to_patch_embedding(img)
        b, n, _ = x.shape
        cls_tokens = repeat(self.cls_token, '1 1 d -> b 1 d', b = b)
        x = torch.cat((cls_tokens, x), dim=1)
        x += self.pos_embedding[:, :(n+1)]
        x = self.dropout(x)
        x = self.pre_encoder_head(x)
        x = self.layers(x)
        x = x.unsqueeze(1)
        x = self.mlp_head(x[:, 0])
        return x

In [None]:
pcvit_model = PoolConViT(image_size = (640, 480), 
                         patch_size = 32,
                         num_classes = 10,
                         proj_kernel = 3,
                         dim = 160,
                         depth = (3, 3, 3), 
                         heads = 10,
                         dim_head = 64,
                         mlp_dim = 1600,
                         dropout = 0.1,
                         emb_dropout = 0.01,
                         channels = 3)

In [None]:
pcvit_model = pcvit_model.cuda()

In [None]:
!pip install torchsummary

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [12]:
from torchsummary import summary
summary(cnn_model, input_size = (3,640,480), device="cuda")

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1          [-1, 8, 640, 480]              32
            Conv2d-2         [-1, 16, 640, 480]             144
            Conv2d-3         [-1, 16, 640, 480]           2,320
            Conv2d-4          [-1, 8, 640, 480]             136
       BatchNorm2d-5          [-1, 8, 640, 480]              16
         LeakyReLU-6          [-1, 8, 640, 480]               0
          ResBlock-7          [-1, 8, 640, 480]               0
            Conv2d-8         [-1, 16, 640, 480]             144
            Conv2d-9         [-1, 16, 640, 480]           2,320
           Conv2d-10          [-1, 8, 640, 480]             136
      BatchNorm2d-11          [-1, 8, 640, 480]              16
        LeakyReLU-12          [-1, 8, 640, 480]               0
         ResBlock-13          [-1, 8, 640, 480]               0
           Conv2d-14         [-1, 32, 6

In [None]:
from torchsummary import summary
summary(vit_model, input_size = (3,640,480), device="cuda")

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
         Rearrange-1            [-1, 300, 3072]               0
            Linear-2            [-1, 300, 1024]       3,146,752
           Dropout-3            [-1, 301, 1024]               0
         LayerNorm-4            [-1, 301, 1024]           2,048
            Linear-5            [-1, 301, 3840]       3,932,160
           Softmax-6         [-1, 10, 301, 301]               0
           Dropout-7         [-1, 10, 301, 301]               0
            Linear-8            [-1, 301, 1024]       1,311,744
           Dropout-9            [-1, 301, 1024]               0
        Attention-10            [-1, 301, 1024]               0
          PreNorm-11            [-1, 301, 1024]               0
        LayerNorm-12            [-1, 301, 1024]           2,048
           Linear-13            [-1, 301, 2048]       2,099,200
             GELU-14            [-1, 30

In [None]:
from torchsummary import summary
summary(pcvit_model, input_size = (3,640,480), device="cuda")

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Unfold-1           [-1, 3072, 1131]               0
         Rearrange-2           [-1, 1131, 3072]               0
            Linear-3           [-1, 1131, 1600]       4,916,800
           Dropout-4           [-1, 1132, 1600]               0
            Linear-5            [-1, 1132, 160]         256,160
         Rearrange-6            [-1, 160, 1132]               0
            Linear-7            [-1, 160, 1200]       1,359,600
         Rearrange-8          [-1, 160, 40, 30]               0
         LayerNorm-9          [-1, 160, 40, 30]               0
           Conv2d-10          [-1, 640, 40, 30]           5,760
           Conv2d-11          [-1, 640, 40, 30]         409,600
  DepthWiseConv2d-12          [-1, 640, 40, 30]               0
           Conv2d-13         [-1, 1280, 40, 30]          11,520
           Conv2d-14         [-1, 1280,

In [13]:
def dataset_split(dataset, val_split=0.2, random_state=42):
    train_idx, val_idx = train_test_split(
        list(range(len(dataset))), 
        test_size=val_split, 
        stratify=dataset.targets, 
        random_state=random_state)
    datasets = {}
    datasets['train'] = torch.utils.data.Subset(dataset, train_idx)
    datasets['test'] = torch.utils.data.Subset(dataset, val_idx)
    return datasets, (train_idx, val_idx)

## splitting training and testing sets
train_data, split_indx = dataset_split(train_data)

In [None]:
print(len(train_data['train']))
print(len(train_data['test']))

In [14]:
def dataset_split(dataset, val_split=0.1, random_state=42):
    y = [dataset[x][1] for x in range(len(dataset))]
    train_idx, val_idx = train_test_split(
        list(range(len(dataset))), 
        test_size=val_split, 
        stratify=y, 
        random_state=random_state)
    datasets = {}
    datasets['train'] = torch.utils.data.Subset(dataset, train_idx)
    datasets['val'] = torch.utils.data.Subset(dataset, val_idx)
    return datasets

## splitting validation set in training set
train_ds = train_data['train']
train_ds = dataset_split(train_ds)

In [None]:
print(len(train_ds['train']))
print(len(train_ds['val']))

7492
833


In [15]:
from sklearn.metrics import precision_score, recall_score, f1_score

class TrainingProcessor:
    def __init__(self, train_data, model, epochs, batch_size=8, learning_rate=0.0001, scheduler=True, class_weights=None):
        super(TrainingProcessor, self).__init__()
        self.train_data = train_data
        self.batch_size = batch_size
        self.traindataloader = DataLoader(self.train_data['train'], batch_size = self.batch_size, shuffle=True)
        self.valdataloader = DataLoader(self.train_data['val'], batch_size = self.batch_size, shuffle=True)
        
        self.epochs = epochs
        self.learning_rate = learning_rate
        self.model = model
        self.optimizer = torch.optim.Adam(self.model.parameters(), lr = self.learning_rate)
        
        if scheduler == True:
            self.scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
                self.optimizer, mode='min', factor=0.5, patience=3, min_lr=5e-6
            )
        if class_weights != None:
            self.class_weights = class_weights
            
    def loss(self, y_pred, y_true):
        weights = torch.tensor(self.class_weights).cuda()
        weighted_ce_loss = nn.CrossEntropyLoss(weight = weights)(y_pred, y_true)
        return weighted_ce_loss
    
    def get_metrics(self, y_pred_tags, y_true_tags):
        correct_pred = (y_pred_tags == y_true_tags).float()
        accuracy = correct_pred.sum() / len(correct_pred)
        y_pred_tags, y_true_tags = y_pred_tags.cpu().numpy(), y_true_tags.cpu().numpy()
        precision = precision_score(y_pred_tags, y_true_tags, average=None)
        recall = recall_score(y_pred_tags, y_true_tags, average=None)
        F1 = f1_score(y_pred_tags, y_true_tags, average=None)
        return accuracy.item(), np.mean(precision), np.mean(recall), np.mean(F1)
    
    def evaluate(self):
        with torch.no_grad():
            self.model.eval()
            evaluated_metrics = []
            evaluated_loss = []
            for i, (batch_x, batch_y) in enumerate(self.valdataloader):
                ## iteration training
                batch_x, batch_y = batch_x.cuda(), batch_y.cuda()
                out = self.model(batch_x)
                _, out_tags = torch.max(torch.log_softmax(out, dim = 1), dim = 1)
                out_loss = self.loss(out, batch_y)
                evaluated_loss.append(out_loss.item())
                ## get metrics
                accuracy, precision, recall, F1 = self.get_metrics(out_tags, batch_y)
                evaluated_metrics.append([accuracy, precision, recall, F1])
        return evaluated_loss, evaluated_metrics
        
    def train(self, start_epoch=0):
        for epoch in range(start_epoch, self.epochs + start_epoch):
            ## initialization
            self.model.train()
            metrics = []
            for i, (batch_x, batch_y) in enumerate(self.traindataloader):
                ## iteration training
                batch_x, batch_y = batch_x.cuda(), batch_y.cuda()
                self.optimizer.zero_grad()
                out = self.model(batch_x)
                _, out_tags = torch.max(torch.log_softmax(out, dim = 1), dim = 1)
                out_loss = self.loss(out, batch_y)
                out_loss.backward()
                self.optimizer.step()
                ## get metrics
                accuracy, precision, recall, F1 = self.get_metrics(out_tags, batch_y)
                metrics.append([accuracy, precision, recall, F1])
            
            val_loss, val_metrics = self.evaluate()
            val_loss = np.mean(val_loss)
            self.scheduler.step(val_loss)
            val_accuracy, val_precision, val_recall, val_F1 = \
                np.mean([x[0] for x in val_metrics]), \
                np.mean([x[1] for x in val_metrics]), \
                np.mean([x[2] for x in val_metrics]), \
                np.mean([x[3] for x in val_metrics])
            
            train_loss = out_loss.item()
            train_accuracy, train_precision, train_recall, train_F1 = \
                np.mean([x[0] for x in metrics]), \
                np.mean([x[1] for x in metrics]), \
                np.mean([x[2] for x in metrics]), \
                np.mean([x[3] for x in metrics])
            
            print("Epoch " + str(epoch+1) + " || " + \
                  "train_loss: {:.4f} val_loss: {:.4f} | ".format(train_loss, val_loss) + \
                  "train_acc: {:.4f} val_acc: {:.4f} | ".format(train_accuracy, val_accuracy) + \
                  "train_p: {:.4f} val_p: {:.4f} | ".format(train_precision, val_precision) + \
                  "train_r: {:.4f} val_r: {:.4f} | ".format(train_recall, val_recall) + \
                  "train_F1: {:.4f} val_F1: {:.4f} | ".format(train_F1, val_F1)
                  )

In [16]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning) 

In [17]:
cnn_dag = TrainingProcessor(train_data=train_ds, 
                            model=cnn_model, 
                            epochs=25, 
                            batch_size=16, 
                            learning_rate=1e-5, 
                            scheduler=True, 
                            class_weights=class_weights_norm)

In [18]:
cnn_dag.train(start_epoch=0)

Epoch 1 || train_loss: 2.4609 val_loss: 4.9846 | train_acc: 0.1975 val_acc: 0.1639 | train_p: 0.1655 val_p: 0.1262 | train_r: 0.1491 val_r: 0.0294 | train_F1: 0.1346 val_F1: 0.0426 | 
Epoch 2 || train_loss: 2.3819 val_loss: 5.8422 | train_acc: 0.3386 val_acc: 0.1639 | train_p: 0.2840 val_p: 0.1277 | train_r: 0.2821 val_r: 0.0229 | train_F1: 0.2517 val_F1: 0.0372 | 
Epoch 3 || train_loss: 2.3030 val_loss: 5.4597 | train_acc: 0.4360 val_acc: 0.1816 | train_p: 0.3721 val_p: 0.1499 | train_r: 0.3822 val_r: 0.0441 | train_F1: 0.3406 val_F1: 0.0589 | 
Epoch 4 || train_loss: 1.9688 val_loss: 5.2524 | train_acc: 0.5181 val_acc: 0.1675 | train_p: 0.4479 val_p: 0.1372 | train_r: 0.4647 val_r: 0.0322 | train_F1: 0.4221 val_F1: 0.0463 | 
Epoch 5 || train_loss: 1.5794 val_loss: 5.3371 | train_acc: 0.5809 val_acc: 0.1816 | train_p: 0.5211 val_p: 0.1472 | train_r: 0.5341 val_r: 0.0447 | train_F1: 0.4913 val_F1: 0.0590 | 
Epoch 6 || train_loss: 1.4161 val_loss: 5.0423 | train_acc: 0.6366 val_acc: 0.16

In [None]:
vit_dag = TrainingProcessor(train_data=train_ds, 
                            model=vit_model, 
                            epochs=50, 
                            batch_size=16, 
                            learning_rate=1e-4, 
                            scheduler=True, 
                            class_weights=class_weights_norm)

In [None]:
vit_dag.train(start_epoch=0)

Epoch 1 || train_loss: 1.5637 val_loss: 1.8645 | train_acc: 0.2775 val_acc: 0.3715 | train_p: 0.2443 val_p: 0.3227 | train_r: 0.2138 val_r: 0.3019 | train_F1: 0.1975 val_F1 0.2757 | 
Epoch 2 || train_loss: 1.5533 val_loss: 1.7967 | train_acc: 0.3483 val_acc: 0.3314 | train_p: 0.2978 val_p: 0.2971 | train_r: 0.2890 val_r: 0.2530 | train_F1: 0.2598 val_F1 0.2425 | 
Epoch 3 || train_loss: 1.0114 val_loss: 1.6843 | train_acc: 0.4084 val_acc: 0.4009 | train_p: 0.3536 val_p: 0.3695 | train_r: 0.3456 val_r: 0.3446 | train_F1: 0.3152 val_F1 0.3190 | 
Epoch 4 || train_loss: 1.3637 val_loss: 1.6213 | train_acc: 0.4498 val_acc: 0.4882 | train_p: 0.3946 val_p: 0.4269 | train_r: 0.3900 val_r: 0.4126 | train_F1: 0.3561 val_F1 0.3871 | 
Epoch 5 || train_loss: 2.2374 val_loss: 1.4902 | train_acc: 0.4936 val_acc: 0.4988 | train_p: 0.4338 val_p: 0.4259 | train_r: 0.4293 val_r: 0.4304 | train_F1: 0.3956 val_F1 0.3988 | 
Epoch 6 || train_loss: 1.3471 val_loss: 1.4339 | train_acc: 0.5281 val_acc: 0.4965 | 

In [None]:
pcvit_dag = TrainingProcessor(train_data=train_ds, 
                              model=pcvit_model, 
                              epochs=25, 
                              batch_size=16, 
                              learning_rate=5e-5, 
                              scheduler=True, 
                              class_weights=class_weights_norm)

In [None]:
pcvit_dag.train(start_epoch=0)

Epoch 1 || train_loss: 2.1893 val_loss: 1.8701 | train_acc: 0.2195 val_acc: 0.3538 | train_p: 0.1925 val_p: 0.3081 | train_r: 0.1499 val_r: 0.3020 | train_F1: 0.1404 val_F1: 0.2768 | 
Epoch 2 || train_loss: 1.1201 val_loss: 1.4013 | train_acc: 0.4495 val_acc: 0.5413 | train_p: 0.3874 val_p: 0.4939 | train_r: 0.3883 val_r: 0.4951 | train_F1: 0.3546 val_F1: 0.4648 | 
Epoch 3 || train_loss: 2.2696 val_loss: 1.2309 | train_acc: 0.6269 val_acc: 0.6344 | train_p: 0.5621 val_p: 0.5598 | train_r: 0.5711 val_r: 0.5531 | train_F1: 0.5339 val_F1: 0.5302 | 
Epoch 4 || train_loss: 0.2315 val_loss: 1.1673 | train_acc: 0.7777 val_acc: 0.6675 | train_p: 0.7290 val_p: 0.6174 | train_r: 0.7339 val_r: 0.6208 | train_F1: 0.7069 val_F1: 0.5866 | 
Epoch 5 || train_loss: 0.2780 val_loss: 1.2280 | train_acc: 0.8645 val_acc: 0.6804 | train_p: 0.8335 val_p: 0.6139 | train_r: 0.8381 val_r: 0.6195 | train_F1: 0.8188 val_F1: 0.5950 | 
Epoch 6 || train_loss: 0.0353 val_loss: 1.1273 | train_acc: 0.9259 val_acc: 0.73

In [19]:
## prediction on test dataset
from sklearn.metrics import classification_report
test_ds = train_data['test']
testdataloader = DataLoader(test_ds, batch_size = 16, shuffle=True)
traindataloader = DataLoader(train_data['train'], batch_size = 16, shuffle=True)

def get_tags(model, dataloader):
    with torch.no_grad():
        model.eval()
        true_tags = []
        pred_tags = []
        for i, (batch_x, batch_y) in enumerate(dataloader):
            ## iteration training
            batch_x, batch_y = batch_x.cuda(), batch_y.cuda()
            out = model(batch_x)
            _, out_tags = torch.max(torch.log_softmax(out, dim = 1), dim = 1)
            true_tags.append(batch_y.cpu().numpy())
            pred_tags.append(out_tags.cpu().numpy())
    return true_tags, pred_tags

In [20]:
cnn_eval_tags = get_tags(cnn_model, testdataloader)

In [None]:
vit_eval_tags = get_tags(vit_model, testdataloader)

In [None]:
pcvit_eval_tags = get_tags(pcvit_model, testdataloader)

In [21]:
## Inception ResNet results on testing set
print(classification_report([y for x in cnn_eval_tags[0] for y in x], [y for x in cnn_eval_tags[1] for y in x]))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        96
           1       0.00      0.00      0.00        76
           2       0.00      0.00      0.00        67
           3       0.17      0.98      0.30       348
           4       0.00      0.00      0.00       193
           5       0.50      0.21      0.29       288
           6       0.00      0.00      0.00       124
           7       0.00      0.00      0.00       319
           8       0.00      0.00      0.00       353
           9       0.00      0.00      0.00       218

    accuracy                           0.19      2082
   macro avg       0.07      0.12      0.06      2082
weighted avg       0.10      0.19      0.09      2082



In [None]:
## Vision Transformer results on testing set
print(classification_report([y for x in vit_eval_tags[0] for y in x], [y for x in vit_eval_tags[1] for y in x]))

              precision    recall  f1-score   support

           0       0.76      0.58      0.66        96
           1       0.86      0.78      0.81        76
           2       0.75      0.75      0.75        67
           3       0.84      0.80      0.82       348
           4       0.81      0.75      0.78       193
           5       0.89      0.89      0.89       288
           6       0.76      0.76      0.76       124
           7       0.81      0.82      0.81       319
           8       0.81      0.87      0.84       353
           9       0.78      0.89      0.83       218

    accuracy                           0.82      2082
   macro avg       0.81      0.79      0.80      2082
weighted avg       0.82      0.82      0.82      2082



In [None]:
## Convolution-based Vision Transformer results on testing set
print(classification_report([y for x in pcvit_eval_tags[0] for y in x], [y for x in pcvit_eval_tags[1] for y in x]))

              precision    recall  f1-score   support

           0       0.55      0.45      0.49        96
           1       0.86      0.78      0.81        76
           2       0.77      0.64      0.70        67
           3       0.80      0.75      0.77       348
           4       0.71      0.67      0.69       193
           5       0.80      0.80      0.80       288
           6       0.80      0.75      0.77       124
           7       0.73      0.77      0.75       319
           8       0.77      0.82      0.80       353
           9       0.69      0.80      0.74       218

    accuracy                           0.75      2082
   macro avg       0.75      0.72      0.73      2082
weighted avg       0.75      0.75      0.75      2082

