> The visual transformer (ViT) introduced by Dosovitskiy et al is an architecture directly inherited from Natural Language Processing , but applied to image classification with raw image patches as input. Their paper
> presents promising results with transformers trained with a large strongly supervised image dataset that is not publicly
> available. The authors concluded that transformers ”do not generalize well when
> trained on insufficient amounts of data”, and used extensive computing resources
> to train their models. Both aspects limit the adoption of ViT and more generally
> of transformers, for researchers without access to such computing resources, or
> without a privileged access to a large private dataset.
> In this paper, we show that none of this is required: we actually train a
> transformer on a single 8-GPU node in two to three days (53 hours of pretraining, and optionally 20 hours of fine-tuning). This vanilla transformer is
> competitive with convnets of a similar number of parameters and efficiencyusing Imagenet as the sole training set, and does not include a single convolution. We build upon the visual transformer architecture from Dosovitskiy et
> al, which is very close to the original token-based transformer architecture where word embeddings are replaced with patch embeddings. With
> our Data-efficient image Transformers (DeiT), we report large improvements
> over previous results, see Figure 1. They mainly come from DeiT’s better training strategy for visual transformers, at both the initial training and the finetuning stage. Our ablation study details the key ingredients for a successful
> training, and hopefully will serve as guidelines for future works.



![](https://github.com/facebookresearch/deit/raw/main/.github/deit.png)

In [None]:
!pip install -q timm==0.3.2

In [None]:
import os
import cv2
import json
import numpy as np
import pandas as pd
from functools import partial

import matplotlib.pyplot as plt


import torch
from torch import nn
import albumentations as A

from timm.utils import accuracy
from timm.data import Mixup
from timm.models import create_model
from timm.loss import LabelSmoothingCrossEntropy, SoftTargetCrossEntropy
from timm.scheduler import create_scheduler
from timm.optim import create_optimizer
from timm.utils import NativeScaler, get_state_dict, ModelEma
from timm.models.registry import register_model
from timm.models.vision_transformer import VisionTransformer, _cfg


In [None]:
train_df = pd.read_csv('../input/cassava-leaf-disease-classification/train.csv')
print('Shape of Train df', train_df.shape)

In [None]:
train_df.head()

**Spliting the dataset into train and validate**

In [None]:
val_df = pd.DataFrame()
for l in train_df.label.unique():
    val_df = pd.concat([train_df[train_df.label == l].sample(400), val_df])
train_df = train_df[~train_df.isin(val_df)].dropna()
print(val_df.shape, train_df.shape)

This is a problem of imbalanced dataset

In [None]:
train_df.label.plot.hist(bins=25);

**SIMPLE DATA OVERSAMPLING**

In [None]:
def return_df(df, c):
    x = (c // df.shape[0] ) + 1
    return pd.concat([df]*x).iloc[:c]


### SIMPLE DATA OVERSAMPLING
MAX_COUNT = 13200
label = np.unique(train_df.label)
train_df_modified = pd.DataFrame([], columns = ['image_id', 'label'])
for l in label:
    df = return_df(train_df[train_df.label == l], MAX_COUNT)
    train_df_modified = pd.concat([train_df_modified, df])
train_df_modified.shape

**TRAINSET AFTER SAMPLING**

In [None]:
train_df_modified.label.plot.hist(bins=25);

**VALIDATION SET AFTER SAMPLING**

In [None]:
val_df.label.plot.hist(bins=25);

In [None]:
import numpy as np

import matplotlib.pyplot as plt

with open('../input/cassava-leaf-disease-classification/label_num_to_disease_map.json', 'r') as file:
    label_map = json.load(file)

def show_examples(images):
    _indexes = [(i, j) for i in range(4) for j in range(4)]
    
    f, ax = plt.subplots(4, 4, figsize=(16, 16))
    for (img, title), (i, j) in zip(images, _indexes):
        ax[i, j].imshow(img)
        ax[i, j].set_title(title)
    f.tight_layout()
    plt.show()

def read_random_images(df):
    data = df.sample(16)
    Images, Label = data.values[:,0], data.values[:,1].astype(int)
    Path = '../input/cassava-leaf-disease-classification/train_images/'
    result = []
    for d, label in zip(Images, Label):
        title = f"Label:{label_map[str(label)]}"
        _image = cv2.imread(Path + d)[:,:,::-1]
        result.append((_image, title))
    print('Showing Sample ...')
    show_examples(result)

In [None]:
label_map

In [None]:
read_random_images(train_df)

In [None]:

class LeafClassificationDataset(object):
    
    def __init__(self, df, 
                 transform = None, 
                 path = '../input/cassava-leaf-disease-classification/train_images'):
        
        self.path = path
        self.df = df
        self.transform = transform

    def __len__(self):
        return self.df.shape[0]
    
    def __getitem__(self, idx):
        im = self.df.iloc[idx].image_id
        img = self.img_loader(im)
        label = int(self.df.iloc[idx, 1])
        return img, torch.tensor(label)
    
    
    def img_loader(self,name):
        path = os.path.join(self.path, name)
        image = cv2.imread(path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        if self.transform:
            image = self.transform(image = image)['image']
        image = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))
        
        return image


**DATA AUGMENTATION**

In [None]:
# Declare an augmentation pipeline
image_size = 224
train_transforms = A.Compose([
        A.LongestMaxSize(max_size=image_size),
        A.PadIfNeeded(image_size, image_size, border_mode=2),
        A.RandomRotate90(),
        A.Flip(),
        A.RandomBrightnessContrast(brightness_limit=0.4, contrast_limit=0.4, p=0.7),
        A.Transpose(),
        A.OneOf([
            A.IAAAdditiveGaussianNoise(),
            A.GaussNoise(),
        ], p=0.2),
        A.OneOf([
            A.MotionBlur(p=.2),
            A.MedianBlur(blur_limit=3, p=0.6),
            A.Blur(blur_limit=3, p=0.1),
        ], p=0.7),
        A.ShiftScaleRotate(shift_limit=0.0625, scale_limit=0.2, rotate_limit=45, p=0.7),
        A.OneOf([
            A.OpticalDistortion(p=0.3),
            A.GridDistortion(p=.1),
            A.IAAPiecewiseAffine(p=0.3),
        ], p=0.5),
        A.OneOf([
            A.CLAHE(clip_limit=2),
            A.IAASharpen(),
            A.IAAEmboss(),
            A.RandomBrightnessContrast(),            
        ], p=0.5),
        A.HueSaturationValue(p=0.5),
        A.Normalize()
    ])
valid_transforms = A.Compose([
        A.LongestMaxSize(max_size=image_size),
        A.PadIfNeeded(image_size, image_size, border_mode=2),
        A.Normalize()
    ])

**DATALOADERS**

In [None]:
import collections
from sklearn.model_selection import train_test_split

from torch.utils.data import DataLoader

def get_loaders(train_df, val_df, batch_size = 32,num_workers=0,train_transforms_fn = None,valid_transforms_fn = None):

    # Creates our train dataset
    train_dataset = LeafClassificationDataset( train_df, train_transforms_fn)


    # Creates our valid dataset
    valid_dataset = LeafClassificationDataset( val_df, valid_transforms_fn)

    train_loader = DataLoader(
      train_dataset,
      batch_size=batch_size,
      shuffle=True,
      num_workers=num_workers,
      drop_last=True,
    )

    valid_loader = DataLoader(
      valid_dataset,
      batch_size=batch_size,
      shuffle=False,
      num_workers=num_workers,
      drop_last=True,
    )

    loaders = collections.OrderedDict()
    loaders["train"] = train_loader
    loaders["valid"] = valid_loader

    return loaders

In [None]:
batch_size = 180

print(f"batch_size: {batch_size}")

loaders = get_loaders( train_df_modified, val_df,batch_size,0, train_transforms, valid_transforms)

**REGISTERING THE MODEL**


Patch size is 16 and Image Size is 224

In [None]:
@register_model
def deit_small_patch16_224(pretrained=False, **kwargs):
    model = VisionTransformer(img_size=224,
        patch_size=16, embed_dim=384, depth=12, num_heads=6, mlp_ratio=4, qkv_bias=True,
        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
    d_c = _cfg()
    d_c['input_size'] = (3,224,224)
    model.default_cfg = d_c
    if pretrained:
        checkpoint = torch.hub.load_state_dict_from_url(
            url="https://dl.fbaipublicfiles.com/deit/deit_small_patch16_224-cd65a155.pth",
            map_location="cpu", check_hash=True
        )
        checkpoint["model"].pop('head.weight')
        checkpoint["model"].pop('head.bias')
        
        model.load_state_dict(checkpoint["model"], strict=False)
    return model

In [None]:
from timm.models import create_model
model = create_model(
    'deit_small_patch16_224',
    pretrained=True,
    num_classes=5)


In [None]:
model_ema = ModelEma(
            model,
            decay=0.99,
            device='cuda')

In [None]:
n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
print('number of params:', n_parameters)

linear_scaled_lr = 0.0001 * batch_size * 1 / 512.0
lr = linear_scaled_lr
optimizer  = torch.optim.Adam(model.parameters(), lr=lr)
loss_scaler = NativeScaler()

lr_scheduler =torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, 0.001)

criterion = nn.CrossEntropyLoss()

In [None]:
mixup_fn = Mixup(
            mixup_alpha=0.2, cutmix_alpha=0.03, 
            prob=0.8, switch_prob=0.3, mode= 'batch', num_classes=5)

In [None]:
@torch.no_grad()
def evaluate(data_loader, model, device):
    criterion = torch.nn.CrossEntropyLoss()


    # switch to evaluation mode
    model.eval()
    acc1_mean = 0.0
    acc5_mean = 0.0
    Batch = len(data_loader)
    print('Total Batch:', Batch)
    
    for images, target in data_loader:
        images = images.to(device, non_blocking=True)
        target = target.to(device, non_blocking=True)

        # compute output
        with torch.cuda.amp.autocast():
            output = model(images)
            loss = criterion(output, target)

        acc1, acc5 = accuracy(output, target, topk=(1, 5))
        acc1_mean+=acc1.item()
        acc5_mean+=acc5.item()
        

        batch_size = images.shape[0]
    print('* Acc@1 {:.3f} Acc@5 {:.3f} loss {:.3f}'
          .format(acc1_mean/Batch, acc5_mean/Batch, loss.item()))


In [None]:
import math
from datetime import datetime

def train_one_epoch(model, criterion,
                    data_loader, optimizer,
                    device, epoch, loss_scaler, max_norm,
                    model_ema, mixup_fn):

    model.train()
    model.to(device)
    header = 'Epoch: [{}]'.format(epoch)
    print_freq = 50
    count = 0
    Batch = len(data_loader)
    print('Total Batch:', Batch)
    loss_mean = 0.0
    for samples, targets in data_loader:
        count +=1
        samples = samples.to(device, non_blocking=True)
        targets = targets.to(device, non_blocking=True)

#         if mixup_fn is not None:
#             samples, _ = mixup_fn(samples, targets)

            

        with torch.cuda.amp.autocast():
            outputs = model(samples)
            
            loss = criterion(outputs, targets)

        loss_value = loss.item()
        loss_mean += loss_value 

        if not math.isfinite(loss_value):
            print("Loss is {}, stopping training".format(loss_value))
            sys.exit(1)

        optimizer.zero_grad()

        # this attribute is added by timm on one optimizer (adahessian)
        is_second_order = hasattr(optimizer, 'is_second_order') and optimizer.is_second_order
        loss_scaler(loss, optimizer, clip_grad=max_norm,
                    parameters=model.parameters(), create_graph=is_second_order)

        torch.cuda.synchronize()
        if model_ema is not None:
            model_ema.update(model)

        if count % print_freq == 0:
            current_time = datetime.now().strftime("%H:%M:%S")
            print(f'{current_time}\tEPOCH: [{epoch}/10] STEP: [{count}/{Batch}], LOSS: {loss_mean/50}')
            loss_mean = 0.0
    torch.save({'model': model.state_dict(),
                'optimizer': optimizer.state_dict(),
                'lr_scheduler': lr_scheduler.state_dict(),
                'epoch': epoch,
                'model_ema': get_state_dict(model_ema),
                }, 'checkpoint.pth')


In [None]:
print("Start training")
max_accuracy = 0.0
for epoch in range(0, 7):

    train_stats = train_one_epoch(
        model, criterion, loaders['train'],
        optimizer, 'cuda', epoch, loss_scaler,
        None, model_ema, mixup_fn
    )

    lr_scheduler.step(epoch)
    evaluate( loaders['valid'], model, 'cuda')