<img src="https://media.zenfs.com/en/usa_today_entertainment_893/557b114cf813d0a6e44a34b7e6a48eef">

# ***Introduction***

This notebook is a experiment with the kaggle competition [Dogs vs. Cats Redux: Kernels Edition](https://www.kaggle.com/competitions/dogs-vs-cats-redux-kernels-edition/overview) that ended 6 years ago. While its not really fair to compare my results to the top results of the competition, I wanted to test how well I could perform using a pretrained ViT model, and also explore the different ViT models that are available in the pytorch library.

This is the first time I participated in a computer vision competition, so a big part of this notebook is me finding a way to work with images in pytorch, and also how to set up a complete training pipeline.

In this notebook, I am focusing on two versions of the ViT model, the ViT-B_16 and the ViT-L_16. I wanted to explore how the sizes of the model would affect the results, and also how the sizes would affect the training time. 

As this is my first time working with images in PyTorch, I am sure there are many things that could be improved, especially in the training pipeline. But I will leave things as they are for now, and try to improve on them in future image classification projects.
<br><br><br>
***Beware:*** No type hints or docstrings can be found in this notebook! :)

In [1]:
!pip install jupyter_black -q
!pip install albumentations -q
!pip install kaggle -q


[0m

# ***Imports***

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image

from tqdm import tqdm

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split
import torch.nn.utils as utils
import torchvision
from torch.optim import lr_scheduler
import torch.nn.functional as F

import albumentations as A
from albumentations.pytorch import ToTensorV2

from time import time
from sklearn.metrics import log_loss as logloss
from sklearn.model_selection import KFold

import re
import cv2
import os
import copy
import random

import gc
import os
import multiprocessing

os.environ["KAGGLE_USERNAME"] = "davidvikstrand"
os.environ["KAGGLE_KEY"] = XX

from colorama import Fore, Style

rs = Style.RESET_ALL
gr = Fore.GREEN
rd = Fore.RED
cy = Fore.CYAN
ye = Fore.YELLOW
ma = Fore.MAGENTA
bl = Fore.BLUE
gld = Fore.YELLOW + Style.BRIGHT
wh = Fore.WHITE + Style.BRIGHT

import jupyter_black

jupyter_black.load()

# ***Constants***

In [3]:
seed = 420
batch_size = 32
num_workers = multiprocessing.cpu_count()
num_outp = 1
device = "cuda"

# ***Set Up***

In [4]:
class CatDog(Dataset):
    def __init__(self, root="train", transform=None):
        self.images = os.listdir(root)
        self.images.sort(key=lambda x: int(re.findall(r"\d+", x)[0]))
        self.root = root
        self.transform = transform

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        file = self.images[idx]
        img = Image.open(os.path.join(self.root, file))
        if self.transform is not None:
            img = self.transform(image=np.array(img))["image"]

        label = 1 if "dog" in file else 0 if "cat" in file else -1
        return img, label, file

In [5]:
def transform_dense(img_s=img_s):
    return A.Compose(
        [
            A.RandomResizedCrop(
                p=1.0,
                height=img_s,
                width=img_s,
                scale=(0.7, 1.2),
                ratio=(0.75, 1.3),
                interpolation=1,
            ),
            A.HorizontalFlip(p=0.5),
            A.ColorJitter(
                brightness=0.3, contrast=0.3, saturation=0.3, hue=0.35, p=0.7
            ),
            A.ShiftScaleRotate(
                shift_limit=0.0,
                scale_limit=0.1,
                rotate_limit=30,
                interpolation=cv2.INTER_LINEAR,
                border_mode=cv2.BORDER_CONSTANT,
                p=0.8,
            ),
            A.Blur(blur_limit=(1, 3), p=0.25),
            A.CoarseDropout(max_holes=2, max_height=50, max_width=50, p=0.5),
            A.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225],
                max_pixel_value=255.0,
            ),
            ToTensorV2(),
        ]
    )


def transform_medium(img_s=img_s):
    return A.Compose(
        [
            A.Resize(height=img_s, width=img_s),
            A.HorizontalFlip(p=0.5),
            A.OneOf(
                [
                    A.ColorJitter(
                        brightness=0.2, contrast=0.2, saturation=0.2, hue=0.2, p=1
                    ),
                    A.CoarseDropout(max_holes=1, max_height=10, max_width=10, p=1),
                ],
                p=0.7,
            ),
            A.ShiftScaleRotate(
                shift_limit=0.1,
                scale_limit=0.1,
                rotate_limit=20,
                interpolation=cv2.INTER_LINEAR,
                border_mode=cv2.BORDER_CONSTANT,
                p=0.8,
            ),
            A.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225],
                max_pixel_value=255.0,
            ),
            ToTensorV2(),
        ]
    )


def transform_light(img_s=img_s):
    return A.Compose(
        [
            A.Resize(height=img_s, width=img_s),
            A.HorizontalFlip(p=0.5),
            A.ShiftScaleRotate(
                shift_limit=0.0,
                scale_limit=0.1,
                rotate_limit=15,
                interpolation=cv2.INTER_LINEAR,
                border_mode=cv2.BORDER_CONSTANT,
                p=0.8,
            ),
            A.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225],
                max_pixel_value=255.0,
            ),
            ToTensorV2(),
        ]
    )


def basic_transform(img_s=img_s):
    return A.Compose(
        [
            A.Resize(height=img_s, width=img_s),
            A.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225],
                max_pixel_value=255.0,
            ),
            ToTensorV2(),
        ]
    )

# ***Models***

In [6]:
def load_vit_b_16(n_out=1):
    global img_s
    img_s = 384

    vit_weights = torchvision.models.ViT_B_16_Weights.IMAGENET1K_SWAG_E2E_V1

    ViT = torchvision.models.vit_b_16(weights=vit_weights)
    ViT.heads = nn.Linear(in_features=768, out_features=n_out)

    for param in ViT.parameters():
        param.requires_grad = False
    for param in ViT.heads.parameters():
        param.requires_grad = True

    return ViT.to(device)


load_vit_b_16();

Downloading: "https://download.pytorch.org/models/vit_b_16_swag-9ac1b537.pth" to /root/.cache/torch/hub/checkpoints/vit_b_16_swag-9ac1b537.pth


  0%|          | 0.00/331M [00:00<?, ?B/s]

In [8]:
def load_vit_l_16(n_out=1):
    global img_s
    img_s = 512

    vit_weights = torchvision.models.ViT_L_16_Weights.IMAGENET1K_SWAG_E2E_V1

    ViT = torchvision.models.vit_l_16(weights=vit_weights)
    ViT.heads = nn.Linear(in_features=1024, out_features=n_out)

    for param in ViT.parameters():
        param.requires_grad = False
    for param in ViT.heads.parameters():
        param.requires_grad = True

    return ViT.to(device)


load_vit_l_16();

Downloading: "https://download.pytorch.org/models/vit_l_16_swag-4f3808c9.pth" to /root/.cache/torch/hub/checkpoints/vit_l_16_swag-4f3808c9.pth


  0%|          | 0.00/1.14G [00:00<?, ?B/s]

In [10]:
class Ensemble(nn.Module):
    def __init__(self, models):
        super(Ensemble, self).__init__()
        self.models = nn.ModuleList(models)

    def forward(self, x):
        outputs = [model(x) for model in self.models]
        return torch.stack(outputs).mean(dim=0)

# ***Utils***

Here are some utility functions that I will use in the training pipeline.

In [11]:
def p(train, i, loss, acc, log_loss):
    if train:
        print(
            f"{i}: [Train] Loss: {bl}{np.mean(loss):.4f}{rs}, "
            f"Accuracy: {bl}{np.mean(acc):.4f}{rs}",
            f"log_loss: {bl}{np.mean(log_loss):.4f}{rs}",
        )
    else:
        print(
            f"[Validation] Loss: {gr}{np.mean(loss):.4f}{rs}, "
            f"Accuracy: {gr}{np.mean(acc):.4f}{rs},",
            f"Log Loss: {gr}{np.mean(log_loss):.4f}{rs}\n",
        )


def get_scores(logs, labels):
    pred = torch.sigmoid(logs)
    acc = ((pred > 0.5) == labels).sum() / pred.size(0)

    pred = torch.clip(pred, 0.005, 0.995).cpu().detach().numpy()
    log_loss = torch.tensor(logloss(labels.cpu().numpy(), pred))
    return acc, log_loss


def focal_loss(logits, targets, criterion, alpha=0.5, gamma=2):
    loss_score = criterion(logits, targets)

    prob = torch.sigmoid(logits)
    factor = (1 - prob) ** gamma
    loss = alpha * factor * loss_score

    return loss.mean()

def empty_cache():
    gc.collect()
    torch.cuda.empty_cache()


def make_predictions(model, loader, batch_size=batch_size):
    print("Predicting...")
    model.eval()
    all_preds = torch.empty((len(loader.dataset), 1))

    with torch.no_grad():
        for i, (data, *_) in enumerate(tqdm(loader)):
            outputs = model(data.to(device))
            all_preds[i * batch_size : (i + 1) * batch_size] = outputs.sigmoid()

    return all_preds.squeeze()


def get_model(model, lr, wd, scheduler):
    optimizer = torch.optim.Adam(
        model.parameters(),
        lr=lr,
        weight_decay=wd,
        amsgrad=True,
    )
    if len(scheduler) == 2:
        scheduler, scheduler_params = scheduler
        scheduler = scheduler(optimizer, **scheduler_params)
    else:
        scheduler = None

    return model, optimizer, scheduler


def set_seed(seed=420):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)


def remove_seed():
    s = lambda: random.randint(0, 1e8)
    torch.manual_seed(s())
    torch.cuda.manual_seed_all(s())
    torch.backends.cudnn.deterministic = False
    torch.backends.cudnn.benchmark = True
    np.random.seed()
    random.seed()


def early_stopping(
    stopping, best_loss, best_state, val_loss, model, patience, cur_epoch
):
    stopping += 1
    if val_loss <= best_loss:
        best_loss, stopping = val_loss, 0
        best_state = copy.deepcopy(model.state_dict())

    if stopping >= patience:
        print(f"{Fore.RED}Early Stopping at epoch {cur_epoch+1}{rs}")

    return stopping, best_loss, best_state, stopping >= patience

def submit_predictions(preds):
    test_preds = np.clip(preds, 0.005, 0.995)

    submission_df = pd.DataFrame(
        {"id": range(1, len(test_preds) + 1), "label": test_preds}
    )
    submission_df.to_csv("submission.csv", index=False)
    !kaggle competitions submit -c dogs-vs-cats-redux-kernels-edition -f submission.csv -m "Message"


def get_data(train_amn, val_amn, model):
    img_s = {"vit_b_16": 384, "vit_l_16": 512}[model]
    train_dataset = CatDog(root="train/", transform=transform_light(img_s))
    test_dataset = CatDog(root="test/", transform=basic_transform(img_s))

    val_s = int(val_amn * (tr_len := len(train_dataset)))
    train_set, val_set = random_split(train_dataset, [tr_len - val_s, val_s])
    val_set = copy.deepcopy(val_set)
    val_set.dataset.transform = basic_transform(img_s)

    train_s = int(train_amn * (tr_len := len(train_set)))
    train_set, _ = random_split(train_set, [train_s, tr_len - train_s])

    return train_set, val_set, test_dataset


def get_loader(dataset, bs, shuffle=False):
    return DataLoader(
        dataset,
        shuffle=shuffle,
        batch_size=bs,
        num_workers=num_workers,
        pin_memory=True,
    )

def set_transform(data_set, transf_d, epoch, bs):
    transf = [transf_d[k] for k in transf_d if epoch in k] or [transform_light]

    data_set = copy.deepcopy(data_set)
    data_set.dataset.transform = transf[0](img_s)
    return get_loader(data_set, bs, shuffle=True)


def one_epoch(loader, train, model, criterion, optimizer, scaler, focal=False):
    model.train(train)

    metrics = [], [], []
    for i, (data, labels, _) in enumerate(tqdm(loader), 1):
        data, labels = data.to(device), labels.to(device).unsqueeze(1).float()

        if labels.unique().numel() == 1:
            continue

        if train:
            with torch.cuda.amp.autocast():
                logs = model(data)
                if focal:
                    loss_score = focal_loss(logs, labels, criterion)
                else:
                    loss_score = criterion(logs, labels)

            optimizer.zero_grad()
            scaler.scale(loss_score).backward()
            scaler.step(optimizer)
            scaler.update()
            utils.clip_grad_norm_(model.parameters(), 0.5)
        else:
            with torch.no_grad():
                logs = model(data)
                loss_score = criterion(logs, labels)

        for j, item in enumerate((loss_score, *get_scores(logs, labels))):
            metrics[j].append(item.item())

    p(train, i, *metrics)
    return np.mean(metrics[2])

# ***Main***

Not an optimal implementation, I will definitely improve on this in future projects.

In [None]:
def train(
    model_fn, data, epochs, deterministic, params, transform_d, load_best_state
):
    start = time()
    scaler = torch.cuda.amp.GradScaler()
    criterion = nn.BCEWithLogitsLoss().to(device)

    set_seed() if deterministic else remove_seed()

    wd, lr, bs, bs_val, patience, focal, warm_up, *scheduler = params.values()
    model, optimizer, scheduler = get_model(model_fn(), lr, wd, scheduler)

    print(
        f"Parameters: {bl}batch_size={bs}, weight_decay={wd}",
        f"learning_rate={lr}{rs}\n",
    )

    train_data, val_data, test_data = data
    val_loader, test_loader = [
        get_loader(data, bs_val) for data in (val_data, test_data)
    ]

    stopping, best_loss, best_state = 0, np.inf, None

    for epoch in range(epochs):
        print(f"{wh}[Epoch {epoch+1}/{epochs}]{rs}")

        train_args = [model, criterion, optimizer, scaler]
        train_loader = set_transform(train_data, transform_d, epoch, bs)

        one_epoch(train_loader, True, *train_args, epoch in focal)
        val_loss = one_epoch(val_loader, False, *train_args)
        empty_cache()

        if scheduler is not None and epoch not in [warm_up, epochs - 1]:
            scheduler.step()
            print(f"Updated lr -> {ma}{scheduler.get_last_lr()[0]:.2e}{rs}\n")

        stopping, best_loss, best_state, early_stop = early_stopping(
            stopping, best_loss, best_state, val_loss, model, patience, epoch
        )

        if early_stop and epoch != epochs - 1:
            break

    model.load_state_dict(best_state) if load_best_state else None
    predictions = make_predictions(model, test_loader, bs_val)
    submit_predictions(predictions)
    empty_cache()

    print(f"\n{bl}(ﾉ´ヮ`)ﾉ*  Finished in {(time() - start) / 60:.2f} minutes{rs}")
    return model, predictions

# ***Training***

In [15]:
data_inp = {
    "train_amn": 1,
    "val_amn": 0.2,
}

data_vit_b_16 = get_data(**data_inp, model="vit_b_16")
data_vit_l_16 = get_data(**data_inp, model="vit_l_16")

## ***ViT-B_16***

Here are all the parameters that needs to be set before training.

The **transform_d** parameter is a dictionary that contains the transforms that will be applied during training and validation. The keys are which epoch the transforms should be applied, and the values are the transforms. In this example, the transform_light is applied for epoch 0-3 and transform_medium for epoch 4.

In [19]:
train_inp = {
    "model_fn": load_vit_b_16,
    "data": data_vit_b_16,
    "epochs": 4,
    "deterministic": False,
    "load_best_state": False,
    "params": {
        "weight_decay": 5e-5,
        "learning_rate": 2e-3,
        "bs_train": 32,
        "bs_val": 64,
        "patience": np.inf,
        "focal": [2, 3],
        "warm_up": [0],
        "scheduler": lr_scheduler.StepLR,
        "scheduler_args": {"step_size": 1, "gamma": 0.7},
    },
    "transform_d": {
        tuple(range(4)): transform_light,
        tuple(range(4, 5)): transform_medium,
        tuple(range(16, 21)): transform_dense,
    },
}

model, predictions = train(**train_inp)

Parameters: [34mbatch_size=32, weight_decay=5e-05 learning_rate=0.002[0m

[37m[1m[Epoch 1/4][0m


100%|██████████| 624/624 [00:51<00:00, 12.18it/s]


624: [Train] Loss: [34m0.0123[0m, Accuracy: [34m0.9961[0m log_loss: [34m0.0159[0m


100%|██████████| 78/78 [00:43<00:00,  1.80it/s]


[Validation] Loss: [32m0.0036[0m, Accuracy: [32m0.9990[0m, Log Loss: [32m0.0082[0m

Updated lr -> [35m1.40e-03[0m

[37m[1m[Epoch 2/4][0m


100%|██████████| 624/624 [00:50<00:00, 12.35it/s]


624: [Train] Loss: [34m0.0050[0m, Accuracy: [34m0.9989[0m log_loss: [34m0.0091[0m


100%|██████████| 78/78 [00:42<00:00,  1.86it/s]


[Validation] Loss: [32m0.0034[0m, Accuracy: [32m0.9990[0m, Log Loss: [32m0.0080[0m

Updated lr -> [35m9.80e-04[0m

[37m[1m[Epoch 3/4][0m


100%|██████████| 624/624 [00:50<00:00, 12.34it/s]


624: [Train] Loss: [34m0.0007[0m, Accuracy: [34m0.9993[0m log_loss: [34m0.0072[0m


100%|██████████| 78/78 [00:42<00:00,  1.85it/s]


[Validation] Loss: [32m0.0032[0m, Accuracy: [32m0.9992[0m, Log Loss: [32m0.0079[0m

Updated lr -> [35m6.86e-04[0m

[37m[1m[Epoch 4/4][0m


100%|██████████| 624/624 [00:50<00:00, 12.30it/s]


624: [Train] Loss: [34m0.0008[0m, Accuracy: [34m0.9991[0m log_loss: [34m0.0075[0m


100%|██████████| 78/78 [00:41<00:00,  1.86it/s]


[Validation] Loss: [32m0.0031[0m, Accuracy: [32m0.9992[0m, Log Loss: [32m0.0078[0m

Predicting...


100%|██████████| 196/196 [01:40<00:00,  1.96it/s]


100%|█████████████████████████████████████████| 137k/137k [00:00<00:00, 316kB/s]
Successfully submitted to Dogs vs. Cats Redux: Kernels Edition
[34m(ﾉ´ヮ`)ﾉ*  Finished in 7.96 minutes[0m


![vit_b](imgs/vit_b_16.png)

After just four epochs, the ViT-B_16 model has already reached a first place on the leaderboard with a score of 0.03210, beating the top score of the competition at 0.03302. With a training time of 6 minutes and prediction time of 2 minutes, this is a pretty good result. Now lets see if the ViT-L_16 can beat it.

## ***ViT-L_16***

In [21]:
train_inp = {
    "model_fn": load_vit_l_16,
    "data": data_vit_l_16,
    "epochs": 2,
    "deterministic": False,
    "load_best_state": False,
    "params": {
        "weight_decay": 1e-3,
        "learning_rate": 2e-3,
        "bs_train": 32,
        "bs_val": 32,
        "patience": np.inf,
        "focal": [1],
        "warm_up": [],
        "scheduler": lr_scheduler.StepLR,
        "scheduler_args": {"step_size": 1, "gamma": 0.9},
    },
    "transform_d": {
        (0,): transform_light,
        (1,): transform_medium,
        (): transform_dense,
    },
}

model, predictions = train(**train_inp)

Parameters: [34mbatch_size=32, weight_decay=0.001 learning_rate=0.002[0m

[37m[1m[Epoch 1/2][0m


100%|██████████| 624/624 [05:05<00:00,  2.04it/s]


624: [Train] Loss: [34m0.0100[0m, Accuracy: [34m0.9970[0m log_loss: [34m0.0134[0m


100%|██████████| 156/156 [04:18<00:00,  1.65s/it]


[Validation] Loss: [32m0.0029[0m, Accuracy: [32m0.9996[0m, Log Loss: [32m0.0071[0m

Updated lr -> [35m1.80e-03[0m

[37m[1m[Epoch 2/2][0m


100%|██████████| 624/624 [05:04<00:00,  2.05it/s]


624: [Train] Loss: [34m0.0006[0m, Accuracy: [34m0.9995[0m log_loss: [34m0.0067[0m


100%|██████████| 156/156 [04:17<00:00,  1.65s/it]


[Validation] Loss: [32m0.0025[0m, Accuracy: [32m0.9996[0m, Log Loss: [32m0.0069[0m

Predicting...


100%|██████████| 391/391 [10:42<00:00,  1.64s/it]


100%|█████████████████████████████████████████| 137k/137k [00:00<00:00, 252kB/s]
Successfully submitted to Dogs vs. Cats Redux: Kernels Edition
[34m(ﾉ´ヮ`)ﾉ*  Finished in 29.59 minutes[0m


![vit_l](imgs/vit_l_16.png)

The ViT-L_16 model is significantly slower than the ViT-B_16 model, this becomes obvious when looking at the amount of parameters that the two models have. In PyTorch, the ViT-B/16 model has 21M parameters and the ViT-L/16 model has 304M parameters, which is 14 times more than the ViT-B/16 model. 

Training for two epochs took around 20 minutes, and the prediction time was 11 minutes. But the result was an improvement over the ViT-B_16 model, with a score of 0.03063.

I made a couple of training runs with different parameters, and the ones i used was the ones that gave the best results. So the two models dont use the exact same parameters, but instead the ones that gave the best results for each model.

# ***Main with CV***

In my previous kaggle competition with tabular data, I used cross validation to get a better estimate of the models performance. I wanted to try the same thing with this competition, but I wasn't sure how to do it with images. I tried to implement it in a similar way as I did with tabular data, but with some slight modifications.

I wanted to do a cross validation with 5 folds, and for each fold save the model state dict. Then after the cross validation is done, I will create an ensemble of the 5 models and use that for the final prediction. 

This would take a lot longer to preform, but I was curious to see if it would improve the results.


As with the last training loop, this is not an optimal implementation and is done for experimental purposes. 

In [None]:
def cv_train(
    model_fn, data, epochs, deterministic, params, transform_d, n_splits=5
):
    start = time()
    wd, lr, bs, bs_val, patience, focal, warm_up, *scheduler_init = params.values()

    print(
        f"Parameters: {bl}batch_size={bs}, weight_decay={wd}",
        f"learning_rate={lr}{rs}\n",
    )

    train_data, _, test_data = data
    test_loader = get_loader(test_data, bs_val)

    kf = KFold(n_splits=n_splits, shuffle=True)
    models = []

    for fold, (train_index, val_index) in enumerate(kf.split(train_data)):
        print(f"{gld}Fold {fold + 1}/{n_splits}{rs}\n")

        set_seed() if deterministic else remove_seed()
        train_data_fold = torch.utils.data.Subset(train_data, train_index)
        val_data_fold = torch.utils.data.Subset(train_data, val_index)

        train_loader = set_transform(train_data_fold, transform_d, fold, bs)
        val_loader = get_loader(val_data_fold, bs_val)

        model = model_fn()
        scaler = torch.cuda.amp.GradScaler()
        criterion = nn.BCEWithLogitsLoss().to(device)

        model, optimizer, scheduler = get_model(model, lr, wd, scheduler_init)

        best_loss, best_state = np.inf, None

        for epoch in range(epochs):
            print(f"{wh}[Epoch {epoch+1}/{epochs}]{rs}")

            train_args = [model, criterion, optimizer, scaler]

            one_epoch(train_loader, True, *train_args, epoch in focal)
            val_loss = one_epoch(val_loader, False, *train_args)
            empty_cache()

            if scheduler is not None and epoch not in [warm_up, epochs - 1]:
                scheduler.step()
                print(f"Updated lr -> {ma}{scheduler.get_last_lr()[0]:.2e}{rs}\n")

            _, best_loss, best_state, early_stop = early_stopping(
                0, best_loss, best_state, val_loss, model, patience, epoch
            )

            if early_stop and epoch != epochs - 1:
                break

        models.append(copy.deepcopy(model))

        del model, optimizer, scaler, criterion, scheduler

    ensemble = Ensemble(models)
    predictions = make_predictions(ensemble, test_loader, bs_val)
    submit_predictions(predictions)
    empty_cache()

    print(f"\n{bl}(ﾉ´ヮ`)ﾉ*  Finished in {(time() - start) / 60:.2f} minutes{rs}")

    return models, ensemble, predictions

# ***Training with CV***

In [None]:
data_inp = {
    "train_amn": 1,
    "val_amn": 0,
}

data_vit_b_16_cv = get_data(**data_inp, model="vit_b_16")
data_vit_l_16_cv = get_data(**data_inp, model="vit_l_16")

## ***ViT-B_16***

5 fold cross validation with 3 epochs for each fold.

In [None]:
train_inp = {
    "model_fn": load_vit_b_16,
    "data": data_vit_b_16_cv,
    "epochs": 3,
    "deterministic": False,
    "load_best": (False, False),
    "params": {
        "weight_decay": 5e-5,
        "learning_rate": 2e-3,
        "bs_train": 32,
        "bs_val": 64,
        "patience": np.inf,
        "focal": [2],
        "warm_up": [0],
        "scheduler": lr_scheduler.StepLR,
        "scheduler_args": {"step_size": 1, "gamma": 0.7},
    },
    "transform_d": {
        tuple(range(4, 4)): transform_light,
        tuple(range(1, 2)): transform_medium,
        tuple(range(16, 21)): transform_dense,
    },
}

models, ensemble, predictions = cv_train(**train_inp)

Parameters: [34mbatch_size=32, weight_decay=5e-05 learning_rate=0.002[0m

[33m[1mFold 1/5[0m

[37m[1m[Epoch 1/3][0m


100%|██████████| 624/624 [00:46<00:00, 13.48it/s]


624: [Train] Loss: [34m0.0116[0m, Accuracy: [34m0.9964[0m log_loss: [34m0.0152[0m


100%|██████████| 78/78 [00:40<00:00,  1.92it/s]


[Validation] Loss: [32m0.0083[0m, Accuracy: [32m0.9986[0m, Log Loss: [32m0.0109[0m

Updated lr -> [35m1.40e-03[0m

[37m[1m[Epoch 2/3][0m


100%|██████████| 624/624 [00:47<00:00, 13.25it/s]


624: [Train] Loss: [34m0.0036[0m, Accuracy: [34m0.9990[0m log_loss: [34m0.0080[0m


100%|██████████| 78/78 [00:38<00:00,  2.01it/s]


[Validation] Loss: [32m0.0074[0m, Accuracy: [32m0.9984[0m, Log Loss: [32m0.0103[0m

Updated lr -> [35m9.80e-04[0m

[37m[1m[Epoch 3/3][0m


100%|██████████| 624/624 [00:47<00:00, 13.22it/s]


624: [Train] Loss: [34m0.0007[0m, Accuracy: [34m0.9993[0m log_loss: [34m0.0073[0m


100%|██████████| 78/78 [00:38<00:00,  2.05it/s]


[Validation] Loss: [32m0.0076[0m, Accuracy: [32m0.9986[0m, Log Loss: [32m0.0105[0m

[33m[1mFold 2/5[0m

[37m[1m[Epoch 1/3][0m


100%|██████████| 624/624 [00:47<00:00, 13.22it/s]


624: [Train] Loss: [34m0.0117[0m, Accuracy: [34m0.9968[0m log_loss: [34m0.0152[0m


100%|██████████| 78/78 [00:38<00:00,  2.03it/s]


[Validation] Loss: [32m0.0046[0m, Accuracy: [32m0.9986[0m, Log Loss: [32m0.0091[0m

Updated lr -> [35m1.40e-03[0m

[37m[1m[Epoch 2/3][0m


100%|██████████| 624/624 [00:47<00:00, 13.23it/s]


624: [Train] Loss: [34m0.0043[0m, Accuracy: [34m0.9987[0m log_loss: [34m0.0085[0m


100%|██████████| 78/78 [00:38<00:00,  2.03it/s]


[Validation] Loss: [32m0.0043[0m, Accuracy: [32m0.9978[0m, Log Loss: [32m0.0089[0m

Updated lr -> [35m9.80e-04[0m

[37m[1m[Epoch 3/3][0m


100%|██████████| 624/624 [00:46<00:00, 13.28it/s]


624: [Train] Loss: [34m0.0008[0m, Accuracy: [34m0.9993[0m log_loss: [34m0.0075[0m


100%|██████████| 78/78 [00:39<00:00,  1.99it/s]


[Validation] Loss: [32m0.0057[0m, Accuracy: [32m0.9982[0m, Log Loss: [32m0.0103[0m

[33m[1mFold 3/5[0m

[37m[1m[Epoch 1/3][0m


100%|██████████| 624/624 [00:47<00:00, 13.28it/s]


624: [Train] Loss: [34m0.0110[0m, Accuracy: [34m0.9970[0m log_loss: [34m0.0146[0m


100%|██████████| 78/78 [00:39<00:00,  1.98it/s]


[Validation] Loss: [32m0.0099[0m, Accuracy: [32m0.9976[0m, Log Loss: [32m0.0134[0m

Updated lr -> [35m1.40e-03[0m

[37m[1m[Epoch 2/3][0m


100%|██████████| 624/624 [00:47<00:00, 13.20it/s]


624: [Train] Loss: [34m0.0041[0m, Accuracy: [34m0.9988[0m log_loss: [34m0.0085[0m


100%|██████████| 78/78 [00:39<00:00,  1.98it/s]


[Validation] Loss: [32m0.0084[0m, Accuracy: [32m0.9984[0m, Log Loss: [32m0.0120[0m

Updated lr -> [35m9.80e-04[0m

[37m[1m[Epoch 3/3][0m


100%|██████████| 624/624 [00:47<00:00, 13.27it/s]


624: [Train] Loss: [34m0.0006[0m, Accuracy: [34m0.9992[0m log_loss: [34m0.0067[0m


100%|██████████| 78/78 [00:39<00:00,  1.99it/s]


[Validation] Loss: [32m0.0090[0m, Accuracy: [32m0.9982[0m, Log Loss: [32m0.0126[0m

[33m[1mFold 4/5[0m

[37m[1m[Epoch 1/3][0m


100%|██████████| 624/624 [00:47<00:00, 13.26it/s]


624: [Train] Loss: [34m0.0124[0m, Accuracy: [34m0.9966[0m log_loss: [34m0.0159[0m


100%|██████████| 78/78 [00:38<00:00,  2.03it/s]


[Validation] Loss: [32m0.0045[0m, Accuracy: [32m0.9986[0m, Log Loss: [32m0.0090[0m

Updated lr -> [35m1.40e-03[0m

[37m[1m[Epoch 2/3][0m


100%|██████████| 624/624 [00:46<00:00, 13.29it/s]


624: [Train] Loss: [34m0.0040[0m, Accuracy: [34m0.9988[0m log_loss: [34m0.0083[0m


100%|██████████| 78/78 [00:38<00:00,  2.01it/s]


[Validation] Loss: [32m0.0040[0m, Accuracy: [32m0.9984[0m, Log Loss: [32m0.0086[0m

Updated lr -> [35m9.80e-04[0m

[37m[1m[Epoch 3/3][0m


100%|██████████| 624/624 [00:47<00:00, 13.22it/s]


624: [Train] Loss: [34m0.0008[0m, Accuracy: [34m0.9994[0m log_loss: [34m0.0075[0m


100%|██████████| 78/78 [00:39<00:00,  1.97it/s]


[Validation] Loss: [32m0.0045[0m, Accuracy: [32m0.9982[0m, Log Loss: [32m0.0092[0m

[33m[1mFold 5/5[0m

[37m[1m[Epoch 1/3][0m


100%|██████████| 624/624 [00:47<00:00, 13.15it/s]


624: [Train] Loss: [34m0.0135[0m, Accuracy: [34m0.9961[0m log_loss: [34m0.0168[0m


100%|██████████| 78/78 [00:39<00:00,  1.97it/s]


[Validation] Loss: [32m0.0040[0m, Accuracy: [32m0.9988[0m, Log Loss: [32m0.0085[0m

Updated lr -> [35m1.40e-03[0m

[37m[1m[Epoch 2/3][0m


100%|██████████| 624/624 [00:47<00:00, 13.20it/s]


624: [Train] Loss: [34m0.0043[0m, Accuracy: [34m0.9987[0m log_loss: [34m0.0084[0m


100%|██████████| 78/78 [00:39<00:00,  2.00it/s]


[Validation] Loss: [32m0.0048[0m, Accuracy: [32m0.9982[0m, Log Loss: [32m0.0094[0m

Updated lr -> [35m9.80e-04[0m

[37m[1m[Epoch 3/3][0m


100%|██████████| 624/624 [00:47<00:00, 13.23it/s]


624: [Train] Loss: [34m0.0008[0m, Accuracy: [34m0.9991[0m log_loss: [34m0.0075[0m


100%|██████████| 78/78 [00:39<00:00,  1.98it/s]


[Validation] Loss: [32m0.0040[0m, Accuracy: [32m0.9990[0m, Log Loss: [32m0.0086[0m

Predicting...


100%|██████████| 196/196 [07:24<00:00,  2.27s/it]


100%|█████████████████████████████████████████| 137k/137k [00:00<00:00, 230kB/s]
Successfully submitted to Dogs vs. Cats Redux: Kernels Edition
[34m(ﾉ´ヮ`)ﾉ*  Finished in 29.13 minutes[0m


![vit_b_cv](imgs/vit_b_16_cv_5fold.png)

The total time for training and prediction was 29 minutes, and the score was 0.03162, which is an improvement over the previous ViT-B_16 prediction. Even though this score is a more robust score, that likley would be more accurate than the previous score, it still worth questioning if the extra time spent on training and prediction is worth it.

## ***ViT-L_16***

5 fold cross validation with 2 epochs for each fold.

In [None]:
train_inp = {
    "model_fn": load_vit_l_16,
    "data": data_vit_l_16_cv,
    "epochs": 2,
    "deterministic": False,
    "load_best": (False, False),
    "params": {
        "weight_decay": 1e-3,
        "learning_rate": 2e-3,
        "bs_train": 32,
        "bs_val": 32,
        "patience": np.inf,
        "focal": [1],
        "warm_up": [],
        "scheduler": lr_scheduler.StepLR,
        "scheduler_args": {"step_size": 1, "gamma": 0.9},
    },
    "transform_d": {
        (0,): transform_light,
        (1,): transform_medium,
        (): transform_dense,
    },
}

models, ensemble, predictions = cv_train(**train_inp)

Parameters: [34mbatch_size=32, weight_decay=0.001 learning_rate=0.002[0m

[33m[1mFold 1/5[0m

[37m[1m[Epoch 1/2][0m


100%|██████████| 624/624 [04:31<00:00,  2.30it/s]


624: [Train] Loss: [34m0.0082[0m, Accuracy: [34m0.9980[0m log_loss: [34m0.0121[0m


100%|██████████| 156/156 [03:59<00:00,  1.53s/it]


[Validation] Loss: [32m0.0039[0m, Accuracy: [32m0.9996[0m, Log Loss: [32m0.0073[0m

Updated lr -> [35m1.80e-03[0m

[37m[1m[Epoch 2/2][0m


100%|██████████| 624/624 [04:32<00:00,  2.29it/s]


624: [Train] Loss: [34m0.0005[0m, Accuracy: [34m0.9994[0m log_loss: [34m0.0067[0m


100%|██████████| 156/156 [03:58<00:00,  1.53s/it]


[Validation] Loss: [32m0.0035[0m, Accuracy: [32m0.9996[0m, Log Loss: [32m0.0073[0m

[33m[1mFold 2/5[0m

[37m[1m[Epoch 1/2][0m


100%|██████████| 624/624 [04:31<00:00,  2.30it/s]


624: [Train] Loss: [34m0.0125[0m, Accuracy: [34m0.9948[0m log_loss: [34m0.0161[0m


100%|██████████| 156/156 [03:59<00:00,  1.53s/it]


[Validation] Loss: [32m0.0013[0m, Accuracy: [32m0.9998[0m, Log Loss: [32m0.0058[0m

Updated lr -> [35m1.80e-03[0m

[37m[1m[Epoch 2/2][0m


100%|██████████| 624/624 [04:32<00:00,  2.29it/s]


624: [Train] Loss: [34m0.0006[0m, Accuracy: [34m0.9995[0m log_loss: [34m0.0067[0m


100%|██████████| 156/156 [03:58<00:00,  1.53s/it]


[Validation] Loss: [32m0.0012[0m, Accuracy: [32m0.9998[0m, Log Loss: [32m0.0057[0m

[33m[1mFold 3/5[0m

[37m[1m[Epoch 1/2][0m


100%|██████████| 624/624 [04:31<00:00,  2.30it/s]


624: [Train] Loss: [34m0.0089[0m, Accuracy: [34m0.9971[0m log_loss: [34m0.0125[0m


100%|██████████| 156/156 [03:58<00:00,  1.53s/it]


[Validation] Loss: [32m0.0059[0m, Accuracy: [32m0.9988[0m, Log Loss: [32m0.0101[0m

Updated lr -> [35m1.80e-03[0m

[37m[1m[Epoch 2/2][0m


100%|██████████| 624/624 [04:31<00:00,  2.29it/s]


624: [Train] Loss: [34m0.0007[0m, Accuracy: [34m0.9993[0m log_loss: [34m0.0068[0m


100%|██████████| 156/156 [03:58<00:00,  1.53s/it]


[Validation] Loss: [32m0.0051[0m, Accuracy: [32m0.9988[0m, Log Loss: [32m0.0094[0m

[33m[1mFold 4/5[0m

[37m[1m[Epoch 1/2][0m


100%|██████████| 624/624 [04:32<00:00,  2.29it/s]


624: [Train] Loss: [34m0.0099[0m, Accuracy: [34m0.9969[0m log_loss: [34m0.0136[0m


100%|██████████| 156/156 [03:59<00:00,  1.53s/it]


[Validation] Loss: [32m0.0018[0m, Accuracy: [32m0.9996[0m, Log Loss: [32m0.0063[0m

Updated lr -> [35m1.80e-03[0m

[37m[1m[Epoch 2/2][0m


100%|██████████| 624/624 [04:32<00:00,  2.29it/s]


624: [Train] Loss: [34m0.0005[0m, Accuracy: [34m0.9994[0m log_loss: [34m0.0066[0m


100%|██████████| 156/156 [03:59<00:00,  1.53s/it]


[Validation] Loss: [32m0.0013[0m, Accuracy: [32m0.9996[0m, Log Loss: [32m0.0059[0m

[33m[1mFold 5/5[0m

[37m[1m[Epoch 1/2][0m


100%|██████████| 624/624 [04:32<00:00,  2.29it/s]


624: [Train] Loss: [34m0.0074[0m, Accuracy: [34m0.9984[0m log_loss: [34m0.0111[0m


100%|██████████| 156/156 [03:59<00:00,  1.53s/it]


[Validation] Loss: [32m0.0054[0m, Accuracy: [32m0.9988[0m, Log Loss: [32m0.0091[0m

Updated lr -> [35m1.80e-03[0m

[37m[1m[Epoch 2/2][0m


100%|██████████| 624/624 [04:32<00:00,  2.29it/s]


624: [Train] Loss: [34m0.0006[0m, Accuracy: [34m0.9995[0m log_loss: [34m0.0066[0m


100%|██████████| 156/156 [03:59<00:00,  1.53s/it]


[Validation] Loss: [32m0.0056[0m, Accuracy: [32m0.9986[0m, Log Loss: [32m0.0096[0m

Predicting...


100%|██████████| 391/391 [49:10<00:00,  7.54s/it]


100%|█████████████████████████████████████████| 137k/137k [00:00<00:00, 233kB/s]
Successfully submitted to Dogs vs. Cats Redux: Kernels Edition
[34m(ﾉ´ヮ`)ﾉ*  Finished in 134.74 minutes[0m


![vit_l_cv](imgs/vit_l_16_cv_5fold.png)

Total time was 2 hours and 15 minutes, and the score was 0.03056. This is the best score in this notebook but its only a 0.00010 improvement over the previous ViT-L_16 prediction. In this case, I don't think the extra time spent on training and prediction is worth it since the ViT-L_16 is already a quite robust model.

Thanks for reading this notebook, I hope you found it interesting. If you have any feedback or suggestions, please let me by know!
<center><img src="https://cdn.shopify.com/s/files/1/0100/8176/3385/products/59bb03a64ae09944b3f86bb9bcfdd8de_580x.jpg?v=1608234066">