<br>
<h1 style = "font-size:60px; font-family:Garamond ; font-weight : normal; background-color: #f6f5f5 ; color : #fe346e; text-align: center; border-radius: 100px 100px;">PyTorch Tutorial Series<br> Inference Series Part I</h1>
<br>

OKOKOK, I admit, with the influx of pretty notebooks recently, I cannot stop to "copy" their style. After all, aesthetic pleasing notes (notebooks) make me want to read it more.

---

This notebook is part I of the PyTorch Tutorial Inference Series. I will detail on how to save and load weights in PyTorch. I will split this notebook into two parts:

1. First part: The real inference in action - for now, I will just add SETI as the main competition. However, I intend to share this notebook across multiple different competitions.

2. The tutorial on PyTorch.

---

I have added a back to top button for each section for easy navigation.

References:

1. [Using args and kwargs](https://note.nkmk.me/en/python-args-kwargs-usage/#:~:text=In%20Python%2C%20by%20adding%20*%20and,arguments)

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:black; border:0' role="tab" aria-controls="home"><center>Quick Navigation</center></h3>

    
* [Dependencies](#1)
* [Configurations](#2)
* [Seeding](#3)
* [Utility](#30)
    * [Numpy to Latex](#31)
* [Loading Files](#4)
* [Dataset](#5)
* [Augmentations](#5)
* [Model Instantiation](#7)    
* [Inference by Folds](#8)
 
    
* [Saving and Loading Model Weights](#20)
    
    
* [Dissecting Inference by Folds](#60)
    * [Problem Settings](#61)

<a id="1"></a>

<h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Dependencies</h1>

In [None]:
!pip install -q git+https://github.com/rwightman/pytorch-image-models.git
!pip install -q torchsummary
!pip install -q -U git+https://github.com/albu/albumentations --no-cache-dir
!pip install -q neptune-client 

from IPython.display import clear_output 
clear_output()

In [None]:
import sys
geffnet_path = '../input/hongnangeffnet/gen-efficientnet-pytorch-master-hongnan'
# timm_path = '../input/pytorchimagemodelsmasteroct302020/pytorch-image-models-master'
timm_path = '../input/pytorch-image-models/pytorch-image-models-master'
vit_path = '../input/vision-transformer-pytorch/VisionTransformer-Pytorch'
sys_paths = [geffnet_path, timm_path, vit_path]
for paths in sys_paths:
    sys.path.append(paths)

import geffnet
import timm
from vision_transformer_pytorch import VisionTransformer

In [None]:
import math
import os
import random
import warnings
from typing import *
from tqdm.notebook import tqdm
import albumentations
import cv2
# import neptune.new as neptune
import numpy as np
import pandas as pd
import seaborn as sns
import timm
import torch
import torch.nn.functional as F
from albumentations.pytorch.transforms import ToTensorV2
from matplotlib import pyplot as plt
from sklearn.model_selection import GroupKFold, KFold, StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from torch import nn
from torch.autograd import Variable
from torch.optim.lr_scheduler import _LRScheduler
from torch.optim.optimizer import Optimizer
from torch.utils.data import DataLoader, Dataset
from torchsummary import summary
from torchvision import models
from tqdm.notebook import tqdm

warnings.filterwarnings("ignore")
from IPython.display import clear_output
from torch.optim.lr_scheduler import (CosineAnnealingLR,
                                      CosineAnnealingWarmRestarts,
                                      ReduceLROnPlateau)
from torch.utils.data import DataLoader, Dataset

clear_output()


<a href="#top">Back to top</a>

<a id="2"></a>

<h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Configurations</h1>

In [None]:
CONFIG = {
    "COMPETITION_NAME": "SETI Breakthrough Listen - E.T. Signal Search",
    "MODEL": {"MODEL_FACTORY": "timm", "MODEL_NAME": "efficientnet_b0"},
    "WORKSPACE": "Kaggle",
    "DATA": {
        "TARGET_COL_NAME": "target",
        "IMAGE_COL_NAME": "id",
        "NUM_CLASSES": 1,
        "CLASS_LIST": [0, 1],
        "IMAGE_SIZE": 640,
        "CHANNEL_MODE": "spatial_6ch",
        "USE_MIXUP": True
    },
    "CROSS_VALIDATION": {"SCHEMA": 'StratifiedKFold', "NUM_FOLDS": 4},
    "TRAIN": {
        "DATALOADER": {
            "batch_size": 32,
            "shuffle": True,  # using random sampler
            "num_workers": 4,
            "drop_last": False,
        },
        "SETTINGS": {
            "IMAGE_SIZE": 640,
            "NUM_EPOCHS": 3,
            "USE_AMP": True,
            "USE_GRAD_ACCUM": False,
            "ACCUMULATION_STEP": 1,
            "DEBUG": True,
            "VERBOSE": True,
            "VERBOSE_STEP": 10,
        },
    },
    "VALIDATION": {
        "DATALOADER": {
            "batch_size": 32,
            "shuffle": False,
            "num_workers": 4,
            "drop_last": False,
        }
    },
    "TEST": {
        "DATALOADER": {
            "batch_size": 64,
            "shuffle": False,
            "num_workers": 4,
            "drop_last": False,
        }
    },
    "OPTIMIZER": {
        "NAME": "AdamW",
        "OPTIMIZER_PARAMS": {"lr": 1e-4, "eps": 1.0e-8, "weight_decay": 1.0e-3},
    },
    "SCHEDULER": {
        "NAME": "CosineAnnealingWarmRestarts",
        "SCHEDULER_PARAMS": {
            "T_0": 4,
            "T_mult": 1,
            "eta_min": 1.0e-7,
            "last_epoch": -1,
            "verbose": True,
        },
        "CUSTOM": "GradualWarmupSchedulerV2",
        "CUSTOM_PARAMS": {"multiplier": 10, "total_epoch": 1},
        "VAL_STEP": False,
    },
    "CRITERION_TRAIN": {
        "NAME": "BCEWithLogitsLoss",
        "LOSS_PARAMS": {
            "weight": None,
            "size_average": None,
            "reduce": None,
            "reduction": "mean",
            "pos_weight": None
        },
    },
    "CRITERION_VALIDATION": {
        "NAME": "BCEWithLogitsLoss",
        "LOSS_PARAMS": {
            "weight": None,
            "size_average": None,
            "reduce": None,
            "reduction": "mean",
            "pos_weight": None
        },
    },
    "TRAIN_TRANSFORMS": {
        "VerticalFlip": {"p": 0.5},
        "HorizontalFlip": {"p": 0.5},
        "Resize": {"height": 640, "width": 640, "p": 1},
    },
    "VALID_TRANSFORMS": {
        "Resize": {"height": 640, "width": 640, "p": 1},
    },
    "TEST_TRANSFORMS": {
        "Resize": {"height": 640, "width": 640, "p": 1},
    },
    "TTA_TRANSFORMS": [{
        "Resize": {"height": 640, "width": 640, "p": 1},
    },
        {
        "Resize": {"height": 640, "width": 640, "p": 1},
    }],
    "PATH": {
        "DATA_DIR": "/content/",
        "TRAIN_CSV": "../input/seti-breakthrough-listen/train_labels.csv",
        "TRAIN_PATH": "../input/seti-breakthrough-listen/train",

        "TEST_CSV": "../input/seti-breakthrough-listen/sample_submission.csv",
        "TEST_PATH": "../input/seti-breakthrough-listen/test",
        "SAMPLE_SUBMISSION_CSV": "../input/seti-breakthrough-listen/sample_submission.csv",
        "SAVE_WEIGHT_PATH": "../input/et41-efficientnetb0",
        "OOF_PATH": "./",
        "LOG_PATH": "./log.txt"
    },
    "SEED": 19921930,
    "DEVICE": "cuda",
    "GPU": "V100",
}

In [None]:
config = CONFIG
device = config['DEVICE']

<a href="#top">Back to top</a>

<a id="3"></a>

<h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Seeding</h1>

In [None]:
def seed_all(seed: int = 1930):
    """Seed all random number generators."""
    print("Using Seed Number {}".format(seed))

    os.environ["PYTHONHASHSEED"] = str(
        seed
    )  # set PYTHONHASHSEED env var at fixed value
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.cuda.manual_seed(seed)  # pytorch (both CPU and CUDA)
    np.random.seed(seed)  # for numpy pseudo-random generator
    # set fixed value for python built-in pseudo-random generator
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.enabled = False


def seed_worker(_worker_id):
    """Seed a worker with the given ID."""
    worker_seed = torch.initial_seed() % 2 ** 32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

In [None]:
seed_all(config['SEED'])

<a href="#top">Back to top</a>

<a id="30"></a>

<h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Utility Functions</h1>

<a id="31"></a>

<h2 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Converting Numpy Arrays into Latex Form</h2>

[How to convert Numpy Array into Latex form](https://stackoverflow.com/questions/17129290/numpy-2d-and-1d-array-to-latex-bmatrix)

In [None]:
def bmatrix(a):
    """Returns a LaTeX bmatrix

    :a: numpy array
    :returns: LaTeX bmatrix as a string
    """
    if len(a.shape) > 2:
        raise ValueError('bmatrix can at most display two dimensions')
    lines = np.array2string(a, max_line_width=np.infty).replace('[', '').replace(']', '').splitlines()
    rv = [r'\begin{bmatrix}']
    rv += ['  ' + ' & '.join(l.split()) + r'\\' for l in lines]
    rv +=  [r'\end{bmatrix}']
    return '\n'.join(rv)

<a href="#top">Back to top</a>

<a id="4"></a>

<h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Loading Data</h1>

In [None]:
def get_file_path(image_id, train_path=None, test_path=None, *args):

    # args handle special cases like the one we see here, where the folders are nested in SETI
    # in this SETI, we apply on image_id so image_info is used to generate

    if train_path is not None:
        return os.path.join(train_path, '{}/{}.npy'.format(image_id[0], image_id))
        # return config['PATH']['TRAIN_PATH']

    if test_path is not None:
        # return config['PATH']['TEST_PATH']
        return os.path.join(test_path, '{}/{}.npy'.format(image_id[0], image_id))


train_path = config['PATH']['TRAIN_PATH']
test_path = config['PATH']['TEST_PATH']
train = pd.read_csv(CONFIG['PATH']['TRAIN_CSV'])
test = pd.read_csv(CONFIG['PATH']['TEST_CSV'])

train['file_path'] = train['id'].apply(get_file_path, train_path=train_path)
test['file_path'] = test['id'].apply(get_file_path, test_path=test_path)

display(train.head(3))
display(test.head(3))


<a href="#top">Back to top</a>

<a id="5"></a>

<h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Dataset</h1>

`Dataset` may be unique to each competition. Here is an architecture that I used here. As you can see, in SETI, we may train on different channels, which may not be the case in most Image Classification tasks. Therefore, we defined `Dataset` differently here.

In [None]:
class AlienDataset(Dataset):
    def __init__(self, df, config, transform=None, mode='train'):
        self.df = df  # this assumes we have a df to begin with and not getting files from directory directly
        self.config = config
        # this line assumes you have a column called file_path, considering putting inside config
        self.file_names = df['file_path'].values
        self.labels = df[config['DATA']['TARGET_COL_NAME']].values
        self.transform = transform
        self.mode = mode

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        image = np.load(self.file_names[idx])  # -> (6, 273, 256)

        if self.config['DATA']['CHANNEL_MODE'] == 'spatial_6ch':
            image = image.astype(np.float32)
            image = np.vstack(image)  # no transpose here (1638, 256)
            # image = np.vstack(image).transpose((1, 0)) -> (256, 1638)

        elif self.config['DATA']['CHANNEL_MODE'] == 'spatial_3ch':
            image = image[::2].astype(np.float32)
            image = np.vstack(image).transpose((1, 0))

        elif self.config['DATA']['CHANNEL_MODE'] == '6_channel':
            image = image.astype(np.float32)
            image = np.transpose(image, (1, 2, 0))

        elif self.config['DATA']['CHANNEL_MODE'] == '3_channel':
            image = image[::2].astype(np.float32)
            image = np.transpose(image, (1, 2, 0))

        if self.transform:
            image = self.transform(image)
        else:
            image = torch.from_numpy(image).float()

        if self.mode == 'test':
            return image
        else:
            label = torch.tensor(self.labels[idx]).float()
            return image, label


<a href="#top">Back to top</a>

<a id="6"></a>

<h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Augmentations</h1>

In [None]:
class Transform:

    def __init__(self, aug_kwargs: Dict):
        albu_augs = [getattr(albumentations, name)(**kwargs)
                     for name, kwargs in aug_kwargs.items()]
        albu_augs.append(ToTensorV2(p=1))

        self.transform = albumentations.Compose(albu_augs)

    def __call__(self, image):
        image = self.transform(image=image)["image"]
        return image

<a href="#top">Back to top</a>

<a id="7"></a>

<h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Model Instantiation</h1>

We are using custom activation in our custom `head`. A `head` is a jargon for the `classifier` layer in a CNN. In general, people finetune a pretrained model by simply removing the last classifier layer and replace it with the correct number of classes.

In [None]:
sigmoid = torch.nn.Sigmoid()


class Swish(torch.autograd.Function):
    @staticmethod
    def forward(ctx, i):
        result = i * sigmoid(i)
        ctx.save_for_backward(i)
        return result

    @staticmethod
    def backward(ctx, grad_output):
        i = ctx.saved_variables[0]
        sigmoid_i = sigmoid(i)
        return grad_output * (sigmoid_i * (1 + i * (1 - sigmoid_i)))


class Swish_Module(torch.nn.Module):
    def forward(self, x):
        return Swish.apply(x)

Unfortunately, defining a `model` class may not necessarily be enough for reusability in every competition. However, a generic architecture as presented below should suffice for most.

In [None]:
class AlienSingleHead(torch.nn.Module):
    """A custom model."""

    def __init__(
        self,
        config: type,
        pretrained: bool = True,
    ):
        """Construct a custom model."""
        super().__init__()
        self.config = config
        self.pretrained = pretrained
        
        print("Pretrained is {}".format(self.pretrained))

        self.activation = Swish_Module()
        
        self.architecture = {
            "backbone": None,
            "bottleneck": None,
            "classifier_head": None,
        }

        def __setattr__(self, name, value):
            self.model.__setattr__(self, name, value)

        _model_factory = (
            timm.create_model
            if self.config["MODEL"]["MODEL_FACTORY"] == "timm"
            else geffnet.create_model
        )
        
        if config['DATA']['CHANNEL_MODE'] == 'spatial_6ch' or config['DATA']['CHANNEL_MODE'] == 'spatial_3ch':

            self.model = _model_factory(
                model_name=self.config["MODEL"]["MODEL_NAME"],
                pretrained=self.pretrained, in_chans=1) # set channel = 1 since we using spatial

        else:
            self.model = _model_factory(
                            model_name=self.config["MODEL"]["MODEL_NAME"],
                            pretrained=self.pretrained, in_chans=3) # set channel = 1 since we using spatial

        # reset head
        self.model.reset_classifier(num_classes=0, global_pool="avg")
        
        # after resetting, there is no longer any classifier head, therefore it is the backbone now.
        self.architecture["backbone"] = self.model
        
        # get out features of the last cnn layer from backbone, which is also the in features of the next layer
        self.in_features = self.architecture["backbone"].num_features

        self.single_head_fc = torch.nn.Sequential(
            torch.nn.Linear(self.in_features, self.in_features),
            self.activation,
            torch.nn.Dropout(p=0.5),
            torch.nn.Linear(self.in_features, self.config["DATA"]["NUM_CLASSES"]),
        )
        self.architecture["classifier_head"] = self.single_head_fc


    # feature map after cnn layer
    def extract_features(self, x):
        feature_logits = self.architecture["backbone"](x)
        # TODO: caution, if you use forward_features, then you need reshape. See test.py
        return feature_logits

    def forward(self, x):
        feature_logits = self.extract_features(x)
        classifier_logits = self.architecture["classifier_head"](feature_logits)
        return classifier_logits

A good practice often overlooked is to do a simpel `forward-pass` in your `model`. This prevents bugs later.

In [None]:
model_forward_pass = AlienSingleHead(config,pretrained=False)
train_dataset = AlienDataset(train, config, transform=Transform(config["TRAIN_TRANSFORMS"]))
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True,
                          num_workers=4, pin_memory=True, drop_last=True)

for image, label in train_loader:
    output = model_forward_pass(image)
    print(output)
    break

<a href="#top">Back to top</a>

<a id="8"></a>

<h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Inference By Folds</h1>

Inference by fold function has remained faithful to me throughout competitions, I only ever have to worry about changing `sigmoid` to `softmax` depending on the model's settings. One thing to note is that, I save a lot of things in the `model` `state_dict`. In particular, I save `oof` predictions in my `state_dict`. This makes me pulling out `oof` predictions easy when I accidentally didn't manage to save my `oof` during training, which happens often.

Go down to my example section for better understanding how it works under the hood.

In [None]:
def inference_by_fold(config, model, state_dicts, test_loader):
    model.to(device)
    model.eval()
    
    all_folds_preds = []
    
    with torch.no_grad():
        
        for fold_num, state in enumerate(state_dicts):
            if "model_state_dict" not in state:
                model.load_state_dict(state)
            else:
                model.load_state_dict(state["model_state_dict"])

            current_fold_preds = []
            for data in tqdm(test_loader, position=0, leave=True):
                images = data
                images = images.to(device)
                logits = model(images)

                sigmoid_preds = logits.sigmoid().detach().cpu().numpy()
                current_fold_preds.append(sigmoid_preds)

            current_fold_preds = np.concatenate(current_fold_preds, axis=0)
            all_folds_preds.append(current_fold_preds)
        avg_preds = np.mean(all_folds_preds, axis=0)
    return avg_preds

def LoadTestSet(test_df: pd.DataFrame, config):
    """Train the model on the given fold."""
    model = AlienSingleHead(config,pretrained=False)
    model.to(device)


    # consider adding args or kwargs here to accomodate multiple tta transforms
    def test_transforms(config=config, tta=False):
        
        transforms_dict = {}
        transforms_test = Transform(config["TEST_TRANSFORMS"])
        transforms_dict['transforms_test'] = transforms_test # this step is guaranteed, but tta isn't since we may not use it
        
        if tta is not False:
            transforms_dict['transforms_tta'] = []
            
            for tta_configs in config["TEST_TRANSFORMS"]:
                transforms_dict['transforms_tta'].append(Transform(tta_configs))
        
        return transforms_dict

    # transforms_test, transforms_tta_test = test_transforms(image_size=config['DATA']['IMAGE_SIZE'])
    
    transforms_test = test_transforms(config, tta=False)['transforms_test']
    
    test_dataset = AlienDataset(df=test,config=config, mode='test', transform=transforms_test)
    # tta_test_dataset = AlienTrainDataset(df=test,config=config, mode='test', transform=transforms_tta_test)
    
    test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False, 
                             num_workers=4, pin_memory=True)
    
#     tta_test_loader = torch.utils.data.DataLoader(
#         tta_test_dataset, batch_size=64, shuffle=False, num_workers=4
#     )

    
    weights = ["../input/et-alien-weights/efficientnet_b0_fold_1.pt"]
    

    state_dicts = [torch.load(path)['model_state_dict'] for path in weights]

    predictions = inference_by_fold(config=config, model=model, state_dicts = state_dicts, test_loader=test_loader)
    test['target'] = predictions
    test[['id', 'target']].to_csv('submission.csv', index=False)
    test.head()

    import matplotlib.pyplot as plt

    plt.figure(figsize=(12,6))
    plt.hist(test.target,bins=100)

In [None]:
LoadTestSet(test, config)

<a href="#top">Back to top</a>

<a id="60"></a>

<h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Dissecting Inference By Folds</h1>

[Reference I on np.concatenate](https://stackoverflow.com/questions/63722692/what-does-numpy-concatenate-do-with-a-single-argument)

When I started out, I had a hard time understanding what is the meaning of "averaging predictions by folds". This could be attributed to me jumping the gun too soon. Here, I give a full overview with a simple example to illustrate this idea.

<a id="61"></a>

<h2 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Problem Settings</h2>

- Batch Size: 2
- Number of Test Images: 3
- Number of Folds: 5
- Classification Problem with Softmax Activation
- Number of classes: 5
- all_preds_array: each inner array is 2d and is of shape (3, 5) - 3 rows and 5 columns where 3 is the number of images, and 5 the number of predictions per image.

I purposely made the `batch_size` to be 2 and total number of images to predict on to be 3. This is to tell you that your `test_loader` will take two loops to complete the predictions.

The below `all_preds_array` is a dummy predictions I made on the 3 images across 5 folds/models. Note that there are 5 inner arrays in this, as an example, we can take `all_preds_array[0]` to be the first fold predictions across all 3 images. This inner array is a 2d array of 3 by 5, where the first row, `all_preds_array[0][0]` is the 5 predictions outputed by `softmax` for image number 1. And to be more pedantic,`all_preds_array[0][0]` being 0.01 just means for image 1, fold 1 predicts the image's probability to be class 1 is 0.01, and `all_preds_array[0][1]` just mean for the same image 1, fold 1 predicts the image's probability to be class 2 is 0.02. Note that if you sum up each row, it must add up to 1 in this case, because we are using `softmax`, where the predictions for all 5 class must sum up to 1.

---

So we have 5 folds, and let us put focus on just image 1 for simplicity. We then have perform inference 5 times on image 1 using the same model, holding everything else constant. 

- Image 1 Fold 1 Predictions: `[0.01, 0.02, 0.03, 0.9, 0.04]`
- Image 1 Fold 2 Predictions: `[0.03, 0.05, 0.01, 0.88, 0.03]`
- Image 1 Fold 3 Predictions: `[0.05, 0.03, 0.08, 0.83, 0.01]`
- Image 1 Fold 4 Predictions: `[0.02, 0.01, 0.03, 0.92, 0.02]`
- Image 1 Fold 5 Predictions: `[0.01, 0.02, 0.02, 0.93, 0.02]`

We then add all 5 folds up and **average** them. This is akin to performing a mean ensemble. In general, we average the predictions. This usually produces a more robust result (can you explain why?)

---

> So we have trained 5 models because we used 5 fold cross validation. For each of the 5 fold models that we have, we will load each fold model's weights and use them to make predictions on the unseen test images. As a result, we have 5 predictions for each test image. To be clear, if you have 1000 unseen test images, named $$i~~~ \forall i \in {1,2,...,1000}$$ then for each image $i$, there will be 5 predictions each for it, we can call them as such $$P(i_{j}) ~~~ \forall j \in {1,2,3,4,5}$$ where $j$ represents the number of folds. Thus, by convention, we take the **average/mean** of these 5 sets of predictions and take the average value/probability as the final prediction value. This score is then submitted to Kaggle and we get what we called the LB score, which should correlate to your CV/OOF score.



In [None]:
all_preds_array = \
np.array([np.array([[0.01, 0.02, 0.03, 0.9 , 0.04],
           [0.02, 0.8 , 0.05, 0.06, 0.07],
           [0.01, 0.8 , 0.03, 0.09, 0.07]]), 
 
 np.array([[0.03,   0.05  , 0.01  , 0.88  ,  0.03],
           [0.01  , 0.82  , 0.02  , 0.07  ,  0.08],
           [0.005 , 0.81  , 0.0205, 0.0555, 0.109]]),
 
 np.array([[0.05, 0.03, 0.08, 0.83, 0.01],
           [0.05, 0.89, 0.01, 0.02, 0.03],
           [0.03, 0.78, 0.05, 0.02, 0.12]]),
 
 np.array([[0.02, 0.01, 0.03, 0.92, 0.02],
           [0.05, 0.85, 0.01, 0.01, 0.08],
           [0.02, 0.88, 0.03, 0.05, 0.02]]), 
 
 np.array([[0.01, 0.02, 0.02, 0.93, 0.02],
           [0.03, 0.76, 0.06, 0.06, 0.09],
           [0.02, 0.83, 0.01, 0.07, 0.07]])])

# we use np.mean on axis=0 to calculate the average. axis=0 just means we take the mean of each row.
print(np.mean(all_preds_array, axis=0))

I hide the input cell below to demonstrate how the batch works in the previous example:

In [None]:
### Fold 1 ###
### First Batch: 2 Predictions ###
fold_1_pred_1 = np.array(
    [[0.01, 0.02, 0.03, 0.9, 0.04], [0.02, 0.8, 0.05, 0.06, 0.07]])

### Second Batch: 1 Prediction ###

fold_1_pred_2 = np.array([[0.01, 0.8, 0.03, 0.09, 0.07]])


### Fold 2 ###
### First Batch: 2 Predictions ###

fold_2_pred_1 = np.array([[0.03, 0.05, 0.01, 0.88, 0.03], [
                         0.01, 0.82, 0.02, 0.07, 0.08]])

### Second Batch: 1 Prediction ###

fold_2_pred_2 = np.array([[0.005, 0.81, 0.0205, 0.0555, 0.109]])


### Fold 3 ###
### First Batch: 2 Predictions ###

fold_3_pred_1 = np.array([[0.05, 0.03, 0.08, 0.83, 0.01], [
                         0.05, 0.89, 0.01, 0.02, 0.03]])

### Second Batch: 1 Prediction ###

fold_3_pred_2 = np.array([[0.03, 0.78, 0.05, 0.02, 0.12]])


### Fold 4 ###
### First Batch: 2 Predictions ###

fold_4_pred_1 = np.array([[0.02, 0.01, 0.03, 0.92, 0.02], [
                         0.05, 0.85, 0.01, 0.01, 0.08]])

### Second Batch: 1 Prediction ###

fold_4_pred_2 = np.array([[0.02, 0.88, 0.03, 0.05, 0.02]])


### Fold 5 ###
### First Batch: 2 Predictions ###

fold_5_pred_1 = np.array([[0.01, 0.02, 0.02, 0.93, 0.02], [
                         0.03, 0.76, 0.06, 0.06, 0.09]])

### Second Batch: 1 Prediction ###

fold_5_pred_2 = np.array([[0.02, 0.83, 0.01, 0.07, 0.07]])


### This list should contain the predictions of all 5 folds ###
all_folds_preds = []


fold_1_preds = []

### Fold 1 ###
### All Batches: Total of 3 Predictions in the following format ###

fold_1_preds = [fold_1_pred_1, fold_1_pred_2]
#print('\nfold_1_preds before np.concatenate\n',fold_1_preds)

### Concatenate because previous format is not good, we want it to be a numpy array ###

fold_1_preds = np.concatenate(fold_1_preds, axis=0)

# Something good to know, concatenate works exactly the same as such: #
# The idea is you concatenate two list, over the axis 0 which is rows. #

fold_1_preds_ = np.concatenate([fold_1_pred_1, fold_1_pred_2], axis=0)
#print('fold_1_preds after np.concatenate\n',fold_1_preds)
#print('fold_1_preds after np.concatenate using different method\n',fold_1_preds_)

all_folds_preds.append(fold_1_preds)
print('All Folds Pred list after Fold 1 is \n\n{}\n\n'.format(all_folds_preds))
###################################################################################

fold_2_preds = []

### Fold 2 ###
### All Batches: Total of 3 Predictions in the following format ###

fold_2_preds = [fold_2_pred_1, fold_2_pred_2]
fold_2_preds = np.concatenate(fold_2_preds, axis=0)

all_folds_preds.append(fold_2_preds)
print('All Folds Pred list after Fold 1+2 is \n\n{}\n\n'.format(all_folds_preds))


fold_3_preds = []

### Fold 3 ###
### All Batches: Total of 3 Predictions in the following format ###

fold_3_preds = [fold_3_pred_1, fold_3_pred_2]
fold_3_preds = np.concatenate(fold_3_preds, axis=0)
all_folds_preds.append(fold_3_preds)
print('All Folds Pred list after fold 1+2+3 is \n\n{}\n\n'.format(all_folds_preds))


fold_4_preds = []

### Fold 4 ###
### All Batches: Total of 3 Predictions in the following format ###

fold_4_preds = [fold_4_pred_1, fold_4_pred_2]
fold_4_preds = np.concatenate(fold_4_preds, axis=0)
all_folds_preds.append(fold_4_preds)
print('All Folds Pred list after fold 1+2+3+4 is \n\n{}\n\n'.format(all_folds_preds))


fold_5_preds = []

### Fold 5 ###
### All Batches: Total of 3 Predictions in the following format ###

fold_5_preds = [fold_5_pred_1, fold_5_pred_2]
fold_5_preds
fold_5_preds = np.concatenate(fold_5_preds, axis=0)
fold_5_preds
all_folds_preds.append(fold_5_preds)
print('All Folds Pred list after fold 1+2+3+4+5 is \n\n{}\n\n'.format(all_folds_preds))


###### Finally, we take np.mean over all_folds_preds over row wise calculation ######
###### Do not use concat here! ######
avg_pred_without_concat = np.mean(all_folds_preds, axis=0)
print('Average predictions is \n\n{}'.format(avg_pred_without_concat))

In [None]:
# Converting to matrix in latex form

print((bmatrix(fold_1_preds)+'\n'))
print((bmatrix(fold_2_preds)+'\n'))
print((bmatrix(fold_3_preds)+'\n'))
print((bmatrix(fold_4_preds)+'\n'))
print((bmatrix(fold_5_preds)+'\n'))

five_folds_add = (fold_1_preds+fold_2_preds+fold_3_preds+fold_4_preds+fold_5_preds)
# five_folds_add
print((bmatrix(five_folds_add)+'\n'))

five_folds_add_avg = (fold_1_preds+fold_2_preds+fold_3_preds+fold_4_preds+fold_5_preds)/5
five_folds_add_avg
# print((bmatrix(five_folds_add_avg)+'\n'))

<a href="#top">Back to top</a>

### Step By Step

We create the `inference_by_fold` function; a step by step explanation based on my own test set is as follows:

#### Qn 1: Why do we call `model.to(device)` in PyTorch?

- We equip the `model` to `device`, telling PyTorch whether we are using GPU/CPU;

#### Qn 2: Why do we set model to `eval()` mode in PyTorch during inference/validation?

- Subsequently, set `model` to `eval()` mode as we are in inference phase; This is extremely important because if you DO NOT set it `eval` mode, then your model predictions during inference will be off. Why? Well I can spend the whole day explaining, but the simple idea is in your model, there are many **regularization** methods like `nn.Dropout()` or common `nn.BatchNorm1d(2d)` layers. Then during inference, if your model mode is `train`, then your predictions will experience DROPOUT as well. This will lead to different, and possibly worse predictions every time you inference.

> What eval mode does to BatchNorm: During training, this layer keeps a running estimate of its computed mean and variance. The running sum is kept with a default momentum of 0.1. During evaluation, this running mean/variance is used for normalization.

#### Qn 3: Why do we `@torch.no_grad` in PyTorch? 

- [Answer](https://stackoverflow.com/questions/63351268/torch-no-grad-affects-on-model-accuracy): Since we are in inference phase, we say `torch.no_grad()` because we are no longer computing or storing gradients anymore.

With the above questions cleared, we can go back to our function.

1. We initiate with an empty array `all_folds_preds=[]` where the final expectation of this array should contain 
2. Initiate with an empty list: `, where the final expectation of this `list` should contain 5 `numpy array` with each `array` having a shape of `3 by 5`, or `n by 5` where `n` represents the total number of test images. Alternatively, we can also do what we did in the previous section, instead of simply `appending`, we can simply add each fold's predictions, and divide by 5 later. (We can explore this, but for now we stick to the below code).

3. Here is where we start looping through each fold's model's and make predictions; since there are 5 folds, this implies we have 5 models, and `states_dicts` is a `list` holding 5 `dictionaries (information of each model's weights)` of each model. The subsequent `if-else` clause can be ignored since in future, I will standardize the way I save model to be strictly the latter: `model.load_state_dict(state['model_state_dict'])`

        for fold_num, state in enumerate(states_dicts):
            if 'model_state_dict' not in state:
                model.load_state_dict(state)
            else:
                model.load_state_dict(state['model_state_dict'])
                
4. Initiate with another empty list: `current_fold_preds = []` where the end result of it is all the predictions of fold $i$ contained in this list in the following format:

        [array([[0.01, 0.02, 0.02, 0.93, 0.02],
                [0.03, 0.76, 0.06, 0.06, 0.09]]),
                
         array([[0.02, 0.83, 0.01, 0.07, 0.07]])]
         
    Note that the `list` only contains 2 entries, where the first entry is an array of shape (2,5), and the second entry is an array of shape (1,5). This is because there are only 3 test images, and hence 3 predictions. Furthermore, the reason of them being split up is because we set the `batch_size` to be 2, so when you iterate through the `DataLoader`, the first for loop will append the first 2 images' predictions, and the second loop will append the remaining 1 image's predictions.
    
    
5. This part should be relatively easy to understand. In essence, we loop through the `DataLoader/TestLoader` and predict for each batch of images by calling `model(images)`, which will in turn return you **logits** because that is how we defined it in our **Model**. Subsequently, we convert the **logits** into **softmax predictions**. The naming might be not suggestive enough, but both **logits and softmax_preds** are **tensor array and numpy array respectively**.

            for data in tqdm(test_loader, position=0, leave=True):
                img_ids, images, labels = data
                images = images.to(device)
                
                logits = model(images)
                softmax_preds = torch.nn.Softmax(dim=1)(input=logits).to('cpu').numpy()
                #print('softmax predictions for fold {} is {}'.format(fold_num+1,softmax_preds))
                # do not use argmax here as we still need these softmax probabilities for averaging.
                current_fold_preds.append(softmax_preds)
                #print(current_fold_preds)
                
 6. After finishing each inner loop, we then `concatenate` the `current_fold_preds` and `append` it to `all_folds_preds`. The reason for `concatenate` is to convert the `list` into `array` as follows: You can compare and contrast this with point 4.

 
            current_fold_preds = np.concatenate(current_fold_preds, axis = 0)
            all_folds_preds.append(current_fold_preds)
            
            
             array([[0.01, 0.02, 0.02, 0.93, 0.02],
                    [0.03, 0.76, 0.06, 0.06, 0.09],
                    [0.02, 0.83, 0.01, 0.07, 0.07]])
                    
                    
7. At this junction, we are edging towards the end. What we have described above can be summarized compactly as follows: 

    Outer Loop: We loop through each model's (5 folds = 5 models)  `states_dicts` and for each fold/model,

    Inner Loop: We loop through the `DataLoader` to predict all the images in the test set.
    
    Final results: `all_folds_preds = [fold_1_preds, fold_2_preds, fold_3_preds, fold_4_preds, fold_5_preds]`
    
    Finally, we average the `all_folds_preds` using `avg_preds = np.mean(all_folds_preds, axis=0)` to get our averaged predictions: 
    
        [[0.024  0.026  0.034  0.892  0.024 ]
         [0.032  0.824  0.03   0.044  0.07  ]
         [0.017  0.82   0.0281 0.0571 0.0778]]
         
    Which is similar to our matrix:
    
    $$\text{Dividing/Averaging all 5 Fold's Predictions}\begin{bmatrix}
  0.024 & 0.026 & 0.034 & 0.892 & 0.024\\
  0.032 & 0.824 & 0.03 & 0.044 & 0.07\\
  0.017 & 0.82 & 0.0281 & 0.0571 & 0.0778\\
\end{bmatrix}$$


**Final note**

If you only pass in one fold/model, this inference function will still work.

**Numpy Shape Rows vs Columns**


There are 3 images in the test set, our final predictions for all 3 images across all 5 folds are presented below, with `shape` to be `[3,5]` which means 3 rows and 5 columns, where $\text{row}_i$ represents (1 by 5) vector containing 5 predictions in probabilities of how likely each class is; to put it even more explicit, if fold 1's prediction on the first image (call it $\text{image}_1$) is $[0.01, 0.02, 0.03, 0.9 , 0.04]$, then it means $\text{image}_1$ being class 1 is $1\%$, class 2 is $2\%$, class 3 is $3\%$, class 4 is $90\%$ and class 5, $4\%$.


One with Linear Algebra background can envision a (3 by 5) 2-dimensional array akin to a (3 x 5) **Matrix**.

So we are clear, the 5 matrices below represents the predictions of each fold. 


$$\text{Fold 1 Predictions}=\begin{bmatrix}
  0.01 & 0.02 & 0.03 & 0.9 & 0.04\\
  0.02 & 0.8 & 0.05 & 0.06 & 0.07\\
  0.01 & 0.8 & 0.03 & 0.09 & 0.07\\
\end{bmatrix}$$


$$\text{Fold 2 Predictions}\begin{bmatrix}
  0.03 & 0.05 & 0.01 & 0.88 & 0.03\\
  0.01 & 0.82 & 0.02 & 0.07 & 0.08\\
  0.005 & 0.81 & 0.0205 & 0.0555 & 0.109\\
\end{bmatrix}$$


$$\text{Fold 3 Predictions}=\begin{bmatrix}
  0.05 & 0.03 & 0.08 & 0.83 & 0.01\\
  0.05 & 0.89 & 0.01 & 0.02 & 0.03\\
  0.03 & 0.78 & 0.05 & 0.02 & 0.12\\
\end{bmatrix}$$

$$\text{Fold 4 Predictions}=\begin{bmatrix}
  0.02 & 0.01 & 0.03 & 0.92 & 0.02\\
  0.05 & 0.85 & 0.01 & 0.01 & 0.08\\
  0.02 & 0.88 & 0.03 & 0.05 & 0.02\\
\end{bmatrix}$$

$$\text{Fold 5 Predictions}\begin{bmatrix}
  0.01 & 0.02 & 0.02 & 0.93 & 0.02\\
  0.03 & 0.76 & 0.06 & 0.06 & 0.09\\
  0.02 & 0.83 & 0.01 & 0.07 & 0.07\\
\end{bmatrix}$$



All we are left to do is to add these 5 matrices, and divide by 5, as follows:

$$\text{Adding all 5 Fold's Predictions}\begin{bmatrix}
  0.12 & 0.13 & 0.17 & 4.46 & 0.12\\
  0.16 & 4.12 & 0.15 & 0.22 & 0.35\\
  0.085 & 4.1 & 0.1405 & 0.2855 & 0.389\\
\end{bmatrix}$$


Dividing by 5 (averaging):


$$\text{Dividing/Averaging all 5 Fold's Predictions}\begin{bmatrix}
  0.024 & 0.026 & 0.034 & 0.892 & 0.024\\
  0.032 & 0.824 & 0.03 & 0.044 & 0.07\\
  0.017 & 0.82 & 0.0281 & 0.0571 & 0.0778\\
\end{bmatrix}$$

# Inference Function

In [None]:
def inference_by_fold(model, states_dicts, test_loader, device):
    model.to(device)
    model.eval()
    
    with torch.no_grad():
        all_folds_preds = []
        for fold_num, state in enumerate(states_dicts):
            if 'model_state_dict' not in state:
                model.load_state_dict(state)
            else:
                model.load_state_dict(state['model_state_dict'])
                
            current_fold_preds = []
            for data in tqdm(test_loader, position=0, leave=True):
                img_ids, images, labels = data
                images = images.to(device)
                
                logits = model(images)
                softmax_preds = torch.nn.Softmax(dim=1)(input=logits).to('cpu').numpy()
                # print('softmax predictions for fold {} is {}'.format(fold_num+1,softmax_preds))
                # do not use argmax here as we still need these softmax probabilities for averaging.
                current_fold_preds.append(softmax_preds)
                #print(current_fold_preds)
            
            current_fold_preds = np.concatenate(current_fold_preds, axis = 0)
            all_folds_preds.append(current_fold_preds)
            

        avg_preds = np.mean(all_folds_preds, axis=0)
        #print("Averaging all 5 folds of softmax predictions", avg_preds)

    
    return avg_preds

We create the `inference_by_fold` function; a step by step explanation based on my own test set is as follows:

1. We equip the `model` to `device`, telling PyTorch whether we using GPU/CPU; subsequently, set `model` to `eval()` mode as we are in inference phase; then since we are inference phase, we say `torch.no_grad()` because we are no longer computing or storing gradients anymore.

        model.to(device)
        model.eval()
        with torch.no_grad():
        
2. Initiate with an empty list: `all_folds_preds=[]`, where the final expectation of this `list` should contain 5 `numpy array` with each `array` having a shape of `3 by 5`, or `n by 5` where `n` represents the total number of test images. Alternatively, we can also do what we did in the previous section, instead of simply `appending`, we can simply add each fold's predictions, and divide by 5 later. (We can explore this, but for now we stick to the below code).

3. Here is where we start looping through each fold's model's and make predictions; since there are 5 folds, this implies we have 5 models, and `states_dicts` is a `list` holding 5 `dictionaries (information of each model's weights)` of each model. The subsequent `if-else` clause can be ignored since in future, I will standardize the way I save model to be strictly the latter: `model.load_state_dict(state['model_state_dict'])`

        for fold_num, state in enumerate(states_dicts):
            if 'model_state_dict' not in state:
                model.load_state_dict(state)
            else:
                model.load_state_dict(state['model_state_dict'])
                
4. Initiate with another empty list: `current_fold_preds = []` where the end result of it is all the predictions of fold $i$ contained in this list in the following format:

        [array([[0.01, 0.02, 0.02, 0.93, 0.02],
                [0.03, 0.76, 0.06, 0.06, 0.09]]),
                
         array([[0.02, 0.83, 0.01, 0.07, 0.07]])]
         
    Note that the `list` only contains 2 entries, where the first entry is an array of shape (2,5), and the second entry is an array of shape (1,5). This is because there are only 3 test images, and hence 3 predictions. Furthermore, the reason of them being split up is because we set the `batch_size` to be 2, so when you iterate through the `DataLoader`, the first for loop will append the first 2 images' predictions, and the second loop will append the remaining 1 image's predictions.
    
    
5. This part should be relatively easy to understand. In essence, we loop through the `DataLoader/TestLoader` and predict for each batch of images by calling `model(images)`, which will in turn return you **logits** because that is how we defined it in our **Model**. Subsequently, we convert the **logits** into **softmax predictions**. The naming might be not suggestive enough, but both **logits and softmax_preds** are **tensor array and numpy array respectively**.

            for data in tqdm(test_loader, position=0, leave=True):
                img_ids, images, labels = data
                images = images.to(device)
                
                logits = model(images)
                softmax_preds = torch.nn.Softmax(dim=1)(input=logits).to('cpu').numpy()
                #print('softmax predictions for fold {} is {}'.format(fold_num+1,softmax_preds))
                # do not use argmax here as we still need these softmax probabilities for averaging.
                current_fold_preds.append(softmax_preds)
                #print(current_fold_preds)
                
 6. After finishing each inner loop, we then `concatenate` the `current_fold_preds` and `append` it to `all_folds_preds`. The reason for `concatenate` is to convert the `list` into `array` as follows: You can compare and contrast this with point 4.

 
            current_fold_preds = np.concatenate(current_fold_preds, axis = 0)
            all_folds_preds.append(current_fold_preds)
            
            
             array([[0.01, 0.02, 0.02, 0.93, 0.02],
                    [0.03, 0.76, 0.06, 0.06, 0.09],
                    [0.02, 0.83, 0.01, 0.07, 0.07]])
                    
                    
7. At this junction, we are edging towards the end. What we have described above can be summarized compactly as follows: 

    Outer Loop: We loop through each model's (5 folds = 5 models)  `states_dicts` and for each fold/model,

    Inner Loop: We loop through the `DataLoader` to predict all the images in the test set.
    
    Final results: `all_folds_preds = [fold_1_preds, fold_2_preds, fold_3_preds, fold_4_preds, fold_5_preds]`
    
    Finally, we average the `all_folds_preds` using `avg_preds = np.mean(all_folds_preds, axis=0)` to get our averaged predictions: 
    
        [[0.024  0.026  0.034  0.892  0.024 ]
         [0.032  0.824  0.03   0.044  0.07  ]
         [0.017  0.82   0.0281 0.0571 0.0778]]
         
    Which is similar to our matrix:
    
    $$\text{Dividing/Averaging all 5 Fold's Predictions}\begin{bmatrix}
  0.024 & 0.026 & 0.034 & 0.892 & 0.024\\
  0.032 & 0.824 & 0.03 & 0.044 & 0.07\\
  0.017 & 0.82 & 0.0281 & 0.0571 & 0.0778\\
\end{bmatrix}$$


**Final note**

If you only pass in one fold/model, this inference function will still work.

# My test images

It is good practice to write `torch.load(path, map_location)`, in the event that you may inference in CPU.

In [None]:
# my_own_test_images = [['11252426.jpg',-1], ['11574961.jpg',-1], ['218377.jpg',-1]] # labels are not impt
# test_df = pd.DataFrame(my_own_test_images, columns = ['image_id', 'label'])

# # Changing my test path here because I am lazy.
# config.test_path = '../input/cassava-test-images'
# states_dicts  = [torch.load(os.path.join(config.weights_dir,'{}_fold_{}_best_val.pt'.format(config.effnet,fold)),map_location=torch.device(config.device)) for fold in config.train_folds_used]
# own_test_dataset = Cassava(test_df, transforms=get_test_transforms(), test=True)
# own_test_loader = DataLoader(own_test_dataset, batch_size=2, shuffle=False, num_workers=4)

# mymodel  = CustomEfficientNet(config, pretrained=False)
# my_predictions = inference_by_fold(mymodel, states_dicts, own_test_loader, config.device)

In [None]:
# my_predictions

# For one fold

In [None]:
# sample = pd.read_csv('../input/cassava-leaf-disease-classification/sample_submission.csv')

In [None]:
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# weights_dir = '../input/cassavahongnan/'
# model = CustomEfficientNet(config, pretrained=False)
# states_dicts = [torch.load('../input/cassavahongnan/tf_efficientnet_b3_ns_best_loss_fold_0.pt')]
# test_dataset = Cassava(sample, transforms=get_test_transforms(), test=True)
# test_loader = DataLoader(test_dataset, batch_size=config.batch_size, shuffle=False, num_workers=4)
# predictions = inference(model, states_dicts, test_loader, device)
# # submission
# sample['label'] = predictions.argmax(1)
# sample[['image_id', 'label']].to_csv('submission.csv', index=False)
# sample.head()

# For Multiple Folds

Constant Reminder: So we have trained 5 models because we used 5 fold cross validation. For each of the 5 fold models that we have, we will load each fold model's weights and use them to make predictions on the unseen test images. As a result, we have 5 predictions for each test image. To be clear, if you have 1000 unseen test images, named $$i~~~ \forall i \in {1,2,...,1000}$$ then for each image $i$, there will be 5 predictions each for it, we can call them as such $$P(i_{j}) ~~~ \forall j \in {1,2,3,4,5}$$ where $j$ represents the number of folds. Thus, by convention, we take the **average/mean** of these 5 sets of predictions and take the average value/probability as the final prediction value. This score is then submitted to Kaggle and we get what we called the LB score, which should correlate to your CV/OOF score.

# Submission Cautions

When inferencing, I think do not print out anything to debug if not it might take too much ram/gpu. 

And last but not least, remember to use [`argmax(1)`](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html) on the predictions because we want a label. Note very carefully that `argmax(1)` returns the index of the array in which it has the largest value. In other words, if you have an array as follows:

    array([[0.00694168, 0.02380404, 0.19128187, 0.00819152, 0.7697809 ]]
    
Then calling `argmax(1)` on it will return you the fifth element, which is of **index 4**, corresponding to our class/label 4 in this competition. However one should be cautious as sometimes, the labels of a competition can be as such: 1, 2, 3, 4, 5 instead of 0, 1, 2, 3, 4. And in this case our prediction's `argmax(1)` will still return 4, which **DOES NOT CORRESPOND** to the correct class (5). In this case, one can create a mapping.

In [None]:
# ====================================================
# inference and submission
# ====================================================


# model = CustomEfficientNet(config, pretrained=False)
# # change test path back
# config.test_path = '../input/cassava-leaf-disease-classification/test_images'
# states_dicts = [torch.load(os.path.join(config.weights_dir,'{}_fold_{}_best_val.pt'.format(config.effnet,fold)),map_location=torch.device(config.device)) for fold in config.train_folds_used]
# test_dataset = Cassava(sample, transforms=get_test_transforms(), test=True)
# test_loader = DataLoader(test_dataset, batch_size=config.batch_size, shuffle=False, num_workers=4)
# predictions = inference_by_fold(model, states_dicts, test_loader, config.device)
# predictions
# # submission
# sample['label'] = predictions.argmax(1)
# sample[['image_id', 'label']].to_csv('submission.csv', index=False)
# sample.head()

<a id="20"></a>

<h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#3aaa80; border-radius: 100px 100px; text-align:center">Saving and Loading Models' Weights</h1>

More often than not, we don't have resources to run the model day and night. Kaggle or Colab disconnects you once it reaches a certain number of hours. This bothers me when I just started out because my model has not **converged** yet.

Therefore, itis important to be able to **checkpoint** or **save** our model, for two main reasons:

**1. Save the model's weights and use it later to inference or make predictions.**

**2. Save the model's weights as a checkpoint and use it later to resume training.**

I will be using example from both the PyTorch website and a book that I bought. Please find below for references:

1. https://pytorch.org/tutorials/

2. Deep Learning with PyTorch - Step by Step A Beginner's Guide - Daniel Voigt

<a href="#top">Back to top</a>

### Question

Now, I am unsure if this has been answered, I remember my buddy telling me that anecdotally, one can "reset" the LR a little when one resumes training from a model's checkpoint.

It is a bit counter-intuitive to me, but say I trained a model for 16 epochs, with a custom scheduler, say OneCycleLr + Adam or something, then when you resume training, should we reset the initial learning rate? I would think resetting is counter-intuitive as the purpose of the learning rate is to tune it such that your model can slowly converge to the minima (global one if the function is convex).

So the question is:

Should I reset the learning rate when resume training, if yes, reset to what?

If we should not reset, what is a good way to "extract" the last learning rate, as some scheduler depends on factors like epochs…


Uncle CPMP's advice will add in and credit later.


In [None]:
def save_checkpoint(model, optimizer, scheduler, scaler, epoch, fold, seed, fname=fname):
    checkpoint = {
        'model': model.state_dict(),
        'optimizer': optimizer.state_dict(),
        'scheduler': scheduler.state_dict(),
        'scaler': scaler.state_dict(),
        'epoch': epoch,
        'fold':fold,
        'seed':seed,
        }
    torch.save(checkpoint, '../checkpoints/%s/%s_%d_%d.pt' % (fname, fname, fold, seed))

def load_checkpoint(fold, seed, fname):
    model = create_model().to(device)
    optimizer = optimizer = torch.optim.Adam(model.parameters(), lr=MAX_LR)
    scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer=optimizer, 
                                              pct_start=PCT_START, 
                                              div_factor=DIV_FACTOR 
                                              max_lr=MAX_LR, 
                                              epochs=EPOCHS, 
                                              steps_per_epoch=int(np.ceil(len(train_data_loader)/GRADIENT_ACCUMULATION)))
    scaler = GradScaler()
    checkpoint = torch.load('../checkpoints/%s/%s_%d_%d.pt' % (fname, fname, fold, seed))
    model.load_state_dict(checkpoint['model'])
    optimizer.load_state_dict(checkpoint['optimizer'])
    scheduler.load_state_dict(checkpoint['scheduler'])
    scaler.load_state_dict(checkpoint['scaler'])
    return model, optimizer, scheduler, scaler, epoch



## What is a state_dict? 

In PyTorch, the learnable parameters (i.e. weights and biases) of an
``torch.nn.Module`` model are contained in the model’s *parameters*
(accessed with ``model.parameters()``). A *state_dict* is simply a
Python dictionary object that maps each layer to its parameter tensor.
Note that only layers with learnable parameters (convolutional layers,
linear layers, etc.) and registered buffers (batchnorm's running_mean)
have entries in the model’s *state_dict*. Optimizer
objects (``torch.optim``) also have a *state_dict*, which contains
information about the optimizer’s state, as well as the hyperparameters
used.

Because *state_dict* objects are Python dictionaries, they can be easily
saved, updated, altered, and restored, adding a great deal of modularity
to PyTorch models and optimizers.

## Simple Example

`in_channels` is the number of channels of the input to the convolutional layer. So, for example, in the case of the convolutional layer that applies to the image, `in_channels` refers to the number of channels of the image. In the case of an RGB image, `in_channels == 3` (red, green and blue); in the case of a gray image, `in_channels == 1`.

`out_channels` is the number of feature maps, which is often equivalent to the number of kernels that you apply to the input. See [here](https://stats.stackexchange.com/a/292064/82135) for more info. 

`kernel_size` is just the size of the kernel, usually `3x3` or `5x5`.

Convolutional Layer : Consider a convolutional layer which takes “l” feature maps as the input and has “k” feature maps as output. The filter size is $n*m$.

Here the input has l=32 feature maps as inputs, k=64 feature maps as outputs and filter size is n=3 and m=3. It is important to understand, that we don’t simply have a $3*3$ filter, but actually, we have $3*3*32$ filter, as our input has 32 dimensions. And as an output from first conv layer, we learn 64 different $3*3*32$ filters which total weights is $$n*m*k*l$$ Then there is a term called bias for each feature map. So, the total number of parameters are $$(n*m*l+1)*k$$

Think of the convolutional layer as a $nxn$ matrix, and $nxn = n^2$ is the weights (a.k.a what values to take when you apply the convolution.

In [None]:
# Define model
class TheModelClass(nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        
        self.conv1 = nn.Conv2d(in_channels=3, out_channels =6, kernel_size =5)
        self.pool = nn.MaxPool2d(2, 2)
        # here in channels = 6 because out channels of previous is 6.
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Initialize model
model = TheModelClass()

# Initialize optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Print model's state_dict
print("Model's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

# Print optimizer's state_dict
print("Optimizer's state_dict:")
for var_name in optimizer.state_dict():
    print(var_name, "\t", optimizer.state_dict()[var_name])
    


[Link to print parameters/weights for each layer](https://stackoverflow.com/questions/49201236/check-the-total-number-of-parameters-in-a-pytorch-model)

[Calculation of parameters in CNN](https://medium.com/@iamvarman/how-to-calculate-the-number-of-parameters-in-the-cnn-5bd55364d7ca)

[LeNet Architecture](https://engmrk.com/lenet-5-a-classic-cnn-architecture/#:~:text=The%20LeNet%2D5%20architecture%20consists,and%20finally%20a%20softmax%20classifier.)

In [None]:
from prettytable import PrettyTable

def count_parameters(model):
    table = PrettyTable(["Modules", "Parameters"])
    total_params = 0
    for name, parameter in model.named_parameters():
        if not parameter.requires_grad: continue
        param = parameter.numel()
        table.add_row([name, param])
        total_params+=param
    print(table)
    print(f"Total Trainable Params: {total_params}")
    return total_params
    
count_parameters(model)

# indeed, 5x5x3x6 = 450, note the input here is 3 channels.

## Loading and Saving

**Save**

It is not a secret that one should use `torch.save` to save a model's weights in the following manner:

`torch.save(model.state_dict(), PATH)`



**Load**

And similarly, one should use `torch.load` to load a model's weights.

    model = TheModelClass(*args, **kwargs)
    model.load_state_dict(torch.load(PATH))
    model.eval()
 
 
**Advanced Save and Load**

One can also make what you save more sophisticated:

    torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': loss,
                ...
                }, PATH)
                
                
And to load them, we do it as follows:

    model = TheModelClass(*args, **kwargs)
    optimizer = TheOptimizerClass(*args, **kwargs)

    checkpoint = torch.load(PATH)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    epoch = checkpoint['epoch']
    loss = checkpoint['loss']

    model.eval()
    # - or -
    model.train()

**Caution**

Notice that the `load_state_dict()` function takes a dictionary object, **NOT a path to a saved object**. This means that you must deserialize the saved `state_dict` before you pass it to the `load_state_dict()` function. For example, you **CANNOT** load using `model.load_state_dict(PATH)`.

## **Loading and Saving on GPU**


**Save:**

      torch.save(model.state_dict(), PATH)

**Load:**



       device = torch.device("cuda")
       model = TheModelClass(*args, **kwargs)
       model.load_state_dict(torch.load(PATH))
       model.to(device)
 
When loading a model on a GPU that was trained and saved on GPU, simply
convert the initialized ``model`` to a CUDA optimized model using
``model.to(torch.device('cuda'))``. Also, be sure to use the
``.to(torch.device('cuda'))`` function on all model inputs to prepare
the data for the model. Note that calling ``my_tensor.to(device)``
returns a new copy of ``my_tensor`` on GPU. It does NOT overwrite
``my_tensor``. Therefore, remember to manually overwrite tensors:
``my_tensor = my_tensor.to(torch.device('cuda'))``. Make sure to call `input = input.to(device)` on any input tensors that you feed to the model, as you will see later.   

# Using it on a real example

**Reminder on what is a state_dict**

In PyTorch, the learnable parameters (i.e. weights and biases) of an
``torch.nn.Module`` model are contained in the model’s *parameters*
(accessed with ``model.parameters()``). A *state_dict* is simply a
Python dictionary object that maps each layer to its parameter tensor.
Note that only layers with learnable parameters (convolutional layers,
linear layers, etc.) and registered buffers (batchnorm's running_mean)
have entries in the model’s *state_dict*. Optimizer
objects (``torch.optim``) also have a *state_dict*, which contains
information about the optimizer’s state, as well as the hyperparameters
used.

In [None]:
# # First, we define the model, since we will be loading our own "pretrained weights", there is no need
# # for one to set pretrained = True here, moreover, this competition is a No-Internet competition.

# effnet_model = CustomEfficientNet(config, pretrained=False)


# # The below code displays the parameter name, along with its size and tensors. 
# # Since the model is set to pretrained = False, the weights are initialized randomly (Xavier or something)
# # However, if you set pretrained=True, the weights are fixed because it is already pretrained.

# def count_parameters(model):
#     table = PrettyTable(["Modules", "Size of Modules", "Number of Parameters"])
#     total_params = 0
#     for param_name in model.state_dict():
#         param_tensor_size = model.state_dict()[param_name].size()
#         num_of_params = model.state_dict()[param_name].numel()
#         table.add_row([param_name, param_tensor_size, num_of_params])
#         total_params+=num_of_params
#     print(table)
#     print(f"Total Trainable Params: {total_params}")
#     return total_params
    
# count_parameters(effnet_model)

In [None]:
# Secondly, we use torch.load to load the pretrained weights that we trained on. Make sure the weights were
# trained from the same EffficientNet, there was a time that I took a EfficientNetB3 weight and used it on
# EfficientNetB0 model. It will not prompt you an error and disaster ensues.

# state_dict_fold_1 = torch.load('../input/flowers/tf_efficientnet_b5_ns_fold_1_best_val.pt')

## CHECK EACH LAYER MATCHES IN SIZE.

In [None]:
# # Thirdly, state_dict_fold_1 is an OrderedDict, before we proceed, it is very important to call
# # load_state_dict on the model! Because if not, we will get the same results as the previous 
# # count_parameters(effnet_model) when the model has yet to be trained!

# from collections import OrderedDict

# effnet_model.load_state_dict(state_dict_fold_1)
# # note that after you load the weights using load_state_dict, the effnet_model.state_dict changes!

# # Note that state_dict_fold_1 and effnet_model.state_dict() are equal!
# # Check the keys of both dict matches
# assert len(state_dict_fold_1) == len(effnet_model.state_dict())

# count = 0
# for param_name_1,param_name_2 in zip(state_dict_fold_1.keys(), effnet_model.state_dict().keys()):
#     if param_name_1 == param_name_2:
#         count+=1
# assert count==len(state_dict_fold_1) == len(effnet_model.state_dict())

# # ok the above code shows both have same param names, now to check if each tensor match:

# tensor_match_count = 0
# for param_name in state_dict_fold_1.keys() & effnet_model.state_dict().keys():
#     state_dict_fold_1_tensor = state_dict_fold_1[param_name].to(config.device)
#     effnet_model_state_dict = effnet_model.state_dict()[param_name].to(config.device)
#     if torch.all(torch.eq(state_dict_fold_1_tensor, effnet_model_state_dict)):
#         tensor_match_count+=1

# assert tensor_match_count==count==len(state_dict_fold_1) == len(effnet_model.state_dict())
# # so indeed all tensors matched.

[References](https://www.kaggle.com/c/siim-isic-melanoma-classification/discussion/175614)

[Understanding TTA](https://stepup.ai/test_time_data_augmentation/)