<h1 style="text-align: center; font-family: Verdana; font-size: 40px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; font-variant: small-caps; letter-spacing: 3px; color: #42288a; background-color: #ffffff;">Learnable Histograms: Statistical Context Features for Deep Neural Networks</h1>

---
<h1 style="text-align: center; font-family: Verdana; font-size: 35px; font-weight: bold; ">ABSTRACT</h1>
<h1 style="text-align: left; font-family: Lucida Handwriting; font-size: 17px; font-style: normal; font-weight: none; text-decoration: none; text-transform: none; font-variant: none; letter-spacing: 3px; color: #429ef5; background-color: #ffffff;">Statistical features, such as histogram, Bag-of-Words (BoW)
and Fisher Vector, were commonly used with hand-crafted features in
conventional classification methods, but attract less attention since the
popularity of deep learning methods. In this paper, we propose a learnable histogram layer, which learns histogram features within deep neural
networks in end-to-end training. Such a layer is able to back-propagate
(BP) errors, learn optimal bin centers and bin widths, and be jointly
optimized with other layers in deep networks during training. Two vision problems, semantic segmentation and object detection, are explored
by integrating the learnable histogram layer into deep networks, which
show that the proposed layer could be well generalized to different applications. In-depth investigations are conducted to provide insights on
the newly introduced layer.
</h1>

<a style="font-size:20px; color:#ccb637; font-weight:bold" href="https://arxiv.org/abs/1804.09398"> PAPER LINK </a>


<h1 style="text-align: center; font-family: Verdana; font-size: 17px; font-weight: none; color: #548f5f; ">
    Context Features play a crucial role in many vision classification problems such as semantic segmentations, object detection, pose estimation, etc. <br> <br>
    Context features could be mainly categorized into statistical and non-statistical
ones depending on whether they abandon the spatial orders of the context information. On the one hand, for most deep learning methods that gain increasing attention in recent years, non-statistical context features dominate <br> <br>
    On the other hand, statistical context features were mostly used in conventional classification methods with hand-crafted features. Commonly used statistical features include histogram, Bag-of-Words (BoW), Fisher vector, 
Second-order pooling, etc <br> <br>
    In this paper, Histogram, a statistical feature of image vectors is represented in the form of convolutions of feature vectors. Unlike existing deep learning methods that treat statistical operations as a
separate module, this proposed histogram layer is able to back-propagate (BP)
errors and learn optimal bin centers and bin width during training. Such properties make it possible to be integrated into neural networks and end-to-end
trained. In this way, the appearance and statistical features in a neural network
could effectively adapt each other and thus lead to better classification accuracy
</h1>

<h1 style="text-align: left; font-family: Verdana; font-size: 25px; font-weight: bold; color: #8C705F; ">
MY QUEST
    </h1>

<h1 style="text-align: left; font-family: Verdana; font-size: 16px; font-weight: none; color: #1F0318; ">
When I first Started with CASSAVA IMAGE CLASSIFICATION competition, I came across some comprehensive and extensive EDA notebooks like <a href="https://www.kaggle.com/foolofatook/starter-eda-cassava-leaf-disease/notebook">this</a> and <a href="https://www.kaggle.com/tanulsingh077/how-to-become-leaf-doctor-with-deep-learning"> this </a>, where they try to cluster images and understand them using their pixel statistics. I was surprised by how much we can understand about images from their pixel values and histograms alone. <br> <br>
    I then tried to find out if there are any ways where we can combine this statistical information and the Deep Learning models and get the best of both worlds. On my Exploration, I found this Amazing paper! <br><br>
    This paper provides us a method to combine these image statistics with CNN Feature vectors. I encourage you to read the paper since it explains the approach in a clear manner. The main approach involves making a LEARNABLE HISTOGRAM LAYER and using it in the model. <br><br>
    The Paper was originally made for OBJECT DETECTION and SEMANTIC SEGMENTATION, but I have tried use it in the classification task as well. The original layer is same in all the tasks so feel free to use it in your model. (After you upvoteðŸ˜‰)
    </h1>

## IMPORTING LIBRARIES AND SETTINGS

In [None]:
# installing timm
!pip install -q timm

import os
import cv2
import copy
import time
import random
import pickle

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler
import torchvision
from torchvision import models
from torch.utils.data import DataLoader, Dataset
from torch.cuda import amp
from tqdm.notebook import tqdm

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.utils import class_weight

from tqdm.notebook import tqdm
from collections import defaultdict
import albumentations as A
from albumentations.pytorch import ToTensorV2

import timm

In [None]:
# defining paths
ROOT_DIR = "../input/cassava-leaf-disease-classification"
TRAIN_DIR = "../input/cassava-leaf-disease-classification/train_images"
TEST_DIR = "../input/cassava-leaf-disease-classification/test_images"

In [None]:
# CFG -  Here we can change different options
class CFG:
    model_name = 'tf_efficientnet_b4_ns'
    img_size = 512
    scheduler = 'CosineAnnealingLR'
    T_max = 10
    T_0 = 10
    lr = 1e-5
    min_lr = 1e-7
    batch_size = 20
    weight_decay = 1e-6
    seed = 42
    num_classes = 5
    num_epochs = 3
    n_fold = 5
    smoothing = 0.2
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")    
    
    # THIS IS SOMETHING NEW. We will see where this is used later
    Binsize = 100


In [None]:
# SEEDING

def set_seed(seed = 42):
    '''Sets the seed of the entire notebook so results are the same every time we run.
    This is for REPRODUCIBILITY.'''
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    # Set a fixed value for the hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)
    
set_seed(CFG.seed)

## DATASET 

In [None]:
df = pd.read_csv(f"{ROOT_DIR}/train.csv")

skf = StratifiedKFold(n_splits=CFG.n_fold, shuffle=True, random_state=CFG.seed)
for fold, ( _, val_) in enumerate(skf.split(X=df, y=df.label)):
    df.loc[val_ , "kfold"] = int(fold)
    
df['kfold'] = df['kfold'].astype(int)

class CassavaLeafDataset(nn.Module):
    def __init__(self, root_dir, df, transforms=None):
        self.root_dir = root_dir
        self.df = df
        self.transforms = transforms
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, index):
        img_path = os.path.join(self.root_dir, self.df.iloc[index, 0])
        img = cv2.imread(img_path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        label = self.df.iloc[index, 1]
        
        if self.transforms:
            img = self.transforms(image=img)["image"]
            
        return img, label

## AUGMENTATIONS

In [None]:
data_transforms = {
    "train": A.Compose([
        A.RandomResizedCrop(CFG.img_size, CFG.img_size),
        A.Transpose(p=0.5),
        A.HorizontalFlip(p=0.5),
        A.VerticalFlip(p=0.5),
        A.ShiftScaleRotate(p=0.5),
        A.HueSaturationValue(
                hue_shift_limit=0.2, 
                sat_shift_limit=0.2, 
                val_shift_limit=0.2, 
                p=0.5
            ),
        A.RandomBrightnessContrast(
                brightness_limit=(-0.1,0.1), 
                contrast_limit=(-0.1, 0.1), 
                p=0.5
            ),
        A.Normalize(
                mean=[0.485, 0.456, 0.406], 
                std=[0.229, 0.224, 0.225], 
                max_pixel_value=255.0, 
                p=1.0
            ),
        A.CoarseDropout(p=0.5),
        A.Cutout(p=0.5),
        ToTensorV2()], p=1.),
    
    "valid": A.Compose([
        A.CenterCrop(CFG.img_size, CFG.img_size, p=1.),
        A.Resize(CFG.img_size, CFG.img_size),
        A.Normalize(
                mean=[0.485, 0.456, 0.406], 
                std=[0.229, 0.224, 0.225], 
                max_pixel_value=255.0, 
                p=1.0
            ),
        ToTensorV2()], p=1.)
}

## DEFINING LOSS

<h1 style="text-align: left; font-family: monospace; font-size: 18px; font-weight: none; color: #548f5f; "> 
    Here I use Taylor Cross Entropy and Label Smoothing Combo loss. <a href="https://www.kaggle.com/yerramvarun/cassava-taylorce-loss-label-smoothing-combo">REFERENCE </a> </h1>

In [None]:
# implementations reference - https://github.com/CoinCheung/pytorch-loss/blob/master/pytorch_loss/taylor_softmax.py
# paper - https://www.ijcai.org/Proceedings/2020/0305.pdf

class TaylorSoftmax(nn.Module):

    def __init__(self, dim=1, n=2):
        super(TaylorSoftmax, self).__init__()
        assert n % 2 == 0
        self.dim = dim
        self.n = n

    def forward(self, x):
        
        fn = torch.ones_like(x)
        denor = 1.
        for i in range(1, self.n+1):
            denor *= i
            fn = fn + x.pow(i) / denor
        out = fn / fn.sum(dim=self.dim, keepdims=True)
        return out

class LabelSmoothingLoss(nn.Module):

    def __init__(self, classes, smoothing=0.0, dim=-1): 
        super(LabelSmoothingLoss, self).__init__() 
        self.confidence = 1.0 - smoothing 
        self.smoothing = smoothing 
        self.cls = classes 
        self.dim = dim 
    def forward(self, pred, target): 
        """Taylor Softmax and log are already applied on the logits"""
        #pred = pred.log_softmax(dim=self.dim) 
        with torch.no_grad(): 
            true_dist = torch.zeros_like(pred) 
            true_dist.fill_(self.smoothing / (self.cls - 1)) 
            true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence) 
        return torch.mean(torch.sum(-true_dist * pred, dim=self.dim))
    

class TaylorCrossEntropyLoss(nn.Module):

    def __init__(self, n=2, ignore_index=-1, reduction='mean', smoothing=0.2):
        super(TaylorCrossEntropyLoss, self).__init__()
        assert n % 2 == 0
        self.taylor_softmax = TaylorSoftmax(dim=1, n=n)
        self.reduction = reduction
        self.ignore_index = ignore_index
        self.lab_smooth = LabelSmoothingLoss(CFG.num_classes, smoothing=smoothing)

    def forward(self, logits, labels):

        log_probs = self.taylor_softmax(logits).log()
        #loss = F.nll_loss(log_probs, labels, reduction=self.reduction,
        #        ignore_index=self.ignore_index)
        loss = self.lab_smooth(log_probs, labels)
        return loss


## TRAINING FUNCTIONS

In [None]:
def train_model(model, criterion, optimizer, scheduler, num_epochs, dataloaders, dataset_sizes, device, fold):
    start = time.time()
    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0
    history = defaultdict(list)
    scaler = amp.GradScaler()

    for epoch in range(1,num_epochs+1):
        print('Epoch {}/{}'.format(epoch, num_epochs))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train','valid']:
            if(phase == 'train'):
                model.train() # Set model to training mode
            else:
                model.eval() # Set model to evaluation mode
            
            running_loss = 0.0
            running_corrects = 0.0
            
            # Iterate over data
            for inputs,labels in tqdm(dataloaders[phase]):
                inputs = inputs.to(CFG.device)
                labels = labels.to(CFG.device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    with amp.autocast():
                        outputs = model(inputs)
                        _, preds = torch.max(outputs,1)
                        loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        scaler.scale(loss).backward()
                        scaler.step(optimizer)
                        scaler.update()


                running_loss += loss.item()*inputs.size(0)
                running_corrects += torch.sum(preds == labels.data).double().item()

            
            epoch_loss = running_loss/dataset_sizes[phase]
            epoch_acc = running_corrects/dataset_sizes[phase]

            history[phase + ' loss'].append(epoch_loss)
            history[phase + ' acc'].append(epoch_acc)

            if phase == 'train' and scheduler != None:
                scheduler.step()

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))
            
            # deep copy the model
            if phase=='valid' and epoch_acc >= best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())
                PATH = f"Fold{fold}_{best_acc}_epoch{epoch}.bin"
                torch.save(model.state_dict(), PATH)

        print()

    end = time.time()
    time_elapsed = end - start
    print('Training complete in {:.0f}h {:.0f}m {:.0f}s'.format(
        time_elapsed // 3600, (time_elapsed % 3600) // 60, (time_elapsed % 3600) % 60))
    print("Best Accuracy ",best_acc)

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model, history, best_acc

In [None]:
def run_fold(model, criterion, optimizer, scheduler, device, fold, num_epochs=10):
    valid_df = df[df.kfold == fold]
    train_df = df[df.kfold != fold]
    
    train_data = CassavaLeafDataset(TRAIN_DIR, train_df, transforms=data_transforms["train"])
    valid_data = CassavaLeafDataset(TRAIN_DIR, valid_df, transforms=data_transforms["valid"])
    
    dataset_sizes = {
        'train' : len(train_data),
        'valid' : len(valid_data)
    }
    
    train_loader = DataLoader(dataset=train_data, batch_size=CFG.batch_size, num_workers=4, pin_memory=True, shuffle=True)
    valid_loader = DataLoader(dataset=valid_data, batch_size=CFG.batch_size, num_workers=4, pin_memory=True, shuffle=False)
    
    dataloaders = {
        'train' : train_loader,
        'valid' : valid_loader
    }

    model, history, best_acc = train_model(model, criterion, optimizer, scheduler, num_epochs, dataloaders, dataset_sizes, device, fold)
    
    return model, history, best_acc


In [None]:
def fetch_scheduler(optimizer):
    if CFG.scheduler == 'CosineAnnealingLR':
        scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=CFG.T_max, eta_min=CFG.min_lr)
    elif CFG.scheduler == 'CosineAnnealingWarmRestarts':
        scheduler = lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=CFG.T_0, T_mult=1, eta_min=CFG.min_lr)
    elif CFG.scheduler == None:
        return None
        
    return scheduler

## A LOOK AT ORIGINAL MODEL

<h1 style="text-align: left; font-family: monospace; font-size: 18px; font-weight: none; color: #548f5f; "> 
    Lets take a look at the journey of our image when it does a pass through our EFFICIENTNET-B4-NS. <br> <br>
    We will use the library <a href="https://github.com/sksq96/pytorch-summary"> TORCHSUMMARY </a>
    </h1>

In [None]:
# EXPAND THE OUTPUT TO SEE THE FULL SUMMARY

# Getting the model - standard code
model = timm.create_model(CFG.model_name, pretrained=True)
num_features = model.classifier.in_features
model.classifier = nn.Linear(num_features, CFG.num_classes)

# installing torch summary
!pip install -q torchsummary

# import it
from torchsummary import summary

# use it!
summary(model, (3, 512,512), device='cpu')

<h1 style="text-align: center; font-family: arial; font-size: 18px; font-weight: none; color: #548f5f; "> 
    We will apply the HistNet Layer to our classification task by appending a feature matrix between 418 and 419 layers (Refer the above summary) 
</h1>

In [None]:
from IPython.display import Image
Image("../input/images/arch.png")

## HISTOGRAM LAYER

<h1 style="text-align: left; font-family: arial; font-size: 18px; font-weight: none; color: #548f5f; ">
    The Histogram layer is implemented as a separate Module which will be used in our main Model
    </h1>

In [None]:
class HistModule(nn.Module):
    
    def __init__(self, in_shape=(1792, 16,16), B=6):
        
        super(HistModule, self).__init__()
        
        # making the feature vector shape (5,16,16)
        self.preconv = nn.Conv2d(1792, 5, kernel_size=(1,1))
        
        # CONV1
        
        # Filter for class 0
        self.conv1_0 = nn.Conv2d(5, B, kernel_size=(1,1))
        # initializing weight appropriately + freezing
        self.conv1_0.weight.data.fill_(0)
        self.conv1_0.weight.data[:, 0, :, :] = 1
        self.conv1_0.weight.data.requires_grad =False
        
        # Filter for class 1
        self.conv1_1 = nn.Conv2d(5, B, kernel_size=(1,1))
        # initializing weight appropriately + freezing
        self.conv1_1.weight.data.fill_(0)
        self.conv1_1.weight.data[:, 1, :, :] = 1
        self.conv1_1.weight.data.requires_grad =False
        
        # Filter for class 2
        self.conv1_2 = nn.Conv2d(5, B, kernel_size=(1,1))
        # initializing weight appropriately + freezing
        self.conv1_2.weight.data.fill_(0)
        self.conv1_2.weight.data[:, 2, :, :] = 1
        self.conv1_2.weight.data.requires_grad =False
        
        # Filter for class 3
        self.conv1_3 = nn.Conv2d(5, B, kernel_size=(1,1))
        # initializing weight appropriately + freezing
        self.conv1_3.weight.data.fill_(0)
        self.conv1_3.weight.data[:, 3, :, :] = 1
        self.conv1_3.weight.data.requires_grad =False
        
        # Filter for class 4
        self.conv1_4 = nn.Conv2d(5, B, kernel_size=(1,1))
        # initializing weight appropriately + freezing
        self.conv1_4.weight.data.fill_(0)
        self.conv1_4.weight.data[:, 4, :, :] = 1
        self.conv1_4.weight.data.requires_grad =False
        
        
        # CONV 2
        
        self.conv2 = nn.Conv2d(5*B, 5*B, kernel_size=(1,1))
        self.conv2.bias.data.fill_(1)
        self.conv2.bias.data.requires_grad = False
        
    
    def forward(self, x):
        # get to class size
        inp = self.preconv(x)
        
        # classwise convs
        cls0 = self.conv1_0(inp)
        cls1 = self.conv1_1(inp)
        cls2 = self.conv1_2(inp)
        cls3 = self.conv1_3(inp)
        cls4 = self.conv1_4(inp)
        
        # concatenate
        concat = torch.cat([cls0, cls1, cls2, cls3, cls4], 1)
        concat = torch.abs(concat)
        
        # conv2
        out = self.conv2(concat)
        out = F.relu(out)
        
        # final outshape = (1792+(5*binsize), 5, 5)
        finout = torch.cat([x, out], 1)
        
        return finout

# check the output shape
hs = HistModule()
hs(torch.rand(1,1792, 16, 16)).shape

## MAKING THE FINAL MODEL

<h1 style="text-align: left; font-family: arial; font-size: 18px; font-weight: none; color: #548f5f; ">
    I have loaded up my original weights for the EFFNET B4 Model and fine tuned them. <br>
    All the weights are preloaded except the Histogram layers which will be learned.<br>
    The learning rate is kept low as we are only fine tuning the model.<br> 
</h1>

In [None]:
class MyHistEffnet(nn.Module):
    
    def __init__(self, Binsize = 6):
        # initialize parent
        super(MyHistEffnet, self).__init__()
        
        # initialize bin size
        self.bins = Binsize
        
        # make model
        self.model = timm.create_model(CFG.model_name, pretrained=True)
        num_features = self.model.classifier.in_features
        self.model.classifier = nn.Linear(num_features, CFG.num_classes)
        
        """
        We are not passing the image through the complete model, hence we don't need the output of the last two layers.
        Lets get rid of them!
        """
        self.modified_model = nn.Sequential(*list(self.model.children())[:-2])
        
        # defining histogram layer
        self.histmod = HistModule(B=Binsize)
        
        # THE LAST TWO LAYERS
        self.global_pool = nn.AdaptiveAvgPool2d(output_size=1)
        self.fc = nn.Linear(1792+(5*self.bins), 5)
        
        
    def forward(self, x):
        
        # get the feature vector
        ftrvec = self.modified_model(x) # 1792, 16, 16
        
        # pass through the histmodule
        x = self.histmod(ftrvec) # 1792+30, 16, 16
        
        # global average pool
        x = self.global_pool(x) # 1792,1, 1
        x = x.view(-1, 1792+(5*self.bins))
        x = self.fc(x)
        
        return x

# check the output shape
mod = MyHistEffnet(Binsize=CFG.Binsize)
mod(torch.rand((1,3,512,512))).shape

<h1 style="text-align: left; font-family: arial; font-size: 18px; font-weight: none; color: #548f5f; ">
I am loading my fold0 weights which originally give 0.902 validation accuracy. Lets see if we can improve on that. <br>
</h1>

In [None]:
# The binsize defined in the config is used here
HistModel = MyHistEffnet(Binsize=CFG.Binsize)
HistModel.to(CFG.device)

# remember to load the weights in the "model" only
HistModel.model.load_state_dict(torch.load('../input/images/Fold0_weights.bin'))

In [None]:
# defining optimizer, criterion, scheduler
optimizer = optim.Adam(HistModel.parameters(), lr=CFG.lr, weight_decay=CFG.weight_decay, amsgrad=False)
criterion = TaylorCrossEntropyLoss(n=2, smoothing=0.2)
scheduler = fetch_scheduler(optimizer) 

## START THE TRAINING!

In [None]:
model, history, ba = run_fold(HistModel, criterion, optimizer, 
                              scheduler, device=CFG.device, fold=0, num_epochs=CFG.num_epochs)

<h1 style="text-align: left; font-family: arial; font-size: 18px; font-weight: none; color: #940a36; ">
    The final accuracy is 0.907 which is not much better than initial 0.902, But this was just the Baseline.
    <br> 
    We have the Binsize to be tuned, We can follow a different fine tuning strategy or we can use a different model as a backbone.
    <br>
    The possibilites are endless!
</h1>

<p style="font-size:20px"> Thank you for making it till here. If the implementation and the code helps you consider UPVOTING! </p>