## Label Smoothing Cross-Entropy-Loss from Scratch with PyTorch

# [Link to my Youtube Video Explaining this whole Notebook](https://www.youtube.com/watch?v=PIpJn8TZJO8&list=PLxqBkZuBynVQqJTE9nRM2p7Tb12TDPlnq&index=2)

[![Imgur](https://imgur.com/BMYxc3W.png)](https://www.youtube.com/watch?v=PIpJn8TZJO8&list=PLxqBkZuBynVQqJTE9nRM2p7Tb12TDPlnq&index=2)



In [1]:
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from torch.nn.modules.loss import _WeightedLoss

## Formulae for Label Smoothing Cross Entropy loss


![Imgur](https://imgur.com/OADA4gm.png)

![Imgur](https://imgur.com/PuwOVQk.png)


Label Smoothing is designed to make the model a little bit less certain of it’s decision by changing a little bit its target:


#### So, instead of wanting to predict 1 for the correct class and 0 for all the others, we ask it to predict 1-ε for the correct class and ε for all the others, with ε a (small) positive number and N the number of classes. This can be written as:

In [2]:
class LabelSmoothingLoss(torch.nn.Module):
    def __init__(self, epsilon: float = 0.1, 
                 reduction="mean", weight=None):
        super(LabelSmoothingLoss, self).__init__()
        self.epsilon   = epsilon
        self.reduction = reduction
        self.weight    = weight

    def reduce_loss(self, loss):
        return loss.mean() if self.reduction == 'mean' else loss.sum() \
         if self.reduction == 'sum' else loss

    def linear_combination(self, i, j):
        return (1 - self.epsilon) * i + self.epsilon * j

    def forward(self, predict_tensor, target):
        assert 0 <= self.epsilon < 1

        if self.weight is not None:
            self.weight = self.weight.to(predict_tensor.device)

        num_classes = predict_tensor.size(-1)
        
        log_preds = F.log_softmax(predict_tensor, dim=-1)
        
        loss = self.reduce_loss(-log_preds.sum(dim=-1))
        
        negative_log_likelihood_loss = F.nll_loss(
            log_preds, target, reduction=self.reduction, weight=self.weight
        )
        return self.linear_combination(negative_log_likelihood_loss, loss / num_classes,)

In [6]:
loss_criterion = LabelSmoothingLoss(epsilon=0.5)

predict_tensor = torch.FloatTensor([[0, 0.2, 0.7, 0.1, 0],
                                [0, 0.9, 0.2, 0.2, 1], 
                                [1, 0.2, 0.7, 0.9, 1]])

target = Variable(torch.LongTensor([2, 1, 0]))

loss_label_smoothed = loss_criterion(Variable(predict_tensor), target )

print(loss_label_smoothed)

tensor(1.4670)


#### The negative log likelihood loss - `torch.nn.functional.nll_loss`


The cross-entropy loss and the (negative) log-likelihood are
the same in the following sense:

If you apply Pytorch’s CrossEntropyLoss to your output layer,
you get the same result as applying Pytorch’s NLLLoss to a
LogSoftmax layer added after your original output layer.

(I suspect – but don’t know for a fact – that using
CrossEntropyLoss will be more efficient because it
can collapse some calculations together, and doesn’t
introduce an additional layer.)

You are trying to maximize the “likelihood” of your model
parameters (weights) having the right values. Maximizing
the likelihood is the same as maximizing the log-likelihood,
which is the same as minimizing the negative-log-likelihood.
For the classification problem, the cross-entropy is the
negative-log-likelihood. 

Any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution defined by the training set and the probability distribution defined by model.


![Imgur](https://imgur.com/pf8iEb8.png)

From the above we can see that if we want to minimize the Cross-Entropy (cross-entropy loss in many deep learning libraries) we need to minimize the Negative Log-Likelihood of the model (cross-entropy loss in many libraries typically calculate Negative Log-Likelihood Loss and Log-Softmax under the hood, like in PyTorch).
