## Label smoothing

Label smoothing, in a nutshell, is a way to make our model more robust so that it generalizes well and by doing this it avoids the problem of overconfidence.


In essence, label smoothing will help your model to train around mislabeled data and consequently improve its robustness and performance.

`label_smoothing` normally helps generalisation (test accuracy and AUC) in the presence of label noise


### Problem of Overconfidence and its solution with Model Calibration

An overconfident model is not calibrated and its predicted probabilities are consistently higher than the accuracy. For example, it may predict 0.9 for inputs where the accuracy is only 0.6. And note that models with small test errors can still be overconfident.

Hence Model calibration is important for model interpretability and reliability

A classification model is calibrated if its predicted probabilities of outcomes reflect their accuracy. For example, consider 100 examples within our dataset, each with predicted probability 0.9 by our model. If our model is calibrated, then 90 examples should be classified correctly. Similarly, among another 100 examples with predicted probabilities 0.6, we would expect only 60 examples being correctly classified.


------------------------------------------------------------------

### Formula of Label Smoothing

Label smoothing replaces one-hot encoded label vector y_hot with a mixture of y_hot and the uniform distribution:


## y_ls = (1 - α) * y_hot + α / K

where K is the number of label classes, and α is a hyperparameter that determines the amount of smoothing. If α = 0, we obtain the original one-hot encoded y_hot. If α = 1, we get the uniform distribution.


Example application of above formulae

Say you were training a model for binary classification. Your labels would be 0 — cat, 1 — not cat.

Now, say you label_smoothing = 0.2

Using the equation above, we get:

```
new_onehot_labels = [0 1] * (1 — 0.2) + 0.2 / 2 =[0 1]*(0.8) + 0.1

new_onehot_labels =[0.9 0.1]

```

These are soft labels, instead of hard labels, that is 0 and 1. This will ultimately give you lower loss when there is an incorrect prediction, and subsequently, your model will penalize and learn incorrectly by a slightly lesser degree.

In other words, instead of using the hard labels or the one-hot encoded variables where the true label is 1, let’s replace them with (1-α) * 1 where α refers to the smoothing parameter. Once that’s done, we add some uniform noise 1/K to the labels where K: total number of labels.


------------------------------------------------------------------------------

The Problem in without Label-Smoothing for a multi-class classification problem is -

For the cross-Entropy loss to really be at a minimum, each logit corresponding to the correct class needs to be significantly higher than the rest. That is, for example for row-1, img-1.jpg the logit of 4.7 corresponding to is_dog needs to be significantly higher than the rest. This is also the case for all the other rows.

#### A mathematical proof of this problem was presented in a [Paper by Lei Mao](https://leimao.github.io/blog/Cross-Entropy-KL-Divergence-MLE/) where he explains why minimizing cross entropy loss is equivalent to do maximum likelihood estimation.


The above can cause two problems. First, it may result in over-fitting: if the model learns to assign full probability to the ground- truth label for each training example, it is not guaranteed to generalize. Second, it encourages the differences between the largest logit and all others to become large.

In other words, our model could become overconfident of it’s predictions because to really minimise the loss, our model needs to be very sure of everything that it predicts. This is bad because it is then harder for the model to generalise and easier for it to overfit to the training data.

Intuitively, label smoothing restraints the logit value for the correct class to be closer to the logit values for other classes.

---------------------------------------------------------

## Formulae for Label Smoothing Cross Entropy loss


Label Smoothing is designed to make the model a little bit less certain of it’s decision by changing a little bit its target:

#### So, instead of wanting to predict 1 for the correct class and 0 for all the others, we ask it to predict 1-ε for the correct class and ε for all the others, with ε a (small) positive number and N the number of classes. This can be written as:


![Imgur](https://imgur.com/mwo8Tfl.png)

i.e.

![Imgur](https://imgur.com/3AQ3Ns5.png))



-------------------------------------------------------------------------------------------------------

### Fastai/PyTorch Implementation of Label Smoothing Cross Entropy loss

https://towardsdatascience.com/label-smoothing-as-another-regularization-trick-7b34c50dc0b9

Label smoothing changes the target vector by a small amount ε. Thus, instead of asking our model to predict 1 for the right class, we ask it to predict 1-ε for the correct class and ε for all the others. So, the cross-entropy loss function with label smoothing is transformed into the formula below.

![](assets/2022-02-19-01-46-14.png)

In this formula, ce(x) denotes the standard cross-entropy loss of x (e.g. -log(p(x))), ε is a small positive number, i is the correct class and N is the number of classes.


```py

class LabelSmoothingCrossEntropy(Module):
    y_int = True
    def __init__(self, eps:float=0.1, reduction='mean'): self.eps,self.reduction = eps,reduction

    def forward(self, output, target):
        c = output.size()[-1]
        log_preds = F.log_softmax(output, dim=-1)
        if self.reduction=='sum': loss = -log_preds.sum()
        else:
            loss = -log_preds.sum(dim=-1) #We divide by that size at the return line so sum and not mean
            if self.reduction=='mean':  loss = loss.mean()
        return loss*self.eps/c + (1-self.eps) * F.nll_loss(log_preds, target.long(), reduction=self.reduction)

```


-------------

### Example of the Problem (without Label Smoothing)

Suppose we have K = 3 classes, and our label belongs to the 1st class. Let [a, b, c] be our logit vector.
If we do not use label smoothing, the label vector is the one-hot encoded vector [1, 0, 0]. Our model will make a ≫ b and a ≫ c.

For example,

My logit vector [10, 0, 0] => Applying softmax to to this gives

[0.9999, 0, 0] rounded to 4 decimal places.

If we use label smoothing with α = 0.1,

The logit vector becomes [3.3322, 0, 0]

And the smoothed label vector after softmax becomes [0.9333, 0.0333, 0.0333].

So the smoothed label vector after softmax, has a smaller gap.

This is why we call label smoothing a regularization technique as it restrains the largest logit from becoming much bigger than the rest.

----------------------

### Q: When do we use label smoothing?
A: Whenever a classification neural network suffers from overfitting and/or overconfidence, we can try label smoothing.

### Q: How do we choose α?
A: Just like other regularization hyperparameters, there is no formula for choosing α. It is usually done by trial and error, and α = 0.1 is a good place to start.

### Q: Can we use distributions other than uniform distribution in label smoothing?
A: Technically yes. In [4] the theoretical groundwork is developed for arbitrary distributions. That being said, the vast majority of empirical studies on label smoothing use uniform distribution.

###  Q: Is label smoothing used outside deep learning?

A: Not really. Most popular non-deep learning methods do not use the softmax function. Thus label smoothing is usually not applicable.

## Label-Smoothing-CrossEntropyLoss from Scratch

In [2]:
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from torch.nn.modules.loss import _WeightedLoss

## Formulae for Label Smoothing Cross Entropy loss


![Imgur](https://imgur.com/OADA4gm.png)

![Imgur](https://imgur.com/PuwOVQk.png)


Label Smoothing is designed to make the model a little bit less certain of it’s decision by changing a little bit its target:


#### So, instead of wanting to predict 1 for the correct class and 0 for all the others, we ask it to predict 1-ε for the correct class and ε for all the others, with ε a (small) positive number and N the number of classes. This can be written as:

In [3]:
class LabelSmoothingLoss(torch.nn.Module):
    def __init__(self, epsilon: float = 0.1, 
                 reduction="mean", weight=None):
        super(LabelSmoothingLoss, self).__init__()
        self.epsilon   = epsilon
        self.reduction = reduction
        self.weight    = weight

    def reduce_loss(self, loss):
        return loss.mean() if self.reduction == 'mean' else loss.sum() \
         if self.reduction == 'sum' else loss

    def linear_combination(self, i, j):
        return (1 - self.epsilon) * i + self.epsilon * j

    def forward(self, predict_tensor, target):
        assert 0 <= self.epsilon < 1

        if self.weight is not None:
            self.weight = self.weight.to(predict_tensor.device)

        num_classes = predict_tensor.size(-1)
        
        log_preds = F.log_softmax(predict_tensor, dim=-1)
        
        loss = self.reduce_loss(-log_preds.sum(dim=-1))
        
        negative_log_likelihood_loss = F.nll_loss(
            log_preds, target, reduction=self.reduction, weight=self.weight
        )
        return self.linear_combination(negative_log_likelihood_loss, loss / num_classes,)

#### The negative log likelihood loss - `torch.nn.functional.nll_loss`


The cross-entropy loss and the (negative) log-likelihood are
the same in the following sense:

If you apply Pytorch’s CrossEntropyLoss to your output layer,
you get the same result as applying Pytorch’s NLLLoss to a
LogSoftmax layer added after your original output layer.

(I suspect – but don’t know for a fact – that using
CrossEntropyLoss will be more efficient because it
can collapse some calculations together, and doesn’t
introduce an additional layer.)

You are trying to maximize the “likelihood” of your model
parameters (weights) having the right values. Maximizing
the likelihood is the same as maximizing the log-likelihood,
which is the same as minimizing the negative-log-likelihood.
For the classification problem, the cross-entropy is the
negative-log-likelihood. 

Any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution defined by the training set and the probability distribution defined by model.


![Imgur](https://imgur.com/pf8iEb8.png)

From the above we can see that if we want to minimize the Cross-Entropy (cross-entropy loss in many deep learning libraries) we need to minimize the Negative Log-Likelihood of the model (cross-entropy loss in many libraries typically calculate Negative Log-Likelihood Loss and Log-Softmax under the hood, like in PyTorch).


In [4]:
loss_criterion = LabelSmoothingLoss(epsilon=0.5)

predict_tensor = torch.FloatTensor([[0, 0.2, 0.7, 0.1, 0],
                                [0, 0.9, 0.2, 0.2, 1], 
                                [1, 0.2, 0.7, 0.9, 1]])

target = Variable(torch.LongTensor([2, 1, 0]))

loss_label_smoothed = loss_criterion(Variable(predict_tensor), target )

print(loss_label_smoothed)

tensor(1.4670)


In [5]:
predict_tensor.size(-1)

5