<a href="https://colab.research.google.com/github/sdamadi/Today-I-learned/blob/master/How_cross_entropy_works.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook breaks down how cross_entropy function (corresponding to CrossEntropyLoss used for classification) is implemented in pytorch, and how it is related to softmax, log_softmax, and nll (negative log-likelihood).

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [0]:
batch_size , n_classes  = 5, 3
x = torch.randn(batch_size, n_classes)
x.shape

torch.Size([5, 3])

In [0]:
x

tensor([[ 1.3877, -1.2656, -0.9112],
        [ 0.6011,  0.0092, -0.0982],
        [-0.0674, -0.2059,  0.3410],
        [ 1.3918,  0.2355,  0.4159],
        [-0.6652, -1.6354, -0.9315]])

`torch.randint` takes highest integer to be drawn from the distribution. `Size` is the output size.

```Softmax Regression = Multi-class Logistic Regression```

**Multi-class classification** is a generalization of **logistic regression** wherein we are dealing with binary classification. Note that in both cases we assume the classes are mutually exclusive.

**Logistic regression** loss function uses the output of a neural network ($\hat{y}$) to calculate the loss of the network. To that end, the result of last layer ($z^L$) is passed in sigmoid function defined as $\sigma({z^L})=\frac{e^{z^L}}{1+ e^{z^L}}=\hat{y}$ where $z^L \in \mathbb{R}$ is the linear output of the last layer, $y$ is the ground truth, and the loss is defined as
$$
\ell(y, \hat{y}) = -(y\log\hat{y} + (1-y)\log(1-\hat{y}))
$$

On the other hand **multi-class classification** loss recieves a vector, i.e., $\hat{\textbf{y}}$ to calculate the loss; $\hat{\textbf{y}}$ is obtained by feeding the linear output of the last layer $\textbf{z}^L$ to the softmax function. Softmax is defined as $\text{softmax}(\textbf{z}^L)=\frac{e^{{\textbf{z}}^L_i}}{\sum_{i=1}^{n}e^{{\textbf{z}}^L_i}}=\hat{\textbf{y}_i}$ where $\hat{\textbf{y}} \in \mathbb{R}^n$ is the output of the network and $\textbf{y}$ is the ground truth vector. Therefore, the loss function would be
$$
\ell(\textbf{y}, \hat{\textbf{y}}) = -\frac{1}{m}\sum_{k=1}^{n}{\textbf{y}}_k\log(\hat{\textbf{y}}_k)
$$
The above loss function is called **cross entropy** loss.


In [0]:
target = torch.randint(n_classes, size = (batch_size,), dtype = torch.long)

In [0]:
target

tensor([2, 0, 1, 0, 1])

In [0]:
x.exp()

tensor([[4.0058, 0.2821, 0.4020],
        [1.8241, 1.0093, 0.9065],
        [0.9349, 0.8139, 1.4064],
        [4.0222, 1.2656, 1.5157],
        [0.5142, 0.1949, 0.3940]])

In [0]:
x.exp().sum(-1).unsqueeze(-1)

tensor([[4.6899],
        [3.7398],
        [3.1551],
        [6.8035],
        [1.1030]])

In [0]:
x.exp()/x.exp().sum(-1).unsqueeze(-1)

tensor([[0.8541, 0.0601, 0.0857],
        [0.4877, 0.2699, 0.2424],
        [0.2963, 0.2580, 0.4457],
        [0.5912, 0.1860, 0.2228],
        [0.4662, 0.1767, 0.3572]])

#Building the loss function applying $\text{softmax}$ and then negative log likelihood

This version is most similar to the math formula, but not numerically stable.

In the following, we first calculate the vector value of $\hat{\textbf{y}}$ after applying softmax function on $\textbf{z}^L$, then, based on the ground truth vector which has only one $1$, a particular elementn of $\hat{\textbf{y}}$ is picked for one data point. Finally, the $\log$ function is applied. Since we are calculating multiple data points, we take average over all the data samples.


In [0]:
def softmax(x):
  return x.exp() / (x.exp().sum(-1)).unsqueeze(-1)
def nl(input, target):
  return -input[range(target.shape[0]), target].log().mean()

In [0]:
target.shape[0]

5

In [0]:
range(target.shape[0])

range(0, 5)

In [0]:
pred = softmax(x)
pred

tensor([[0.8541, 0.0601, 0.0857],
        [0.4877, 0.2699, 0.2424],
        [0.2963, 0.2580, 0.4457],
        [0.5912, 0.1860, 0.2228],
        [0.4662, 0.1767, 0.3572]])

In [0]:
pred[range(target.shape[0]), target]

tensor([0.0857, 0.4877, 0.2580, 0.5912, 0.1767])

In [0]:
-pred[range(target.shape[0]), target].log()

tensor([2.4566, 0.7180, 1.3550, 0.5256, 1.7335])

In [0]:
-pred[range(target.shape[0]), target].log().mean()

tensor(1.3577)

In [0]:
pred = softmax(x)
loss=nl(pred, target)
loss

tensor(1.3577)

#Building the loss function by applying $\log$ and then calculating softmax.

While mathematically equivalent to log(softmax(x)), doing these two operations separately is slower, and numerically unstable. This function uses an alternative formulation to compute the output and gradient correctly.

Since we want just one element of $\hat{\textbf{y}}$ based on grand truth,let's say $i$-th element, we can first apply softmax and the take the log. $\log(\frac{e^{\textbf{z}_i^L}}{\sum_ie^{\textbf{z}_k^L}})=\textbf{z}_i^L-\log(\sum_ie^{\textbf{z}_k^L})$

```python
x - x.exp().sum(-1).log().unsqueeze(-1)
```

In [0]:
m = nn.LogSoftmax()

In [0]:
input = torch.randn(2, 3)

In [0]:
m(input)

  """Entry point for launching an IPython kernel.


tensor([[-2.7986, -0.8772, -0.6479],
        [-1.7172, -0.6639, -1.1855]])

In [0]:
input

tensor([[-0.6471,  1.2743,  1.5036],
        [-0.6790,  0.3743, -0.1474]])

In [0]:
input.exp()

tensor([[0.5236, 3.5764, 4.4981],
        [0.5071, 1.4539, 0.8630]])

In [0]:
input.exp().sum(-1).unsqueeze(-1)

tensor([[8.5980],
        [2.8240]])

In [0]:
input.exp()/input.exp().sum(-1).unsqueeze(-1)

tensor([[0.0609, 0.4160, 0.5232],
        [0.1796, 0.5148, 0.3056]])

In [0]:
(input.exp()/input.exp().sum(-1).unsqueeze(-1)).log()

tensor([[-2.7986, -0.8772, -0.6479],
        [-1.7172, -0.6639, -1.1855]])

In [0]:
def log_softmax(x): 
  return x - x.exp().sum(-1).log().unsqueeze(-1)
def nll(input, target): 
  return -input[range(target.shape[0]), target].mean()

pred = log_softmax(x)
loss = nll(pred, target)
loss

tensor(1.3577)

#Using pytorch functional module to calculate the loss

In [0]:
pred = F.log_softmax(x, dim=-1)
loss = F.nll_loss(pred, target)
loss

tensor(1.3577)

#Finding the loss in one shot using cross entropy loss 

In [0]:
F.cross_entropy(x, target)

tensor(1.3577)