Classification: Cross Entropy Loss
Classification attempts to place an item into one of two or more classes. It turns out that many of the tasks we'll be performing in this course are classification tasks. In this lesson, we'll dig in to the cross entropy loss function.

When we're performing classification tasks, our outputs for each item look like a vector of scores which we can interpret as probabilities. We are generally comparing this against a vector representing the actual class, where the index of the class is represented by a 1 and all the other indices are 0. We want a loss function that is lower when the score of the correct class is closest to 1. This is what cross entropy loss does. We'll go over cross entropy in detail to understand the loss function, then go over how to use it practically in torch.

Cross entropy for a single inference is given by

H(p,q)=-(sumatorio de 1 a M)(qi)*(log pi)

p=probabilidad
q=clase

where

\( p \) is a vector of probabilities, 1 per class
\( p_i \) is a single probability for one of the output classes
\( q \) is a one-hot vector labels
\( q_i \) is the value at a single label
\( M \) is the number of classes
Let's take, for instance, a 3-class task. For a single inference, we will have a softmax-normalized vector we interpret as a probability distribution over the labels that sums to 1.0. For example, let's take the vector \( [0.1, 0.7, 0.2] \). We also have our actual distribution that encodes the true label. Let's consider a true distrubution of \( [0, 1, 0] \). If we apply the equation, we can see that for indices 0 and 2, \( q_i = 0 \), so regardless of what \( p_i \) is the output of these indices will be 0. All we really care about is the index of the true label. In this case, the loss for this item will be:



Let's use numpy or torch to check our work from the example above. First, let's check that the negative log of 0.7 matches our calculations above.

In [1]:
import numpy as np

In [2]:
#check our work
-np.log(0.7)

0.35667494393873245

Now, let's define a function H that takes in a vector of scores and a vector indicating the correct labels, then apply it to our example case. We'll see that the values we return match.

In [3]:
import torch

In [4]:
def H(p,q):
  return (-1*q*p.log()).sum()

In [5]:
#Define a cross entropy function
H(torch.tensor([0.1,0.7,0.2]),torch.tensor([0,1,0]))

tensor(0.3567)

In [6]:
t=torch.tensor([0.1,0.7,0.2])

In [7]:
def our_cross_entropy(yhat,y):
  act=yhat[y]
  return -act.log()

In [8]:
our_cross_entropy(t,1)

tensor(0.3567)

In practice, we won't be calculating the loss for single items, but for a batch. To calculate the cross entropy loss for a batch, we just calculate this for each row and take the mean. Just to make it easier, let's just use the indices of the labels instead of the one-hot vectors. Below, we define a function that uses the predicted probabilities, simulated by a normalized tensor of random numbers, and some label indices to calculate the cross entropy loss.

In [9]:
def avg_cross_entropy(yhat,y):
  return -yhat[range(y.shape[0]),y].log().mean()    #Echar un vistazo, no lo veo

In [10]:
t=torch.randn(3,3).softmax(dim=1)
t

tensor([[0.7502, 0.0828, 0.1670],
        [0.3724, 0.2735, 0.3541],
        [0.0895, 0.8934, 0.0171]])

In [11]:
y=torch.randint(low=0, high=3, size=(3,))
y

tensor([0, 2, 0])

In [12]:
avg_cross_entropy(t,y)

tensor(1.2463)

One common problem in deep learning is that of precision. Here, we're not talking about the ML metric - we're talking about the computer's limited ability to store really large or really small numbers in memory. For each value we work with, torch will allocate a specific amount of memory, and there's a limit to how close or far from zero for numbers we can store within that memory. If we are working numbers near those limits, performing operations on those numbers are often not precise because we just can't store an accurate representation of the result. Let's multiply some small and large numbers to illustrate this.

In [17]:
a = 0.00000000000000000000000000001
a * a

9.999999999999998e-59

In [18]:
a = 100000000000000000000000000000.
a * a

9.999999999999998e+57

In deep learning, it's common to work with very small or very large numbers. When we multiply small numbers together, they get even smaller and closer to that limit of precision. There is a useful property of logs/exponents that helps us calculate these numbers a bit more precisely:

log(a*b)=log(a)+log(b)
a*b=e(log(a)+log(b))


Let's revisit the softmax function that we apply to turn the logits (outputs of our final layer) into a probability distribution. Given a vector of logits \( z \) :

sigma(z)=(e_elevado(zi))/(sumatorio 1 a N de e_elevado(zi))


Between softmax and cross entropy, there are a lot of logs and exponents happening, Without going into too much detail, torch's implementation of cross entropy loss combines the softmax and cross entropy loss functions in a way that's more efficient than calculating it with normalized to a probability distribution. For a more detailed explanation of cross entropy loss and torch's implementation, check out this video.

Let's see how cross entropy loss works in practice.

In [19]:
def make_classification_logits(n_classes, n_samples, pct_correct, confidence=1):
    """
    This function returns simulated logits and classes.

    n_classes: nuber of classes
    n_samples: number of rows
    pct_correct: float between 0 and 1. The higher it is,
                 the higher the % of logits that will
                 generate the correct output.
    confidence: controls how confident our logits are.
                Closer to 0: less confident
                Larger: more confident
    """
    classes = list(range(n_classes))
    # Randomly make logits
    logits = np.random.uniform(-5., 5., (n_samples, n_classes))
    # Randomly make labels
    labels = np.random.choice(classes, size=(n_samples))
    # Find the max of each row in logits
    maxs = np.abs(logits).max(axis=1)
    # For each row...
    for i in range(len(maxs)):
        # If we want the answer to be right...
        if np.random.random() <= pct_correct:
            # Make the correct item the highest logit
            logits[i, labels[i]] = maxs[i] + np.random.random()*confidence
        # If we want it to be wrong...
        else:
            # Make the highest logit a different index
            _c = classes.copy()
            _c.remove(classes[labels[i]])
            _i = np.random.choice(_c)
            logits[i, _i] = maxs[i] + np.random.random()/10

    # Return logits and labels
    return torch.FloatTensor(logits), torch.tensor(labels)

Let's use the function defined above to make some classification logits and associated labels.

In [20]:
# Create some logits and associated labels.
# There will be some error here!
logits, labels = make_classification_logits(3, 10, 0.8, confidence=1)
logits

tensor([[ 4.2177, -4.0934,  0.0668],
        [-0.7385,  2.6636,  3.4101],
        [-4.0501,  1.9145,  4.6214],
        [ 4.8040,  1.1541, -2.3394],
        [ 2.5674, -2.1252,  1.5663],
        [-2.0453,  3.3044,  0.0779],
        [ 4.1436,  5.1625,  4.5696],
        [ 4.2516,  3.3550, -0.4895],
        [-0.7235, -0.0780,  2.7425],
        [-4.7485,  0.8806,  5.0635]])

Let's use the .softmax method to normalize the logits to a probability score.

In [21]:
# What are the normalized predicted probabilities for each class?
logits.softmax(dim=1)

tensor([[9.8426e-01, 2.4190e-04, 1.5502e-02],
        [1.0596e-02, 3.1819e-01, 6.7122e-01],
        [1.6065e-04, 6.2557e-02, 9.3728e-01],
        [9.7392e-01, 2.5315e-02, 7.6946e-04],
        [7.2642e-01, 6.6558e-03, 2.6692e-01],
        [4.5471e-03, 9.5745e-01, 3.8007e-02],
        [1.8863e-01, 5.2255e-01, 2.8882e-01],
        [7.0587e-01, 2.8797e-01, 6.1616e-03],
        [2.8639e-02, 5.4614e-02, 9.1675e-01],
        [5.3965e-05, 1.5024e-02, 9.8492e-01]])

Let's look find the indices with the highest probability. These will serve as our predictions.

In [22]:
# How well do they match with our labels?
labels

tensor([0, 2, 2, 1, 0, 1, 1, 0, 1, 2])

Now let's use our logits to calculate cross entropy loss for this entire batch of data using nn.CrossEntropyLoss() and F.cross_entropy. These functions take in the logits and labels, and return the cross entroypy loss for the batch.

In [24]:
from torch import nn

In [25]:
cross_entropy = nn.CrossEntropyLoss()

In [26]:
cross_entropy(logits, labels)

tensor(0.8439)

In [28]:
import torch.nn.functional as F

In [29]:
F.cross_entropy(logits, labels)

tensor(0.8439)

Finally, let's apply our cross entropy loss function from earlier over this batch and make sure we get the same result.

In [30]:
torch.mean(
    torch.tensor(
        [our_cross_entropy(lo, la)
         for lo, la # softmax of logits, labels
         in zip(logits.softmax(dim=1), labels)
        ]
    )
)

tensor(0.8439)

In this lesson, we reviewed cross entropy loss, the general loss function for classificaiton. We learned a little bit about numerical precision, and how multiplying small or large numbers together can result in some numerical error. This is why we use the logits to calculate the loss instead of the normalized probabilities.