Classification: Multicategory Classification

Sometimes, you may want to make many mutually independent decisions using the same model. In this scenario, your model will return many outputs, each output corresponding to the binary probability of a different category, so the probabilities for an item may sum to more than 1.0.

Let's say we want to build a model to analyze the sentiment of tweets about the NBA. We could have two models, one for sentiment (1 is positive, 0 is negative) and another classifying whether it's about the NBA. Alternately, we could just use the same model for both. We could use one output for whether it's positive or negative (sentiment), and another output for whether it's about the NBA; the probability of one shouldn't be dependent on the other.

Let's simulate some potential logits:

In [2]:
import torch

In [3]:
#Column 0 may be the p(happy)=1-p(sad)
#and column 2 may be p(lakers)=1-p(not lakers)

logits=torch.rand(10,2)*2
logits

tensor([[1.6969, 0.9234],
        [1.5039, 0.8510],
        [0.9432, 1.1963],
        [0.5297, 0.6372],
        [0.5523, 0.8420],
        [0.0046, 0.3863],
        [1.9631, 0.9802],
        [1.5399, 1.0572],
        [0.1167, 0.9180],
        [1.5522, 0.3613]])

Now, we need labels for these logits. We'll need one column of labels per column of logits.

In [4]:
labels=torch.randint(0,2,(10,2)).float()
labels

tensor([[1., 0.],
        [1., 0.],
        [0., 1.],
        [1., 1.],
        [1., 1.],
        [0., 1.],
        [0., 1.],
        [1., 0.],
        [0., 1.],
        [1., 0.]])

To convert to probabilities, we still need to normalize. But we no longer need the probabilities of each row to add up to 1.0, since they are not mutually exclusive. Notice we normalize each value independently.

In [5]:
logits.sigmoid()    #no need of dimension as mentioned above

tensor([[0.8451, 0.7157],
        [0.8182, 0.7008],
        [0.7198, 0.7679],
        [0.6294, 0.6541],
        [0.6347, 0.6989],
        [0.5012, 0.5954],
        [0.8769, 0.7271],
        [0.8234, 0.7422],
        [0.5291, 0.7146],
        [0.8252, 0.5893]])

We can still use F.binary_cross_entropy_with_logits or nn.BCEWithLogitsLoss() to calculate the loss here. It will reduce the loss over all the logits and labels for us.

In [6]:
from torch import nn

In [8]:
bce=nn.BCEWithLogitsLoss()

In [9]:
bce(logits,labels)

tensor(0.6709)

Let's perform a sanity check by calculating the binary cross entropy for each column separately. The mean of these two values should be the same as the output from the code above.

In [10]:
bce(logits[:,1],labels[:,1])

tensor(0.6930)

In [11]:
bce(logits[:,0],labels[:,0])

tensor(0.6488)

In [12]:
(bce(logits[:,0],labels[:,0])+bce(logits[:,1], labels[:,1]))/2  #It must match the previous calc

tensor(0.6709)

In this lesson, we reviewed the case for multilabel classification, and how to apply binary cross entropy to build models that output more than one decision at a time.