## Multi-class multi-label classification

We now turn to multi-label classification, whereby multiple labels can be assigned to each example. As a first example
of the reach of LTNs, we shall see how the previous example can be extended naturally using LTN to account for multiple
labels, which is not always a trivial extension for most ML algorithms.

The standard approach to the multi-label problem is to provide explicit negative examples for each class. By contrast,
LTN can use background knowledge to relate classes directly to each other, thus becoming a powerful tool in the case of
the multi-label problem, where typically the labelled data is scarce.

We explore the Leptograpsus crabs dataset, consisting of 200 examples of 5 morphological measurements of 50 crabs.
The task is to classify the crabs according to their colour and sex. There are four labels: blue, orange, male, and female.
The colour labels are mutually-exclusive, and so are the labels for sex. LTN will be used to specify such information
logically.

For this specific task, LTN uses the following language and grounding:

**Domains:**
- $items$, denoting the examples from the crabs data set;
- $labels$, denoting the class labels.

**Variables:**
- $x_{blue}, x_{orange}, x_{male}, x_{female}$ for the positive examples of each class;
- $x$, used to denote all the examples;
- $D(x_{blue}) = D(x_{orange}) = D(x_{male}) = D(x_{female}) = D(x) = items$.

**Constants:**
- $l_{blue}, l_{orange}, l_{male}, l_{female}$: the labels of each class;
- $D(l_{blue}) = D(l_{orange}) = D(l_{male}) = D(l_{female}) = labels$.

**Predicates:**
- $P(x,l)$ denoting the fact that item $x$ is labelled as $l$;
- $D_{in}(P) = items,labels$.

**Axioms:**

- $\forall x_{blue} P(x_{blue}, l_{blue})$: all the examples coloured by blue should have label $l_{blue}$;
- $\forall x_{orange} P(x_{orange}, l_{orange})$: all the examples coloured by orange should have label $l_{orange}$;
- $\forall x_{male} P(x_{male}, l_{male})$: all the examples that are males should have label $l_{male}$;
- $\forall x_{female} P(x_{female}, l_{female})$: all the examples that are females should have label $l_{female}$;
- $\forall x \lnot (P(x, l_{blue}) \land P(x, l_{orange}))$: if an example $x$ is labelled as blue, it cannot be labelled
as orange too;
- $\forall x \lnot (P(x, l_{male}) \land P(x, l_{female}))$: if an example $x$ is labelled as male, it cannot be labelled
as female too.

Notice how the last two logical rules represent the mutual exclusion of the labels on colour and sex, respectively.
As a result, negative examples are not used explicitly in this specification.


**Grounding:**
- $\mathcal{G}(items)=\mathbb{R}^{5}$, items are described by 5 features;
- $\mathcal{G}(labels)=\mathbb{N}^{4}$, we use a one-hot encoding to represent labels;
- $\mathcal{G}(x_{blue}) \in \mathbb{R}^{m_1 \times 5}, \mathcal{G}(x_{orange}) \in \mathbb{R}^{m_2 \times 5},\mathcal{G}(x_{male}) \in \mathbb{R}^{m_3 \times 5},\mathcal{G}(x_{female}) \in \mathbb{R}^{m_4 \times 5}$.
These sequences are not mutually-exclusive, one example can for instance be in both $x_{blue}$ and $x_{male}$;
- $\mathcal{G}(x) \in \mathbb{R}^{m \times 5}$, that is, $\mathcal{G}(x)$ is a sequence of all the examples;
- $\mathcal{G}(l_{blue}) = [1, 0, 0, 0]$, $\mathcal{G}(l_{orange}) = [0, 1, 0, 0]$, $\mathcal{G}(l_{male}) = [0, 0, 1, 0]$, $\mathcal{G}(l_{female}) = [0, 0, 0, 1]$;
- $\mathcal{G}(P \mid \theta): x,l \mapsto l^\top \cdot \sigma\left(\operatorname{MLP}_{\theta}(x)\right)$, where $MLP$
has four output neurons corresponding to as many labels, and $\cdot$ denotes the dot product as a way of selecting an
output for $\mathcal{G}(P \mid \theta)$. In fact, multiplying the $MLP$’s output by the one-hot vector $l^\top$ gives the probability
corresponding to the label denoted by $l$. By contrast with the previous example, notice the use
of a *sigmoid* function instead of a *softmax* function. We need that because labels are not mutually exclusive anymore.


### Dataset

Now, let's import the dataset.

The Leptograpsus crabs dataset consists of 200 examples. Every example is represented by 5 features. The dataset
is subdivided into train and test set. In particular, we use 160 examples for training and 40 for test.

In [1]:
import torch
import pandas as pd

df = pd.read_csv("datasets/crabs.dat", sep=" ", skipinitialspace=True)
df = df.sample(frac=1)  # shuffle dataset
df = df.replace({'B': 0, 'O': 1, 'M': 2, 'F': 3})

features = torch.tensor(df[['FL', 'RW', 'CL', 'CW', 'BD']].to_numpy())
labels_sex = torch.tensor(df['sex'].to_numpy())
labels_color = torch.tensor(df['sp'].to_numpy())

train_data = features[:160].float()
test_data = features[160:].float()
train_sex_labels = labels_sex[:160].long()
test_sex_labels = labels_sex[160:].long()
train_color_labels = labels_color[:160].long()
test_color_labels = labels_color[160:].long()

### LTN setting

In order to define our knowledge base (axioms), we need to define predicate $P$, constants $l_{blue}$, $l_{orange}$, $l_{male}$,
$l_{female}$, connectives, universal quantifier, and the `SatAgg` operator.

For the connectives and quantifier, we use the stable product configuration (seen in the tutorials).

For predicate $P$, we have two models. The first one implements an $MLP$ which outputs the logits for the four classes of
the dataset, given an example $x$ in input. The second model takes as input a labelled example $(x,l)$, it computes the logits
using the first model and then returns the prediction (*sigmoid*) for class $l$.

The constants $l_{blue}$, $l_{orange}$, $l_{male}$, and $l_{female}$, represent the one-hot labels for the four classes, as we have already seen in the
definition of the grounding for this task.

`SatAgg` is defined using the `pMeanError` aggregator.

In [2]:
import ltn

# we define the constants
l_blue = ltn.Constant(torch.tensor([1, 0, 0, 0]))
l_orange = ltn.Constant(torch.tensor([0, 1, 0, 0]))
l_male = ltn.Constant(torch.tensor([0, 0, 1, 0]))
l_female = ltn.Constant(torch.tensor([0, 0, 0, 1]))

# we define predicate P
class MLP(torch.nn.Module):
    """
    This model returns the logits for the classes given an input example. It does not compute the softmax, so the output
    are not normalized.
    This is done to separate the accuracy computation from the satisfaction level computation. Go through the example
    to understand it.
    """
    def __init__(self, layer_sizes=(5, 16, 16, 8, 4)):
        super(MLP, self).__init__()
        self.elu = torch.nn.ELU()
        self.dropout = torch.nn.Dropout(0.2)
        self.linear_layers = torch.nn.ModuleList([torch.nn.Linear(layer_sizes[i - 1], layer_sizes[i])
                                                  for i in range(1, len(layer_sizes))])

    def forward(self, x, training=False):
        """
        Method which defines the forward phase of the neural network for our multi class classification task.
        In particular, it returns the logits for the classes given an input example.

        :param x: the features of the example
        :param training: whether the network is in training mode (dropout applied) or validation mode (dropout not applied)
        :return: logits for example x
        """
        for layer in self.linear_layers[:-1]:
            x = self.elu(layer(x))
            if training:
                x = self.dropout(x)
        logits = self.linear_layers[-1](x)
        return logits


class LogitsToPredicate(torch.nn.Module):
    """
    This model has inside a logits model, that is a model which compute logits for the classes given an input example x.
    The idea of this model is to keep logits and probabilities separated. The logits model returns the logits for an example,
    while this model returns the probabilities given the logits model.

    In particular, it takes as input an example x and a class label l. It applies the logits model to x to get the logits.
    Then, it applies a softmax function to get the probabilities per classes. Finally, it returns only the probability related
    to the given class l.
    """
    def __init__(self, logits_model):
        super(LogitsToPredicate, self).__init__()
        self.logits_model = logits_model
        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, x, l, training=False):
        logits = self.logits_model(x, training=training)
        probs = self.sigmoid(logits)
        out = torch.sum(probs * l, dim=1)
        return out

mlp = MLP()
P = ltn.Predicate(LogitsToPredicate(mlp))

# we define the connectives, quantifiers, and the SatAgg
Not = ltn.Connective(ltn.fuzzy_ops.NotStandard())
And = ltn.Connective(ltn.fuzzy_ops.AndProd())
Forall = ltn.Quantifier(ltn.fuzzy_ops.AggregPMeanError(p=2), quantifier="f")
SatAgg = ltn.fuzzy_ops.SatAgg()

### Utils

Now, we need to define some utility classes and functions.

We define a standard PyTorch data loader, which takes as input the dataset and returns a generator of batches of data.
In particular, we need a data loader instance for training data and one for testing data.

Then, we define functions to evaluate the model performances. The model is evaluated on the test set using the following metrics:
- the satisfaction level of the knowledge base: measure the ability of LTN to satisfy the knowledge;
- the classification accuracy: this time, the accuracy is defined as $1 - HL$, where $HL$ is the average Hamming loss,
i.e. the fraction of labels predicted incorrectly, with a classification threshold of 0.5 (given an example $u$,
if the model outputs a value greater than 0.5 for class $C$ then $u$ is deemed as belonging to class $C$).

In [3]:
from sklearn.metrics import accuracy_score
import numpy as np

class DataLoader(object):
    def __init__(self,
                 data,
                 labels,
                 batch_size=1,
                 shuffle=True):
        self.data = data
        self.labels_sex = labels[0]
        self.labels_color = labels[1]
        self.batch_size = batch_size
        self.shuffle = shuffle

    def __len__(self):
        return int(np.ceil(self.data.shape[0] / self.batch_size))

    def __iter__(self):
        n = self.data.shape[0]
        idxlist = list(range(n))
        if self.shuffle:
            np.random.shuffle(idxlist)

        for _, start_idx in enumerate(range(0, n, self.batch_size)):
            end_idx = min(start_idx + self.batch_size, n)
            data = self.data[idxlist[start_idx:end_idx]]
            labels_sex = self.labels_sex[idxlist[start_idx:end_idx]]
            labels_color = self.labels_color[idxlist[start_idx:end_idx]]

            yield data, labels_sex, labels_color


# define metrics for evaluation of the model

# it computes the overall satisfaction level on the knowledge base using the given data loader (train or test)
def compute_sat_level(loader):
    mean_sat = 0
    for data, labels_sex, labels_color in loader:
        x = ltn.Variable("x", data)
        x_blue = ltn.Variable("x_blue", data[labels_color == 0])
        x_orange = ltn.Variable("x_orange", data[labels_color == 1])
        x_male = ltn.Variable("x_male", data[labels_sex == 2])
        x_female = ltn.Variable("x_female", data[labels_sex == 3])
        mean_sat += SatAgg(
            Forall(x_blue, P(x_blue, l_blue)),
            Forall(x_orange, P(x_orange, l_orange)),
            Forall(x_male, P(x_male, l_male)),
            Forall(x_female, P(x_female, l_female)),
            Forall(x, Not(And(P(x, l_blue), P(x, l_orange)))),
            Forall(x, Not(And(P(x, l_male), P(x, l_female))))
        )
    mean_sat /= len(loader)
    return mean_sat

# it computes the overall accuracy of the predictions of the trained model using the given data loader
# (train or test)
def compute_accuracy(loader, threshold=0.5):
    mean_accuracy = 0.0
    for data, labels_sex, labels_color in loader:
        predictions = mlp(data).detach().numpy()
        labels_male = (labels_sex == 2)
        labels_female = (labels_sex == 3)
        labels_blue = (labels_color == 0)
        labels_orange = (labels_color == 1)
        onehot = np.stack([labels_blue, labels_orange, labels_male, labels_female], axis=-1).astype(np.int32)
        predictions = predictions > threshold
        predictions = predictions.astype(np.int32)
        nonzero = np.count_nonzero(onehot - predictions, axis=-1).astype(np.float32)
        multilabel_hamming_loss = nonzero / predictions.shape[-1]
        mean_accuracy += np.mean(1 - multilabel_hamming_loss)

    return mean_accuracy / len(loader)

# create train and test loader
train_loader = DataLoader(train_data, (train_sex_labels, train_color_labels), 64, shuffle=True)
test_loader = DataLoader(test_data, (test_sex_labels, test_color_labels), 64, shuffle=False)

### Learning

Let us define $D$ the data set of all examples. The objective function is given by $\operatorname{SatAgg}_{\phi \in \mathcal{K}} \mathcal{G}_{\boldsymbol{\theta}, x \leftarrow \boldsymbol{D}}(\phi)$.

In practice, the optimizer uses the following loss function:

$\boldsymbol{L}=\left(1-\underset{\phi \in \mathcal{K}}{\operatorname{SatAgg}} \mathcal{G}_{\boldsymbol{\theta}, x \leftarrow \boldsymbol{B}}(\phi)\right)$

where $B$ is a mini batch sampled from $D$.

### Querying

To illustrate the learning of constraints by LTN, we have queried three formulas
that were not explicitly part of the knowledge base, over time during learning:
- $\phi_1: \forall x (P(x, l_{blue}) \implies \lnot P(x, l_{orange}))$;
- $\phi_2: \forall x (P(x, l_{blue}) \implies P(x, l_{orange}))$;
- $\phi_3: \forall x (P(x, l_{blue}) \implies P(x, l_{male}))$.

For querying, we use $p=5$ when approximating the universal quantifiers with
`pMeanError`. A higher $p$ denotes a stricter universal quantification with a stronger
focus on outliers. We should expect $\phi_1$ to hold true (every
blue crab cannot be orange and vice-versa), and we should expect $\phi_2$ (every blue crab is also orange) and $\phi_3$
(every blue crab is male) to be false.

In the following, we define some functions computing the three formulas and the implication connective, since we need it
to define the formulas.

In [4]:
Implies = ltn.Connective(ltn.fuzzy_ops.ImpliesReichenbach())

def phi1(features):
    x = ltn.Variable("x", features)
    return Forall(x, Implies(P(x, l_blue), Not(P(x, l_orange))), p=5)

def phi2(features):
    x = ltn.Variable("x", features)
    return Forall(x, Implies(P(x, l_blue), P(x, l_orange)), p=5)

def phi3(features):
    x = ltn.Variable("x", features)
    return Forall(x, Implies(P(x, l_blue), P(x, l_male)), p=5)

# it computes the satisfaction level of a formula phi using the given data loader (train or test)
def compute_sat_level_phi(loader, phi):
    mean_sat = 0
    for features, _, _ in loader:
        mean_sat += phi(features).value
    mean_sat /= len(loader)
    return mean_sat

In the following, we learn our LTN in the multi-class multi-label classification task using the satisfaction of the knowledge base as
an objective. In other words, we want to learn the parameters $\theta$ of binary predicate $P$ in such a way the three
axioms in the knowledge base are maximally satisfied. We train our model for 500 epochs and use the `Adam` optimizer.

In [5]:
optimizer = torch.optim.Adam(P.parameters(), lr=0.001)

for epoch in range(500):
    train_loss = 0.0
    for batch_idx, (data, labels_sex, labels_color) in enumerate(train_loader):
        optimizer.zero_grad()
        # we ground the variables with current batch data
        x = ltn.Variable("x", data)
        x_blue = ltn.Variable("x_blue", data[labels_color == 0])
        x_orange = ltn.Variable("x_orange", data[labels_color == 1])
        x_male = ltn.Variable("x_male", data[labels_sex == 2])
        x_female = ltn.Variable("x_female", data[labels_sex == 3])
        sat_agg = SatAgg(
            Forall(x_blue, P(x_blue, l_blue)),
            Forall(x_orange, P(x_orange, l_orange)),
            Forall(x_male, P(x_male, l_male)),
            Forall(x_female, P(x_female, l_female)),
            Forall(x, Not(And(P(x, l_blue), P(x, l_orange)))),
            Forall(x, Not(And(P(x, l_male), P(x, l_female))))
        )
        loss = 1. - sat_agg
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    train_loss = train_loss / len(train_loader)

    # we print metrics every 20 epochs of training
    if epoch % 20 == 0:
        print(" epoch %d | loss %.4f | Train Sat %.3f | Test Sat %.3f | Train Acc %.3f | Test Acc %.3f | "
                        "Test Sat Phi 1 %.3f | Test Sat Phi 2 %.3f | Test Sat Phi 3 %.3f " %
              (epoch, train_loss, compute_sat_level(train_loader),
                        compute_sat_level(test_loader),
                        compute_accuracy(train_loader), compute_accuracy(test_loader),
                        compute_sat_level_phi(test_loader, phi1), compute_sat_level_phi(test_loader, phi2),
                        compute_sat_level_phi(test_loader, phi3)))

 epoch 0 | loss 0.5147 | Train Sat 0.502 | Test Sat 0.510 | Train Acc 0.497 | Test Acc 0.500 | Test Sat Phi 1 0.813 | Test Sat Phi 2 0.882 | Test Sat Phi 3 0.838 
 epoch 20 | loss 0.3738 | Train Sat 0.626 | Test Sat 0.626 | Train Acc 0.531 | Test Acc 0.475 | Test Sat Phi 1 0.554 | Test Sat Phi 2 0.794 | Test Sat Phi 3 0.752 
 epoch 40 | loss 0.3693 | Train Sat 0.631 | Test Sat 0.630 | Train Acc 0.500 | Test Acc 0.500 | Test Sat Phi 1 0.554 | Test Sat Phi 2 0.787 | Test Sat Phi 3 0.782 
 epoch 60 | loss 0.3562 | Train Sat 0.644 | Test Sat 0.641 | Train Acc 0.770 | Test Acc 0.725 | Test Sat Phi 1 0.578 | Test Sat Phi 2 0.742 | Test Sat Phi 3 0.809 
 epoch 80 | loss 0.3108 | Train Sat 0.689 | Test Sat 0.679 | Train Acc 0.812 | Test Acc 0.800 | Test Sat Phi 1 0.653 | Test Sat Phi 2 0.594 | Test Sat Phi 3 0.804 
 epoch 100 | loss 0.2584 | Train Sat 0.740 | Test Sat 0.718 | Train Acc 0.861 | Test Acc 0.819 | Test Sat Phi 1 0.741 | Test Sat Phi 2 0.397 | Test Sat Phi 3 0.784 
 epoch 120 | los

Notice that variables $x_{blue}$, $x_{orange}$, $x_{male}$, and $x_{female}$ are grounded batch by batch with new data
arriving from the data loader. This is exactly what
we mean with $\mathcal{G}_{x \leftarrow \boldsymbol{B}}(\phi(x))$, where $B$ is a mini-batch sampled by the data loader.

Notice also that `SatAgg` takes as input the four axioms and returns one truth value which can be interpreted as the satisfaction
level of the knowledge base.

Note that after 100 epochs the test accuracy is around 1. This shows the power of LTN in learning
the multi-class multi-label classification task only using the satisfaction of a knowledge base as an objective.

At the beginning of the training, the truth values of $\phi_1$, $\phi_2$, and $\phi_3$ are non-informative. Instead, during
training, one can see a trend towards the satisfaction of $\phi_1$, and an opposite trend for $\phi_2$ and $\phi_3$, as expected.
This shows the ability of LTN to query and reason on never-seen formulas.
