# Multiclass logarithmic loss

A quick look into the behavior and definition of the logloss as used in the Kaggle Tabular Data competition of May 2021. The logloss uses natural logs rather than log2 (as in https://en.wikipedia.org/wiki/Cross_entropy). In addition to that an epsilon value is used to prevent returning infinite loss values. The definition used in the competition is exactly the one from sklearn. The Tensorflows implementation is similar (with a higher epsilon), as is the one in LightGMB (https://github.com/microsoft/LightGBM/blob/master/src/metric/multiclass_metric.hpp).

# Loading data

In [None]:
import pandas as pd
import numpy as np
train = pd.read_csv('../input/tabular-playground-series-may-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-may-2021/test.csv')
train['target']

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train['target'].unique())
le.classes_

In [None]:
target = le.transform(train['target'])
target

# Submission

In [None]:
def submit(test_pred, filename):
    submission = pd.DataFrame(test_pred, columns=le.classes_)
    submission.insert(0, 'id', test['id'])
    submission.to_csv(filename, index=False)

# Baseline model: majority class

In [None]:
from sklearn.metrics import log_loss

In [None]:
import numpy as np
labels, frequencies = np.unique(target, return_counts=True)
(labels, frequencies)

In [None]:
frequencies/len(train)

Calculated value:
* y = class 1 (57.5% of cases): logloss = 0
* y != class 1 (42.5% of cases): logloss = log(10^-15) = -34.5

In [None]:
pred = [[0.0,1.0,0.0,0.0]]*len(train)

In [None]:
import math
print(f'logloss:{-math.log(10**-15)*(1-frequencies[1]/len(train))}')

Calculate with sklearn

In [None]:
log_loss(target, pred, labels=[0,1,2,3])

Calculate with tensorflow, uses a non-configurable(?) epsilon of 10^-7.

In [None]:
import math
print(f'logloss:{-math.log(10**-7)*(1-frequencies[1]/len(train))}')

In [None]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy
scc = SparseCategoricalCrossentropy()
scc(target, pred).numpy()

## Check with leaderboard

In [None]:
test_pred = [[0.0,1.0,0.0,0.0]]*len(test)
submit(test_pred, 'baseline_majority.csv')

Score is 14.62209, seems to match

# Baseline model: a priori probabilities

Calculated value:
* y = class 0 (8.49% of cases): logloss=log(0.0849)=-2.46
* y = class 1 (57.5% of cases): logloss = log(0.57497)=-0.55
* etc

This is the best single-point estimator.

In [None]:
pred = frequencies/len(train)

In [None]:
print(f'logloss:{-np.sum(pred * np.log(pred))}')

In [None]:
pred = np.tile(frequencies/len(train), (len(train),1))
pred

In [None]:
log_loss(target, pred, labels=[0,1,2,3])

In [None]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.losses import Reduction
scc = SparseCategoricalCrossentropy()
scc(target, pred).numpy()

## Check with leaderboard

In [None]:
test_pred = pred = np.tile(frequencies/len(test), (len(test),1))
submit(test_pred, 'baseline_apriori.csv')

Leaderboard score is 1.11634, seems to match