<a href="https://colab.research.google.com/github/rahiakela/machine-learning-algorithms/blob/main/neural-networks-from-scratch/05-calculating-loss/cross_entropy_loss_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Cross-Entropy Loss from Scratch

To train a model, we tweak the weights and biases to improve the model’s accuracy and confidence. To do this, we calculate how much error the model has. The loss function , also referred to as the cost function , is the algorithm that quantifies how wrong a model is. Loss is the measure of this metric. Since loss is the model’s error, we ideally want it to be 0.

The output of a neural network is actually confidence, and more confidence in the correct answer is better. Because of this, we strive to increase correct confidence and decrease misplaced confidence.

##Setup

In [1]:
import numpy as np
import math

## Categorical Cross-Entropy Loss

If you’re familiar with linear regression, then you already know one of the loss functions used with neural networks that do regression: squared error (or mean squared error with neural networks).

The model has a softmax activation function for the output layer, which means it’s outputting a probability distribution. Categorical cross-entropy is explicitly used to compare a `ground-truth` probability ( y or `targets`) and some predicted distribution ( y-hat or `predictions`), so it makes sense to use cross-entropy here. It is also one of the most commonly used loss functions with a softmax activation on the output layer.

The formula for calculating the categorical cross-entropy of `y` (actual/desired distribution) and `y-hat` (predicted distribution) is:

$$
L_i = - \sum_j y_{i, j}log(\hat{y_{i, j}})
$$

In general, the log loss error function is what we apply to the output of a binary logistic regression model — there are only two classes in the distribution, each of them applying to a single output (neuron) which is targeted as a `0` or `1`. In our case, we have a classification model that returns a probability distribution over all of the outputs. Cross-entropy
compares two probability distributions. 

In our case, we have a softmax output, let’s say it’s:

In [2]:
softmax_output = [0.7, 0.1, 0.2]

We have 3 class confidences in the above output, and let’s assume that the desired prediction is the first class (index 0, which is currently 0.7). If that’s the intended prediction, then the desired probability distribution is `[1, 0, 0]`.

Arrays or vectors like this are called one-hot , meaning one of the values is “hot” (on), with a value of 1, and the rest are “cold” (off), with values of 0. When comparing the model’s results to a one-hot vector using cross-entropy, the other parts of the equation zero out, and the target probability’s log loss is
multiplied by 1, making the cross-entropy calculation relatively simple. This is also a special case of the cross-entropy calculation, called categorical cross-entropy. 

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/neural-networks-from-scratch/05-calculating-loss/images/1.png?raw=1' width='800'/>

To exemplify this — if we take a softmax output of `[0.7, 0.1, 0.2]` and targets of `[1, 0, 0]` , we can apply the calculations as follows:

In [3]:
L = -(1 * math.log(0.7) + 0 * math.log(0.1) + 0 * math.log(0.2))
print(L)

0.35667494393873245


In [4]:
L = -(-0.35667494393873245 + 0  + 0)
print(L)

0.35667494393873245


Let’s see the Python code for this:

In [5]:
# An example output from the output layer of the neural network
softmax_output = [0.7, 0.1, 0.2]
# Ground truth
target_output = [1, 0, 0]

loss = -(math.log(softmax_output[0]) * target_output[0] +
         math.log(softmax_output[1]) * target_output[1] +
         math.log(softmax_output[2]) * target_output[2])

print(loss)

0.35667494393873245


That’s the full categorical cross-entropy calculation, but we can make a few assumptions given one-hot target vectors.

The example confidence level might look like `[0.22, 0.6, 0.18]` or `[0.32, 0.36, 0.32]`. In both cases, the argmax of these vectors will return the second
class as the prediction, but the model’s confidence about these predictions is high only for one of them. The Categorical Cross-Entropy Loss accounts for that and outputs a larger loss the lower the confidence is:

In [6]:
print (math.log( 1. ))
print (math.log( 0.95 ))
print (math.log( 0.9 ))
print (math.log( 0.8 ))
print ( '...' )
print (math.log( 0.2 ))
print (math.log( 0.1 ))
print (math.log( 0.05 ))
print (math.log( 0.01 ))

0.0
-0.05129329438755058
-0.10536051565782628
-0.2231435513142097
...
-1.6094379124341003
-2.3025850929940455
-2.995732273553991
-4.605170185988091


When the confidence level equals 1 , meaning the model is 100% “sure” about its prediction, the loss value for this sample equals 0 . The loss value raises with the confidence level, approaching 0. You might also wonder
why we did not print the result of log(0) — we’ll explain that shortly.

Log is short for logarithm and is defined as the solution for the x-term in an equation of the form $a^x = b$.

For example, $10^x = 100$ can be solved with a $log: log_10 (100)$ , which evaluates to 2. This property of the log function is especially beneficial when e (Euler’s number or `~2.71828`) is used in the base (where 10 is in the example). The logarithm with e as its base is referred to as the natural logarithm , natural log , or simply log — you may also see this written as $ln : ln(x) = log(x) = log_e (x)$.

The natural log represents the solution for the x-term in the equation $e^x = b$ ; for example, $e^x = 5.2$ is solved by $log(5.2)$.

In [7]:
b = 5.2
print(np.log(b))

1.6486586255873816


We can confirm this by exponentiating our result:

In [8]:
print(math.e ** 1.6486586255873816)

5.199999999999999


The small difference is the result of floating-point precision in Python.

Consider a scenario with a neural network that performs classification between three classes, and the neural network classifies in batches of three. After running through the softmax activation function with a batch of 3 samples and 3 classes, the network’s output layer yields:

In [9]:
# Probabilities for 3 samples
softmax_outputs = np.array([
  [ 0.7 , 0.1 , 0.2 ],
  [ 0.1 , 0.5 , 0.4 ],
  [ 0.02 , 0.9 , 0.08 ]
])

We need a way to dynamically calculate the categorical cross-entropy, which we now know is a negative log calculation. To determine which value in the softmax output to calculate the negative log from, we simply need to know our target values.

In this example, there are 3 classes; let’s say we’re trying to classify something as a “dog,” “cat,” or “human.” A dog is class 0 (at index 0), a
cat class 1 (index 1), and a human class 2 (index 2). 

Let’s assume the batch of three sample inputs to this neural network is being mapped to the target values of a dog, cat, and cat. So the targets (as a list of target indices) would be `[0, 1, 1]`.

In [10]:
softmax_outputs = [
  [ 0.7 , 0.1 , 0.2 ],
  [ 0.1 , 0.5 , 0.4 ],
  [ 0.02 , 0.9 , 0.08 ]
]
class_targets = [ 0 , 1 , 1 ] # dog, cat, cat

- The first value, 0, in class_targets means the first softmax output distribution’s intended prediction was the one at the 0th index of `[ 0.7 , 0.1 , 0.2 ]` ; the model has a 0.7 confidence score that this observation is a dog.

- This continues throughout the batch, where the intended target of the 2nd softmax distribution, `[ 0.1 , 0.5 , 0.4 ]` , was at an index of 1; the model only has a 0.5 confidence score that this is a cat — the model is less certain about this observation. 

- In the last sample, it’s also the 2nd index from the softmax distribution, a value of 0.9 in this case — a pretty high confidence.

With a collection of softmax outputs and their intended targets, we can map these indices to retrieve the values from the softmax distributions:

In [11]:
for target_index, distribution in zip(class_targets, softmax_outputs):
  print(distribution[target_index])

0.7
0.5
0.9


This can be further simplified using NumPy (we’re creating a NumPy array of the Softmax outputs this time):

In [12]:
# Probabilities for 3 samples
softmax_outputs = np.array([
  [ 0.7 , 0.1 , 0.2 ],
  [ 0.1 , 0.5 , 0.4 ],
  [ 0.02 , 0.9 , 0.08 ]
])

class_targets = [0, 1, 1] # dog, cat, cat

print(softmax_outputs[[0, 1, 2], class_targets])

[0.7 0.5 0.9]


What are the 0, 1, and 2 values? NumPy lets us index an array in multiple ways. One of them is to use a list filled with indices and that’s convenient for us — we could use the class_targets for this purpose as it already contains the list of indices that we are interested in. The problem is that
this has to filter data rows in the array — the second dimension.

To perform that, we also need to
explicitly filter this array in its first dimension. This dimension contains the predictions and we, of
course, want to retain them all. We can achieve that by using a list containing numbers from 0
through all of the indices. We know we’re going to have as many indices as distributions in our
entire batch, so we can use a `range ()` instead of typing each value ourselves:

In [13]:
print(softmax_outputs[range(len(softmax_outputs)), class_targets])

[0.7 0.5 0.9]


This returns a list of the confidences at the target indices for each of the samples. Now we apply the negative log to this list:

In [14]:
print(-np.log(softmax_outputs[range(len(softmax_outputs)), class_targets]))

[0.35667494 0.69314718 0.10536052]


Finally, we want an average loss per batch to have an idea about how our model is doing during training. There are many ways to calculate an average in Python; the most basic form of an average is the `arithmetic mean : sum(iterable) / len(iterable)`. NumPy has a method that computes this average on arrays, so we will use that instead.

In [15]:
neg_log = -np.log(softmax_outputs[range(len(softmax_outputs)), class_targets])
average_loss = np.mean(neg_log)

print(average_loss)

0.38506088005216804


We have already learned that targets can be one-hot encoded, where all values, except for one, are zeros, and the correct label’s position is filled with 1.Since we implemented this to work with sparse labels (as in our training data), we have to add a check if they are one-hot encoded and handle it a bit differently in this new case. The check can be performed by counting the dimensions — if targets are single-dimensional (like a list), they are
sparse, but if there are 2 dimensions (like a list of lists), then there is a set of one-hot encoded vectors.

In this second case, instead of filtering out the confidences at the target labels. We have to multiply confidences by the targets, zeroing out all values except the ones at correct labels, performing a sum along the row axis (axis 1).

We have to add a test to the code we just wrote for the number of dimensions,
move calculations of the log values outside of this new if statement, and implement the solution for the one-hot encoded labels following the first equation:

In [16]:
# Probabilities for 3 samples
softmax_outputs = np.array([
  [ 0.7 , 0.1 , 0.2 ],
  [ 0.1 , 0.5 , 0.4 ],
  [ 0.02 , 0.9 , 0.08 ]
])

class_targets = np.array([
  [1, 0, 0],
  [0, 1, 0],
  [0, 1, 0]
])

# Probabilities for target values - only if categorical labels
if len(class_targets.shape) == 1:
  correct_confidences = softmax_outputs[range(len(softmax_outputs)), class_targets]
elif len(class_targets.shape) == 2:  # Mask values - only for one-hot encoded labels
  correct_confidences = np.sum(softmax_outputs * class_targets, axis=1)

# Losses
neg_log = - np.log(correct_confidences)
average_loss = np.mean(neg_log)

print(average_loss)

0.38506088005216804


Before we move on, there is one additional problem to solve. The softmax output, which is also an input to this loss function, consists of numbers in the range from 0 to 1 - a list of confidences.
It is possible that the model will have full confidence for one label making all the remaining confidences zero. Similarly, it is also possible that the model will assign full confidence to a value that wasn’t the target.

If we then try to calculate the loss of this confidence of 0:

In [17]:
-np.log(0)

  """Entry point for launching an IPython kernel.


inf

From the mathematical point of view, $log(0)$ is undefined. We already know the following dependence: if $y=log(x)$ , then $e^y =x$. The question of
what the resulting $y$ is in $y=log(0)$ is the same as the question of what’s the $y$ in $e^y =0$.

The negative natural logarithm of 0 , in Python with NumPy, equals an infinitely big number, rather than undefined, and prints a warning
about a division by 0.

In [18]:
np.e ** (-np.inf)

0.0

When calculating a derivative of the absolute value function, which
does not exist for an input of 0 and we’ll have to make some decisions to work around this.

With optimization, we will also have a problem calculating gradients, starting with a mean value of all sample-wise losses since a single infinite value in a list will cause the average of that list to also be infinite:

In [19]:
np.mean([1, 2, 3, -np.log(0)])

  """Entry point for launching an IPython kernel.


inf

We could add a very small value to the confidence to prevent it from being a zero.

In [20]:
-np.log(1e-7)

16.11809565095832

Adding a very small value, one-tenth of a million, to the confidence at its far edge will insignificantly impact the result, but this method yields an additional 2 issues. 

First, in the case where the confidence value is 1:

In [22]:
-np.log(1 + 1e-7)

-9.999999505838704e-08

When the model is fully correct in a prediction and puts all the confidence in the correct label, loss becomes a negative value instead of being 0.

The other problem here is shifting confidence towards 1 , even if by a very small value. To prevent both issues, it’s better to clip values from
both sides by the same number, `1e-7` in our case. 

That means that the lowest possible value will become `1e-7` (like in the demonstration we just performed) but the highest possible value, instead
of being `1+1e-7`, will become `1-1e-7` (so slightly less than 1 ):

In [23]:
-np.log(1 - 1e-7)

1.0000000494736474e-07

This will prevent loss from being exactly 0 , making it a very small value instead, but won’t make it a negative value and won’t bias overall loss towards 1 . 

Within our code and using numpy, we’ll accomplish that using `np.clip()` method:

In [26]:
y_pred = softmax_outputs
y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
y_pred_clipped

array([[0.7 , 0.1 , 0.2 ],
       [0.1 , 0.5 , 0.4 ],
       [0.02, 0.9 , 0.08]])

This method can perform clipping on an array of values, so we can apply it to the predictions directly and save this as a separate array, which we’ll use shortly.

## Categorical Cross-Entropy Loss Class