# Chapter 5 : Calculating Network Error with Loss

The **loss function**, also referred to as the **cost function** algorithm that quantifies how wrong a model is.**Loss**, is the is the measure of this metric. Since loss is the model’s error, we ideally want it to be 0. <br>
You may wonder why we do not calculate the error of a model based on the argmax accuracy. <br>
Recall our earlier example of confidence: [0.22, 0.6, 0.18] vs [0.32, 0.36, 0.32]. If the correct class were indeed the middle one (index 1), the model accuracy would be identical between the two above. But are these two examples really​ as accurate as each other?<br>
They are not, because accuracy is simply applying an argmax to the output to find the index of the biggest value. The output of a neural network is actually confidence, and more confidence in the correct answer is better. Because of this, we strive to **increase correct confidence and decrease misplaced confidence.** 

**Categorical Cross-Entropy Loss**

We are dealing with a classification problem here. The model has a softmax activation function for the output layer, which means it’s 
outputting a probability distribution. **Categorical cross-entropy** is explicitly used to compare a “**ground-truth**” probability (y ​ or ​ “targets​”) and some predicted distribution (y-hat ​ or “predictions​”), so it makes sense to use cross-entropy here. **It is also one of the most 
commonly used loss functions with a softmax activation on the output layer.** <br>
<br>
The formula for calculating the categorical cross-entropy of y​ (actual/desired distribution) and y-hat​ (predicted distribution) is:
$$L_{i} = -\sum_{j} y_{i,j}\log{\hat{y_{i,j}}}$$ 
Where $L_{i}$ denotes sample loss value, $i$​ is the i-th sample in the set, $j$ ​ is the label/output index, y denotes the target values, and y-hat​ denotes the predicted values. <br>
In our case, we have a classification model that returns a probability distribution over all of the outputs. Cross-entropy compares two probability distributions. In our case, we have a softmax output, let’s say it’s: <br>
```
softmax_output = [0.7,0.1,0.2]
```
Which probability distribution do we intend to compare this to? <br>
We have 3 class confidences in the above output, and let’s assume that the desired prediction is the first class (index 0, which is currently 0.7). If that’s the intended prediction, then the desired probability distribution is  [1, 0, 0]. <br> <br>
When comparing the model’s results to a one-hot vector using cross-entropy, the other parts of the equation zero out, and the target probability’s log loss is multiplied by 1, making the cross-entropy calculation relatively simple. This is also a special case of the cross-entropy calculation, called categorical cross-entropy. To exemplify this — if we take a softmax output of [0.7, 0.1, 0.2] and targets of [1, 0, 0], we can apply the 
calculations as follows: 
$$L_{i} = - {1 \cdot \log{0.7} + 0 \cdot \log{0.1} + 0 \cdot \log{0.2}}$$

For coding purposes , we can simplify it to : 
$$L_{i} = -\log{\hat{y_{i,k}}}$$ 
Where $L_{i}$ denotes sample loss value, $i$​ is the i-th sample in a set, $k$​ is the index of the target label 
(ground-true label), y​ denotes the target values and y-hat​ denotes the predicted values. <br>
Above is possible due to **one-hot** vectors. Arrays or vectors like this are called one-hot,​ meaning one of the values is “hot” (on), with a value of 1, and the rest are “cold” (off), with values of 0.

For both these examples
```
[0.22, 0.6, 0.18] 
[0.32, 0.36, 0.32] 
```
the argmax​ of these vectors will return the second class as the prediction, but the model’s confidence about these predictions is high only for one of 
them. The Categorical Cross-Entropy Loss accounts for that and **outputs a larger loss the lower the confidence** is:

So how do we proceed about this:


In [2]:
# A batch of three samples with three classes gives the following softmax distribution
softmax_outputs = [[0.7,0.1,0.2],
                   [0.1,0.5,0.4],
                   [0.02,0.9,0.08]]

# the classes are dogs[0],cats[1] & humans[2]
# target values corresponding these three samples are 
class_targets = [0,1,1]

The first value, 0, in class_targets means the first softmax output distribution’s intended 
prediction was the one at the 0th index of [0.7, 0.1, 0.2];<br>
the model has a 0.7 confidence score that this observation is a dog. <br>
This continues throughout the batch, where the intended target 
of the 2nd softmax distribution, [0.1, 0.5, 0.4], was at an index of 1; <br>
the model only has a 0.5 confidence score that this is a cat — the model is less certain about this observation. <br>
In the last sample, it’s also the 2nd index from the softmax distribution, a value of 0.9 in this case — a pretty high confidence. 

In [3]:
for targ_idx,distribution in zip(class_targets,softmax_outputs):
    print(distribution[targ_idx])

0.7
0.5
0.9


In [4]:
# using numpy 
import numpy as np
softmax_outputs = np.array(softmax_outputs)
print(softmax_outputs[range(len(softmax_outputs)),class_targets])

[0.7 0.5 0.9]


In [5]:
print(-np.log(softmax_outputs[range(len(softmax_outputs)),class_targets]))

[0.35667494 0.69314718 0.10536052]


Finally, we want an average loss per batch to have an idea about how our model is doing during training. We'll use the arithmetic mean.

In [6]:
neg_loss = -np.log(softmax_outputs[range(len(softmax_outputs)),class_targets])
avg_loss = np.mean(neg_loss)
print(avg_loss)

0.38506088005216804


The target vakues can be **sparse** or **one-hot** encoded, so we need to add a check in loss calculation. If target are single dimensional like lists then it's sparse, but if there are two dimensions like a list of lists then it's one-hot encoded. <br>
In this second case, we’ll implement a solution using the first equation from this chapter, instead of filtering out the confidences at the target labels. 

In [13]:
softmax_outputs = np.array([[0.7,0.1,0.2],
                   [0.1,0.5,0.4],
                   [0.02,0.9,0.08]])
class_targets = np.array([[1,0,0],
                          [0,1,0],
                          [0,1,0]])

In [16]:
# probablities of target values
# only if categorical values
if len(class_targets.shape) == 1:
    correct_confidences = softmax_outputs[
        range(len(softmax_outputs)),
        class_targets
    ]

# mask values - only for one-hot encoded labels
elif len(class_targets.shape) == 2:
    correct_confidences = np.sum(
        softmax_outputs*class_targets,
        axis = 1
    )

# losses
neg_loss = -np.log(correct_confidences)

average_loss = np.mean(neg_loss)
print(average_loss)


0.38506088005216804


**Another Problem:**
Before we move on, there is one additional problem to solve. The softmax output, which is also 
an input to this loss function, consists of numbers in the range from 0 to 1 - a list of confidences. 
It is possible that the model will have full confidence for one label making all the remaining 
confidences zero. Similarly, it is also possible that the model will assign full confidence to a value 
that wasn’t the target.

In [20]:
np.log(0)

  np.log(0)


-inf

**Solution**: <br>
Clip both the edges of the confidence interval using very small number,say one tenth of a million {1e-7}. <br>
This will stop loss from being exactly zero or $-\infin$ .

In [19]:
-np.log(1e-7)

16.11809565095832

In [21]:
-np.log(1-1e-7)

1.0000000494736474e-07

**The Categorical Cross-Entropy Loss Class**

No matter which loss function we’ll use, the overall loss is always a mean value of all sample losses. <br>
The Loss class is containing the calculate method that will call our loss object’s forward method and calculate the mean value of the returned sample losses.

In [22]:
# Common loss class
class Loss:
    # Calculates the data and regularization losses 
    # given model output and ground truth values
    def calculate(self,output,y):
        # calculate sample losses
        sample_losses = self.forward(output,y)

        #calculate mean loss
        data_loss = np.mean(sample_losses)

        #Return loss
        return data_loss

In [23]:
#Cross Entropy Loss
class Loss_CategoricalCrossEntropy(Loss): #using inheritance here

    def forward(self,y_pred,y_true):

        # number of sample in a batch
        samples = len(y_pred)

        # Clip data to prevent division by 0 
        # Clip both sides to not drag mean towards any value 
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7) 

        # Probabilities for target values - 
        # only if categorical labels 
        if len(y_true.shape) == 1: 
            correct_confidences = y_pred_clipped[ 
                range(samples), 
                y_true 
            ] 
 
        # Mask values - only for one-hot encoded labels 
        elif len(y_true.shape) == 2: 
            correct_confidences = np.sum( 
                y_pred_clipped * y_true, axis=1 
            ) 

        # Losses
        negative_log_likelihoods = -np.log(correct_confidences)
        return negative_log_likelihoods
    

In [24]:
loss_funtion = Loss_CategoricalCrossEntropy()
loss = loss_funtion.calculate(softmax_outputs,class_targets)

In [25]:
print(loss)

0.38506088005216804


**Accuracy calculation** <br>
While loss is a useful metric for optimizing a model, the metric commonly used in practice along with loss is the **accuracy**, which describes how often the largest confidence is the correct class in terms of a fraction. 

In [2]:
import numpy as np
# Probabilities of 3 samples 
softmax_outputs = np.array([[0.7, 0.2, 0.1], 
                            [0.5, 0.1, 0.4], 
                            [0.02, 0.9, 0.08]]) 
# Target (ground-truth) labels for 3 samples 
class_targets = np.array([0, 1, 1]) 
 
# Calculate values along second axis (axis of index 1) 
predictions = np.argmax(softmax_outputs, axis=1) 

# If targets are one-hot encoded - convert them 
if len(class_targets.shape) == 2: 
    class_targets = np.argmax(class_targets, axis=1) 

# True evaluates to 1; False to 0 
accuracy = np.mean(predictions==class_targets) 
 
 
print('acc:', accuracy)

acc: 0.6666666666666666


**Combining everything**


In [3]:
import numpy as np 
import nnfs 
from nnfs.datasets import spiral_data 
 
nnfs.init() 
 
 
# Dense layer 
class Layer_Dense: 
 
    # Layer initialization 
    def __init__(self, n_inputs, n_neurons): 
        # Initialize weights and biases 
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons) 
        self.biases = np.zeros((1, n_neurons)) 
 
    # Forward pass 
    def forward(self, inputs): 
        # Calculate output values from inputs, weights and biases 
        self.output = np.dot(inputs, self.weights) + self.biases 
 
 
# ReLU activation 
class Activation_ReLU: 
 
    # Forward pass 
    def forward(self, inputs): 
        # Calculate output values from inputs 
        self.output = np.maximum(0, inputs)
    
# Softmax activation 
class Activation_Softmax: 
 
    # Forward pass 
    def forward(self, inputs): 
 
        # Get unnormalized probabilities 
        exp_values = np.exp(inputs - np.max(inputs, axis=1, 
                                            keepdims=True)) 
        # Normalize them for each sample 
        probabilities = exp_values / np.sum(exp_values, axis=1, 
                                            keepdims=True) 
 
        self.output = probabilities 
 
 
# Common loss class 
class Loss: 
 
    # Calculates the data and regularization losses 
    # given model output and ground truth values 
    def calculate(self, output, y): 
 
        # Calculate sample losses 
        sample_losses = self.forward(output, y) 
 
        # Calculate mean loss 
        data_loss = np.mean(sample_losses) 
 
        # Return loss 
        return data_loss 
 
 
# Cross-entropy loss 
class Loss_CategoricalCrossentropy(Loss): 
 
    # Forward pass 
    def forward(self, y_pred, y_true): 
 
        # Number of samples in a batch 
        samples = len(y_pred) 
 
        # Clip data to prevent division by 0 
        # Clip both sides to not drag mean towards any value
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)

         # Probabilities for target values - 
        # only if categorical labels 
        if len(y_true.shape) == 1: 
            correct_confidences = y_pred_clipped[ 
                range(samples), 
                y_true 
            ] 
 
        # Mask values - only for one-hot encoded labels 
        elif len(y_true.shape) == 2: 
            correct_confidences = np.sum( 
                y_pred_clipped * y_true, 
                axis=1 
            ) 
 
        # Losses 
        negative_log_likelihoods = -np.log(correct_confidences) 
        return negative_log_likelihoods 
 
 
 
# Create dataset 
X, y = spiral_data(samples=100, classes=3) 
 
# Create Dense layer with 2 input features and 3 output values 
dense1 = Layer_Dense(2, 3) 
 
# Create ReLU activation (to be used with Dense layer): 
activation1 = Activation_ReLU() 
 
# Create second Dense layer with 3 input features (as we take output 
# of previous layer here) and 3 output values 
dense2 = Layer_Dense(3, 3) 
 
# Create Softmax activation (to be used with Dense layer): 
activation2 = Activation_Softmax() 
 
# Create loss function 
loss_function = Loss_CategoricalCrossentropy() 
 
# Perform a forward pass of our training data through this layer 
dense1.forward(X) 
 
# Perform a forward pass through activation function 
# it takes the output of first dense layer here 
activation1.forward(dense1.output) 

# Perform a forward pass through second Dense layer 
# it takes outputs of activation function of first layer as inputs 
dense2.forward(activation1.output) 

# Perform a forward pass through activation function 
# it takes the output of second dense layer here 
activation2.forward(dense2.output) 

# Let's see output of the first few samples: 
print(activation2.output[:5]) 

# Perform a forward pass through loss function 
# it takes the output of second dense layer here and returns loss 
loss = loss_function.calculate(activation2.output, y) 

# Print loss value 
print('loss:', loss)

# Calculate accuracy from output of activation2 and targets 
# calculate values along first axis 
predictions = np.argmax(activation2.output, axis=1) 
if len(y.shape) == 2: 
   y = np.argmax(y, axis=1) 
accuracy = np.mean(predictions==y) 
# Print accuracy 
print('acc:', accuracy)

[[0.33333334 0.33333334 0.33333334]
 [0.3333332  0.3333332  0.33333364]
 [0.3333329  0.33333293 0.3333342 ]
 [0.3333326  0.33333263 0.33333477]
 [0.33333233 0.3333324  0.33333528]]
loss: 1.0986104
acc: 0.34


Again, we get ~0.33​ values since the model is random, and its average loss is also not great for 
these data, as we’ve not yet trained our model on how to correct its errors.