<a href="https://colab.research.google.com/github/sandipanpaul21/Neural-Network-in-Python/blob/main/05_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Calculating Network Error with Loss

- With a randomly-initialized model, or even a model initialized with more sophisticated approaches, our goal is to train, or teach, a model over time. To train a model, we tweak the weights and biases to improve the model’s accuracy and confidence. 
- To do this, we calculate how much error the model has. The ​loss function​, also referred to as the ​**cost function​**
- It is the algorithm that quantifies how wrong a model is. ​Loss​ is the measure of this metric. Since loss is the model’s error, we ideally want it to be 0.

#### Categorical Cross-Entropy Loss

- Categorical cross-entropy​ is explicitly used to compare a “ground-truth” probability (​y o​r​ “​ ​targets”​ ) and some predicted distribution (​y-hat o​r “​predictions​”)
- So it makes sense to use cross-entropy here. 
- It is also one of the most commonly used loss functions with a softmax activation on the output layer.

The formula for calculating the categorical cross-entropy of ​y​ (actual/desired distribution) and y-hat​ (predicted distribution) is: 

     Loss = - Actual * log(Predicted)

- Cross-entropy compares two probability distributions. 

  In our case, we have a softmax output: softmax_output ​= ​[​0.7​, ​0.1​, ​0.2​]

**Which probability distribution do we intend to compare this to?**
 
- We have 3 class confidences in the above output, and let’s assume that the desired prediction is the first class (index 0, which is currently 0.7). 
- If that’s the intended prediction, then the desired probability distribution is ​[​1​, 0​, ​0​]​. 
- Cross-entropy can also work on probability distributions like ​[​0.2​, ​0.5​, ​0.3​]​; they wouldn’t have to look like the one above. 
- That said, the desired probabilities will consist of a 1 in the desired class, and a 0 in the remaining undesired classes. 

- Arrays or vectors like this are called ​**one-hot**​,​ meaning one of the values is “hot” (on), with a value of 1, and the rest are “cold” (off), with values of 0. 
- When comparing the model’s results to a one-hot vector using cross-entropy, the other parts of the equation zero out, and the target probability’s log loss is multiplied by 1, making the cross-entropy calculation relatively simple. This is also a special case of the cross-entropy calculation, called categorical cross-entropy.

In [1]:
import math

# Predicted Output
softmax_output = [0.7,0.1,0.2]  

# Truth 
target_output = [1,0,0]

# Loss = - Actual * Log(Predicted)
loss = -(target_output[0] * math.log(softmax_output[0]) +
         target_output[1] * math.log(softmax_output[1]) +
         target_output[2] * math.log(softmax_output[2]))

print("Model Loss :",loss)

Model Loss : 0.35667494393873245


In [2]:
import numpy as np

# Predicted
softmax_output = np.array([[0.7, 0.1,0.2],
                           [0.1, 0.5, 0.4],
                           [0.02,0.9,0.08]])

# Truth
target_output = [0,1,1]

print(softmax_output[[0,1,2],target_output])

[0.7 0.5 0.9]


- NumPy lets us index an array in multiple ways. 
- One of them is to use a list filled with indices and that’s convenient for us — we could use the ​class_targets​ for this purpose as it already contains the list of indices that we are interested in. 
- The problem is that this has to filter data rows in the array — the second dimension. 
- To perform that, we also need to explicitly filter this array in its first dimension. This dimension contains the predictions and we, of course, want to retain all. 
- We can achieve that by using a list containing numbers from 0 through all of the indices. We know we’re going to have as many indices as distributions in our entire batch, so we can use a ​range​()​ instead of typing each value ourselves:

In [3]:
print("Softmax Output :\n",softmax_output)
print("Length of Softmax Output :",len(softmax_output))
print("Range of Softmax Output :",range(len(softmax_output)))
print("Target Output :",target_output)
print("So for range (0 : n), starting from list 0 to n-1 extract Target Output Position to find Highest Probability")
print("Highest Probability Row Wise (by adding target column) : ",softmax_output[range(len(softmax_output)),target_output])

Softmax Output :
 [[0.7  0.1  0.2 ]
 [0.1  0.5  0.4 ]
 [0.02 0.9  0.08]]
Length of Softmax Output : 3
Range of Softmax Output : range(0, 3)
Target Output : [0, 1, 1]
So for range (0 : n), starting from list 0 to n-1 extract Target Output Position to find Highest Probability
Highest Probability Row Wise (by adding target column) :  [0.7 0.5 0.9]


- Since we implemented this to work with sparse labels (as in our training data), we have to add a check if they are one-hot encoded and handle it a bit differently in this new case. 
- For example above have a list which contains true value [0,1,1] contains list only correct answer and it contains only one list
- But suppose same can be represented as one hot encoded true values [ [1,0,0], [0,1,0] , [0,1,0] ] here it is list of list
- The check can be performed by counting the dimensions — if targets are single-dimensional (like a list)
- But if there are 2 dimensions (like a list of lists), then there is a set of one-hot encoded vectors.

In [4]:
softmax_output = np.array([[0.7, 0.1,0.2],
                           [0.1, 0.5, 0.4],
                           [0.02,0.9,0.08]])

class_target = np.array([[1,0,0],
                        [0,1,0],
                        [0,1,0]])

print("Print Length of Class Target (start from 0, if 3 then length(number of rows) is 4) :",len(class_target.shape))
print("Print Range of Class Target :",range(len(class_target.shape)))

# Probabilities for target values
# only if categorical labels

if len(class_target.shape) == 1:
  print("Class Target is of One List")
  print("List contains True Values")
  correct_confidence = softmax_output[range(len(softmax_output)),class_target]

elif len(class_target.shape) == 2:
  print("\nClass Target has more than One List")
  print("One Hot Encoded Way")
  correct_confidence = np.sum(softmax_output * class_target,axis = 1)
  print("Choosen Probability from every list :",correct_confidence)

# Losses = -log(probabilities)
neg_log = -np.log(correct_confidence)
print("\nNegative Log(Choosen Probability) :",neg_log)

# Average of Loss
average_loss = np.mean(neg_log)
print("\nAverage of Negative Log(Choosen Probability) :",average_loss)

Print Length of Class Target (start from 0, if 3 then length(number of rows) is 4) : 2
Print Range of Class Target : range(0, 2)

Class Target has more than One List
One Hot Encoded Way
Choosen Probability from every list : [0.7 0.5 0.9]

Negative Log(Choosen Probability) : [0.35667494 0.69314718 0.10536052]

Average of Negative Log(Choosen Probability) : 0.38506088005216804


- The softmax output, which is also an input to this loss function, consists of numbers in the range from 0 to 1 - a list of confidences. 
- It is possible that the model will have full confidence for one label making all the remaining confidences zero. 
- Similarly, it is also possible that the model will assign full confidence to a value that wasn’t the target. 

In [5]:
# If we then try to calculate the loss of this confidence of 0:
-np.log(0) # Gives Error

  


inf

- Before we explain this, we need to talk about ​log(0)​. 
- From the mathematical point of view, ​log(0) is undefined. We already know the following dependence: if ​y=log(x),​ then ​e​^y​=x​. 
- The question of what the resulting ​y​ is in ​y=log(0)​ is the same as the question of what’s the ​y​ in ​e​^y​=0​. 
- In simplified terms, the constant ​e​ to any power is always a positive number, and there is no ​y resulting in ​e​^y=​ 0.​ 
- This means the ​log(0)​ is undefined. We need to be aware of what the ​log(0)​ is, and “undefined” does not mean that we don’t know anything about it. 

**Since ​log(0)​ is undefined, what’s the result for a value very close to ​0​?**

Ans. What this means is that the limit is negative infinity for an infinitely small ​x,​ where ​x​ never reaches ​0.​

In [6]:
# We could add a very small value to the confidence to prevent it from being a zero, for example, 1e-7​: 0.000,000,1
- np.log(1e-7) 

16.11809565095832

In [7]:
# Adding a very small value, one-tenth of a million, to the confidence at its far edge will insignificantly impact the result, 
# but this method yields an additional 2 issues. 

# First, in the case where the confidence value is ​1​, adding even smallest value will give probability more than 1 which is not possible:
- np.log(1+1e-7) 

-9.999999505838704e-08

- When the model is fully correct in a prediction and puts all the confidence in the correct label, loss becomes a negative value instead of being 0. 
- The other problem here is shifting confidence towards ​1​, even if by a very small value. 
- To prevent both issues, it’s better to clip values from both sides by the same number, ​1e-7​ in our case.
- That means that the lowest possible value will become ​1e-7​ (like in the demonstration we just performed) but the highest possible value, instead of being ​1+1e-7​, will become ​1-1e-7​ (so slightly less than ​1​):

In [8]:
- np.log(1-1e-7)

1.0000000494736474e-07

#### Clipping Example

Clip (limit) the values in an array.

- Given an interval, values outside the interval are clipped to the interval edges. 
- For example, if an interval of [0, 1] is specified, values smaller than 0 become 0, and values larger than 1 become 1.

      a = np.arange(10)
      a # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
      
      np.clip(a, 1, 8) # array([1, 1, 2, 3, 4, 5, 6, 7, 8, 8])
      np.clip(a, 8, 1) # array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

- This will prevent loss from being exactly ​0​, making it a very small value instead, but won’t make it a negative value and won’t bias overall loss towards ​1.​ 

      y_pred_clipped ​= ​np.clip(y_pred, ​1e-7​, ​1 ​- ​1e-7​)  # (value,smallest range,highest range)

- This method can perform clipping on an array of values, so we can apply it to the predictions directly and save this as a separate array,

In [9]:
# Function for Cross Entropy Loss

# Common loss class
class Loss:
  # Calculates the data and regularization losses 
  # Given model output and ground truth values
  def calculate(self,output,y):
    # Calculate Sample Loss
    sample_losses = self.foward(output,y)
    # Average of Loss
    data_loss = np.mean(sample_losses)
    # Returb Loss
    return data_loss

class Loss_CategoricalCrossEntropy(Loss) :
  # Forward Pass
  def forward(self, y_pred, y_true):
    
    # Number of samplesin Batch
    samples = (y_pred)

    # Clip Data to prevent division by 0
    # Clip both sides to not drag mean towards any value ​y_pred_clipped ​= ​np.clip(y_pred, ​1e-7​, ​1 ​- ​1e-7​)
    y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
    
    # Probabilities for target values 
    # only if categorical labels 
    if len(y_true.shape) == 1:
      print("Actual Target is of One List")
      print("List contains True Values")
      correct_confidence = y_pred_clipped[range(len(samples)),y_true]

    elif len(y_true.shape) == 2:
      print("\nActual Target has more than One List")
      print("One Hot Encoded Way")
      correct_confidence = np.sum(y_pred_clipped * y_true,axis = 1)
      print("Choosen Probability from every list :",correct_confidence)

    # Losses = -log(probabilities)
    negative_log_likelihood = -np.log(correct_confidence)
    return negative_log_likelihood

In [10]:
import nnfs   
from nnfs.datasets import spiral_data

nnfs.init()   # Will set random seed = 0, to make it repeatable

class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.1 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases

class Activation_ReLU:
    def forward(self, inputs):
        self.output = np.maximum(0, inputs)

class Activation_Softmax:
    def forward(self, inputs):
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.output = probabilities

class Loss:
  def calculate(self, output,y):
    sample_losses = self.forward(output,y)
    data_loss = np.mean(sample_losses)
    return data_loss

class Loss_CategoricalCrossEntropy(Loss) :
  def forward(self, y_pred, y_true):
    samples = (y_pred)
    y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
    if len(y_true.shape) == 1:
      correct_confidence = y_pred_clipped[range(len(samples)),y_true]

    elif len(y_true.shape) == 2:
      correct_confidence = np.sum(y_pred_clipped * y_true,axis = 1)
    negative_log_likelihood = -np.log(correct_confidence)
    return negative_log_likelihood

X, y = spiral_data(samples=100, classes=3)

dense1 = Layer_Dense(2,3)
activation1 = Activation_ReLU()

dense2 = Layer_Dense(3, 3)
activation2 = Activation_Softmax()

dense1.forward(X)
activation1.forward(dense1.output)

dense2.forward(activation1.output)
activation2.forward(dense2.output)

print("Activation Function Output :\n",activation2.output[:5])

# Create loss function
loss_function = Loss_CategoricalCrossEntropy()

# Perform a forward pass through loss function
# it takes the output of second dense layer here and returns loss
loss = loss_function.calculate(activation2.output,y)

print("Loss :",loss)

Activation Function Output :
 [[0.33333334 0.33333334 0.33333334]
 [0.33331734 0.3333183  0.33336434]
 [0.3332888  0.33329153 0.33341965]
 [0.33325943 0.33326396 0.33347666]
 [0.33323312 0.33323926 0.33352762]]
Loss : 1.098445


Again, we get ​~0.33​ values since the model is random, and its average loss is also not great for these data, as we’ve not yet trained our model on how to correct its errors.

#### Accuracy Calculation
- While loss is a useful metric for optimizing a model, the metric commonly used in practice along with loss is the ​accuracy​, which describes how often the largest confidence is the correct class in terms of a fraction. 
- Conveniently, we can reuse existing variable definitions to calculate the accuracy metric. 
- We will use the ​argmax ​values from the ​softmax outputs ​and then compare these to the targets. 


In [11]:
# Model Output
softmax_output = np.array([[0.7, 0.1,0.2],
                           [0.5, 0.1, 0.4],
                           [0.02,0.9,0.08]])

# Ground Truth
class_target = np.array([0,1,1])

predictions = np.argmax(softmax_output,axis = 1)
print("Model Predicted : ",predictions)
print("Actual Truth : ",class_target)

# If targets are one hot encoded - convert them
if len(class_target.shape) == 2:
  class_target = np.argmax(class_target,axis = 1)

# Evaluation Accuracy
accuracy = np.mean(predictions == class_target)
print("Accuracy :",accuracy)

Model Predicted :  [0 0 1]
Actual Truth :  [0 1 1]
Accuracy : 0.6666666666666666


In [12]:
import nnfs   
from nnfs.datasets import spiral_data

nnfs.init()   # Will set random seed = 0, to make it repeatable

class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.1 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases

class Activation_ReLU:
    def forward(self, inputs):
        self.output = np.maximum(0, inputs)

class Activation_Softmax:
    def forward(self, inputs):
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.output = probabilities

class Loss:
  def calculate(self, output,y):
    sample_losses = self.forward(output,y)
    data_loss = np.mean(sample_losses)
    return data_loss

class Loss_CategoricalCrossEntropy(Loss) :
  def forward(self, y_pred, y_true):
    samples = (y_pred)
    y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
    if len(y_true.shape) == 1:
      correct_confidence = y_pred_clipped[range(len(samples)),y_true]

    elif len(y_true.shape) == 2:
      correct_confidence = np.sum(y_pred_clipped * y_true,axis = 1)
    negative_log_likelihood = -np.log(correct_confidence)
    return negative_log_likelihood

X, y = spiral_data(samples=100, classes=3)

dense1 = Layer_Dense(2,3)
activation1 = Activation_ReLU()

dense2 = Layer_Dense(3, 3)
activation2 = Activation_Softmax()

dense1.forward(X)
activation1.forward(dense1.output)

dense2.forward(activation1.output)
activation2.forward(dense2.output)

print("Activation Function Output :\n",activation2.output[:5])

# Create loss function
loss_function = Loss_CategoricalCrossEntropy()

# Perform a forward pass through loss function
# it takes the output of second dense layer here and returns loss
loss = loss_function.calculate(activation2.output,y)

print("Loss :",loss)

# Accuracy
predictions = np.argmax(activation2.output,axis = 1)

# If targets are one hot encoded - convert them
if len(y.shape) == 2:
  y = np.argmax(y,axis = 1)

# Evaluation Accuracy
accuracy = np.mean(predictions == y)
print("Accuracy :",accuracy)

Activation Function Output :
 [[0.33333334 0.33333334 0.33333334]
 [0.33331734 0.3333183  0.33336434]
 [0.3332888  0.33329153 0.33341965]
 [0.33325943 0.33326396 0.33347666]
 [0.33323312 0.33323926 0.33352762]]
Loss : 1.098445
Accuracy : 0.34


We perform a forward pass through our network and calculate the metrics to signal if the model is performing poorly, we will embark on optimization in the next chapter!