#### Backpropogation and optimization requires us to know how far our output values are from the actual labels - since output is given as a probability distribution among classes, we need a loss function to determine the distance of each class prediction from the true value. 

One example for regression: 

Mean Absolute Error: | predicted val - target val |

MAE is relatively simple. Typically, we use RMSE. 

For classification, loss function is generally Categorical Cross-Entropy.

The categorical cross-entropy loss is defined as:

$$
L_i = - \sum_{j} y_{i,j} \log(\hat{y}_{i,j})
$$

where:
- \( L_i \) is the loss for the \(i\)-th example.
- \( y_{i,j} \) is the true label (one-hot encoded) for the \(j\)-th class of the \(i\)-th example.
- \( \hat{y}_{i,j} \) is the predicted probability for the \(j\)-th class of the \(i\)-th example.
- The summation is over all classes \( j \).

One hot encoding ultimately reduces this to:

$$
L_i = - \log(\hat{y}_{i,k})
$$

where:
- \( L_i \) is sample loss value.
- \( i \) is i-th sample in a set.
- \( k \) is target label index, index of correct class probability.
- \( \hat{y} \) represents the predicted probability of the correct class for the \(i\)-th sample.

#### One Hot Encoding...

Produces vectors to represent labels with binary values. 

For instance, if I have 5 classes and the label represents the 3rd class, the one hot vector may look like this:

[0, 0, 1, 0, 0]

#### Logarithms...

Typically refers to natural logarithm:

$$
y = \log_e x = \ln x
$$

e is Euler's number!




In [4]:
# A bit on Logs.
# Generally speaking, a logarithm solves for x in the equation: e ^ x = b
# So, log(b) = x


import numpy as np
import math

b = 5.2

print(np.log(b))

print(math.e ** 1.6486586255873816 )

1.6486586255873816
5.199999999999999


#### How do one hot encoding and loss function work together?

Classes: 3
Label:   0
One-hot: [1, 0, 0]
Prediction: [0.7, 0.1, 0.2]

$$
L_i = - \log(\hat{y}_{i,j}) = -(1 * \log(0.7) + 0 * \log(0.1) + 0 * \log(0.2))
$$

One hot encoding means I only have one non-zero class label, and it's a one. So I only need the -log of my predicted class.

This simplifies as:

$$
L_i = - \log(\hat{y}_{i,j} = - \log(0.7) = 0.35667494393873245
$$

#### What does it look like to code categorical cross entropy?

In [7]:
import math


softmax_output = [0.7, 0.1, 0.2]
target_class = 0 # so output should be highest at index 0
target_output = [1, 0, 0]

loss = -(math.log(softmax_output[0]) * target_output[0] + # target out is 1...
         math.log(softmax_output[1]) * target_output[1] + # zeroes out
         math.log(softmax_output[2]) * target_output[2])  # zeroes out

print(loss)

loss = -math.log(softmax_output[0])

print(loss)

print(-math.log(0.7))

# But what if the softmax output for the target class was lower?

print(-math.log(0.5))

# The lower the predicted value is when it should be one, the higher the -log(predicted_val), and thus the higher the loss.


0.35667494393873245
0.35667494393873245
0.35667494393873245
0.6931471805599453


#### Implementing Loss... 

We will need to calculate loss in batches. 

In [17]:
softmax_outputs = np.array([[0.7, 0.1, 0.2],
                           [0.1, 0.5, 0.4],
                           [0.02, 0.9, 0.08]])

class_targets = [0, 1, 1] # each index represents the target class for the corresponding batch output

# The goal is to grab the predicted output that corresponds to the class I was supposed to predict, i.e. the target class.
# In the above case, each of the three batches makes predictions for three classes. 
# The first element in each vector corresponds to te first class (label = 0), so on and so forth.
# So given the class targets, I need the first element of the first batch, the second element of the second batch, 
# and second element of the third batch. Array function helps me do this easily.

print(softmax_outputs[[0,1,2], class_targets]) # softmax outputs can be indexed in vectorized form using array function

# 0,1,2 represent first dimension indices. Class targets give us second dimension indices, i.e. elements. 

# But I don't have to hardcode the range of dimensions for the batch length! 

print(softmax_outputs[
    range(len(softmax_outputs)), class_targets
])

# To calculate loss, I just need -logs...
print()
print("Calculated loss for each batch: ", -np.log(softmax_outputs[
    range(len(softmax_outputs)), class_targets
]))
print()
print("Average loss over all three batches: ", np.mean(-np.log(softmax_outputs[
    range(len(softmax_outputs)), class_targets
])))

[0.7 0.5 0.9]
[0.7 0.5 0.9]

Calculated loss for each batch:  [0.35667494 0.69314718 0.10536052]

Average loss over all three batches:  0.38506088005216804


#### Finding average loss like this is great and all, but what about when I have to take -log(0)? That won't work.. Well it might, but I certainly can't grab the mean with an infinite value floating around in there.

This would happen in cases where the network's confidence for the correct class is zero...pretty bad, right? Happens.

To account for this, we can clip low values down to a near-zero value. To avoid some one-sided bias from clipping the low end, we clip the high end as well.

Typical approach: Set range from -log(1e-7) to -log(1-1e-7).
ypred_clipped = np.clip(ypred, 1e-7, 1 - 1e-7)

In [2]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data

nnfs.init()

class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):      
        self.weights = 0.10 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons)) 
    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases
        
class Activation_ReLU:
    def forward(self, inputs):
        self.output = np.maximum(0,inputs)
        
class Activation_Softmax:
    def forward(self, inputs):
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.output = probabilities
        
class Loss: # define base class to work with various forms of loss function
    def calculate(self, output, y):
        sample_losses = self.forward(output, y)
        data_loss = np.mean(sample_losses)
        return data_loss
        
class Loss_CategoricalCrossentropy(Loss):
    def forward(self, y_pred, y_true):
        samples = len(y_pred)
        y_pred_clipped = np.clip(y_pred, 1e-7, 1-1e-7)
        
        # Need to handle both scalar inputs and one-hot encoded inputs
        
        if len(y_true.shape) == 1 : # indicates scalar values
            correct_confidences = y_pred_clipped[range(samples), y_true] # grabs predicted probs that correspond to index of target vals
        elif len(y_true.shape) == 2 : # indicates one-hot encoded vectors
            correct_confidences = np.sum(y_pred_clipped&y_true, axis=1) # returns values within each batch that correspond to target val

        negative_log_likelihoods = -np.log(correct_confidences)
        return negative_log_likelihoods
    


X,y = spiral_data(samples=100, classes=3) # initialize random spiral data with 3 classes and two input features over 100 batches

dense1 = Layer_Dense(2,3) # create first layer, receiving 2 inputs and passing to 3 output neurons
activation1 = Activation_ReLU() # initialize ReLU

dense2 = Layer_Dense(3,3) # create output layer, receiving 3 inputs from layer 1 and passing to 3 classes as defined in X,y above
activation2 = Activation_Softmax() # initialize output layer activation, Softmax

dense1.forward(X) # pass training data into first layer
activation1.forward(dense1.output) # apply ReLU to clip values below zero

dense2.forward(activation1.output) # pass clipped values from layer 1 to final layer
activation2.forward(dense2.output) # apply Softmax to treat negatives and normalize range to 0,1 to produce probabilities

print("Output probabilities: ", "\n",activation2.output[:5]) # returns first 5/100 sets of 3 probabilities each, representing one prob for each class
print()

loss_function = Loss_CategoricalCrossentropy()
loss = loss_function.calculate(activation2.output, y)

print("Loss: ", loss)

Output probabilities:  
 [[0.33333334 0.33333334 0.33333334]
 [0.33331734 0.3333183  0.33336434]
 [0.3332888  0.33329153 0.33341965]
 [0.33325943 0.33326396 0.33347666]
 [0.33323312 0.33323926 0.33352762]]

Loss:  1.098445


#### Just in case, calculate accuracy...

In [5]:
softmax_outputs = np.array([[0.7, 0.1, 0.2],
                           [0.5, 0.1, 0.4], # notice that index 0 is predicted when target is index 1
                           [0.02, 0.9, 0.08]])

class_targets = [0, 1, 1]

predictions = np.argmax(softmax_outputs, axis=1)
accuracy = np.mean(predictions == class_targets)

print("Predictions: ", predictions)
print()
print("Accuracy: ", accuracy)

Predictions:  [0 0 1]

Accuracy:  0.6666666666666666


#### How do we decrease loss? This is the subject of optimization!

# Unfortunately, Neural Networks from Scratch content ends here. I will move on to implement these ideas in Tensorflow and Keras, where we can continue the journey of backpropogation and optimization.