## __Applying Backpropagation__
Let's first start by editing the Dense class. We can just use what we in the previous notebook, and apply it here

In [26]:
class Dense:
  def __init__(self, input_neurons, output_neurons):
    self.weights = 0.1*np.random.randn(input_neurons, output_neurons)
    self.biases = np.zeros((1, output_neurons))
  def forward(self, inputs):
    self.inputs = inputs
    self.output = np.dot(inputs, self.weights) + self.biases
  def backprop(self, dvalues):
    self.dinputs = np.dot(dvalues, self.weights.T)
    self.dweights = np.dot(self.inputs.T, dvalues)
    self.dbiases = np.sum(dvalues, axis=0, keepdims=True)

Now let's edit the relu class

In [27]:
class ReLu:
  def forward(self, inputs):
    self.inputs = inputs
    self.output = np.maximum(0, inputs)
  def backprop(self, dvalues):
    self.dinputs = dvalues.copy()
    self.dinputs[self.inputs <= 0] = 0

As you can see, in both classes we added a variable called inputs, which is saved while we do the forward pass. This is because we need those inputs, when we are calculating our gradients during backpropagation. But what about the softmax and loss function derivatives. Well we can actually combine them together, which will also make our gradient calculations easier and faster. Let's start by taking the partial derivative of the loss function with respect to the inputs of the softmax function. We can use the chain rule to do this.

### $\frac{\partial L_{i}}{\partial z_{i,k}}$

This represent the gradient of the loss $L_{i}$ with respect to the input $z_{i,k}$ of the softmax function from class $k$. Now we can apply chain rule 

### $\frac{\partial L_{i}}{\partial z_{i,k}} = \frac{\partial L_{i}}{\partial \hat y_{i,k}}\times\frac{\partial \hat y_{i,k}}{\partial z_{i,k}}$

Now we have the gradient of the loss function, with respect to the softmax output, times the gradient of the softmax output, with respect to the input. The gradients are:

### $\frac{\partial L_{i}}{\partial \hat y_{i,k}} = - \frac {y_{i,j}}{\hat y_{i,j}}$
Calculated using partial derivative of multiplication

Now there are two possibilities of the softmax gradient. The reason is because since $j$ and $k$ both represent classes, either $j=k$ or $j \neq k$

### $\frac{\partial \hat y_{i,k}}{\partial z_{i,k}} = \hat y_{i, k}(1 - \hat y_{i, k}), (j = k)$

### $\frac{\partial \hat y_{i,k}}{\partial z_{i,k}} = -\hat y_{i, k} \times \hat y_{i, j}, (j \neq k)$

This helps us get what impact the input has on its class and other classes. By combining both the gradient, we our solution.

### $\frac{\partial L_{i}}{\partial z_{i,k}} = \hat y_{i, k} - y_{i, k}$

In simpler terms, the gradient for loss function, with respect to softmax, is the predicted minus the true probability. This is simple to apply, because for the ones that aren't the true, we can leave them unchanged since their true probability is 0, but for the ones that are correct, we can just subtract 1, meaning they are the true label. We set up a combined class, initialize our variables, and add a backprop method.

In [28]:
class Softmax:
  def forward(self, inputs):
    exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
    self.output = exp_values / np.sum(exp_values, axis=1, keepdims=True)

class Loss:
  def calculate(self, y_pred, y_true):
    samples = len(y_pred)
    y_pred = np.clip(y_pred, 1e-7, 1-1e-7)
    correct_confidences = y_pred[range(samples), y_true]
    return -np.mean(np.log(correct_confidences))

class Softmax_Loss:
  def __init__(self):
    self.activation = Softmax()
    self.loss = Loss()
  def forward(self, inputs, y_true):
    self.activation.forward(inputs)
    self.output = self.activation.output
    return self.loss.calculate(self.output, y_true)
  def backprop(self, dvalues, y_true):
    samples = len(dvalues)
    self.dinputs = dvalues.copy()
    self.dinputs[range(samples), y_true] -= 1
    self.dinputs /= samples

# __Training Model__

We are finally finished with backpropagation, and can start training our model. We will start by getting our dataset, and defining our neural network. Then we will set some hyperparameters, called

1. Epochs - How many times will the model train using the training data
2. Batch Size - How many samples will the model look at once
3. Learning Rate - How much of the gradient we will use to update

After that we will create a training loop, which will run based on number of epochs. Then we will shuffl our data, and create another loop. This loop will process the data batch by batch. First we will create batches. Then we will perform forward pass, and then backpropagation. After that we will update our weights and biases based on our learning rate and gradients, and display our loss and accuracy, which we will calculate. Then after training, we will try our model on a test dataset, and see how it does.

In [29]:
import numpy as np
import pickle
from sklearn.model_selection import train_test_split

import pickle
with open('dataset.p', 'rb') as file:
  X, y = pickle.load(file)
X = X.reshape(X.shape[0], -1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

layer_one = Dense(784, 128)
relu_one = ReLu()
layer_two = Dense(128, 64)
relu_two = ReLu()
layer_three = Dense(64, 10)
softmax_loss = Softmax_Loss()

epochs = 20
batch_size = 32
learning_rate = 0.001

for epoch in range(epochs):
  indices = np.arange(len(X_train))
  np.random.shuffle(indices)
  X_train, y_train = X_train[indices], y_train[indices]
  
  for i in range(0, len(X_train), batch_size):
    X_batch, y_batch = X_train[i:i+batch_size], y_train[i:i+batch_size]
     
    layer_one.forward(X_batch)
    relu_one.forward(layer_one.output)
    layer_two.forward(relu_one.output)
    relu_two.forward(layer_two.output)
    layer_three.forward(relu_two.output)
    loss = softmax_loss.forward(layer_three.output, y_batch)

    predictions = np.argmax(softmax_loss.output, axis=1)
    accuracy = np.mean(predictions == y_batch)

    softmax_loss.backprop(softmax_loss.output, y_batch)
    layer_three.backprop(softmax_loss.dinputs)
    relu_two.backprop(layer_three.dinputs)
    layer_two.backprop(relu_two.dinputs)
    relu_one.backprop(layer_two.dinputs)
    layer_one.backprop(relu_one.dinputs)

    layer_one.weights -= learning_rate * layer_one.dweights
    layer_one.biases -= learning_rate * layer_one.dbiases
    layer_two.weights -= learning_rate * layer_two.dweights
    layer_two.biases -= learning_rate * layer_two.dbiases
    layer_three.weights -= learning_rate * layer_three.dweights
    layer_three.biases -= learning_rate * layer_three.dbiases

  print(f'Epoch {epoch+1}/{epochs}, Loss: {loss:.5f}, Accuracy: {accuracy * 100:.2f}%')

Epoch 1/20, Loss: 0.66908, Accuracy: 75.00%
Epoch 2/20, Loss: 0.92674, Accuracy: 78.12%
Epoch 3/20, Loss: 0.38734, Accuracy: 90.62%
Epoch 4/20, Loss: 0.75978, Accuracy: 87.50%
Epoch 5/20, Loss: 0.27943, Accuracy: 90.62%
Epoch 6/20, Loss: 0.32375, Accuracy: 90.62%
Epoch 7/20, Loss: 0.65538, Accuracy: 81.25%
Epoch 8/20, Loss: 0.16089, Accuracy: 96.88%
Epoch 9/20, Loss: 0.64477, Accuracy: 84.38%
Epoch 10/20, Loss: 0.27178, Accuracy: 96.88%
Epoch 11/20, Loss: 0.14102, Accuracy: 93.75%
Epoch 12/20, Loss: 0.45884, Accuracy: 84.38%
Epoch 13/20, Loss: 0.18597, Accuracy: 93.75%
Epoch 14/20, Loss: 0.25083, Accuracy: 93.75%
Epoch 15/20, Loss: 0.09507, Accuracy: 100.00%
Epoch 16/20, Loss: 0.22362, Accuracy: 96.88%
Epoch 17/20, Loss: 0.20952, Accuracy: 93.75%
Epoch 18/20, Loss: 0.03129, Accuracy: 100.00%
Epoch 19/20, Loss: 0.02739, Accuracy: 100.00%
Epoch 20/20, Loss: 0.08755, Accuracy: 96.88%


Nice, it looks like our model did a great job, with a high accuracy on the training data. Let's see how it does on the testing data.

In [30]:
layer_one.forward(X_test)
relu_one.forward(layer_one.output)
layer_two.forward(relu_one.output)
relu_two.forward(layer_two.output)
layer_three.forward(relu_two.output)

test_loss = softmax_loss.forward(layer_three.output, y_test)
print(f'Test Loss: {test_loss:.5f}') 

predictions = np.argmax(softmax_loss.output, axis=1) 
accuracy = np.mean(predictions == y_test)
print(f'Test Accuracy: {accuracy * 100:.2f}%')

Test Loss: 0.25982
Test Accuracy: 93.07%


It seems like the testing accuracy is not as high as we would like it to be. Not to worry, as we will next cover optimizers and how they can really make our learning process, faster and better for the model.