<a href="https://colab.research.google.com/github/sgcortes/KerasTensor/blob/master/07_Simple_Backprop_using_gradient_Tape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automatic Differentiation and GradientTape

* **Backpropagation** algorithm is used heavily in neural networks to update the model's parameters.
* The algorithm works by continuously moving backwards in the network, finding the partial derivatives of the loss function w.r.t. the model's parameters, and then performing the parameter updates.
* A key task in backpropagation is to first find out the gradient values for each trainable parameter of the model.
* We know that **Backpropagation algorithm uses Chain rule to find the partial derivatives.**
* But to derive and define the gradient calculation for each parameter on our own can be tedious, there being so many parameters.
* However, when using modern Deep Learning frameworks such as Tensorflow, PyTorch or MXNet, we generally don't have to worry about calculating these gradients manually. It’s done automatically for us.


<img src='https://learnopencv.com/wp-content/uploads/2022/01/c4_02_GradientTape_Poster.png' width=400 align='center'>

## Table of Contents

* [1 Objective](#1-Objective)
* [2 GradientTape](#2-GradientTape)
    * [2.1 Auto-differentiation Example](#2.1-Auto-differentiation-Example)
* [3 GradientTape Arguments](#3-GradientTape-Arguments)
* [4 Back-Progagation Examle](#4-Multivariate-Linear-Regression)
    * [4.1 Problem Setup](#4.1-Problem-Setup)
    * [4.2 Sigmoid Function](#4.2-Sigmoid-Function)
    * [4.3 Binary Cross-Entropy](#4.3-Binary-Cross\-Entropy)
    * [4.4 Solving Manually](#4.4-Solving-Manually)
* [5 Direct Implementation using GradientTape](#5-Direct-Implementation-using-GradientTape)

## 1 Objective

In this notebook, we'll study the `GradientTape API` provided by TensorFlow, and see for ourselves how easy it is to perform automatic differentiation.
    
1. First, we will explain what `GradientTape` is and show through an example how it can be used for automatic differentiation.
2. Next, we'll briefly discuss the different GradientTape arguments.
3. Finally, we take a backpropagation example to demonstrate the effectiveness of the GradientTape API.
4. In the section **Backpropagation Example**, we will start by explaining the example problem we are trying to solve, then go over the two steps in backpropagation as well as defining  the various mathematical equations involved in backpropagation. After that, we shall explore three different ways to solve the example problem.
    1. ***Manually:*** In this section, we solve the problem manually, by first deriving the derivative equations and defining the gradient functions for individual functions. We then find out the gradient value for the respective derivatives involved in the chain rule and go on to solve its equations.
    2. ***Using GradientTape:*** Next you’ll  see how with the help of GradientTape we quickly find the gradients for all the individual functions involved, without defining the gradient functions. Then use these gradient values to solve the chain rule equations.
    3. ***Direct method:*** Finally, we see a more general and direct way of using GradientTape. Here, we only need to focus on executing the forward pass correctly, and at the end, we get the final gradients w.r.t. the variables with just one method call. You needn’t worry here about implementing the chain rule. The GradientTape will handle it internally.


## 2 GradientTape

From Tensorflow: 

> ***TensorFlow provides the tf.GradientTape API for automatic differentiation; that is, computing the gradient of a computation with respect to some inputs, usually tf.Variables. TensorFlow "records" relevant operations executed inside the context of a tf.GradientTape onto a "tape". <br>
TensorFlow then uses that tape to compute the gradients of a "recorded" computation using reverse mode differentiation."***

### 2.1 Auto-differentiation Example

In [None]:
import tensorflow as tf

In [None]:
# Declare a tensorflow Variable

x = tf.Variable(2.0)

Within the **`with tf.GradientTape() as tape`** context manager we'll perform some operations.


> **"To differentiate automatically, TensorFlow needs to remember what operations happen in what order during the forward pass. Then, during the backward pass, TensorFlow traverses this list of operations in reverse order to compute gradients.**

In [None]:
with tf.GradientTape() as tape:
    # perform squaring operation
    y = x ** 2
    
    # Now that GradientTape has recorded the operation  
    # we can calculate the gradient of the operation i.e. dy/dx
    

# Note: We are now outside the GradientTape context
# Gradient calculations and updates need to be performed
# outside the GradientTape context, or these operations will be
# recorded on the tape as well, and increased memory usage.


dy_dx = tape.gradient(y, x)

* The `.gradient(...)` is responsible for actually calculating the gradients.
<br><br>

**`.gradient(...)` method's signature**

```python
Signature:
tape.gradient(
    target,
    sources,
    output_gradients=None,
    unconnected_gradients=<UnconnectedGradients.NONE: 'none'>,
)

Docstring:
Computes the gradient using operations recorded in context of this tape.

Note: Unless you set `persistent=True` a GradientTape can only be used to
compute one set of gradients (or jacobians).

Args:
  target: a list or nested structure of Tensors or Variables to be
    differentiated.
  sources: a list or nested structure of Tensors or Variables. `target`
    will be differentiated against elements in `sources`.
  output_gradients: a list of gradients, one for each element of
    target. Defaults to None.
  unconnected_gradients: a value which can either hold 'none' or 'zero' and
    alters the value which will be returned if the target and sources are
    unconnected. The possible values and effects are detailed in
    'UnconnectedGradients' and it defaults to 'none'.

Returns:
  a list or nested structure of Tensors (or IndexedSlices, or None),
  one for each element in `sources`. Returned structure is the same as
  the structure of `sources`.
```

In [None]:
# Gradient value

print(dy_dx.numpy())

We got an output of 4.0.<br>
Now, let's calculate and check it manually:

**Derivative:**<br>

$$y = x^{2} $$

$$\frac{dy}{dx} = \frac{d (x^{2})}{dx}$$ <br><br>

Derivative of: $\mathbf{x^{n}=n*x^{n-1}}$, Therefore,

$$\frac{dy}{dx} = 2 * \frac{d (x)}{dx}$$

$$\frac{dy}{dx} = 2 * x $$

$$\frac{dy}{dx} = 2 * 2 $$

$$\frac{dy}{dx} = 4 $$



## 3 GradientTape Arguments

<hr style="border:none; height: 4px; background-color:#D3D3D3" />

The `tf.GradientTape` class takes in 2 parameters, which are as follows:

* **watch_accessed_variables:** (Boolean, Default: `True`) Controls whether the tape will automatically `watch` any (trainable) variables that are accessed while the tape is active. This means gradients can be requested from any result computed in the tape derived from reading a trainable Variable. If False, users must explicitly `watch` any Variables they want to request gradients from.
* **persistent:** (Boolean, Default: `False`) Controls whether or not to create a persistent gradient tape. This should be used when you need to compute more than one set of gradients.
    

<hr style="border:none; height: 4px; background-color:#D3D3D3" />

## 4 Backpropagation Example

### 4.1 Problem Setup

**Our goal here is to show the effectiveness of `GradientTape` API.<br>
We'll demonstrate how you can use GradientTape API to minimize a loss function, using the backpropagation algorithm**

We will take an example that will include operations closely resembling a deep learning pipeline.
<img src='https://learnopencv.com/wp-content/uploads/2022/01/c4_02_weight_update.png' width=600 align='center'>

**Let's define our equations:**


$$ A_{1} = x_{1} * w_{1} + x_{2} * w_{2} + b$$

$$ A_{2} = sigmoid(A_{1})$$

$$ A_{3} = J(A_{2}, Y)$$




___

1. In the above equations, we have:
    1. 3 scalar variables: $\bf{w_{1}, w_{2}}$ and $\bf{b}$
    2. 2 input scalar constants $\bf{x_{1}}$ and $\bf{x_{2}}$
2. $Y$ represents the constant scalar ground-truth class.
3. $A_{1}$ computes the weighted addition of the inputs.
4. $A_{2}$ applies the Sigmoid function ($\sigma(x)$) to $A_{1}$.
5. $A_{3}$ applies the Binary cross-entropy loss function ($J(...))$) to calculate the loss between $A_{2}$ and $Y$.

The Backpropagation algorithm involves two steps:
1. Using the **Chain rule** to get the gradient of loss with respect to the variables.
2. Updating the variables using the **weight-update rule.**



We will first use the Chain rule to find out gradients: $A_{3}'(w_{1})$, $A_{3}'(w_{2})$ and $A_{3}'(b)$. Then update the variables using the weight-update rule to show the decrease in loss.



<hr style="border:none; height: 4px; background-color:#D3D3D3" />


**The Chain rule from the above-mentioned equations are as follows:**

1. $A_{3}'(w_{1})$ is given by:
$$\frac{\partial (A_{3})}{\partial (w_{1})} = \frac{\partial (A_{3})}{\partial (A_{2})} \;*\; \frac{\partial (A_{2})}{\partial (A_{1})} \;*\; \frac{\partial (A_{1})}{\partial (w_{1})}$$


2. $A_{3}'(w_{2})$ is given by:
$$\frac{\partial (A_{3})}{\partial (w_{2})} = \frac{\partial (A_{3})}{\partial (A_{2})} \;*\; \frac{\partial (A_{2})}{\partial (A_{1})}\;*\; \frac{\partial (A_{1})}{\partial (w_{2})}$$


3. $A_{3}'(b)$ is given by:
$$\frac{\partial (A_{3})}{\partial (b)} = \frac{\partial (A_{3})}{\partial (A_{2})} \;*\; \frac{\partial (A_{2})}{\partial (A_{1})}\;*\; \frac{\partial (A_{1})}{\partial (b)}$$


<hr style="border:none; height: 4px; background-color:#D3D3D3" />

**Weight-Update Rule**

After calculating the gradients, you can update the weights using the following equations:

$$w_{1} \leftarrow w_{1} - \gamma \frac{\partial A_{3}}{\partial w_{1}}$$

$$w_{2} \leftarrow w_{2} - \gamma \frac{\partial A_{3}}{\partial w_{2}}$$


$$b \leftarrow b - \gamma \frac{\partial A_{3}}{\partial b}$$

Where. $\gamma$ represents the learning rate.

###  4.2 Sigmoid Function

**The Sigmoid function and its derivative is given by:**

Let $\mathbf{y' = \sigma(z)}$, then:
$$y' = \sigma(z) = \frac{1}{1 + e^{-z}}$$

**Derivative of sigmoid w.r.t. its input.**

$$\frac{\partial y'}{\partial z}  = \sigma(z)(1 - \sigma(z)) = y'(1-y')$$

### 4.3 Binary Cross-Entropy

The Binary cross-entropy is given by:

$$ J(y^{'}, y) = -y\log(y^{'}) - (1-y)\log(1-y{'})$$


Derivative of Binary cross-entropy w.r.t its input $y'$.

$$\frac{\partial J(y',\;y)}{\partial y'} = -\frac{y}{y'} + \frac{1-y}{1-y'}$$



Now, let's derive the individual derivatives in the Chain rule equations, one by one:<br><br>


1) $\mathbf{\large{\frac{\partial(A_{3})}{\partial(A_{2})}}} = \large{\frac{\partial J(A_{2},\;Y)}{\partial A_{2}}} = \mathbf{\large-\frac{Y}{A_{2}} + \frac{1\;-\;Y}{1\;-\;A_{2}}}$ <br><br><br>


2) $\large{\mathbf{\frac{\partial(A_{2})}{\partial(A_{1})}} = \frac{\partial (\sigma(A_{1}))}{\partial A_{1}} = \small{\mathbf{\sigma(A_{1}) * (1 - \sigma(A_{1}))}}}$ <br><br><br>

3) $\large{\mathbf{\frac{\partial (A_{1})}{\partial (w_{1})}} = \frac{\partial (x_{1} * w_{1}\;+\;x_{2} * w_{2}\;+\;b)}{\partial (w_{1})} =  \frac{\partial (x_{1} * w_{1})}{\partial(w_{1})} = \mathbf{x_{1}}}$ <br><br><br>

4) $\large{\mathbf{\frac{\partial (A_{1})}{\partial (w_{2})}} = \frac{\partial (x_{1} * w_{1}\;+\;x_{2} * w_{2}\;+\;b)}{\partial (w_{1})} =  \frac{\partial (x_{2} * w_{2})}{\partial(w_{2})} = \mathbf{x_{2}}}$ <br><br><br>


5) $\large{\mathbf{\frac{\partial (A_{1})}{\partial (b)}} = \frac{\partial (x_{1} * w_{1}\;+\;x_{2} * w_{2}\;+\;b)}{\partial (b)} =  \frac{\partial (b)}{\partial(b)} = \mathbf{1}}$ <br><br><br>

Let's define the constants and variables that we'll be using throughout this example

In [None]:
# let's first define the constants and variables

x1 = tf.constant(1.3, name="x1")
x2 = tf.constant(2.1, name="x2")
lr = tf.constant(0.1, name="learning_rate")
Y  = tf.constant(1.0, name="ground_truth")

# ---------
w1 = tf.Variable(0.7, name="x3")
w2 = tf.Variable(-0.3, name="x3")
b  = tf.Variable(1.0, name="b")

# Text formatting
bold = "\033[1m"
end = "\033[0m"

### 4.4 Solving Manually

In this section, we define the functions necessary for manually solving the Chain rule.

Here we are defining the functions for the involved equations, along with their gradient functions.

| Method|Notes|
|:----|:----|
| `wx_plus_b(...)`    |Implements the first equation $A_{1} = w_{1}*x_{1} + w_{2}*x_{2} + b$.|
|`grad_wx_plus_b(...)`|Returns the gradients for the variable involved in the above equation during the backward pass.|
|`sigmoid(...)`|Function to implement the sigmoid activation function in equation $A_{2}$.|
|`grad_sigmoid(...)`|Returns the gradients of `sigmoid` function w.r.t. it's inputs.|
|`bce_loss(...)`|Function to implement the binary cross entropy loss function used in equation $A_{3}$.|
|`grad_bce_loss(...)`|Returns the gradients of `bce_loss` function w.r.t. it's inputs.|

In [None]:
# Implementation for equation A1:  x1w1 + x2w2 + b and its derivative

def wx_plus_b(x1, x2, w1, w2, b):
    return x1*w1 + x2* w2 + b


# Derivative of WX + B w.r.t. its input W and B
def grad_wx_plus_b(x1, x2):
    return x1, x2, tf.constant(1.0)

# -------------------------------------

# Implementation for equation A2: Sigmoid and its derivative

def sigmoid(x):
    return 1 / (1 + tf.math.exp(-x))


# Derivative of sigmoid w.r.t. its input.
def grad_sigmoid(x):
    return sigmoid(x) * (1.0 - sigmoid(x))


# -------------------------------------

# Implementation for equation A3: Binary cross-entropy and its derivative

def bce_loss(y_hat, y):
    loss = -(y * tf.math.log(y_hat)) - ((1 - y) * tf.math.log(1.0 - y_hat))
    return loss


# Derivative of binary cross-entropy w.r.t. its input.
def grad_bce_loss(y_hat, y):
    return -(y / y_hat) + ((1.0 - y) / (1.0 - y_hat))

In the **`forward`** function, the input data is passed through the equations one by one in a forward manner.

In [None]:
def forward(x1, x2, w1, w2, b, Y):
    A1 = wx_plus_b(x1, x2, w1, w2, b)
    A2 = sigmoid(A1)
    A3 = bce_loss(A2, Y)
    
    return_dict = {
        "A1": A1,
        "A2": A2,
        "A3": A3
    }
    return return_dict

The **`backward`** function is responsible for calculating the derivative of $A{3}$ w.r.t. the variables $w_{1}, w_{2}, b$.<br>
In this function, we first calculate the gradient values for each partial derivative involved in the chain rule equation and then we use them to solve the chain rule.

In [None]:
def backward(x1, x2, Y, A1, A2):
    
    # Compute the gradients of A3 w.r.t  A2 i.e dA3/dA2
    d_bce_loss = grad_bce_loss(A2, Y)

    # Compute the gradients A2 w.r.t A1 i.e dA2/dA1
    d_sigmoid = grad_sigmoid(A1)

    # Compute the gradients of weighted sum(z) w.r.t weights and bias
    # dA1/dw1, dA1/dw2, dA1/b
    d_w1, d_w2, d_b = grad_wx_plus_b(x1, x2)

    # Using chain rule to find overall gradient of Loss w.r.t weights and bias
    w1_grad = d_bce_loss * d_sigmoid * d_w1
    w2_grad = d_bce_loss * d_sigmoid * d_w2
    b_grad  = d_bce_loss * d_sigmoid * d_b
    
    
    return_dict = {
        "dA3_dA2": d_bce_loss,
        "dA2_dA1": d_sigmoid,
        "dA1_dw1": d_w1,
        "dA1_dw2": d_w2,
        "dA1_db" : d_b,
        "dA3_dw1": w1_grad,
        "dA3_dw2": w2_grad,
        "dA3_db" : b_grad,
    }

    return return_dict

**Next we will:**

1. Execute the `forward` function to get the `initial Loss`.
2. Execute the `backward` function to get derivative of `Loss` w.r.t. to the variables.
3. Perform the **weight updates**.
4. Print the **`New Loss`**. 

In [None]:
# Performing forward pass

forward_outputs = forward(x1, x2, w1, w2, b, Y)

print(f"{bold}Forward Pass:{end}\n")

print(f"{bold}A1:{end} {forward_outputs['A1'].numpy()}")
print(f"{bold}A2:{end} {forward_outputs['A2'].numpy()}")
print(f"{bold}A3:{end} {forward_outputs['A3'].numpy()} <---{bold} Initial Loss{end}")

In [None]:
# Performing backward pass

A1 = forward_outputs['A1']
A2 = forward_outputs['A2']

backward_outputs = backward(x1, x2, Y, A1, A2)

print(f"{bold}Backward Pass: Step 1{end}\n")
print(f"{bold}Individual Derivatives:{end}\n")

print(f"{bold}dA3/dA2{end} = {backward_outputs['dA3_dA2']}")
print(f"{bold}dA2/dA1{end} = {backward_outputs['dA2_dA1']}")
print(f"{bold}dA1/dw1{end} = {backward_outputs['dA1_dw1']}")
print(f"{bold}dA1/dw2{end} = {backward_outputs['dA1_dw2']}")
print(f"{bold}dA1/db {end} = {backward_outputs['dA1_db']}")

print("\n-----------------\n")

print(f"{bold}Gradient of A3 w.r.t. variables:{end}\n")

print(f"{bold}dA3/dw1{end} = {backward_outputs['dA3_dw1']}")
print(f"{bold}dA3/dw2{end} = {backward_outputs['dA3_dw2']}")
print(f"{bold}dA3/db {end} = {backward_outputs['dA3_db']}")

The **`weight_update`** function applies the weight-update rule to change the weight values during backpropagation. 

In [None]:
def weight_update(w1, w2, b, dw1, dw2, db, lr):
    
    # w1, w2 and b are objects of tf.Variable class
    # They are updated in place
    
    w1.assign_sub(lr * dw1) # w1 = w1 - lr * dw1
    w2.assign_sub(lr * dw2)
    b.assign_sub(lr * db)

    return w1, w2, b

In [None]:
w1_grad = backward_outputs["dA3_dw1"]
w2_grad = backward_outputs["dA3_dw2"]
b_grad  = backward_outputs["dA3_db"]

# keeping a copy of old w and b for comparison
# as w and b will be updated inplace

w1_old = tf.identity(w1, name="old_w1")
w2_old = tf.identity(w2, name="old_w2")
b_old  = tf.identity(b,  name="old_b")

# Perform Weight Update
w1_updated, w2_updated, b_updated = weight_update(w1, w2, b, w1_grad, w2_grad, b_grad, lr)

print(f"{bold}Backward Pass: Step 2{end}\n")
print(f"{bold}Parameter Updates{end}\n")

print(f"{bold}w1{end} --> {bold}Old:{end} {w1_old.numpy():<20} {bold}New:{end} {w1_updated.numpy()}")
print(f"{bold}w2{end} --> {bold}Old:{end} {w2_old.numpy():<20} {bold}New:{end} {w2_updated.numpy()}")
print(f"{bold}b{end}  --> {bold}Old:{end} {b_old.numpy():<19}  {bold}New:{end} {b_updated.numpy()}")

**Comparing the Old and New Loss**

In [None]:
# New loss

new_forward_outputs = forward(x1, x2, w1_updated, w2_updated, b_updated, Y)

old_A3 = forward_outputs["A3"]
new_A3 = new_forward_outputs["A3"]

# We can also pass w1, w2 and b as the objects are being replaced in-place 
# _, _, new_loss = forward(x1, x2, w1, w2, b, Y)

print(f"{bold}Checking New Loss{end}:\n")

print(f"{bold}LOSS{end} --> {bold}Old:{end} {old_A3.numpy():<20} {bold}New:{end} {new_A3.numpy()}")

### 4.5 Using GradientTape

In [None]:
# Redefine the constants and variables

x1 = tf.constant(1.3, name="x1")
x2 = tf.constant(2.1, name="x2")
lr = tf.constant(0.1, name="learning_rate")
Y  = tf.constant(1.0, name="ground_truth")

# ---------
w1 = tf.Variable(0.7, name="x3")
w2 = tf.Variable(-0.3, name="x3")
b  = tf.Variable(1.0, name="b")

* Using `GradientTape` to perform automatic differentiaton.
* Here, `persistent=True` because we'll be calling the `.gradient(...)` method more than once

In [None]:
with tf.GradientTape(persistent=True) as tape:
    # record operations
    A1 = w1 * x1 + w2 * x2 + b
    A2 = sigmoid(A1)
    A3 = bce_loss(A2, Y)

In [None]:
print(f"{bold}Forward Pass:{end}\n")

print(f"{bold}A1:{end} {A1.numpy()}")
print(f"{bold}A2:{end} {A2.numpy()}")
print(f"{bold}A3:{end} {A3.numpy()} <---{bold} Initial Loss{end}")

In [None]:
print(f"{bold}Backward Pass: Step 1{end}\n")
print(f"{bold}Individual Derivatives:{end}\n")


dA3_dA2 = tape.gradient(A3, A2)

dA2_dA1 = tape.gradient(A2, A1)

dA1_dw1 = tape.gradient(A1, w1)

dA1_dw2 = tape.gradient(A1, w2)

dA1_db  = tape.gradient(A1, b)


print(f"{bold}dA3/dA2{end} = {dA3_dA2}")
print(f"{bold}dA2/dA1{end} = {dA2_dA1}")
print(f"{bold}dA1/dw1{end} = {dA1_dw1}")
print(f"{bold}dA1/dw2{end} = {dA1_dw2}")
print(f"{bold}dA1/db{end} =  {dA1_db}")

print("\n-----------------\n")


# implementing Chain rule
dA3_dw1 = dA3_dA2 * dA2_dA1 * dA1_dw1
dA3_dw2 = dA3_dA2 * dA2_dA1 * dA1_dw2
dA3_db  = dA3_dA2 * dA2_dA1 * dA1_db

print(f"{bold}Gradient of A3 wrt. variables:{end}\n")

print(f"{bold}dA3/dw1{end} = {dA3_dw1}")
print(f"{bold}dA3/dw2{end} = {dA3_dw2}")
print(f"{bold}dA3/db{end}  = {dA3_db}")

**You can clearly see that the values calculated manually are the same as those derived via GradientTape.**

<hr style="border:none; height: 4px; background-color:#D3D3D3" />

## 5 Direct Implementation using GradientTape

There's one more method which is more general and direct that TensorFlow allows us to use: 

> ***In this, we can directly pass all the variables against which we want to calculate the derivatives. TensorFlow will then internally perform the chain rule and return us the proper answer.***

In [None]:
# Redefine the constants and variables

x1 = tf.constant(1.3, name="x1")
x2 = tf.constant(2.1, name="x2")
lr = tf.constant(0.1, name="learning_rate")
Y  = tf.constant(1.0, name="ground_truth")

# ---------
w1 = tf.Variable(0.7, name="x3")
w2 = tf.Variable(-0.3, name="x3")
b  = tf.Variable(1.0, name="b")

In [None]:
def compute(x1, x2, w1, w2, b, Y):
    
    # Notice we have not used persistent=True
    
    with tf.GradientTape() as tape:
        outputs = forward(x1, x2, w1, w2, b, Y)
    
    # passing all variables with respect to which 
    # we want to calculate the derivative of A3
    grads = tape.gradient(outputs["A3"], [w1, w2, b])
    
    return outputs, grads

In [None]:
forward_outputs, gradients = compute(x1, x2, w1, w2, b, Y)

print(f"{bold}Backward Pass: Step 1{end}\n")
print(f"{bold}Direct Gradient of A3 wrt. variables using GradientTape:{end}\n")

print(f"{bold}dA3/dw1{end} = {gradients[0]}")
print(f"{bold}dA3/dw2{end} = {gradients[1]}")
print(f"{bold}dA3/db{end}  = {gradients[2]}")

In [None]:
# keeping a copy of old w and b for comparison
# as w and b will be updated inplace

w1_old = tf.identity(w1, name="old_w1")
w2_old = tf.identity(w2, name="old_w2")
b_old  = tf.identity(b,  name="old_b")

# Perform Weight Update

w1_updated, w2_updated, b_updated = weight_update(w1, w2, b, gradients[0], gradients[1], gradients[2], lr)

print(f"{bold}Backward Pass: Step 2{end}\n")
print(f"{bold}Parameter Updates{end}\n")

print(f"{bold}w1{end} --> {bold}Old:{end} {w1_old.numpy():<20} {bold}New:{end} {w1_updated.numpy()}")
print(f"{bold}w2{end} --> {bold}Old:{end} {w2_old.numpy():<20} {bold}New:{end} {w2_updated.numpy()}")
print(f"{bold}b{end}  --> {bold}Old:{end} {b_old.numpy():<19}  {bold}New:{end} {b_updated.numpy()}")

In [None]:
# New loss computation

new_forward_outputs = forward(x1, x2, w1_updated, w2_updated, b_updated, Y)

old_A3 = forward_outputs["A3"]
new_A3 = new_forward_outputs["A3"]

# We can also pass w1, w2, b due to the objects being replaced in the memory 
# _, _, new_loss = forward(x1, x2, w1, w2, b, Y)

print(f"{bold}Checking New Loss{end}:\n")

print(f"{bold}LOSS{end} --> {bold}Old:{end} {old_A3.numpy():<20} {bold}New:{end} {new_A3.numpy()}")