<a href="https://colab.research.google.com/github/stephenbeckr/convex-optimization-class/blob/main/Demos/AutoDiffByHand.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Forward vs Reverse mode Autodiff: explicit example
Copied from 2024 SciML class
#### Background
If $F: \mathbb{R}^n \to \mathbb{R}^m$, we can write $F(\vec{x}) = ( F_i(\vec{x}) )_{i=1}^m$ for component functions $F_i : \mathbb{R}^n \to \mathbb{R}$, and we define the Jacobian $J_F$ to be the $m\times n$ matrix of partial derivatives, so that the $(i,j)$ entry of $J_F$ is $\frac{\partial F_i}{\partial x_j}(\vec{x})$.  *Note that if $m=1$ then the Jacobian is just the transpose of the gradient.*

For multivariate functions, because derivatives (i.e., the Jacobian) are matrices and because matrix multiplication does not commute, we have to be careful with the order we write the chain rule in. The correct order is:
$$J_{f \circ g}(\vec{x}) = J_f(g(\vec{x})) \cdot J_g(\vec{x}).$$

In [1]:
import torch
import matplotlib.pyplot as plt
import sys
import numpy as np
print("Torch version is", torch.__version__)
print("Numpy version is", np.__version__)
print("Python version is", sys.version)

Torch version is 2.6.0+cu124
Numpy version is 2.0.2
Python version is 3.11.12 (main, Apr  9 2025, 08:55:54) [GCC 11.4.0]


### Problem 1
We'll explore forward-mode and reverse-mode automatic differentiation. Implementing autodiff to work in general requires a lot of programming (especially for reverse-mode), so instead we'll specialize to one particular function, and choose a function with a straightforward "computational graph". Let
$$f:\mathbb{R}^{d_0} \to \mathbb{R}, \quad f(\vec{x}) = \text{sum}\left( \sigma( B \cdot \sigma( A \cdot \vec{x} ) ) \right)$$
where $\vec{x}\in\mathbb{R}^{d_0}$, $A \in \mathbb{R}^{d_1 \times d_0}$, $B\in \mathbb{R}^{d_2\times d_1}$ and $\sigma(\alpha) = (1+e^{-\alpha})^{-1}$ is the 1D logistic function (aka *the* "sigmoid" in ML terminology) and $\sigma$ applied to a vector is done componentwise.  Basically, this is a simple feed-forward neural net.  We're going to compute the gradient of $f$, $\nabla f(\vec{x})$, aka $J_f(\vec{x})^\top$.  *Be careful, in neural net training, we take gradients with respect to the weight matrices, but in this problem we are thinking of the weights as fixed and differentiating with respect to $\vec{x}$, since that's slightly simpler since it's a vector not a matrix*

#### Part 1a
Let $h_1(\vec{x}) = A \cdot \vec{x}$, $h_2( \vec{y} ) = \sigma(\vec{y})$, $h_3(\vec{y}) = B\cdot \vec{y}$, $h_4= h_2$, and $h_5(\vec{y})= \text{sum}(\vec{y})$.
Then
$$f(\vec{x}) = \text{sum}\Big( \overbrace{\sigma( \overbrace{B \cdot \underbrace{\sigma( \underbrace{A \cdot \vec{x}}_{\vec{y}_1} )}_{\vec{y}_2} }^{\vec{y}_3} )}^{\vec{y}_4} \Big)
= h_5(h_4(h_3(h_2(h_1(\vec{x})))))$$
so we can write the Jacobian of $F$ as
$$J_f(\vec{x}) = J_{h_5}( \vec{y}_4) \cdot J_{h_4}( \vec{y}_3 ) \cdot
 J_{h_3}( \vec{y}_2 ) \cdot
  J_{h_2}( \vec{y}_1 ) \cdot J_{h_1}( \vec{x} ).$$
For part (a), mathematically work out what the Jacobian of each of the $h_k$ functions is and write out your answer.

### Part 1b: implement the function
Implement the function $f$ in code, and use an existing automatic differentiation package (I suggest PyTorch) to get the gradient, which we will later use to check the correctness of our code.  *I suggest choosing moderate values for $d_0,d_1,d_2$ and make these values different to help find bugs in your code.  The matrices $A$ and $B$ can be arbitrary, e.g., random.*

In [2]:
sigma = torch.nn.Sigmoid() # really "logistic function"

d0 = 100
d1 = 105
d2 = 95
torch.manual_seed(100)
# dtype = torch.float32 # the default
dtype = torch.float64
A   = torch.randn( (d1,d0), dtype=dtype) # parameters
B   = torch.randn( (d2,d1), dtype=dtype )
x   = torch.randn( (d0,1), dtype=dtype , requires_grad=True )

def f(x):
    return torch.sum( sigma( B@sigma(A@x)) )


f(x)

tensor(53.0627, dtype=torch.float64, grad_fn=<SumBackward0>)

### Part 1c: implement the gradient, forward-style
 Implement a gradient for your function in the forward-mode style. That is, calculate
$$J_f(\vec{x}) = J_{h_5}( \vec{y}_4) \cdot \Bigg( J_{h_4}( \vec{y}_3 ) \cdot
 \Big(J_{h_3}( \vec{y}_2 ) \cdot
  \Big(J_{h_2}( \vec{y}_1 ) \cdot J_{h_1}( \vec{x} )\Big)\Big)\Bigg).$$
This gradient function should be in the same function that also calculates $f(\vec{x})$, so now have that function return two values, $f(\vec{x})$ and $J_f(\vec{x})$.
Comparing with the autodiff software (PyTorch) at one or more points $\vec{x}$, make sure you get the right answer. *Note: autodiff software like PyTorch works in single precision by default, so you shouldn't expect your answer to be more than about 5 to 8 digits the same. Try using double precision, and so you should have 10 to 15 digits the same.*


In [3]:
def dsigmoid(x):
    """
    d/dx sigma(x) for an activation function sigma
    Here, we're assuming the activation function is logistic/sigmoid
    i.e., sigma(x) = 1/(1+e^{-x}) = e^x / (1+e^x)
    so
    d/dx sigma(x) = sigma(x)*(1-sigma(x)) = 1/(e^x + 2 + e^{-x})
                  = e^x / (1+e^x)^2 = 1/(2*(cosh(x)+2))
    """
    # s = sigma(x)
    # return s * (1-s)

    return 1/(2*(1+torch.cosh(x)))

    # This is *less* stable
    # ex =  torch.exp(x)
    # return ex / (1+ex)**2

def f_and_Jacobian(x, mode='reverse'):
    y1 = torch.matmul(A,x)
    y2 = torch.matmul(B,sigma(y1) )
    fx = torch.sum(sigma(y2))

    if mode.lower() == 'reverse':
        # compute Jacobian starting at the end (reverse-mode)
        # Let's force y2 to be of size (d2,1) rather than (d2,)
        #   (right now, it depends on whether input is size (d0,1) or (d,) )
        z3 = dsigmoid(y2.reshape((-1,1))) # implicitly doing ones vector times diagonal matrix
        z2 = z3.T @ B # z2 is now a row vector
        z1 = z2 * dsigmoid(y1.ravel()) # diagonal matrix multiply
        J_f = z1 @ A
    elif mode.lower() == 'forward':
        z2 = dsigmoid(y1.reshape(-1,1) ) * A
        z3 = B @ z2
        z4 = dsigmoid(y2) * z3
        J_f = torch.sum( z4, 0, keepdim=True)
    else:
        raise ValueError('That mode is not implemented')

    return fx, J_f

In [5]:
fx, gx = f_and_Jacobian(x.detach(), mode='forward' )
gx = gx.T # Gradient is the transpose of the Jacobian

# And compute the gradient using PyTorch's autodiff to check our answer
if x.grad is not None:
    x.grad.data.zero_()
out = f(x)
out.backward()
grad = x.grad

# Print out some metrics:
print(f'"allclose" is {torch.allclose( gx, grad )}')
print(f'||gx-grad||_infty|| is {torch.linalg.norm(gx-grad,ord=np.inf):.2e}')

"allclose" is True
||gx-grad||_infty|| is 1.89e-15


### Part 1d: implement the gradient, backward-style
 Implement a gradient for your function in the reverse-mode style. That is, calculate
$$
J_f(\vec{x}) = \Bigg(\Big(\Big( J_{h_5}( \vec{y}_4) \cdot  J_{h_4}( \vec{y}_3 ) \Big) \cdot
 J_{h_3}( \vec{y}_2 ) \Bigg) \cdot
  J_{h_2}( \vec{y}_1 ) \Bigg) \cdot J_{h_1}( \vec{x} ).
$$
Again, implement this in the same function that calculates $f(\vec{x})$, since you will want to save some intermediate values on the forward pass.
Comparing with the autodiff software, make sure you get the right answer.

In [7]:
# Already implemented above

fx, gx = f_and_Jacobian(x.detach(), mode='reverse' )
gx = gx.T # Gradient is the transpose of the Jacobian

# And compute the gradient using PyTorch's autodiff to check our answer
if x.grad is not None:
    x.grad.data.zero_()
out = f(x)
out.backward()
grad = x.grad

# Print out some metrics:
print(f'"allclose" is {torch.allclose( gx, grad )}')
print(f'||gx-grad||_infty|| is {torch.linalg.norm(gx-grad,ord=np.inf):.2e}')

"allclose" is True
||gx-grad||_infty|| is 1.67e-15


### Part 1e: complexity
What is the computational complexity for the forward-mode style (in terms of $d_0,d_1,d_2$)? What about for reverse-mode style?

**Solution**

- For **forward-mode**, the complexity is $\mathcal{O}(d_0 d_1 d_2)$
- For **reverse-mode**, the complexity is $\mathcal{O}(d_0 (d_1 + d_2))$, which is much better

### Part 1f: run it for a large size
 Set $d_0 = d_1 = d_2 = 8000$ and choose $A,B$ to be random matrices (double-precision). Time how long it takes your code to run in forward-mode, and how long it takes to run in reverse-mode, and also compare with how long it takes the autodiff package to run.

In [8]:
d0 = int(8e3)
d1 = d0
d2 = d0

torch.manual_seed(100)
# dtype = torch.float32 # the default
dtype = torch.float64
A   = torch.randn( (d1,d0), dtype=dtype) # parameters
B   = torch.randn( (d2,d1), dtype=dtype )
x   = torch.randn( (d0,1), dtype=dtype , requires_grad=True )

# We were lazy and "burned in" the values of A and B into the function definition,
#   so with new "A" and "B" we need to redefine them:
def f(x):
    return torch.sum( sigma( B@sigma(A@x)) )
def f_and_Jacobian(x, mode='reverse'):
    y1 = torch.matmul(A,x)          # we need to store this for use in reverse mode
    y2 = torch.matmul(B,sigma(y1) ) # we need to store this for use in reverse mode
    fx = torch.sum(sigma(y2))

    if mode.lower() == 'reverse':
        # compute Jacobian starting at the end (reverse-mode)
        # Let's force y2 to be of size (d2,1) rather than (d2,)
        #   (right now, it depends on whether input is size (d0,1) or (d,) )
        z3 = dsigmoid(y2.reshape((-1,1))) # implicitly doing ones vector times diagonal matrix
        z2 = z3.T @ B # z2 is now a row vector
        z1 = z2 * dsigmoid(y1.ravel()) # diagonal matrix multiply
        J_f = z1 @ A
    elif mode.lower() == 'forward':
        z2 = dsigmoid(y1.reshape(-1,1) ) * A
        z3 = B @ z2
        z4 = dsigmoid(y2) * z3
        J_f = torch.sum( z4, 0, keepdim=True)
    else:
        raise ValueError('That mode is not implemented')

    return fx, J_f

In [9]:
%%time
if x.grad is not None:
    x.grad.data.zero_()

out = f(x)
out.backward()
grad = x.grad
print('== The time for PyTorch to do AutoDiff (it uses reverse mode) ==')

== The time for PyTorch to do AutoDiff (it uses reverse mode) ==
CPU times: user 177 ms, sys: 0 ns, total: 177 ms
Wall time: 178 ms


In [10]:
%%time
fx, gx = f_and_Jacobian(x.detach(), mode='forward' )
gx = gx.T
print('== The time for our own forward mode ==')

== The time for our own forward mode ==
CPU times: user 30.7 s, sys: 1.19 s, total: 31.9 s
Wall time: 32.9 s


In [11]:
%%time
fx, gx = f_and_Jacobian(x.detach(), mode='reverse' )
gx = gx.T
print('== The time for our own reverse mode ==')

== The time for our own reverse mode ==
CPU times: user 190 ms, sys: 65 µs, total: 190 ms
Wall time: 190 ms


Conclusion: for this kind of function, reverse-mode is much much faster than forward-mode