# Automatic Differentiation with Steepest Descent

Going to try out a small example of autodiff by using it to solve the steepest descent least squares problem. We then implement a couple tests to ensure our derivatives are correct. We 1) compare to the closed form solution 2) use a test based on Taylor series introduced by Lars. SDLS works as follows. 

\begin{gather}
f(x)=\Vert{Ax-b}\Vert^2 \\
\min_{x}f(x) \\
\end{gather}

We know from vector calculus the steepest descent direction from a point $x$ is the negative gradient evaluated at that point.
So we just need to solve for the solution to the equation. 

\begin{gather}
\frac{\partial}{\partial x} f(x)=\nabla f(x)=A^T(Ax-b)=0
\end{gather}

To solve the exact line search problem to optimize our step size $\alpha$, we solve another optimization. 

\begin{gather}
x_{(i+1)}=x_{(i)}-\alpha\nabla f(x_{(i)}) \\
\min_{\alpha}\Vert{Ax_{(i+1)}-b}\Vert^2 \\
\frac{\partial}{\partial \alpha} f(x_{(i+1)}) = \nabla f(x_{(i+1)})^T \frac{\partial}{\partial \alpha} x_{(i+1)} 
= -\nabla f(x_{(i+1)})^T \nabla f(x_{(i)})
\end{gather}

It's interesting that the step size that guarantees the lowest error in the direction of the previous gradient, is the one 
that makes the next gradient orthogonal to the previous gradient. Hence the zig-zagging path we see in SDLS with exact line search. With further decomposition, we can get a closed form solution to $\alpha$

\begin{gather}
\nabla f(x_{(i+1)})^T \nabla f(x_{(i)})=0 \\
(A^T(Ax_{(i+1)}-b))^T \nabla f(x_{(i)})=0 \\
(A^T(A(x_{(i)}+\alpha\nabla f(x_{(i)})-b))^T \nabla f(x_{(i)})=0 \\
(A^T(Ax_{(i)}+\alpha A\nabla f(x_{(i)})-b))^T \nabla f(x_{(i)})=0 \\
(A^T(Ax_{(i)}-b+\alpha A\nabla f(x_{(i)}))^T \nabla f(x_{(i)})=0 \\
(\nabla f(x_{(i)})+\alpha A^T A\nabla f(x_{(i)}))^T \nabla f(x_{(i)})=0 \\
(\nabla f(x_{(i)})^T+\alpha \nabla f(x_{(i)})^T A^T A) \nabla f(x_{(i)})=0 \\
\alpha = \frac{\nabla f(x_{(i)})^T \nabla f(x_{(i)})}{\nabla f(x_{(i)})^T A^T A \nabla f(x_{(i)})}
\end{gather}

I'm going to use the autodiff on the first optimization problem to compute the gradient of $f(x)$, 
and use this closed form solution on alpha. 

In [9]:
%matplotlib inline
import matplotlib.pyplot as plt
import autograd.numpy as np
import autograd as ad

In [84]:
def loss_function(x, A, b):
    """
    Computes least-squares loss function for a linear system. 
    """
    return .5 * np.linalg.norm(A @ x - b)**2

def line_search(alpha, x, A, b, d):
    """
    Computes least squares loss function wrt to alpha
    """
    return .5 * np.linalg.norm(A @ (x + alpha * d) - b)**2

loss_grad = ad.grad(loss_function)
line_search_grad = ad.grad(line_search)

def SDLS(A, x, b, maxIter):
    n = A.shape[1]
    his = np.zeros(maxIter)
    Xall = np.zeros((n, maxIter))
    for i in np.arange(maxIter):
        # AUTODIFF HERE
        d = -loss_grad(x, A, b)
        # AUTODIFF HERE
        Ad = A @ d
        alpha = (d.T @ d) / (Ad.T @ Ad)
        x = x + alpha * d
    return x
                               

In [85]:
A = np.array([[3, 2], [2, 6]])
b = np.array([[2],[-8]])

In [86]:
x = np.zeros_like(b)
#x = np.array([2,-2])[np.newaxis].T

In [101]:
solution = SDLS(A, x, b, 50)
print(solution)

[[ 2.]
 [-2.]]


In [102]:
solution = SDLS(A, x, b, 100)
print(solution)

[[nan]
 [nan]]


Why does it return nan at 100 iterations? 