# Motivation for Conjugate Gradient 

In steepest descent algorithms with exact line search, the basic intuition is that we choose a starting point, say $x_{(0)}$. 
From this starting point, we see which direction in the vector space $x$ resides in guarantees the most dramatic decrease
in the value of our objective function. To know how far to go in that direction before the objective function no longer decreases, we see what step size guarantees to give the
lowest value of our objective function, in that direction $d_{(i)}$. 

While this sounds like a strategy that would work well in a lot of cases, there are scenarios where steepest descent struggles. It's also not the most efficient. With exact line search, one sees that the direction the iterations take are orthogonal to the step right before it (the optimization of step size $\alpha$ guarantees that the gradients at each successive iteration must be orthogonal). This results in a series of zig-zag pattern. But what if you could just collapse all iterations that travel in the same direction into one step? For example in the 2D case, if I was able to do that, My algorithm would stop after only 2 steps (only 2 orthogonal directions in $\mathbb{R}^2$ but could be many more in steepest descent especially if my problem is ill-conditioned.

## The Method of Conjugate Directions

Orthogonality is one of numerical analysts best friends, and in this case it comes into play yet again. We know that if our solution is in a vector space of dimension $n$, say $\mathbb{R}^n$, then our algorithm would travel at most $n$ steps. If our starting point happens to be along one of the orthogonal directions, it could arrive at the solution in even less time! But therein lies our first problem. How do we choose what directions $d_0, d_1, d_2, \dots d_n$ to travel in, given that we start at $x_0$? 

In [4]:
import autograd.numpy as np
import autograd as ag
import matplotlib.pyplot as plt

def softmax(x, y):
    """Compute the softmax of vector x."""
    exps = np.exp(np.array([x, y]))
    return exps / np.sum(exps)

In [3]:
der = ag.grad(ag.grad(ag.grad(ag.grad(softmax))))

In [None]:
X, Y = np.meshgrid()
x = np.array([a,a])
y1 = softmax(x)