# Gradient descent methods

(Notebook prepared by [Emmanuel Rachelson](https://github.com/erachelson/))

In this section, we concentrate on finding the minimum of a given differentiable function $f$. Try to keep on mind the general illustration from section 1 to keep track of why we do this.

In order to make things didactic and graphical, we shall work on functions from $\mathbb{R}^2$ to $\mathbb{R}$, which can be easily plotted (either as surfaces in 3D or as contour plots in 2D).

## 1. Descent algorithms

Descent algorithms define a sequence of steps "rolling down" the function's surface (similarly to how water would flow down from a mountain), that is a sequence of points $x_k$ of decreasing value $f(x_k)$. To construct this sequence, one defines at each step a descent direction $d_k$ and a descent step $\alpha_k$. The next point in the sequence is thus $x_{k+1} = x_k + \alpha_k d_k$.

Since the gradient $\nabla f$ of $f$ indicates the steepest ascent direction, $-\nabla f$ is a descent direction, hence the name **gradient descent**.

Let's illustrate that. You don't need to read the full formula of the function below, but for your curiosity, it is built as follows:
\begin{align}
&(x'_0, x'_1) = rotate_{\pi/6}(x_0,x_1)\\
&g(z_0,z_1) = \left(a_4 z_0^4 + a_3 z_0^3 + a_2 z_0^2 + a_1 z_0 + a_0 \right) \left( b_2 z_1^2 + b_1 z_1 + b_0 \right)\\
&f(x_0,x_1) = g\left(rotate_{\pi/6} \left(x_0,x_1\right)\right)\\
&\textrm{with }[a_4, a_3, a_2, a_1, a_0] = [0.019217, 0.013158, -0.423455, -0.247614, 4.]\\
&\textrm{and }[b_2, b_1, b_0] = [0.1, 0., 0.1]
\end{align}

In [None]:
import numpy as np

def func(x):
    a = np.pi / 6.0
    c = np.cos(a)
    s = np.sin(a)
    xx = c * x[0] + s * x[1]
    yy = -s * x[0] + c * x[1]
    p = np.poly1d(
        [
            0.019217057452351031,
            0.013158736688148412,
            -0.42345571095569301,
            -0.24761472187941180,
            4.0,
        ]
    )
    q = np.poly1d([0.1, 0.0, 0.1])
    
    return p(xx) * q(yy)


def func_der(x):
    a = np.pi / 6.0
    c = np.cos(a)
    s = np.sin(a)
    xx = c * x[0] + s * x[1]
    yy = -s * x[0] + c * x[1]
    p = np.poly1d(
        [
            0.019217057452351031,
            0.013158736688148412,
            -0.42345571095569301,
            -0.24761472187941180,
            4.0,
        ]
    )
    pp = np.polyder(p)
    q = np.poly1d([0.1, 0.0, 0.1])
    qq = np.polyder(q)
    grad0 = c * pp(xx) * q(yy) - s * qq(yy) * p(xx)
    grad1 = s * pp(xx) * q(yy) + c * qq(yy) * p(xx)
    
    return np.array([grad0, grad1]).T

Let's first plot this function. Again, you don't need a deep understanding of the plotting functions below.
- The first one plots a function $f$ in 3D;
- The second one plots the contour levels of function $f$ in 2D, with several options to plot the function's gradient in specific points.

An example is provided below.

In [None]:
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import axes3d
from matplotlib import cm


def plot_3d_func(f, X0, X1):
    """Plots function f over the grid of X0 and X1 arguments.
    X0 and X1 should be numpy arrays.
    """
    X = np.meshgrid(X0, X1)
    Z = f(X)
    fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(projection="3d"))
    surf = ax.plot_surface(
        X[0],
        X[1],
        Z,
        rstride=1,
        cstride=1,
        cmap=cm.coolwarm,
        linewidth=0,
        antialiased=False,
        alpha=0.3,
    )

    for label in ax.get_xticklabels() + ax.get_yticklabels() + ax.get_zticklabels():
        label.set_fontsize("large")

    ax.set_xlabel("X")
    ax.set_ylabel("Y")
    ax.set_zlabel("Z")

    return fig, ax


def plot_contours_func(
    f,
    X0,
    X1,
    levels=np.array([]),
    xp=np.empty((2, 0)),
    plot_line=False,
    f_der=None,
    add_levels=True,
):
    """Contour plot of function f.

    - X0 and X1 should be numpy arrays defining a grid over which the contour
      plot will be drawn.
    - levels should be a numpy array providing user-defined contour levels
      (otherwise, 10 regularly spaced levels are generated)
    - xp should be a (2,n)-numpy array of (x0,x1)-coordinates for additional
      points to plot on the graph
    - plot_line indicates whether a line should be plotted between the xp points
    - f_der should be a function returning the derivative of f when evaluated in
      (x0,x1). It is used to plot the gradient in xp.
      If f_der=None no derivatives will be plotted
    - add_levels indicates whether to add the contour levels for the xp points
    """

    X = np.meshgrid(X0, X1)
    Z = f(X)
    fig, ax = plt.subplots(figsize=(10, 10))
    ax.axis("equal")

    if np.equal(len(levels), 0):
        levels = np.arange(np.min(Z), np.max(Z), (np.max(Z) - np.min(Z)) / 10.0)

    if np.not_equal(xp.shape[1], 0):
        z = f(xp)
        if add_levels:
            levels = np.append(levels, z)
        ax.scatter(
            xp[0, :],
            xp[1, :],
            cmap=cm.autumn,
            c=-np.arange(xp.shape[1]),
            edgecolors="black",
        )
        if plot_line:
            ax.plot(xp[0, :], xp[1, :])
        if np.not_equal(f_der, None):
            grad = f_der(xp)
            ax.quiver(xp[0, :], xp[1, :], grad[:, 0], grad[:, 1])

    levels = np.sort(levels)
    cont = ax.contour(X[0], X[1], Z, levels)
    ax.clabel(cont, cont.levels, inline=True, fontsize=10)

    return fig, ax


def clean_axis(ax, center=(0, 0)):
    ax.spines["right"].set_visible(False)
    ax.spines["top"].set_visible(False)
    ax.spines["bottom"].set_color("#bab0ac")
    ax.spines["left"].set_color("#bab0ac")

    if center is not None:
        ax.spines["bottom"].set_position(("data", center[1]))
        ax.spines["left"].set_position(("data", center[0]))

    for label in ax.get_xticklabels() + ax.get_yticklabels():
        label.set_fontsize("large")

In [None]:
X0 = np.arange(-6, 6, 0.1)
X1 = np.arange(-6, 6, 0.1)
xd = np.array([[4.0, 3.0, 4.0], [1.5, -0.4, -0.4]])
levels = np.array([0.15, 0.2, 0.25, 0.5, 1.0, 2.0, 3.0, 4.0, 5.0])

fig, ax = plot_3d_func(func, X0, X1)
fig, ax = plot_contours_func(func, X0, X1, levels, xp=xd, f_der=func_der)
ax.set_xlim((-7, 7))
ax.set_ylim((-7, 7))
clean_axis(ax)

## 2. Gradient descent with fixed step size

Let's take a starting point $x_0$ and write a gradient descent procedure that build the sequence of points $x_{k+1} = x_k + \alpha_k d_k$ with $d_k = -\dfrac{\nabla_x f(x_k)}{\|\nabla_x f(x_k)\|}$ (that is a unit vector in the opposite direction of the gradient) and a fixed step size $\alpha_k$.

<div class="alert alert-warning"><b>Exercice:</b>
<ul>
<li> Write an algorithm that starts in $(-5,-4)$ and uses gradient descent on the `func` function defined above, with fixed-length steps of length $\alpha = 0.1$, for 20 steps (if you need to compute the norm of a vector, use [`np.linalg.norm`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html)).
<li> Use `plot_contours_func` to plot the sequence of points obtained.
<li> Did the algorithm find the minimum? Were there enough steps? Increase the number of steps until you reach a vicinity of the minimum.
<li> What are the coordinates of this minimum? What is the value of $f$ there? Is this a local or a global minimum?
<li> How far do you get from the true minimum?
<li> Restart your algorithm at $(4,-4)$, do you obtain the same result?
</ul>
</div>

In [None]:
# %load solutions/code1.py

Recall that the gradient's norm in $x$ is the steepest increase rate of the function in $x$. So as long as this norm is non-zero, it means we can still move up or down the function's surface by following the gradient's direction. Conversely, if the gradient becomes zero, it means we have reached a "flat" area in the function's surface, which is likely to be a (local) minimum.

<div class="alert alert-warning"> <b>Exercice:</b>
Copy-paste your code from the previous cells in the cell below to modify it. We'll work from the initial point $(x_0,x_1)=(-5,-4)$ in this exercice.
<ul>
<li> Is there a better way to stop the algorithm than a predefined number of steps? For example using the gradient's norm?
<li> Plot the evolution of the gradient's norm (actually you'll see it's clearer to plot the [log](https://docs.scipy.org/doc/numpy/reference/generated/numpy.log.html) of the gradient's norm) along the 100 first iterations. Is the norm of the gradient always a good way to control the algorithm's stopping? What happens after (around) the 34th iteration?
<li> Plot the sequence of points after iteration 34 by using `plot_contours_func` on the $[-3.05,-2.9]\times[-1.8,-1.6]$ domain.
<li> Also plot the (log of the) value of the objective function after iteration 34.
</ul>
</div>

In [None]:
# %load solutions/code2.py

Apparently, depending on the value of the threshold we put on the gradient's norm, the algorithm does not terminate. In all cases, fixed-step sizes do not allow us to reach the local minimum but rather oscillate around it. So it seems we need adaptive, decreasing step-sizes. Any ideas as to what values we should pick for the step-size?

## 3. A first attempt at adaptive step sizes

<div class="alert alert-warning"> <b>Exercice:</b> Try implementing a step-size equal to the gradient's norm (again, copy-paste your code from the previous cells to avoid rewriting everything).
<ul>
<li> Stop the algorithm when the gradient's norm falls below $0.001$
<li> How many iterations until convergence?
</ul>
</div>

In [None]:
# %load solutions/code3.py

So apparently, taking a step length proportional to the gradient seems like a good idea. It makes sense intuitively since the norm of the gradient indicates how steep the function is locally, so, the steepest the surface, the biggest the step we can expect to make.<br>
<br>
However, this intuition can be proven very (very) wrong! In particular, taking a step size equal to the gradient's norm can be catastrophic. To convince yourself, consider the function $f(x)=10\cos(x)$ on the $[0,2\pi$] domain and check below what happens on the first step of gradient descent from $x=\pi/2$. <br>

In [None]:
X = np.arange(0, 4 * np.pi, 0.01)
Z = 10.0 * np.cos(X)

fig, ax = plt.subplots()
clean_axis(ax)

ax.plot(X, Z)
x0 = np.pi / 2.0
val0 = 10.0 * np.cos(x0)
ax.plot(x0, 0.0, "ro")
grad0 = -10.0 * np.sin(x0)
x1 = x0 - grad0
val1 = 10.0 * np.cos(x1)
ax.arrow(x0, 0.0, -grad0, 0.0, head_width=0.5, length_includes_head=True)
ax.plot(x1, 0.0, "ro")

print("First point x0:", x0)
print("Gradient in x0:", grad0)
print("Second point x1=x0-grad0:", x1)

## 4. Line search

So the gradient gives the steepest descent direction locally but this does not give any reliable information as to the best step size. We learn from this that the step size needs to adapt to the actual function in order for the sequence $f(x_k)$ to actually be decreasing. This is generally done via **line search**. Once a descent direction $d_k$ has been chosen in $x_k$, line search consists in defining the univariate function $g(\alpha)=f(x_k+\alpha\cdot d_k)$ and minimizing this function. Then, the found value for $\alpha$ is used as $\alpha_k$ and the process is repeated from $x_{k+1}=x_{k} + \alpha_k d_k$.<br>
<br>
The minimization of the scalar function $g$ can be done in a number of ways:
<ul>
<li> If $g'(\alpha)=0$ can be easily solved analytically, then it provides a series of candidates for a minimum. 
<li> Interpolation methods such as Cubic interpolation, Quadratic interpolation (Brent method) or the Golden section method (all left to your curiosity), that do not require the knowledge of an analytical form of $g'$ can be used to narrow down a minimum.
</ul>

Fortunately for us, `scipy.optimize` provides a [`minimize_scalar`](https://docs.scipy.org/doc/scipy-0.17.1/reference/generated/scipy.optimize.minimize_scalar.html) function that performs this tedious line search for us.

Those interested in going further on the topic of step size selection in descent methods can check the following advanced topics (do that later if you are curious, or you probably won't have time to finish this notebook):
<ul>
<li> Armijo rule and Wolfe conditions
<li> Goldstein rule
<li> Robbins-Monro stochastic approximation
</ul>

Before going any further, let us recall an important property. With the line search procedure we just introduced, two successive descent directions are necessarily orthogonal. Why? Simply because if $d_{k+1}$ was not orthogonal to $d_k$, that would mean there is a component of $d_{k+1}$ along $d_k$, which would mean that in $x_{k+1}$, it would still be possible to decrease $f$ just by moving along the $d_k$ direction, but this is obviously impossible since $x_{k+1}$ is the minimum of $f(x_k+\alpha\cdot d_k)$. So $d_{k+1}$ is necessarily orthogonal to $d_k$.<br>
<br>
Note that, in dimension 2, that leaves little choice for the direction $d_{k+1}$ given $d_k$. But in higher dimensions, there is an infinity of unit vectors that are orthogonal to $d_k$, so what this property says really is just that two successive descent directions are orthogonal and nothing else.
<div class="alert alert-warning"> <b>Exercice:</b> Just take a minute to make sure you understood this property.<br>
</div>
<div class="alert alert-warning"><b>Exercice:</b><br>
Now, reuse your previous code to write a descent method where the descent direction is the normalized gradient and the step size is found by line search.<br>
<ul>
<li> Can you confirm the above property (graphically)?
<li> How many steps until convergence?
</ul>
</div>

In [None]:
# %load solutions/code4.py

Did you notice the orthogonal descent directions?

## 5. Convexity

Let's take a step back and consider the general question of finding minimas of differentiable functions. So far, what we have intuitively done corresponds to saying that if we roll down the function's surface in the opposite direction of the gradient, we might end up in a zero-gradient point that is a local minimum.

We could do that precisely because our functions were differentiable; their gradient exists.

But the reverse implication is not true: zero-gradient points are not necessarily minimas! Consider the two following functions.

In [None]:
X0 = np.arange(-1.0, 1.0, 0.05)
X1 = np.arange(-1.0, 1.0, 0.05)
X = np.meshgrid(X0, X1)
Z = -X[0] * X[0] - X[1] * X[1]

fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(projection="3d"))
surf = ax.plot_surface(
    X[0],
    X[1],
    Z,
    rstride=1,
    cstride=1,
    cmap=cm.coolwarm,
    linewidth=0,
    antialiased=False,
    alpha=1,
)

ax.set_xlabel("X")
ax.set_ylabel("Y")
ax.set_zlabel("Z")

clean_axis(ax)

In [None]:
Z = -X[0] * X[0] * X[0] - X[1] * X[1] * X[1]
fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(projection="3d"))

surf = ax.plot_surface(
    X[0],
    X[1],
    Z,
    rstride=1,
    cstride=1,
    cmap=cm.coolwarm,
    linewidth=0,
    antialiased=False,
    alpha=1,
)

ax.set_xlabel("X")
ax.set_ylabel("Y")
ax.set_zlabel("Z")

clean_axis(ax)

Both these functions have a null gradient in $(0,0)$ but neither have a minimum in $(0,0)$.

So $\hat{x}$ is a minimum $\Rightarrow$ $\nabla_x f(\hat{x}) = 0$ is true but the reverse implication is false.

Take a point $x$ where the surface is flat (with a zero gradient). To guarantee that this point is a (local) minimum we need to guarantee that the fonction is (locally) **convex** around this point. Technically, that means that for any two points $y$ and $z$, if we draw a line between them, the function will actually sit below that line. Formally:
$$f\textrm{ is convex }\Leftrightarrow \forall (y,z)\in\mathbb{R}^n, \lambda \in [0,1], f(\lambda y + (1-\lambda) z) \leq \lambda f(y) + (1-\lambda) f(z)$$

Let's link that with the gradient. A zero gradient means a flat surface, not a minimum. What we were missing before is the fact that the surface goes up when we move away from the minimum. In other words, when we move away from a minimum (in *any* direction), the slope of the function must increase. This means that the derivative of the gradient must be positive. But the gradient is a vector, so its derivative is a matrix. The gradient's derivative is called the **Hessian matrix**. It is written:
$$H_f(x) = \nabla_x^2 f(x) = \left[ \begin{array}{ccc}
\frac{\partial^2 f}{\partial x_0^2}(x) & \cdots & \frac{\partial^2 f}{\partial x_0 \partial x_n}(x)\\
\vdots & \ddots & \vdots \\
\frac{\partial^2 f}{\partial x_0 \partial x_n}(x) & \cdots & \frac{\partial^2 f}{\partial x_n^2}(x)
\end{array}\right]$$

And we have the equivalence:
$$H_f(x)\textrm{ is a positive definite matrix} \Leftrightarrow f\textrm{ in convex in }x$$

We say that a function is strictly convex in $x$ if $H_f(s)$ is positive definite. If it is only postive semi-definite, the function is only convex (not strictly). What does that mean?

Recall that a matrix $A$ is positive iff $\forall x\in\mathbb{R}^n, x^T A x \geq 0$. It is equivalent to say that a matrix is positive and that its eigen-values are strictly positive. On the other hand, the Hessian's elements indicate how fast the gradient's components increase when we slightly move away from $x$ in a given direction. So saying that the Hessian is positive is equivalent to saying that its eigen-values are all positive, which corresponds in turn to saying that the gradient increases, whatever the direction we take to move away from $x$. In the end, $H_f(x)>0$ is a very natural definition of convexity.

In particular, if the Hessian is only positive semi-definite, that means some eigen-values can be zero, which means that the gradient does not increase nor decrease in the directions of the corresponding eigen-vectors. In this case, the function is not strictly convex, just convex.

Similarly, if $f$ is convex in all possible $x$, we say it is globally convex.

Why did we go through all this?
1. First because it is important to formalize what convexity is, both geometrically (through the definition above: the function between $y$ and $z$ sits below the line that connects $f(y)$ and $f(z)$) and analytically (the Hessian is positive definite in all $x$).
2. Because we will now use this characterization of convexity to look at the function's shape (and not only the gradient) to improve our gradient descent.

## 6. Adapting the descent directions to the function's conditioning

Let us now take a look at a last issue in descent methods. In the previous example, you may have noticed the sequence of points keeps bouncing off the "walls" of the surface, moving in consecutive orthogonal directions. That would actually be perfect if these "walls" were circular because in that case the first gradient would directly point to the minimum. Unfortunately, in real life, functions are rarely so well-behaved. Still, wouldn't it be nice to have a series of descent directions that automatically adapt to the local shape of the function?

Formally, the local *conditioning* of a convex function in $x$ is the ratio $\frac{M}{m}$ between the biggest and the smallest eigen-value of its Hessian matrix $H=\nabla^2 f(x)$ in $x$. Graphically, the conditioning measures how "well-behaved" is the function for gradient descent (that is how "circular" are the contour lines). A well-conditioned function locally looks like a circular crater (and the conditioning is close to 1), while an ill-conditioned function locally looks like the grand canyon (and the conditioning is large).

Let's illustrate this on the two functions $f_0(x_0,x_1) = x_0^2+x_1^2$ and $f_1(x_0,x_1) = x_0^2+10x_1^2$.

<div class="alert alert-warning"><b>Exercice:</b>
<ul>
<li> What are the Hessian matrices of $f_1$ and $f_2$? Do these matrices depend on $x$?
<li> What is the conditioning of $f_1$? And $f_2$?
</ul>
</div>

Let's illustrate that graphically.

In [None]:
def func0(x):
    return x[0] * x[0] + x[1] * x[1]


def func1(x):
    return x[0] * x[0] + 10 * x[1] * x[1]


X0 = np.arange(-1, 1, 0.05)
X1 = np.arange(-1, 1, 0.05)

fig, ax = plot_contours_func(func0, X0, X1)
clean_axis(ax)

fig, ax = plot_contours_func(func1, X0, X1)
clean_axis(ax)

Interestingly, changing the basis vectors (that is, changing the coordinates system) used to describe our function, changes its conditioning. Recall old memories of linear algebra: if you change the coordinate systems by moving from the canonical basis to the basis formed by the eigen-vectors of the Hessian matrix, then, locally, you get a perfect conditioning of $1$.<br>
<br>
So, locally in $x$, for a convex function, there exists a basis in which the function is well-conditioned and we would prefer to express the gradient in this basis in order to define a descent direction. It so happens (as said in the previous paragraph) that the basis-change matrix for such a change of coordinates is precisely the Hessian in $x$.<br>
<br>
The idea of the Conjugate Gradients method is to take successive descent directions that, instead of being orthogonal to each other in the standard coordinate system ($d_k^Td_{k+1} = 0$) are orthogonal to each other in the basis spanned by the columns of the Hessian $H$ in $x$, that is $d_k^THd_{k+1} = 0$.<br>
<br>
Such descent directions are said to be $H$-conjugate. The property of the Conjugate Gradients method is that it adapts to the local shape of the function. One can prove that by taking successive conjugate descent directions, one finally reaches the function's minimum. In practice, we can expect the Conjugate Gradients method to converge faster than plain gradient descent precisely because it avoids the "bouncing phenomenon" on the function's contour lines.<br>
<br>
To simplify the notations, from now on, we shall write $g_k = \nabla_x f(x_k)$ the function's gradient in point $x_k$.<br>
<br>
In practice, one wants to avoid computing the Hessian at each step to find the next descent direction. Because the sequence of gradients generated by line search are orthogonal to each other, finding the current descent direction $d_k$ that is $H$-conjugate to all previous descent directions, can actually be simplified to an iterative formula that only requires the knowledge of the current gradient $g_k$ and the previous descent direction $d_{k-1}$.<br>
<br>
In the case of quadratic functions (constant Hessian), the consecutive descent directions are generated by:
$$d_{k} = -g_k + \frac{\left(g_k\right)^T g_k}{(g_{k-1})^Tg_{k-1}} d_{k-1}$$

In the general case, Conjugate Gradients methods incrementally construct the sequence of descent directions using the Fletcher-Reeves or the Polak-Ribière formula. The later is the most commonly used one:
$$d_{k} = -g_k + \frac{\left(g_k\right)^T \left(g_k - g_{k-1} \right)}{(g_{k-1})^Tg_{k-1}} d_{k-1}$$
<br>
Recall that as previously, in a Conjugate Gradient method, the step-size is found by line search.<br>
<br>
Implementing a Conjugate Gradients method can be a little tedious and we won't have the time during this class to do it. However, `scipy.optimize` provides a [`fmin_cg`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_cg.html) function (which is actually equivalent to calling the function [`minimize`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html) with the argument `method='CG'`). Run the code below to see the descent performed by the CG method.

In [None]:
x0 = np.array([-5, -4])
res = sopt.fmin_cg(func, x0, fprime=func_der, retall=True, disp=True)
xopt = res[0]
steps = np.array(res[1])

print(steps)

X0 = np.arange(-6, 6, 0.1)
X1 = np.arange(-6, 6, 0.1)
levels = np.array([0.15, 0.2, 0.25, 0.5, 1.0, 2.0, 3.0, 4.0, 5.0])
fig, ax = plot_contours_func(
    func,
    X0,
    X1,
    levels=levels,
    xp=steps.T,
    plot_line=True,
    add_levels=False,
    f_der=func_der,
)
clean_axis(ax)

X0 = np.arange(-2.99, -2.98, 0.001)
X1 = np.arange(-1.726, -1.720, 0.001)
fig, ax = plot_contours_func(
    func, X0, X1, xp=steps[4:, :].T, plot_line=True, add_levels=False, f_der=func_der
)
clean_axis(ax, None)

Did you notice something unexpected between the 3rd and 4th point?

Yes indeed! It seems the line search procedure did not really find a minimum and went way too far. Actually, it seems to happen also between the 1st and 2nd point, where, this time it did not go as far as expected. That's because [`fmin_cg`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_cg.html) performs the line search using [Wolfe conditions](https://en.wikipedia.org/wiki/Wolfe_conditions), which define stopping conditions for performing *inexact line search*. Inexact line search methods provide an efficient way of computing an acceptable step length $\alpha$ that reduces the objective function 'sufficiently', rather than minimizing the objective function over $\alpha\in \mathbb{B}^+$ exactly. So it is simply a matter of tradeoff between computational time and line search accuracy.

<div class="alert alert-warning"><b>Exercice:</b> Given all the experiments carried out in this section, with the same function and the same initialization of the search, fill the "number of steps before convergence" column in the table below.
</div>

| Algorithm                              | Number of steps before convergence |
|----------------------------------------|------------------------------------|
| Gradient descent with fixed step sizes | $\sim30$                           |
| Gradient descent with line search      | $10$                               |
| Conjugate gradients with line search   | $2$                                |

<div class="alert alert-warning"><b>Exercice:</b>
How many steps does it take to a conjugate gradients method to converge on the quadratic, ill-conditioned $f_1(x_0,x_1) = x_0^2+10x_1^2$ function?
</div>

In [None]:
# %load solutions/code5.py

<div class="alert alert-info">
    <b>Let's wrap everything up.</b>
    <br/>In this section we have focused on gradient descent methods. We have seen that:
<ol>
<li> Descent methods define a sequence $x_{k+1} = x_k + \alpha_k d_k$, where $d_k$ is a descent direction and $\alpha_k$ is the step size.
<li> The opposite of the gradient gives a descent direction.
<li> The step size needs to be adapted to guarantee convergence. For this, we introduced line search and its properties.
<li> The conjugate gradients method takes descent directions that account for the local shape (convexity, conditioning) of the function and converge faster.
</ol>
</div>