# Derivative-based methods

## Thanks and Credits
The core exercises are taken directly from [Daniel Newman's Github repository](https://github.com/dtnewman/stochastic_gradient_descent), which is distributed freely.

In [None]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from scipy.optimize import fmin
plt.style.use('seaborn-white')
plt.rcParams.update({'font.size': 18})

### Gradient Descent

<b>Gradient descent</b>, also known as <b>steepest descent</b>, is an optimization algorithm for finding the local minimum of a function. To find a local minimum, the function "steps" in the  direction of the negative of the gradient. <b>Gradient ascent</b> is the same as gradient descent, except that it steps in the direction of the positive of the gradient and therefore finds local maximums instead of minimums. The algorithm of gradient descent can be outlined as follows:

&nbsp;&nbsp;&nbsp; 1: &nbsp; Choose initial guess $x_0$ <br>
&nbsp;&nbsp;&nbsp;    2: &nbsp; <b>for</b> k = 0, 1, 2, ... <b>do</b> <br>
&nbsp;&nbsp;&nbsp;    3:   &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $s_k$ = -$\nabla f(x_k)$ <br>
&nbsp;&nbsp;&nbsp;    4:   &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; choose $\eta_k$ to minimize $f(x_k+\eta_k s_k)$ <br>
&nbsp;&nbsp;&nbsp;    5:   &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $x_{k+1} = x_k + \eta_k s_k$ <br>
&nbsp;&nbsp;&nbsp;    6: &nbsp;  <b>end for</b>

As a simple example, let's find a local minimum for the function $f(x) = x^3-2x^2+2$

In [None]:
def f(x) : 
    # Fill in the blanks
    # Takes in a single floating point number, spits out the function value
    return 0 * x

In [None]:
# This creates an array from -1 to 2.5 with 1000 points in between
x = np.linspace(-1,2.5,1000)

# Familiarize yourself with this syntax for matplotlib. 
# It should be pretty intutive the more you see it.
plt.plot(x, f(x))
plt.xlabel('x')
plt.ylabel('f(x)')
plt.xlim([-1,2.5])
plt.ylim([0,3])
plt.show()

We can see from plot above that our local minimum is gonna be near around 1.4 or 1.5 (on the x-axis), but let's pretend that we don't know that, so we set our starting point (arbitrarily, in this case) at $x_0 = 2$

In [None]:
x_old = 0
x_new = 2 # The algorithm starts at x=2
n_k = 0.1 # step size parameter eta above
precision = 0.0001 # some desired precision so that we can stop iterating

# lists that can append values, used for plotting later
x_list, y_list = [x_new], [f(x_new)]

# returns the value of the derivative of our function
def f_prime(x):
    # fill in f_prime
    # Takes in a numpy array, spits out the function value
    return 0.0 * x

# Fill in the algorithm
# Fill in criterion : iterate till values are "sufficiently" close to one another
while 0 > 1:
    # Remove this pass statement
    pass
    # Do calculation as per the formula above here
    
    # Append values to list here ...

    
print("Local minimum occurs at:", x_new)
print("Number of steps:", len(x_list))

The figures below show the route that was taken to find the local minimum.

In [None]:
plt.figure(figsize=[10,3])
plt.subplot(1,2,1)
plt.plot(x,f(x))
plt.plot(x_list,y_list,"ro-", ms=12)
plt.xlim([-1,2.5])
plt.ylim([0,3])
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title("Gradient descent")
plt.subplot(1,2,2)
plt.plot(x,f(x))
plt.plot(x_list,y_list,"ro-", ms=12)
plt.xlim([1.2,2.1])
plt.ylim([0,3])
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title("Gradient descent (zoomed in)")
plt.show()

You'll notice that the step size (also called learning rate) in the implementation above is constant, unlike the algorithm in the pseudocode. Doing this makes it easier to implement the algorithm. However, it also presents some issues: If the step size is too small, then convergence will be very slow, but if we make it too large, then the method may fail to converge at all. 

A solution to this is to use adaptive step sizes as the algorithm below does (using `scipy`'s `fmin` function to find optimal step sizes). I will showcase this in-class only.

Another approach to update the step size is choosing a decrease constant $d$ that shrinks the step size over time:
$\eta(t+1) = \eta(t) / (1+t \times d)$. This is commonly done in supervised machine-learning methods (where a variation of steepest descent called the Stochastic Gradient Descent (SGD) is used).  

In [None]:
x_old = 0
x_new = 2 # The algorithm starts at x=2
n_k = 0.17 # step size
precision = 0.0001
t, d = 0, 1

x_list, y_list = [x_new], [f(x_new)]

# returns the value of the derivative of our function
def f_prime(x):
    # fill in f_prime or use one filled above
    return 0.0 * x

# Fill in the algorithm
# Fill in criterion : iterate till values are "sufficiently" close to one another
while 0 > 1:
    # Remove this pass statement
    pass
    # Do calculation here
    
    # Adapt eta here
    
    # Append to list here ..


print("Local minimum occurs at:", x_new)
print("Number of steps:", len(x_list))

### Gradient Descent in two-dimensions

The same algorithm works independent of the dimensions! The derivatives are now gradients and hence vectors...

Let's work on finding the minimum of a function $ x^2 + \texttt{stretch_factor}*y^2 $ where `stretch_factor` is a variable that can be changed, using the constant step-size version of steppest-descent algorithm.

In [None]:
x_old = np.array([0.0, 0.0])
x_new = np.array([6.0, 6.0]) # The algorithm starts at x=6.0,6.0
n_k = 0.1 # step size
precision = 0.0001
t, d = 0, 1

# controls how the contour plot is stretched in the x/y-direction
stretch_factor = 1.0

def f(x):
    # fill in, takes in an array of size (2,) and spits out a single number
    return 0.0 * x[0] + 0.0 * x[1]

# returns the value of the derivative of our function
def f_prime(x):
    # fill in 
    # Takes in an array of size (2,) and spits out an array of (2,)
    return 0.0 * x

# lists that can append values, used for plotting later
x_list, y_list = [x_new], [f(x_new)]

# Fill in criterion : iterate till values are "sufficiently" close to one another
while 0 > 1:
    # Fill in algorithm

    # You should see that with numpy you can essentially write
    # code that 'looks' like you are operating on a single number
    # but you are doing array operations!
    
    # lists that can append values, used for plotting later
    pass

print("Local minimum occurs at:", x_new)
print("Number of steps:", len(x_list))

In [None]:
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)
x_collection = np.array(x_list)
x_collection = x_collection if x_collection.shape[1] == 2 else x_collection.T
ax.plot(x_collection[:, 0], x_collection[:, 1], 'ro-', ms=14)
grid_x = np.linspace(-6.0, 6.0, 100)
grid_y = np.linspace(-6.0, 6.0, 100)
X,Y = np.meshgrid(grid_x, grid_y)
Z = f([X, Y])
ax.contourf(X, Y ,Z, cmap=plt.cm.viridis)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('f(x,y)')
ax.set_aspect('equal')

### Brittle
But it's very easy to break. Try changing the `stretch_factor` in the example above (we started of from 1, how about changing it to 2,4,8,16...)?. 


The conjugate gradient method overcomes this _difficulty_ with `stretch_factor`.

## Method of Conjugate Gradients

If we need to minimize a function of the form

$$ \mathbf{x}^* = \textrm{argmin} \left( {\tfrac {1}{2}} \mathbf{x}^{\mathsf {T}} \mathbf{A} \mathbf{x} - \mathbf{x}^{\mathsf {T}}\mathbf{b} \right) $$

which reduces to solving $ \mathbf{A} \mathbf{x} - \mathbf{b} = 0$, we can use the following algorithm (found [here](https://en.wikipedia.org/wiki/Conjugate_gradient_method#The_resulting_algorithm)). An approachable introduction to understand CG can be found in this [link](http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf).

\begin{aligned}&\mathbf {r} _{0}:=\mathbf {b} -\mathbf {Ax} _{0}\\&{\hbox{if }}\mathbf {r} _{0}{\text{ is sufficiently small, then return }}\mathbf {x} _{0}{\text{ as the result}}\\&\mathbf {p} _{0}:=\mathbf {r} _{0}\\&k:=0\\&{\text{repeat}}\\&\qquad \alpha _{k}:={\frac {\mathbf {r} _{k}^{\mathsf {T}}\mathbf {r} _{k}}{\mathbf {p} _{k}^{\mathsf {T}}\mathbf {Ap} _{k}}}\\&\qquad \mathbf {x} _{k+1}:=\mathbf {x} _{k}+\alpha _{k}\mathbf {p} _{k}\\&\qquad \mathbf {r} _{k+1}:=\mathbf {r} _{k}-\alpha _{k}\mathbf {Ap} _{k}\\&\qquad {\hbox{if }}\mathbf {r} _{k+1}{\text{ is sufficiently small, then exit loop}}\\&\qquad \beta _{k}:={\frac {\mathbf {r} _{k+1}^{\mathsf {T}}\mathbf {r} _{k+1}}{\mathbf {r} _{k}^{\mathsf {T}}\mathbf {r} _{k}}}\\&\qquad \mathbf {p} _{k+1}:=\mathbf {r} _{k+1}+\beta _{k}\mathbf {p} _{k}\\&\qquad k:=k+1\\&{\text{end repeat}}\\&{\text{return }}\mathbf {x} _{k+1}{\text{ as the result}}\end{aligned}

We can couch the problem seen before, of minimizing $x^2 + \texttt{stretch_factor} * y^2$ into the following form:

\begin{equation*}
\mathbf{x}^* = \textrm{argmin} \left( {\tfrac {1}{2}} \mathbf{x}^{\mathsf {T}} \cdot \begin{bmatrix}
1 & 0\\
0 & \texttt{stretch_factor}
\end{bmatrix}
\cdot \mathbf{x} - \mathbf{x}^{\mathsf {T}}
\begin{bmatrix}
0 \\
0
\end{bmatrix}\right) \\
\end{equation*}


In [None]:
stretch_factor = 100.0
A = np.random.randn(2,2) # What do you think A should be? 
b = np.random.randn(2,)  # What do you think b should be? 

In [None]:
# Initial guess value which solves the problem
x = np.array([6.0, 6.0])
x_list = [x]

# Optional : use a "max" number of iterations beyond which the simulation
# doesn't run
i = 0
imax = 10 

# Tolerance
eps = 0.0001

# Start algorithm here
# Do some initial setup before the repeat block above

# initial setup

# Setup conditions for the loop
while 0 > 1 and 1  > 2:
    # Complex processing

    # Loop counter
    i += 1
    
    # Don't forget to append data to list!
    
    pass

In [None]:
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)
x_collection = np.array(x_list)
x_collection = x_collection if x_collection.shape[1] == 2 else x_collection.T
ax.plot(x_collection[:, 0], x_collection[:, 1], 'ro-', ms=14)
grid_x = np.linspace(-6.0, 6.0, 100)
grid_y = np.linspace(-6.0, 6.0, 100)
X,Y = np.meshgrid(grid_x, grid_y)
Z = f([X, Y])
ax.contourf(X, Y ,Z, cmap=plt.cm.viridis)
ax.set_aspect('equal')

## Is this realistic?

That's great, but how useful is it in real-life functions that are
- Multi-modal (the above was a unimodal function, with one global minima)
- Non-convex (the above was a convex function)
- Non-separable (in the above example x and y are equivalent but separate)
- Non-linear (the above problem is essentially linear)

? 

To test that, let's take the Rastrigin function that was discussed a couple of lectures ago and apply steepest descent and CG to minimize it. We need to locally linearize the problem at every step, which involves finding gradients (first-derivatives : a vector) and Hessians (second-derivatives : a matrix) of the function! The rastrigin function in two dimensions is :
$$f(\mathbf{x}) = 20 + \left[ x^2 - 10 \cos\left(2 \pi x \right) \right] + \left[ y^2 - 10 \cos\left(2 \pi y \right) \right]$$

For this part, I'll demonstrate the math and code in class.