# Lab 2 (Part A) - Introduction to gradient descent

This part of the Lab is a step by step introduction to the gradient descent algorithm. It will help you understand how it works. Make sure that you check the videos of lecture 2 before starting this Lab:
- Introduction to Linear Regression: https://www.youtube.com/watch?v=-wmjwMWRsZU&list=PLS8J_PRPtGfdnPf9QqT7Itnn2rAHsoWqY&index=3
- Introduction to Nonlinear Regression: https://www.youtube.com/watch?v=Hyu8QMLEHrE&list=PLS8J_PRPtGfdnPf9QqT7Itnn2rAHsoWqY&index=4

First, please select the Python code cell below and run it to initialize some plots. You DO NOT have to understand the code in this cell.

In [2]:
# Always run this cell before anything else. DO NOT modify this code.
%matplotlib notebook

import sys
sys.path.insert(0, 'labutils/')

from lab2utils import lab2partA1, lab2partA2
lab2A1, lab2A2 = lab2partA1(), lab2partA2()

# 1. Minimizing a function of one parameter with gradient descent

In this section, you are given a function $F(a) = (a + 5)^2$ of one parameter $a$ (a scalar value). The goal is to minimize the function $F$ using the gradient descent algorithm.

You are asked to read and complete the Python code below to perform gradient descent (read carefully the code, the comments, and the *TODO* comments in red). The function `dF(a)` that you should complete corresponds to $\frac{\partial F(a)}{\partial a}$, i.e., the derivative of the function $F(a)$ with respect to the parameter $a$.

If your implementation of gradient descent is correct and the value of $\alpha$ is correctly choosen, then the value of $a$ should approach $-5$ and the value of $F(a)$ should approach $0$. This is because the minimum of the function $F(a)$ is $0$ when $a = -5$.

Once your code works well, you can re-run it with different values of the learning rate $\alpha$ and see the difference in terms of the number of iterations it takes until convergence. For example, you can try the following values for $\alpha$: 0.01, 0.3, and 0.9. Please note that if your learning rate $\alpha$ is too large (e.g. $\alpha = 1.5$ for this example), then $F(a)$ can **diverge** and *blow up*, resulting in values which are too large for computer calculations. If your value of $F(a)$ increases or even blows up, stop the execution, adjust your learning rate and try again.

You can also re-run the code with a different initial value for the parameter $a$. For example, you can try an initial value of $a = 0$ or $a = -15$.

In [4]:
# DO NOT modify the definition of the function F(a)
def F(a):
    return (a + 5)**2


""" TODO:
Write here the definition of the function `dF(a)`, which is
the derivative of `F(a)` with respect to the parameter `a`
"""
def dF(a):
    
    return 2*(a+5)


alpha = 0.1            # The learning rate of gradient descent
#alpha = 0.01
#alpha = 0.3
#alpha = 0.9
a = 7                  # The initial value of a (any initial value is ok)
max_iterations = 100   # Maximum number of iterations to perform
epsilon = 0.0001       # Some small number to test for convergence (i.e. to stop if F(a) does not decrease too much)

for itr in range(max_iterations):
    lab2A1.plot(itr, F, a) # This plots an animation (DO NOT modify this line)
    prev = F(a) # Save the value of F(a)
    
    
    """ TODO:
    Write here the gradient descent step to update the value  of the parameter a.
    Hint: You need to use `alpha` and `dF(a)`.
    """
    a = a - alpha * dF(a)

    """ TODO:
    Replace the boolean variable `CONDITION` below with a condition to break-out
    of the loop if we are close to convergence. Hint: You need to use `prev` (the 
    previous value of F(a)), the current value of `F(a)` (after a has been updated), and `epsilon`.
    """
    CONDITION = abs(prev-F(a)) <= epsilon # Replace True with a boolean condition
    if CONDITION:
        break


<IPython.core.display.Javascript object>

# 2. Minimizing a function of two parameters with gradient descent

In this section, you are given a function $F(a, b) = 5 + a^2 + \frac{3}{2} b^2 + a b~$ of two parameters $a$ and $b$ (scalar values). The goal is to minimize the function $F(a, b)$ using the gradient descent algorithm.

You are asked to read and complete the Python code below to perform gradient descent (read carefully the code, the comments, and the *TODO* comments in red). The first function `dFa(a, b)` that you should complete corresponds to $\frac{\partial F(a, b)}{\partial a}$, i.e., the derivative of the function $F(a, b)$ with respect to the first parameter $a$. The second function `dFb(a, b)` that you should complete corresponds to $\frac{\partial F(a, b)}{\partial b}$, i.e., the derivative of the function $F(a, b)$ with respect to the second parameter $b$.

Note that $\frac{\partial F(a, b)}{\partial a} = 2 a + b$, and $\frac{\partial F(a, b)}{\partial b} = 3 b + a$.

If your implementation is correct and the value of $\alpha$ is correctly chosen, then the value of both $a$ and $b$ should approach $0$ and the value of $F(a, b)$ should approach $5$. This is because the minimum of the function $F(a, b)$ is $5$ when $a = 0$ and $b = 0$.

Once your code works well, you can re-run it with different values of the learning rate $\alpha$, and different values of the initial parameters $a$ and $b$.

In [13]:
# DO NOT modify the definition of the function F(a, b)
def F(a, b):
    return 5 + a**2 + 1.5 * b**2 + a * b


""" TODO:
Write here the definition of the function dFa(a, b), which is
the derivative of F(a, b) with respect to the first parameter a
"""
def dFa(a, b):
    return 2*a + b


""" TODO:
Write here the definition of the function dFb(a, b), which is
the derivative of F(a, b) with respect to the second parameter b
"""
def dFb(a, b):
    return 3*b + a


alpha = 0.1            # The learning rate of gradient descent
a, b = 80, 90          # The initial values of a and b (any initial values are ok)
max_iterations = 100   # Maximum number of iterations
epsilon = 0.0001       # Some small number to test for convergence (i.e. to stop if F(a) does not decrease too much)

for itr in range(max_iterations):
    lab2A2.plot(itr, F, a, b) # This plots an animation (DO NOT modify this line)
    prev = F(a, b) # Save the value of F(a, b)
    
    
    """ TODO:
    Write here the gradient descent step to update the value of parameters `a` and `b` simultaneously.
    Hint: You need to use `alpha`, `dFa(a, b)` and `dFb(a, b)`
    """
    gradF = [dFa(a, b), dFb(a, b)]
    a, b = ((a - alpha*dFa(a, b)),(b - alpha*dFb(a, b)))
    
    
    """ TODO:
    Replace the boolean variable `CONDITION` below with a condition to break-out
    of the loop if we are close to convergence. You need to use `prev` the previous value of
    F(a, b), the current value of `F(a, b)`, and `epsilon`.
    """
    CONDITION = abs(prev - F(a, b)) < epsilon # Replace True with a boolean condition
    if CONDITION:
        break
        

<IPython.core.display.Javascript object>

# 3. Minimizing a function of multiple parameters with gradient descent

This section is similar to the previous one, but you will minimize a function of multiple parameters (i.e., a vector of $p$ parameters: $\theta \in \mathbb{R}^p$).

First, you are asked to write the function $F(\theta)$ in the following Python code. The function $F(\theta)$ is defined as:
$$F(\theta) = \sum_{j} \theta_j^2$$

The gradient of the function $F(\theta)$ is denoted as $\nabla F(\theta)$. This is a vector containing the derivative of $F(\theta)$ with respect to each parameter $\theta_j$:
$$\nabla F(\theta) = \left ( \frac{\partial F(\theta)}{\partial \theta_0}, \frac{\partial F(\theta)}{\partial \theta_1}, \frac{\partial F(\theta)}{\partial \theta_2}, \dots \right )$$ 

Write the definition of the function `gradF(theta)` in the following Python code. This function corresponds to $\nabla F(\theta)$. It should return an array containing the derivative of $F(\theta)$ with respect to each parameter $\theta_j$.

If your implementation is correct and the value of $\alpha$ is correctly chosen, then you should end up getting all parameter values close to $0$ and the value of $F(\theta)$ should approach $0$. This is because the minimum of the function $F(\theta)$ is $0$ when $\theta = \vec{0} $ (the null vector).

Once your code works well, you can re-run it with a different value of the learning rate $\alpha$, and a different initial parameters vector $\theta$.

In [16]:
import numpy as np, matplotlib.pylab as plt

""" TODO:
Write here the definition of the function `F(theta)`, where `theta` is an array of parameters.
"""
def F(theta):
    return theta.T @ theta


""" TODO:
Write here the definition of the function `gradF(theta)`, the gradient of F(theta).
This function should return an array where the j'th value is this array is the 
derivative of F(theta) with respect to the j'th parameter theta_j.
"""
def gradF(theta):
    return sum(theta*2)


alpha = 0.1                      # The learning rate of gradient descent
#alpha = 0.01
#alpha = 0.3
#alpha = 0.9
theta = np.array([80, 90, -20])  # Some initial parameters vector: theta = [theta_0, theta_1, theta_2, ...]
max_iterations = 100             # Maximum number of iterations
epsilon = 0.000001               # Some small number to test for convergence (i.e. to stop if F(a) does not decrease too much)
history = []

for itr in range(max_iterations):
    prev = F(theta)
    print("iteration = {}, theta = {}, F(theta) = {}".format(itr, theta, prev))
    history.append(F(theta))
    """ TODO:
    Write here the gradient descent step to update the parameters vector `theta`.
    All the parameter values in theta should be updated simultaneously.
    Hint: You need to use `alpha` and `gradF(theta)`
    """
    theta = theta - alpha*gradF(theta)
    
    
    """ TODO:
    Replace the boolean variable `CONDITION` below with a condition to break-out
    of the loop if we are close to convergence. Hint: You need to use `prev` the 
    previous value of F(theta), the current value of `F(theta)`, and `epsilon`.
    """
    CONDITION = abs(prev - F(theta)) < epsilon # Replace True with a boolean condition
    if CONDITION:
        break


""" TODO:
Produce a plot here showing the value of F(theta) at each iteration.
You might need to modify the above code to save all the historical values of F(theta)
Note: you can use ax.plot(...) to do this plot
"""
fig, ax = plt.subplots()
ax.plot(history)
# Plot here the number of iterations vs. the history of values of F(theta)
ax.set_xlabel("Number of iterations")
ax.set_ylabel("F(theta)")
fig.show()

iteration = 0, theta = [ 80  90 -20], F(theta) = 14900
iteration = 1, theta = [ 50.  60. -50.], F(theta) = 8600.0
iteration = 2, theta = [ 38.  48. -62.], F(theta) = 7592.0
iteration = 3, theta = [ 33.2  43.2 -66.8], F(theta) = 7430.72
iteration = 4, theta = [ 31.28  41.28 -68.72], F(theta) = 7404.9152
iteration = 5, theta = [ 30.512  40.512 -69.488], F(theta) = 7400.786432
iteration = 6, theta = [ 30.2048  40.2048 -69.7952], F(theta) = 7400.125829119999
iteration = 7, theta = [ 30.08192  40.08192 -69.91808], F(theta) = 7400.020132659198
iteration = 8, theta = [ 30.032768  40.032768 -69.967232], F(theta) = 7400.003221225471
iteration = 9, theta = [ 30.0131072  40.0131072 -69.9868928], F(theta) = 7400.000515396075
iteration = 10, theta = [ 30.00524288  40.00524288 -69.99475712], F(theta) = 7400.00008246337
iteration = 11, theta = [ 30.00209715  40.00209715 -69.99790285], F(theta) = 7400.000013194139
iteration = 12, theta = [ 30.00083886  40.00083886 -69.99916114], F(theta) = 7400.000002

<IPython.core.display.Javascript object>