# Question 1: Gradient Descent 
Consider the function:

$$f(u,v,b)=-\log⁡\sigma(u+b)-\log⁡\sigma(v+b)-\log⁡\sigma\bigg(- {u + v + 2b \over 2}\bigg)+{u^2+v^2+b^2\over 100}$$

where $u,v,b \in \mathbb{R}$ and 

$$\sigma(x) = {1\over 1+e^{-x}} = {e^{x} \over e^{x} + 1} $$

is the sigmoid function. We will encounter objective functions like this one later in a more complex way when we discuss neural networks. The objective function here is actually the one of logistic regression for three data points with $L_2$​-regularization. You might have learned about logistic regression in another course such as data analytics for engineers. 

We try to find the minimum of $f$ with the gradient descent algorithm. In particular, we evaluate various step-size policies. In order to do this, you need implement the following:
- Implement a function that takes a point $(u,v,b)$ and returns the the gradient of $f$ at this point.</li>
- Implement a function `gradient_descent(f, grad_f, eta, (u_0, v_0, b_0), max_iter=100)` that performs `max_iter` gradient descent steps $$x_{t + 1} \leftarrow x_t - \eta(t) \cdot \nabla f(x_t​)$$ where $f$ is the function to be minimized, $\nabla f$ returns the gradient (implemented by `grad_f`), $\eta(t)$ returns the step-size at iteration $t$ (implemented by `eta`) and $(u_0,v_0,b_0)$ is the starting point (initialization).

Using these functions, perform $100$ gradient descent steps, starting at $(u_0,v_0,b_0) = (4,2,1)$ and return the function value of $f(u_{100},v_{100}, b_{100})$ and the lowest (best) function value achieved throughout the $100$ steps for the step size policies below.

## Definition of the gradient of $f$
for $f(u,v,b) = - \log(\frac{e^{u+b}}{e^{u+b}+1}) - \log(\frac{e^{v+b}}{e^{v+b}+1}) -log(\frac{e^{- \frac{u+v+2b}{2}}}{e^{- \frac{u+v+2b}{2}}+1}) +\frac{u^2 + v^2 + b^2}{100}$, we can compute the gradient of $f$ at a point $(u,v,b)$ as follows:
- $\frac{\partial f}{\partial u} = - \frac{1}{e^{u+b}+1} + \frac{1}{2e^{- \frac{u+v+2b}{2}}+2} + \frac{u}{50}$
- $\frac{\partial f}{\partial v} = - \frac{1}{e^{v+b}+1} + \frac{1}{2e^{- \frac{u+v+2b}{2}}+2} + \frac{v}{50}$
- $\frac{\partial f}{\partial b} = - \frac{1}{e^{u+b}+1} - \frac{1}{e^{v+b}+1} + \frac{1}{e^{- \frac{u+v+2b}{2}}+1} + \frac{b}{50}$

$$
\nabla f(u,v,b)=\left[\begin{array}{c}
\dfrac{\partial f}{\partial u}(v,b)\\
\dfrac{\partial f}{\partial v}(u,b) \\
\dfrac{\partial f}{\partial b}(u,v) 
\end{array}\right] = \left[\begin{array}{c}
- \dfrac{1}{e^{u+b}+1} + \dfrac{1}{2e^{- \frac{u+v+2b}{2}}+2} + \dfrac{u}{50} \\
- \dfrac{1}{e^{v+b}+1} + \dfrac{1}{2e^{- \frac{u+v+2b}{2}}+2} + \dfrac{v}{50} \\
- \dfrac{1}{e^{u+b}+1} - \dfrac{1}{e^{v+b}+1} + \dfrac{1}{e^{- \frac{u+v+2b}{2}}+1} + \dfrac{b}{50}
\end{array}\right]
$$

In [1]:
# implementation of function returning the gradient of f at a point (u,v,b)
import numpy as np
def grad_f(u, v, b):
    # gradient of f at (u,v,b)
    grad_u = -1/(np.exp(u+b)+1) + 1/(2*np.exp(-(u+v+2*b)/2)+2) + u/50
    grad_v = -1/(np.exp(v+b)+1) + 1/(2*np.exp(-(u+v+2*b)/2)+2) + v/50
    grad_b = -1/(np.exp(u+b)+1) - 1/(np.exp(v+b)+1) + 1/(np.exp(-(u+v+2*b)/2)+1) + b/50
    return grad_u, grad_v, grad_b

## Definition of the gradient descent algorithm
The gradient descent algorithm is defined as follows:
- Input: function $f$, gradient of $f$ `grad_f`, step size $\eta$, starting point $(u_0,v_0,b_0)$, maximum number of iterations `max_iter`
- Output: $f(u_{100},v_{100},b_{100})$ and $f_{\text{best}} = \min_{1\leq t \leq 100} f(u_t,v_t,b_t)$


In [2]:
# implementation of the gradient descent algorithm
def gradient_descent(f, grad_f, eta, x0, max_iter=100):
    # initialization
    x = np.array(x0)
    f_best = f(*x)
    # iterate
    for t in range(max_iter):
        # compute gradient
        grad = np.array(grad_f(*x))
        # update x
        x = x - eta(t) * grad
        # update f_best
        f_best = min(f_best, f(*x))
    return f(*x), f_best

## Question 1a (8 points)
Use a constant step size strategy: implement a function `eta_const( t, c=0.2 )` that returns for each iteration `t` the constant `c` as step size. Using this step-size policy, what are the results? 

In [3]:
# implementation of the constant step size strategy
def eta_const(t, c=0.2):
    return c

What is the final function value after $100$ iterations? So $f(u_{100}, v_{100}, b_{100}) =$

In [4]:
# perform gradient descent with constant step size
u0, v0, b0 = 4, 2, 1
x0 = (u0, v0, b0)
f = lambda u, v, b: -np.log(1/(np.exp(u+b)+1)) - np.log(1/(np.exp(v+b)+1)) - np.log(1/(np.exp(-(u+v+2*b)/2)+1) ) + (u**2+v**2+b**2)/100

f_final, f_best = gradient_descent(f, grad_f, eta_const, x0)
f_final

2.6297032372655194

What is the best function value obtained throughout the training process? So $\displaystyle\min _{1\leq t \leq 100}f(u_t,v_t,b_t) =$

In [5]:
f_best

2.6297032372655194

## Question 1b (6 points)
Use a continuously decreasing step size strategy: implement a function `eta_sqrt( t, c=0.5 )` that returns for iteration $t$ the step size $c\over\sqrt{t+1}$​. Using this step size policy, what are the results?


What is the final function value after $100$ iterations? So $f(u100,v100,b100) =$


What is the best function value obtained throughout the training process? $\displaystyle\min _{1\leq t \leq 100} f(u_t,v_t,b_t)=$

## Question 1c (6 points)


Use a multi-step step-size strategy: implement a function `eta_multistep( t, milestones=[30,50,80], c=0.5, eta_init=1.0 )` that returns a step-size that is initially set to `eta_init`, but is decayed at each milestone by multiplying it with factor `c`. For example:

$$\texttt{eta\_multistep( t, [20,50], c=0.1, eta\_init=1 )} = \begin{cases}
1    & \text{if } t < 20 \\
0.1  & \text{if } 20 \geq t < 50 \\
0.01 & \text{if } 50 \geq t
\end{cases}$$

What is the final function value after 100 iterations? So $f (w_{100}, b_{100}​) =$

What is the best function value obtained throughout the training process? $\displaystyle\min _{1\leq t \leq 100} f(w_t,b_t) =$