# Convex Optimization - Algorithms for Unconstrained Optimization, Gradient-Descent

> GD, Smoothness, Strict Convexity, Line Search

- hide: true
- toc: true
- badges: true
- comments: true
- categories: ['Optimization','Applied Mathematics','Proofs']

# PLAN

1. Unconstrained algorithms 
2. Oracle Access Model of order 1
3. Develop GD from two perspectives - linear and quadratic
4. Run GD on model problems x^2 and |x|
5. Develop the notion of M-smooth and m-strongly convex based on step 4
6. Analyze the performance of GD

8. Develop Accelerated GD
7. Develop Subgradient Descent

# Introduction

***Gradient descent*** is a powerful, yet incredibly simple, algorithm for unconstrained, convex optimization. We can think of it as the analog of a ***greedy algorithm*** in the setting of convex, continuous optimization. That is, it is our attempt to do best locally given limited information about the objective $f$, and limited computational power.

# The Gradient Descent Algorithm

As an iterative algorithm, it relies on ***initialization*** and an ***update step***.

Let $x$ be the initial iterate, and let the update be given by $x^+ = x + \eta d$ for some directional unit-vector $d$ and ***step-size*** parameter $\eta > 0$.

We base the algorithm on the assumption that the linear approximation of the objective at a the next iterate $x^+$ is a good-enough local approximation of its true value at $x^+$. That is:

<br>
$$f(x + \eta d) \approx f(x) + \eta \nabla f(x)^T d \tag{1.1}$$
<br>

Since we wish to minimize $f$, it would be wise to insist that the objective at $x^+$ improves or, at the very least, does not worsen. This is the locally optimal choice that presents itself to us. We insist: 

<br>
$$f(x + \eta d) \approx f(x) + \eta \nabla f(x)^T d \leq f(x) \tag{1.2}$$
<br>

Since $f(x)$ is fixed and $\eta > 0$, this amounts to minimizing the scaled inner-product $\nabla f(x)^Td$. So, we choose $d$ opposite and parallel to the gradient, i.e. $d = - \frac{\nabla f(x)}{||\nabla f(x)||_2}$.

The update step becomes:

<br>
$$x^+ = x - \eta \frac{\nabla f(x)}{||\nabla f(x)||_2}$$
<br>

By re-labeling, $\eta$ absorbs the normalization constant obtaining the final gradient descent update step: 

<br>
$$x^+ = x - \eta \nabla f(x) \tag{1.3}$$
<br>

This makes intuitive sense because the negative gradient direction is the direction of steepest decrease. So, it's only natural that the update should take us in this most enticing direction given that we've specified our goal is to minimize the objective.

# Important Questions

Given the ease with which we came up with the algorithm, we should ask ourselves the following questions. 

1. Is gradient descent guaranteed to converge for all step-sizes? 
2. How should we choose a step-size that guarantees convergence? 
3. What's the rate of convergence of gradient descent? Does the rate depend on step-size? Does it depend on properties of the objective function?
4. How should we choose a step-size that maximizes convergence rate?

We will explore each of this questions shortly but, before doing that, it's worth taking a high-level look at convex optimization itself. Perhaps the most important question to ask is this: does gradient descent's convergence rate, for an optimally chosen step-size, give a taxonomy of easier-to-harder problems within the scope of convex optimization? The answer, as it turns out, is *yes*.

Let's start by running gradient descent on two quintessential convex problems in $\mathbb{R}$, $f(x) = x^2$ and $h(x) = |x|$.



If $f$ is $M$-smooth, then we have the tightest possible point-wise quadratic upper bound as:

<br>
$$f(y) \leq f(x) + \nabla f(x)^T(y-x) + \frac{M}{2} ||y-x||_2^2 \ \ \forall y$$
<br>

How does this help us choose step size in a better way?

Plug the step $x^+ = x - \eta \nabla f(x)$ into the upper bound:

<br>
$$
\begin{aligned}
f(x^+) &\leq f(x) + \nabla f(x)^T(\eta \nabla f(x)) + \frac{M}{2} ||\eta \nabla f(x)||_2^2 \\
\iff f(x^+) &\leq f(x) +  \eta || \nabla f(x) ||_2^2 + \frac{M \eta^2}{2} ||\nabla f(x)||_2^2
\end{aligned}
$$
<br>

But the RHS is a quadratic function in $\eta$, the step-size. So, the upper bound on the next iterate $f(x^+)$, and consequently also the next iterate itself, can be minimized w.r.t. $\eta$. Essentially, we trust the upper bound is a good quadratic approximation and choose its minimizer as the next iterate. Just like in NM we trust the quadratic estimate, and set its minimizer as the next iterate. But here, we require no second-order information about the function.

The non-NM view is this. We have a fixed $x$ and the quadratic upper bound at that $x$. Any next iterate will be smaller than the qudratic UB *at* that iterate. So we choose a step size that minimizes this UB. This guarantees $f(x^+) \leq$ min of UB which is the tightest guarantee we can get on $f(x^+)$ with the information at hand. So we, once again, do the locally optimal thing and choose that.

The best choice turns out to be $\eta = \frac{1}{M}$


FOR FUNCTIONS THAT ARE M SMOOTH M WORKS FOR EVERY POINT. WHEREAS WE ARE SHIT OUT OF LUCK IF SUCH AN M DOESN'T EXIST. THEN WE DON'T HAVE THIS UB. IT IS THEN THAT WE USE EITHER NM OR QUASI-NM. 

||\||