# Convex Optimization - Algorithms for Unconstrained Optimization, Gradient-Descent

> GD, Smoothness, Strict Convexity, Line Search

- hide: true
- toc: true
- badges: true
- comments: true
- categories: ['Optimization','Applied Mathematics','Proofs']

# PLAN

1. Unconstrained algorithms 
2. Oracle Access Model of order 1
3. Develop GD from two perspectives - linear and quadratic
4. Run GD on model problems x^2 and |x|
5. Develop the notion of M-smooth and m-strongly convex based on step 4
6. Analyze the performance of GD

8. Develop Accelerated GD
7. Develop Subgradient Descent

# Introduction

***Gradient descent (GD)*** is a powerful, yet incredibly simple, algorithm for unconstrained, convex optimization. We can think of it as the analog of a ***greedy algorithm*** in the setting of convex, continuous optimization. That is, it is our attempt to do best locally given only limited information about the objective $f$, and limited computational power.

# The Gradient Descent Algorithm

As all iterative algorithms, gradient descent relies on ***initialization*** and an ***update step***.

Let $x$ be the initial iterate, and let the update be given by $x^+ = x + \eta d$ for some directional unit-vector $d$ and ***step-size*** parameter $\eta > 0$.

We base the algorithm on the assumption that the linear approximation of the objective at a the next iterate $x^+$ is a good-enough local approximation of its true value at $x^+$. 

That is:

<br>
$$f(x^+) = f(x + \eta d) \approx f(x) + \eta \nabla f(x)^T d \ \ \forall d \tag{1.1}$$
<br>

Immediately, a locally optimal choice presents itself to us. Since we wish to minimize $f$, it would be wise to insist that the objective at $x^+$ improves or, at the very least, does not worsen. 

That is, we insist: 

<br>
$$f(x^+) \approx f(x) + \eta \nabla f(x)^T d \leq f(x) \tag{1.2}$$
<br>

In fact, since we are greedy in our approach, it's in our interest to get $f(x^+)$ to be as small as possible. Since $f(x)$ is fixed and $\eta > 0$, this amounts to minimizing the scaled inner-product $\nabla f(x)^Td$. To that end, we choose $d$ opposite and parallel to the gradient, i.e. $d = - \frac{\nabla f(x)}{||\nabla f(x)||_2}$.

The update step becomes:

<br>
$$x^+ = x - \eta \frac{\nabla f(x)}{||\nabla f(x)||_2}$$
<br>

By re-labeling, $\eta$ absorbs the normalization constant obtaining the final gradient descent update step: 

<br>
$$x^+ = x - \eta \nabla f(x) \tag{1.3}$$
<br>

This makes intuitive sense because the negative gradient direction is the direction in which the objective decreases most. So, it's only natural that the update should take us in this most enticing direction.

It's worth exploring another avenue leading to the same GD update rule. Instead of defining the step $x^+ = x + \eta d$ and then determining the direction $d$, we can determine the step and the direction in one fell swoop.

Starting from the linear approximation:

<br>
$$
f(y) \approx f(x) + \nabla f(x)^T(y - x) \ \ \forall y \tag{2.1}
$$
<br>

We can insist, in a greedy fashion, that the next iterate $x^+$ be the minimizer of the linear approximation. That is, we insist:

<br>
$$
x^+ = \arg \min_y f(x) + \nabla f(x)^T(y - x)
$$
<br>

But since the linear approximation is unbounded below, this obtains $x^+ = \pm \infty$. So, we introduce a parametrized penalty term that prevents $x^+$ from venturing too far from the current iterate $x$. That is, we augment the linear approximation into:

<br>
$$
x^+ = \arg \min_y f(x) + \nabla f(x)^T(y - x) + \eta ||y - x||_2^2
$$
<br>

Now, since the RHS is a a simple quadratic, it has a unique minimum which can be found by using the ***unconstrained optimality condition***. This just means taking the gradient of the RHS w.r.t. the optimization variable $y$, setting it to zero, and solving for the roots. 

<br>
$$
\begin{aligned}
\nabla f(x) &+ 2 \eta (y - x) = 0 \\ \\
\iff 2 \eta y &= 2 \eta x - \nabla f(x) \\
\iff y &= x - \frac{1}{2 \eta} \nabla f(x)
\end{aligned}
$$
<br>

# Oracle Access Model

With the foresight 

# Important Questions

Given the ease with which we came up with the algorithm, we should ask ourselves the following questions. 

1. Is GD guaranteed to converge for all step-sizes? 
2. How should we choose a step-size that guarantees convergence? 
3. What's the rate of convergence of GD? Does the rate depend on step-size? Does it depend on properties of the objective function?
4. How should we choose a step-size that maximizes convergence rate?

We will explore each of this questions, and more, shortly. However, before doing so, it's worth taking a bird's eye look at the problem of convex optimization itself. Perhaps the most important question to ask is this: does gradient descent's convergence rate, for an optimally chosen step-size, give a taxonomy of easier-to-harder problems within the scope of convex optimization? The answer, as it turns out, is *yes*.

Let's start by running fixed step-size GD on two quintessential convex problems in $\mathbb{R}$, $f(x) = x^2$ and $h(x) = |x|$.

## Simple Examples of Gradient Descent

First, let's run the algorithm on $h(x) = |x|$ for $x \in \mathbb{R}$. 

Since $|x|$ is non-differentiable at $x = 0$, the gradient has a discontinuity at $x = 0$. Non-differentiability, such as this, will eventually lead us to introduce the notion of ***sub-gradients***, but for now we can get away with using the discontinuous gradient:

$$
h'(x) = 
\begin{cases} 
\begin{aligned} 
-1 \ &\textrm{if $x < 0$} \\ 
1 \ &\textrm{if $x > 0$} 
\end{aligned}
\end{cases}
$$

Then, for a fixed $\eta > 0$, the update step is:

<br>
$$x^+ = x \pm \eta$$
<br>

where the sign of $\eta$ depends on where the previous iterate, $x$, falls inside the domain $(-\infty, 0) \cup (0, \infty)$.

Now, switching our attention to $f(x) = x^2$, we compute its GD update as follows. We compute the gradient as $f'(x) = 2x$ which leads to the fixed step-size update:

<br>
$$x^+ = x - 2\eta x$$
<br>

Note that $x^* = 0$ is the unique optimal solution to both $f(x)$ and $h(x)$. With this in mind, there are two key observations to make. 

The first is that, for $x$ far away from $x^* = 0$, the update, $2\eta x$, is large in magnitude. So, if the iterate is far from the optimal solution, GD makes fast progress towards the optimal. The second observation is that, as $x \rightarrow x^*$, the update becomes small in magnitude. So, as the iterate comes close to the optimal, GD takes smaller and smaller steps which converge to $0$ in a summable way. This means, we can get the sub-optimality $|f(x) - f(x^*)|$ to be $\epsilon$-arbitrarily small for any fixed step-size $\eta$.

Neither of these observations hold for GD on $h(x) = |x|$ since the update $\eta$ is fixed regardless of the Euclidean distance between $x$ and $x^* = 0$. In particular, this means GD is not fast for $x$ far away from $x^*$ and does not slow down as $x$ nears $x^*$. Arbitrary accuracy is, also, impossible in this setting for a fixed step-size $\eta$. The iterates eventually cycle between $x^T - \eta$ and $x^T + \eta$ where $x^T$ is in the open $\eta$-neighborhood of $x^* = 0$, so the sub-optimality also cycles between two values and depends on the choice of $\eta$.

We say GD on $f(x) = x^2$ enjoys the ***self-tuning property***, whereas GD on $h(x) = |x|$ does not. This speaks to the fact that the self-tuning is a property of the objective functions, rather than GD itself. In a sense, functions *like* $x^2$ will all have self-tuning while functions *like* $|x|$ do not and this is what introduces a taxonomy of easier-to-harder convex optimization problems. What it means, precisely, to be *like* $x^2$ or $|x|$ will be made rigorous in the next few sections.

If $f$ is $M$-smooth, then we have the tightest possible point-wise quadratic upper bound as:

<br>
$$f(y) \leq f(x) + \nabla f(x)^T(y-x) + \frac{M}{2} ||y-x||_2^2 \ \ \forall y$$
<br>

How does this help us choose step size in a better way?

Plug the step $x^+ = x - \eta \nabla f(x)$ into the upper bound:

<br>
$$
\begin{aligned}
f(x^+) &\leq f(x) + \nabla f(x)^T(\eta \nabla f(x)) + \frac{M}{2} ||\eta \nabla f(x)||_2^2 \\
\iff f(x^+) &\leq f(x) +  \eta || \nabla f(x) ||_2^2 + \frac{M \eta^2}{2} ||\nabla f(x)||_2^2
\end{aligned}
$$
<br>

But the RHS is a quadratic function in $\eta$, the step-size. So, the upper bound on the next iterate $f(x^+)$, and consequently also the next iterate itself, can be minimized w.r.t. $\eta$. Essentially, we trust the upper bound is a good quadratic approximation and choose its minimizer as the next iterate. Just like in NM we trust the quadratic estimate, and set its minimizer as the next iterate. But here, we require no second-order information about the function.

The non-NM view is this. We have a fixed $x$ and the quadratic upper bound at that $x$. Any next iterate will be smaller than the qudratic UB *at* that iterate. So we choose a step size that minimizes this UB. This guarantees $f(x^+) \leq$ min of UB which is the tightest guarantee we can get on $f(x^+)$ with the information at hand. So we, once again, do the locally optimal thing and choose that.

The best choice turns out to be $\eta = \frac{1}{M}$


FOR FUNCTIONS THAT ARE M SMOOTH M WORKS FOR EVERY POINT. WHEREAS WE ARE SHIT OUT OF LUCK IF SUCH AN M DOESN'T EXIST. THEN WE DON'T HAVE THIS UB. IT IS THEN THAT WE USE EITHER NM OR QUASI-NM. 

||\||