# Convex Optimization - Algorithms for Unconstrained Optimization, Gradient-Descent

> GD, Smoothness, Strict Convexity, Line Search

- hide: true
- toc: true
- badges: true
- comments: true
- categories: ['Optimization','Applied Mathematics','Proofs']

# PLAN

1. Unconstrained algorithms 
~~2. Oracle Access Model of order 1~~
~~3. Develop GD from two perspectives - linear and quadratic~~
4. Run GD on model problems x^2 and |x|
5. Develop the notion of M-smooth and m-strongly convex based on step 4
6. Analyze the performance of GD

8. Develop Accelerated GD
7. Develop Subgradient Descent

# Introduction

***Gradient descent (GD)*** is a powerful, yet incredibly simple, optimization algorithm. We can think of it as a ***greedy algorithm*** in the setting of continuous optimization. That is, it is our best, local attempt at optimization given only limited information about the objective $f(x)$, and having limited computational power. For now, we focus on the simpler case of unconstrained optimization in order to develop the key algorithmic ideas. Later on, perhaps in a different post, we will explore modifications to gradient descent that make it suitable for constrained optimization.

# The Gradient Descent Algorithm

As all iterative algorithms, gradient descent relies on ***initialization*** and an ***update step***. In this section, we explore two ideas that play a key role in developing the GD algorithm. 

## Idea 1 - Greedy Choice of Direction

Let $x$ be the initial iterate, and let the update be given by:
<br>
$$x^+ = x + \eta d$$ 
<br>

for some directional unit-vector $d$ and ***step-size*** parameter $\eta > 0$.

We base the algorithm on the assumption that the linear approximation of the objective at a the next iterate $x^+$ is a good-enough estimate of its true value at $x^+$. 

That is:

<br>
$$f(x^+) = f(x + \eta d) \approx f(x) + \eta \nabla f(x)^T d \ \ \forall d \tag{1.1}$$
<br>

Immediately, a locally optimal choice presents itself to us. Since we wish to minimize $f(x)$, it would be wise to insist that the objective at $x^+$ improves or, at least, does not worsen. 

That is, we insist: 

<br>
$$f(x^+) \approx f(x) + \eta \nabla f(x)^T d \leq f(x) \tag{1.2}$$
<br>

And, since we are greedy in our approach, we wish to make $f(x^+)$ as small as possible. Since, on the RHS, $f(x)$ is fixed and $\eta > 0$, this amounts to minimizing the scaled inner-product $\nabla f(x)^Td$. To that end, we choose $d$ opposite and parallel to the gradient, i.e. $d = - \frac{\nabla f(x)}{||\nabla f(x)||_2}$.

The update step becomes:

<br>
$$x^+ = x - \eta \frac{\nabla f(x)}{||\nabla f(x)||_2}$$
<br>

By re-labeling, $\eta$ can absorb the normalization constant. This obtains the gradient descent update step as it's often introduced in the textbooks: 

<br>
$$x^+ = x - \eta \nabla f(x) \tag{1.3}$$
<br>

This makes intuitive sense because the negative gradient direction is the direction in which the objective decreases most. So, it's only natural that the update should take us in this most enticing direction.

## Idea 2 - Greedy Choice of Next Iterate

Instead of defining the update step $x^+ = x + \eta d$ and then choosing the locally optimal direction $d$ greedily, we can choose the update step and the direction, both, in one fell swoop.

Starting from the linear approximation:

<br>
$$
f(y) \approx f(x) + \nabla f(x)^T(y - x) \ \ \forall y \tag{2.1}
$$
<br>

We can now insist, in a greedy fashion, that the next iterate $x^+$ be the minimizer of the linear approximation. That is, we insist:

<br>
$$
x^+ = \arg \min_y f(x) + \nabla f(x)^T(y - x) \tag{2.2}
$$
<br>

But since the linear approximation is unbounded below, this obtains $x^+ = \pm \infty$. To avoid this problem, we introduce a parametrized penalty term that prevents $x^+$ from venturing too far from the current iterate $x$. That is:

<br>
$$
x^+ = \arg \min_y f(x) + \nabla f(x)^T(y - x) + \eta ||y - x||_2^2 \tag{2.3}
$$
<br>

Now, since the RHS is a a simple quadratic in $y$, it has a unique minimizer which can be found by using the ***unconstrained optimality condition***. This just means taking the gradient of the RHS w.r.t. the optimization variable $y$, setting it to zero, and then solving for the unique root. This obtains: 

<br>
$$x^+ = x - \frac{1}{2 \eta} \nabla f(x)$$
<br>

By re-labeling, we, once again, get the canonical form of the GD update step $(1.3)$.

# Important Questions in Analysis

Given the ease with which we came up with the algorithm, we should ask ourselves the following questions:

1. Is GD sensitive to initialization?
2. Is GD guaranteed to converge for all step-sizes?
3. How should we choose a step-size that guarantees convergence? 
4. What's the rate of convergence of GD? Does the rate depend on step-size? Does it depend on properties of the objective function?
5. How should we choose a step-size that maximizes convergence rate?

We will shortly explore each of these questions and more. However, before doing so, it's worth taking a bird's eye look at the problem of convex optimization itself. 

Perhaps the most important question to ask ourselves is this: does gradient descent's convergence rate, for an optimally chosen step-size, give a taxonomy of easier-to-harder problems within the scope of convex optimization? The answer, as it turns out, is *yes*.

## Initialization

From this point on, we will limit our discussion to convex objectives in order to eliminate the possibility of strictly ***local optimizers*** and ***inflection points***, both of which GD, by construction, can get stuck at given a badly chosen initial point. This ensures the only ***stationary points***, points at which $\nabla f(x) = 0$ and the GD update makes no further progress, are global minimizers. On convex functions GD, as we will soon discover, has a convergence guarantee for all step-sizes independently of initialization.

## Fixed Step-Size GD

To kickstart our analysis of GD, we consider the fixed step-size algorithm first. Let's take two quintessential convex problems in $\mathbb{R}$, $f(x) = x^2$ and $h(x) = |x|$, and analyze GD's performance on these objectives.

### Simple Analysis of Fixed Step-Size GD

First, let's run the algorithm on $h(x) = |x|$ for $x \in \mathbb{R}$. 

Since $|x|$ is non-differentiable at $x = 0$, the gradient has a discontinuity at $x = 0$. Non-differentiability, such as this, will eventually lead us to introduce the notion of ***sub-gradients***, but for now we can get away with using the discontinuous gradient:

$$
h'(x) = 
\begin{cases} 
\begin{aligned} 
-1 \ &\textrm{if $x < 0$} \\ 
1 \ &\textrm{if $x > 0$} 
\end{aligned}
\end{cases}
$$

Then, for a fixed $\eta > 0$, the update step is:

<br>
$$x^+ = x \pm \eta$$
<br>

where the sign of $\eta$ depends on where the previous iterate, $x$, falls inside the domain $(-\infty, 0) \cup (0, \infty)$.

Now, switching our attention to $f(x) = x^2$, we compute its GD update as follows. We compute the gradient as $f'(x) = 2x$ which leads to the fixed step-size update:

<br>
$$x^+ = x - 2\eta x$$
<br>

Note that $x^* = 0$ is the unique optimizer of both $f(x)$ and $h(x)$. With this in mind, there are two key observations to make. 

The first is that, for $x$ far away from $x^* = 0$, the update, $2\eta x$, is large in magnitude. So, if the iterate is far from the optimizer, GD makes fast progress towards it. 

The second observation is that, as $x \rightarrow x^*$, the update becomes small in magnitude. So, as the iterate comes close to the optimizer, GD takes smaller and smaller steps which converge to $0$ in a summable way. This means, we can get the sub-optimality $|f(x) - f(x^*)|$ to be $\epsilon$-arbitrarily small for any fixed step-size $\eta$.

Neither of these observations hold for GD on $h(x) = |x|$ since the update $\eta$ is fixed regardless of the Euclidean distance between $x$ and $x^* = 0$. In particular, this means GD is not fast for $x$ far away from $x^*$ and does not slow down as $x$ nears $x^*$. Arbitrary accuracy is, also, impossible in the setting of a fixed step-size $\eta$. The iterates eventually cycle between $x^T - \eta$ and $x^T + \eta$ where $x^T$ is in the open $\eta$-neighborhood of $x^* = 0$, so the sub-optimality also cycles between two values which depend on the choice of $\eta$. That is, the sub-optimality cannot be $\epsilon$-arbitrary small for a fixed choice of $\eta$. To be clear, there is still convergence but it's slow and not arbitrarily accurate. Arbitrary accuracy for such problems as this can only be achieved by choosing a sequence of diminishing step-sizes $\{ \eta_t \}_{t=1}^T$ which help diminish the update since the gradient itself is non-diminishing. Of course, the sequence must be chosen with care since it's possible to *'run out of steam'* before reaching the optimizer.

We say GD on $f(x) = x^2$ enjoys the ***self-tuning property***, whereas GD on $h(x) = |x|$ does not. This speaks to the fact that the self-tuning is a property of the objective functions, rather than GD itself. 

As an overview of the theory we will soon develop, functions *like* $x^2$ will all have the self-tuning property while functions *like* $|x|$ will not. This is what ends up introducing a taxonomy of easier-to-harder convex optimization problems. What it means, precisely, to be *like* $x^2$ or $|x|$ will be made rigorous in the next few sections.

# Smoothness and Strong Convexity

As we saw above, gradient descent with a fixed step-size behaved much better on $f(x) = x^2$ than on $h(x) = |x|$. Since both of these problems are convex, $x^2$ must have additional properties not shared by $|x|$ that make it more amenable to optimization by GD. These properties turn out to be ***smoothness*** and ***strong convexity***. We will see that these properties provide insight into choosing the best fixed-step size which guarantees faster convergence of GD.

We can start by asking ourselves what makes the two quintessential functions $f(x) = x^2$ and $h(x) = |x|$ different from one another. Since the GD update step relies on the gradient, it helps thinking in terms of the differences of the gradients instead of the objective functions themselves.

The first difference of note is that $|x|$ has a discontinuity at $x = 0$ that's not present in $x^2$. At a point of discontinuity the gradient experiences an abrupt jump. So, in general, jumps in the gradient must pose a problem for GD. 

The second thing to note is that $|x|$ is flat compared to $x^2$. In flat regions, the gradient is constant. So, in general, constant regions in the gradient must pose a problem for GD. 

Both of these scenarios can be ruled out with a [***Lipschitz condition***](https://en.wikipedia.org/wiki/Lipschitz_continuity) on the gradient. Lipschitz conditions are both regularity conditions and growth conditions, so they rule out abrupt jumps and contain the growth of the gradient.

We are ready to define the two properties mentioned in the beginning of this section.

> **Smoothness:** &nbsp; We say a function $f(x)$ is ***M-smooth*** if its gradient is ***M-Lipschitz***. That is, if:
<br>
$$\exists M > 0 \ \ s.t. \ \ ||\nabla f(x) - \nabla f(y)||_2 \leq M||x-y||_2 \ \ \forall x,y$$
<br>

This is a universal upper-bound on the change in gradient which rules out jumps.

> **Strong Convexity:** &nbsp; We say a function $f(x)$ is ***m-strongly-convex*** if:
<br>
$$\exists m > 0 \ \ s.t. \ \ ||\nabla f(x) - \nabla f(y)||_2 \geq m||x-y||_2 \ \ \forall x,y$$
<br>

This is a universal lower-bound on the change in gradient which rules out the possibility of a constant gradient.

In particular, an $M$-smooth, and $m$-strongly-convex function $f(x)$ has the property that:

<br>
$$m||x-y||_2 \leq ||\nabla f(x) - \nabla f(y)||_2 \leq M||x-y||_2 \ \ \forall x,y \tag{3.1}$$
<br>

For a twice-differentiable function, there's a more compact way to express these properties using the hessian. It makes use of an ordering on matrices introduced by matrix [***definiteness***](https://en.wikipedia.org/wiki/Definite_matrix).

<br>
$$||\nabla f(x) - \nabla f(y)||_2 \leq M||x-y||_2 \ \ \forall x,y$$
$$\iff$$
$$||\nabla^2 f(x)||_2 \leq M \ \ \forall x$$
$$\iff$$
$$\nabla^2 f(x) \preceq  MI \ \ \forall x \tag{4.1}$$
<br>

The first equivalence is by the [***Mean Value Theorem***](https://en.wikipedia.org/wiki/Mean_value_theorem) and the second follows from the definition of ***matrix norm***.

Line $(3.1)$ should be read as *'the maximum eigenvalue of the hessian $\nabla^2 f(x)$ is $M$'*. 

By a symmetric argument, we also have: 

<br>
$$\nabla^2 f(x) \succeq mI \ \ \forall x \tag{4.2}$$
<br>

Which should be read as *'the minimum eigenvalue of the hessian $\nabla^2 f(x)$ is $m$'*.

Together, $(4.1)$ and $(4.2)$ give the analog of $(3.1)$ for twice-differentiable $M$-smooth and $m$-strongly-convex functions: 

<br>
$$mI \preceq \nabla^2 f(x) \preceq MI \ \ \forall x \tag{3.2}$$
<br>

Since the hessian represents the curvature of the function, $(3.2)$ is a two-sided bound on the curvature of $f(x)$. So, we see that smoothness and strong convexity also regulate function shape itself. The lower-bound rules out flatness, while the upper-bound rules out discontinuities like corners and cusps.

## Quadratic Bounds

Smoothness and strong-convexity 






Ruling out discontinuities should involve a regularity condition, whereas ruling out flatness should 

If $f$ is $M$-smooth, then we have the tightest possible point-wise quadratic upper bound as:

<br>
$$f(y) \leq f(x) + \nabla f(x)^T(y-x) + \frac{M}{2} ||y-x||_2^2 \ \ \forall y$$
<br>

How does this help us choose step size in a better way?

Plug the step $x^+ = x - \eta \nabla f(x)$ into the upper bound:

<br>
$$
\begin{aligned}
f(x^+) &\leq f(x) + \nabla f(x)^T(\eta \nabla f(x)) + \frac{M}{2} ||\eta \nabla f(x)||_2^2 \\
\iff f(x^+) &\leq f(x) +  \eta || \nabla f(x) ||_2^2 + \frac{M \eta^2}{2} ||\nabla f(x)||_2^2
\end{aligned}
$$
<br>

But the RHS is a quadratic function in $\eta$, the step-size. So, the upper bound on the next iterate $f(x^+)$, and consequently also the next iterate itself, can be minimized w.r.t. $\eta$. Essentially, we trust the upper bound is a good quadratic approximation and choose its minimizer as the next iterate. Just like in NM we trust the quadratic estimate, and set its minimizer as the next iterate. But here, we require no second-order information about the function.

The non-NM view is this. We have a fixed $x$ and the quadratic upper bound at that $x$. Any next iterate will be smaller than the qudratic UB *at* that iterate. So we choose a step size that minimizes this UB. This guarantees $f(x^+) \leq$ min of UB which is the tightest guarantee we can get on $f(x^+)$ with the information at hand. So we, once again, do the locally optimal thing and choose that.

The best choice turns out to be $\eta = \frac{1}{M}$


FOR FUNCTIONS THAT ARE M SMOOTH M WORKS FOR EVERY POINT. WHEREAS WE ARE SHIT OUT OF LUCK IF SUCH AN M DOESN'T EXIST. THEN WE DON'T HAVE THIS UB. IT IS THEN THAT WE USE EITHER NM OR QUASI-NM. 

||\||

In this analysis, we're only concerned about convex, unconstrained objectives $f(x)$. So, by definition of convexity, the linear approximation is actually a lower-bound of the true value at $x^+$.

<br>
$$f(x^+) \geq f(x) + \eta \nabla f(x)^T d \ \ \forall d \tag{1.2}$$
<br>