# Optimisation for Machine Learning

November 08, 2023

In [4]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('default')
plt.rc('text', usetex=True)
plt.rc('font', family='sans-serif')
plt.rc('font', size=14)
plt.rc('axes', titlesize=14)
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=14)
plt.rc('ytick', labelsize=14)
plt.rc('legend', fontsize=14)
plt.rc('lines', markersize=10)

## Finite sum problems

$$
\min_{w \in \mathbb{R}^d} f(w) = \sum_{i=1}^n f_i(w)
$$

where $f_i$ is convex and $L$-smooth for all $i$.

**Example:** Linear least squares
$$
\mathbb{X} = \begin{bmatrix} x_1^T \\ \vdots \\ x_n^T \end{bmatrix} \in \mathbb{R}^{n \times d}, \quad y = \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} \in \mathbb{R}^n
$$
$$
f(w) = \frac{1}{2n} \| \mathbb{X} w - y \|_2^2 = \frac{1}{2n} \sum_{i=1}^n \underbrace{\left( x_i^T w - y_i \right)^2}_{f_i(w)}
$$

Typical ML setup is $n \gg 1$.

Main cost of algortihm is to access the data.

Assumptions:
- $f_i$ is convex and $L$-smooth for all $i$, $f_i \in \mathcal{C}^1$ for all $i$

Gradient descent: $\nabla f(w) = \frac{1}{n} \sum_{i=1}^n \nabla f_i(w)$
Iteration $k$: $w_{k+1} = w_k - \alpha_k \nabla f(w_k)$ = $w_k - \alpha_k \frac{1}{n} \sum_{i=1}^n \nabla f_i(w_k)$
Where $\alpha_k$ is the step size with $\alpha_k > 0$.

One iteration of GD requires to access aall the data points. This is expensive for large $n$.

**GOAL**: Find a method that can minimise $f(w)$ that has iteration less expensive than GD.

**Note**: $f(w) = \frac{1}{n} \sum_{i=1}^n f_i(w)$ known as **empirical risk** minimisation. Finite amount of data points.
Most of what we study also applies to $\min_{w \in \mathbb{R}^d} f(w) = \mathbb{E}_{x \sim \mathcal{D}} \left[ f(x, w) \right]$ where $x$ is a random variable and $\mathcal{D}$ is a probability distribution.

## Stochastic gradient descent (SGD)

$$
f(w) = \frac{1}{n} \sum_{i=1}^n f_i(w) \quad \text{where } f_i \in \mathcal{C}^1 \text{ and } L_i \text{-smooth} \\
\text{Gradient descent: } w_{k+1} = w_k - \alpha_k \sum_{i=1}^n \nabla f_i(w_k) \\
\text{Stochastic gradient descent: } w_{k+1} = w_k - \alpha_k \nabla f_{i_k}(w_k) \quad \text{where } i_k \sim \{1, \dots, n\} \text{ uniformly at random}
$$

One iteration of SGD requires to access only one data point $f_{i_k}$ while GD requires to access all the data points.

**Note**: Does SGD decrease the objective function $f(w)$? Not always!

**Example**: $f_1(w) = 2w^2$, $f_2(w) = -w^2$, $f(w) = \frac{1}{2} \left( f_1(w) + f_2(w) \right) = \frac{1}{2} w^2$.
For $i = 2$, the gradient is $\nabla f_2(w) = -2w$ so $w_{k+1} = w_k + 2 \alpha_k w_k$ and $w_{k+1} = \left( 1 + 2 \alpha_k \right) w_k$. As long as $\alpha_k > -\frac{1}{2}$, $w_k \to \infty$.

**Note**: SGD is not a descent method. The function value might increase because of the random and inexactly gradient.In practice, this works well for ML applications with many data points. In those cases, SGD tends to converge faster than GD.

Iteration of SGD: $w_{k+1} = w_k - \alpha_k \nabla f_{i_k}(w_k) \quad \text{where } \alpha > 0 \text{ and } i_k \sim \{1, \dots, n\} \text{ uniformly at random}$

### **Question**: How to choose $\alpha_k$ and $i_k$?

##### Step size $\alpha_k$
- Constant step size: $\alpha_k = \alpha$ for all $k \geq 0$
- Decreasing step size: $\alpha_k = \frac{\alpha}{k}$ for all $k \geq 0$ This is a good choice for convex problems because $\sum_{k=0}^\infty \alpha_k = \infty$ and $\sum_{k=0}^\infty \alpha_k^2 < \infty$, so $\lim_{k \to \infty} \alpha_k = 0$ and $\sum_{k=0}^\infty \alpha_k = \infty$. Example: $\alpha_k = \frac{1}{k}$, $\alpha_k = \frac{1}{\sqrt{k}}$, $\alpha_k = \frac{1}{k^\beta}$ where $\beta \in (0, 1)$
- Hybrid step size: Start with a fixed step size $\alpha_0$ and then switch to a decreasing step size $\alpha_k = \frac{\alpha_0}{k}$ for all $k \geq 1$ while we are not making any progress.
- Bold driver: $\alpha_{k+1} = \begin{cases} \alpha_k \text{ if } f(w_{k+1}) < f(w_k) \\ \beta \alpha_k \text{ if } f(w_{k+1}) \geq f(w_k) \end{cases}$ where $\beta > 1$ is a constant
- Backtracking line search: $\alpha_k = \beta^j \alpha_0$ where $\beta \in (0, 1)$ and $j = \min \{ j \in \mathbb{N} \mid f(w_k - \beta^j \alpha_0 \nabla f_{i_k}(w_k)) \leq f(w_k) - \frac{\beta^j}{2} \alpha_0 \| \nabla f_{i_k}(w_k) \|_2^2 \}$ LineSearches do not work well for SGD: checking the function decrease is expensive because we need to access all the data points.

#### Convergence Analysis
**Gradient descent**: $\nabla f$ is $L$-Lipschitz continuous, $|| \nabla f(w) - \nabla f(v) ||_2 \leq L \| w - v \|_2$ for all $w, v \in \mathbb{R}^d$ $\rightarrow$ $f(w) \leq f(v) + \nabla f(v)^T (w - v) + \frac{L}{2} \| w - v \|_2^2$ for all $w, v \in \mathbb{R}^d$
$$
\begin{align*}
f(w_{k+1}) &\leq f(w_k) + \nabla f(w_k)^T (w_{k+1} - w_k) + \frac{L}{2} \| w_{k+1} - w_k \|_2^2 \\
&= f(w_k) - \alpha_k \| \nabla f(w_k) \|_2^2 + \frac{L \alpha_k^2}{2} \| \nabla f(w_k) \|_2^2 \\
&= f(w_k) + \| \nabla f(w_k) \|_2^2 \left( \frac{L \alpha_k^2}{2} - \alpha_k \right)
\end{align*}
$$

Find step size $\alpha_k$:
$$
\phi(\alpha) = \frac{L \alpha^2}{2} - \alpha \\
\phi'(\alpha) = L \alpha - 1 = 0 \rightarrow \alpha = \frac{1}{L} \\
\phi(\frac{1}{L}) = \frac{1}{2L} - \frac{1}{L} = -\frac{1}{2L} < 0
$$

$$
f(w_k) - \frac{1}{L} \nabla f(w_k) \leq f(w_{k}) - \frac{1}{2L} \| \nabla f(w_k) \|_2^2 \\
\underbrace{f(w_k) - f(w_{k+1})}_{\text{decrease in function value}} \geq \frac{1}{2L} \| \nabla f(w_k) \|_2^2
$$
Sum the decrease of all iterations:
$$
\begin{align*}
f(w_0) - f(w_{k}) = \sum_{i=0}^{k-1} \left( f(w_i) - f(w_{i+1}) \right) &\geq \frac{1}{2L} \sum_{i=0}^{k-1} \| \nabla f(w_i) \|_2^2 \\
&\geq \frac{1}{2L} \sum_{i=0}^{k-1} \left( \min_{w \in \mathbb{R}^d} \| \nabla f(w) \|_2^2 \right) \\
&= \frac{k}{2L} \left( \min_{w \in \mathbb{R}^d} \| \nabla f(w) \|_2^2 \right)
\end{align*}
$$
$$
\Rightarrow \sum_{k=0}^\infty \| \nabla f(w_k) \|_2^2 < \infty \text{ and } \lim_{k \to \infty} \| \nabla f(w_k) \|_2 = 0
$$

Rate of convergence of GD (f is non-convex):
$$
f(w_0) - f^* \geq \frac{1}{2L} \sum_{i=0}^{k-1} \| \nabla f(w_i) \|_2^2 \geq \frac{k}{2L} \left( \min_{w \in \mathbb{R}^d} \| \nabla f(w) \|_2^2 \right) \\
\Rightarrow \min_{i=0, \dots, k-1} \| \nabla f(w_i) \|_2^2 \leq \sqrt{\frac{2L \left( f(w_0) - f^* \right)}{k}} = \mathcal{O} \left( \frac{1}{\sqrt{k}} \right)
$$

**Stochastic gradient descent**: 
$f \in \mathcal{C}^1$ and $L$-smooth, $\nabla f$ is $L$-Lipschitz continuous. We have also $\nabla f(w) = \frac{1}{n} \sum_{i=1}^n \nabla f_i(w)$, $w_{k+1} = w_k - \alpha_k \nabla f_{i_k}(w_k)$ where $i_k \sim \{1, \dots, n\}$ uniformly at random.
$$
\begin{align*}
f(w_{k+1}) &\leq f(w_k) + \nabla f(w_k)^T (w_{k+1} - w_k) + \frac{L}{2} \| w_{k+1} - w_k \|_2^2 \\
&= f(w_k) + \nabla f(w_k)^T \left( -\alpha_k \nabla f_{i_k}(w_k) \right) + \frac{L \alpha_k^2}{2} \| \nabla f_{i_k}(w_k) \|_2^2 \quad (*) \quad \text{because of the randomness in } i_k \text{we cannot find the stepsize which ensures } f(w_{k+1}) \leq f(w_k) \\
\end{align*}
$$
We can show a decrease on average by taking the expected value with respect to $i_k$ in the last inequality, also think of the gradient $\underbrace{\nabla f_{i_k}(w_k)}_{\text{g(w_k, i_k) = estimate of } \nabla f(w_k)}$ as a random variable.

Assume that at every iteration $k$, $i_k$ is chosen uniformly at random and independently from $i_0, \dots, i_{k-1}$ and satisfies:
- $\mathbb{E}_{i_k} \left[ \nabla f_{i_k}(w_k) \right] = \nabla f(w_k) \quad \text{unbiased estimate of } \nabla f(w_k)$
- $\mathbb{E}_{i_k} \left[ \| \nabla f_{i_k}(w_k) \|_2^2 \right] \leq \| \nabla f(w_k) \|_2^2 + \sigma^2 \quad \text{variance of } \nabla f_{i_k}(w_k) \leftrightarrow Var_{i_k} \left[ \| \nabla f_{i_k}(w_k) \|_2^2 \right] \leq \sigma^2$

**Example**:
Drawing $i_k$ uniformly at random from $\{1, \dots, n\}$:
$$
\mathbb{E}_{i_k} \left[ \nabla f_{i_k}(w_k) \right] = \sum_{i=1}^n \underbrace{\mathbb{P}(i = i_k)}_{\frac{1}{n}} \nabla f_i(w_k) = \frac{1}{n} \sum_{i=1}^n \nabla f_i(w_k) = \nabla f(w_k)
$$

Continue convergence analysis from $(*)$ for SGD:
$$
\begin{align*}
f(w_{k+1}) &\leq f(w_k) - \alpha_k \nabla f(w_k)^T \nabla f_{i_k}(w_k) + \frac{L \alpha_k^2}{2} \left( \| \nabla f(w_k) \|_2^2 + \right) \quad (*) \\
\text{Take expectation with respect to } i_k \text{ on both sides: } \\
\mathbb{E}_{i_k} \left[ f(w_{k+1}) \right] &\leq \underbrace{\mathbb{E}_{i_k} \left[ \nabla f_{i_k}(w_k) \right]}_{\nabla f(w_k)} - \alpha_k \nabla f(w_k)^T \underbrace{\mathbb{E}_{i_k} \left[ \nabla f_{i_k}(w_k) \right]}_{\nabla f(w_k)} + \frac{L \alpha_k^2}{2} \underbrace{\mathbb{E}_{i_k} \left[ \| \nabla f_{i_k}(w_k) \|_2^2 \right]}_{\leq \| \nabla f(w_k) \|_2^2 + \sigma^2} \\
\Leftrightarrow \mathbb{E}_{i_k} \left[ f(w_k) \right] \leq f(w_k) - \alpha_k \| \nabla f(w_k) \|_2^2 + \frac{L \alpha_k^2}{2} \left( \| \nabla f(w_k) \|_2^2 \right) + \underbrace{\frac{L \alpha_k^2}{2} \sigma^2}_{\text{noise/variance term}} \\
\end{align*}
$$
This allows to find step size $\alpha_k$ such that $\mathbb{E}_{i_k} \left[ f(w_{k+1}) \right] \leq f(w_k)$.

Convergence result for $f, \mu$-strongly convex and $L$-smooth:
$$
f(v) \geq f(w) + \nabla f(w)^T (v - w) + \frac{\mu}{2} \| v - w \|_2^2 \quad \text{for all } v, w \in \mathbb{R}^d \\
$$

#### Theorem: Constant step size
Let $f$ strongly convex, Lipschitz smooth. $f(w^*) = \min f$
Suppose that we do K iterations of SGD with constant step size $\alpha_k = \alpha \leq \frac{1}{L}$.
Distribution of $i_k$ satisfies assumptions above.
Then:
$$
\mathbb{E} \left[ f(w_K) - f(w^*) \right] \leq \underbrace{\left( 1 - \alpha \mu \right)^K}_{\text{convergence rate}} \underbrace{\left( f(w_0) - f(w^*) - \frac{L \alpha}{2} \sigma^2 \right)}_{\text{initial error}} + \underbrace{\frac{L \alpha}{2} \sigma^2}_{\text{noise term}}
$$

**Recall on Gradient Descent**
$\alpha_k = \alpha \text{fixed}$, $f$ strongly convex:
$$
f(w_k) - f(w^*) \leq \frac{L}{2} \| w_k - w^* \|_2^2 \leq (1 - \mu \alpha)^k \left( f(w_0) - f(w^*) \right)
$$

Reducing stepsize leads to smaller noise but worst convergence rate. $(1 - \mu\alpha) \to 1$ as $\alpha \to 0$.

Gradient descent gives $f(w_k) \to f(w^*)$ deterministically; linear convergence rate.
SGD gives in expectation that $\mathbb{E} \left[ f(w_k) \right] \to [f(w^*), f(w^*) + \frac{L \alpha}{2} \sigma^2]$; this corresponds to practical behaviour.