# Optimisation for Machine Learning

September 27, 2023

### Logistic
Contact: [Clement Royer](mailto:clement.royer@lamsade.dauphine.fr)
Lecture's web: [URL](https://www.lamsade.dauphine.fr/%7Ecroyer/teachOAA.html)
Examen: 60% (2h), dated December 13, 2023 10:00 AM - 12:00 PM
Project: 40%, during from October 6, 2023 to December 23, 2023

In [1]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('default')
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.rc('font', size=18)
plt.rc('axes', titlesize=18)
plt.rc('axes', labelsize=18)
plt.rc('xtick', labelsize=18)
plt.rc('ytick', labelsize=18)
plt.rc('legend', fontsize=18)
plt.rc('lines', markersize=10)

## GRADIENT METHODS AND CONVEX OPTIMIZATION

**Problem of interest**: $\min_{w \in \mathbb{R}^d} f(w)$ and $f: \mathbb{R}^d \rightarrow \mathbb{R}$ is convex and differentiable.

When it is convex we don't have to worry about local minima, because the local minima is the global minima.

**Assumption**: $f$ belongs to $C^{1,1}_{L}$ functions, i.e. $f$ is convex and its gradient is $L$-Lipschitz continuous.
- $f$ is $C^1$ (continuously differentiable) at every point $w \in \mathbb{R}^d$, $\exists \nabla f(w) \in \mathbb{R}^d$ (gradient of $f$ at $w$) that represents how the function $f$ varies locally around $w$.
- The gradient mapping $\nabla f: \mathbb{R}^d \rightarrow \mathbb{R}^d, w \mapsto \nabla f(w)$ is $L$-Lipschitz continuous where $L > 0$ is the Lipschitz constant of $\nabla f$. This means that $\forall w, w' \in \mathbb{R}^d, \|\nabla f(w) - \nabla f(w')\| \leq L \|w - w'\|$. (At $L = 0$, $\nabla f$ is constant, i.e. $\nabla f(w) = \nabla f(w')$ for all $w, w' \in \mathbb{R}^d$.)

**Examples of $C^{1,1}_{L}$ functions**:
- Linear least squares: $f(w) = \frac{1}{2} \|Xw - y\|^2$ where $X \in \mathbb{R}^{n \times d}$ and $y \in \mathbb{R}^n$.
- Logistic regression objective function: $f(w) = \sum_{i=1}^n \log(1 + \exp(-y_i x_i^T w))$ where $x_i \in \mathbb{R}^d$ and $y_i \in \{-1, 1\}$.
- Quadratic function: $f(w) = \frac{1}{2} w^T A w - b^T w$ where $A \in \mathbb{R}^{d \times d}$ and $b \in \mathbb{R}^d$.

**Remark**: $C^{1,1}_{L}$ assumption is a simplifying assumption that:
- Functions are not always $C^{1}$
- Functions/gradients are not always Lipschitz continuous on the whole space $\mathbb{R}^d$. $\rightarrow$ Possible to use local Lipschitz constants instead of global Lipschitz constants.

Sometimes $C^{1,1}_{L}$ are called $L$-smooth functions. $L$ is called the smoothness constant.

**Properties of $C^{1,1}_{L}$ functions**: If $f$ is $C^{1,1}_{L}$ convex then 

- $\forall w, w' \in \mathbb{R}^d f(w') \leq f(w) + \nabla f(w)^T (w' - w) + \frac{L}{2} \|w' - w\|^2$ $\rightarrow$ Upper bound on $f(w')$ with a quadratic function of $w'$.
- $\forall w, w' \in \mathbb{R}^d f(w') \geq f(w) + \nabla f(w)^T (w' - w)$ $\rightarrow$ Lower bound on $f(w')$ with a linear function of $w'$. (This is actually a characterization of convexity for $C^1$ functions.)

Toward gradient descent:
1. Suppose that we are at $w \in \mathbb{R}^d$ and we know $f(w)$ and $\nabla f(w)$.
2. If $\|\nabla f(w)\| = 0$ then $w$ is a global minimum of $f$ (because $f$ is convex).
3. When $\|\nabla f(w)\| \neq 0$, using (1) we can find $v$ such that: $f(w) + \nabla f(w)^T v + \frac{L}{2} \|v\|^2 < f(w)$ implying that $f(v) \leq f(w) + \nabla f(w)^T v + \frac{L}{2} \|v\|^2 < f(w)$.
4. We can then replace $w$ by $v$ and repeat the process until $\|\nabla f(w)\| = 0$.