# Optimisation for Machine Learning

September 20, 2023

### Logistic
Contact: [Clement Royer](mailto:clement.royer@lamsade.dauphine.fr)
Lecture's web: [URL](https://www.lamsade.dauphine.fr/%7Ecroyer/teachOAA.html)
Examen: 60% (2h), dated December 13, 2023 10:00 AM - 12:00 PM
Project: 40%, during from October 6, 2023 to December 23, 2023

In [3]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-paper')
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.rc('font', size=18)
plt.rc('axes', titlesize=18)
plt.rc('axes', labelsize=18)
plt.rc('xtick', labelsize=18)
plt.rc('ytick', labelsize=18)
plt.rc('legend', fontsize=18)
plt.rc('lines', markersize=10)
plt.rcParams['figure.figsize'] = [12, 12]

#### What is optimisation?
- Field of study that is concerned with finding the best decision (or decisions) from a set of available alternatives. In the simplest case, this is the problem of finding a maximum (or minimum) of a real function $f(x)$ of a real variable $x$.
- As a subfield of theoritical computer science/computational mathematics, matured between 1980s - 2000s with efficient software packages.
- Shift in the past decades (2000s - 2020s) data-driven optimisation, with the rise of machine learning and big data, lead to specific optimisation problems/formulations, changes the preferred classes of algorithms. Old algorithms became trendy again. Other algorithms are irrelevant (e.g. Newton's method).

#### Typical optimisation ML:
Start with data $D = \{(x_i, y_i)\}_{i=1}^n$ and a model $h(x, \theta)$, where $\theta$ is the parameter to be optimised. The goal is to discover a mapping between the input $x_i$ and the output $y_i$. $h$ such that $h(x_i, \theta) \approx y_i$ for all $i$.

#### Process: Training/Learning
Typically $h$ is defined/parameterized with a vector $\theta \in R^d$ and learning consists in finding the best $\theta$ such that $h(x_i, w) \approx y_i$ for all $i$.

An optimisation problem asscoiated with this task will have following form:
$$
\min_{\theta \in R^d} f_D(\theta) + \lambda R(\theta)
$$
where $f_D(\theta)$ is the loss function, $R(\theta)$ is the regularisation term and $\lambda$ is the regularisation parameter.

#### Notation:
$w = \begin{bmatrix} w_1 \\ \vdots \\ w_d \end{bmatrix} \in R^d$ is the vector of parameters to be optimised.
Objective function: $f(w) = f_D(w) + \lambda R(w)$ is the function to be minimised.
Optimal solution: $w^* = \arg \min_{w \in R^d} f(w)$ is the vector of parameters that minimises the objective function.
Optimal value: $f^* = f(w^*)$ is the value of the objective function at the optimal solution.