In [1]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('default')
plt.rc('text', usetex=True)
plt.rc('font', family='sans-serif')
plt.rc('font', size=14)
plt.rc('axes', titlesize=14)
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=14)
plt.rc('ytick', labelsize=14)
plt.rc('legend', fontsize=14)
plt.rc('lines', markersize=10)

## Large scale optimisation

$\min_{w \in \mathbb{R}^d} f(w)$ with $d \gg 1$. We want to avoid computation $f(w)$ for all $w$.

Approach: At each iteration, update a single coordinate $w_i$.

#### Coordinate descent method

We have $w_0 \in \mathbb{R}^d$
At iteration $k$ we choose $j_k \in \{1, \dots, d\}$ and update $w_{k+1} = w_k + \alpha_k \nabla_{j_k} f(w_k) e_{j_k}$ where $e_{j_k}$ is the $j_k$-th canonical vector and $\nabla_{j_k} f(w_k)$ is the gradient of $f$ with respect to the $j_k$-th coordinate.

Interesting when I can compute $\nabla_{j_k} f(w_k)$ without having to access full $w_k$.

Example:
Sparse least squares with $l2$ regularisation:
$$
\mathbf{X} \in \mathbb{R}^{n \times d}, y \in \mathbb{R}^n, \lambda > 0, \min_{w \in \mathbb{R}^d} \frac{1}{2n} \| \mathbf{X} w - y \|_2^2 + \lambda \| w \|_2^2 = f(w)
$$

Gradient:
$$
\nabla f(w) = \mathbf{X}^T (\mathbf{X} w - y) + 2 \lambda w
$$
cost of computing $\nabla f(w)$ depends on the number of non-zero entries in $\mathbf{X}$.

$j = 1, \dots, d$:
$$
\nabla_j f(w) = \frac{1}{n} \sum_{i=1}^n x_{ij} (x_i^T w - y_i) + 2 \lambda w_j
$$
cost of computing $\nabla_j f(w)$ depends on the number of non-zero entries in the $j$-th column of $\mathbf{X}$.

#### Variants of coordinate descent

Question raised how we can choose $j_k$?
$\Longrightarrow$ Cyclic coordinate descent: $j_k = (1, 2, \dots, d)$ we cycle through the coordinates. This might not converge (counter example by Powell). Cycle through a permutation of the coordinates $1, \dots, d$ which is updated every $d$ iterations. So it is better guaranteed to converge than prior version.

Randomised coordinate descent: $j_k$ is chosen uniformly at random from $\{1, \dots, d\}$. At every iteration we choose a random coordinate. Good convergence guarantees.

Block coordinate descent: We update a few cordinates at every iteration

Example: Given matrix $\mathbf{M} \in \mathbb{R}^{n \times d}$, one wants to find a low rank approximation of $\mathbf{M}$:
$$
\mathbf{L} \in \mathbb{R}^{n \times d}, \mathbf{R} \in \mathbb{R}^{d \times d}, \mathbf{M} \approx \mathbf{L} \mathbf{R}^T
$$
Find \mathbf{L} and \mathbf{R} such that $\| \mathbf{M} - \mathbf{L} \mathbf{R}^T \|_F^2$ is minimised. This is non-convex and difficult to solve.

Block the cordinate descents over \mathbf{L} and \mathbf{R}:
Iteration k, update $L_{i_k, :}$ and $R_{:, j_k}$.
$$
\begin{align}
L_{i_k, :} &\leftarrow \arg \min_{L_{i_k, :}} \| \mathbf{M} - \mathbf{L} \mathbf{R}^T \|_F^2 \\
R_{:, j_k} &\leftarrow \arg \min_{R_{:, j_k}} \| \mathbf{M} - \mathbf{L} \mathbf{R}^T \|_F^2
\end{align}
$$

Coordinates descent is designed for a distributed setting. We can distribute the data and compute the gradient in parallel. At every iteration we choose a random coordinate and update it in parallel. This is called parallel coordinate descent.

Synchronized:
$w_k \rightarrow \{\text{every processor updates its own coordinate}\} \rightarrow w_{k+1}$
Asynchronized:
$w_k \rightarrow \{\text{every processor updates its own coordinate without waiting for the others}\} \rightarrow w_{k+1}$ This works well in practice: Algorithm of Hogwild Recht et al. 2011. Hogwild combines the advantages of SGD and coordinate descent. It is a SGD algorithm where the gradient is computed in parallel. It is a coordinate descent algorithm where the coordinates are updated in parallel.

#### Decentralised learning

$$ \min_{w \in \mathbb{R}^d} \sum_{i=1}^n f_i(w) $$

The setting is that we have $n$ agents, each agent $i$ has access to $f_i$ and wants to minimise it. How can we minimise this in a decentralised way? With nobody having access to all of $f_i$.

Each agent has its own variables $w_i \in \mathbb{R}^d$ and wants to minimise $\min_{w_i \in \mathbb{R}^d} f_i(w_i)$. $z = w_i$ for all $i$.

