# Non linear optimization: preconditioning

## Introduction to optimization and operations research

Michel Bierlaire


In [None]:

import numpy as np



Consider the function $f:\mathbb{R}^2 \to \mathbb{R}$ defined as
$$
f(x)= \frac{1}{2}x_1^2 + \frac{101}{2} x_2^2 +  x_1 x_2.
$$

# Question 1
Implement a function in Python that calculates the function and its first and second derivatives.

The derivatives of the function are
$$
\nabla f(x) = \begin{pmatrix*} x_1+x_2
\\ x_1+101x_2 \end{pmatrix*},\;
\nabla^2 f(x) =
\begin{pmatrix*}
1 & 1 \\  1 & 101
\end{pmatrix*}.
$$


In [None]:
def the_function(x: np.array) -> tuple[float, np.array, np.array]:
    """Calculates the function and its derivatives

    :param x: a vector of dimension 2
    :return: a tuple with the value of the function, the gradient and the second derivatives matrix
    """
    f = 0.5 * x[0] * x[0] + 101 / 2 * x[1] * x[1] + x[0] * x[1]
    g = np.array([x[0] + x[1], x[0] + 101 * x[1]])
    h = np.array([[1, 1], [1, 101]])
    return f, g, h



Test it at the point $(1, 1)$

In [None]:
x = np.array([1, 1])
function, gradient, hessian = the_function(x)
print(f'f(x)={function}')
print(f'gradient(x)={gradient}')
print(f'hessian(x)=\n{hessian}')


# Question 2
Consider a change of variables
$$
x' = L_k^T x,
$$
Consider the function in the new variables
$$
\tilde{f}(x') = f(L_k^{-T} x').
$$

The gradient of the new function is
$$ \nabla \tilde{f}(x') = L_k^{-1} \nabla f(L_k^{-T} x'),$$ that is, the solution of the system:
$$L_k \nabla \tilde{f}(x') = \nabla f(L_k^{-T} x').$$

The hessian of the new function is
$$ \nabla^2 \tilde{f}(x') = L_k^{-1} \nabla^2 f(L_k^{-T} x') L_k^{-T}.$$

that is, the solution of the system:
$$L_k \nabla^2 \tilde{f}(x') =  D^T_k,$$
where $D_k$ is the solution of the system
$$\nabla^2 f(L_k^{-T} x') = L_k D_k$$
Implement a Python function that calculates this function and its first and second derivatives.

In [None]:


def preconditioned_function(
    x: np.array, l_k: np.array
) -> tuple[float, np.array, np.array]:
    """Calculates the preconditioned function and its gradient.

    :param x: a vector of dimension 2.
    :param l_k:  matrix defining the change of variables.
    :return: a tuple with the value of the function, the gradient and the hessian.
    """
    x_original = np.linalg.solve(l_k.T, x)
    f, g, h = the_function(x_original)
    the_gradient = np.linalg.solve(l_k, g)
    d_k = np.linalg.solve(l_k, h)
    the_hessian = np.linalg.solve(l_k, d_k.T)
    return f, the_gradient, the_hessian



Consider $L_k$ to be the Cholesky factor of the second derivative matrix
$$
L_k L_k^T=  \nabla^2 f(x_k).
$$

In [None]:
l_k = np.linalg.cholesky(hessian)
print(l_k)


Check that it is indeed the Cholesky factor

In [None]:
print(l_k @ l_k.T)


It must be the same as the hessian

In [None]:
print(hessian)


Evaluate the preconditioned function at the point $x'=L_k^T x$, where $x=(1,1)$

In [None]:
prec_x = l_k.T @ x
print(f'{prec_x=}')


In [None]:
prec_function, prec_gradient, prec_hessian = preconditioned_function(prec_x, l_k)
print(f'Preconditioned f(x)={prec_function}')
print(f'Preconditioned gradient(x)={prec_gradient}')
print(f'Preconditioned hessian(x)=\n{prec_hessian}')



Note that the second derivatives matrix is the identify matrix.

Here are some additional details to understand the above results.
The matrix $L_k$ defining the change of variables is the
Cholesky factor of the second derivatives matrix, that is
$$
L_k = \begin{pmatrix*}
1 & 0 \\ 1 & 10
\end{pmatrix*}, \; \forall k,
$$
as
$$
L_k L_k^T = \begin{pmatrix*}
1 & 1 \\  1 & 101
\end{pmatrix*}.
$$
Therefore, the change of variables is $x'=L_k^Tx$, that is
$$
x' = \begin{pmatrix*}
1 & 1 \\ 0 & 10
\end{pmatrix*} x
$$
or
\begin{align*}
x_1' &= x_1 + x_2 \\
x_2' &= 10 x_2.
\end{align*}

We write the change of variables in the opposite direction as
$x=L_k^{-T} x'$, that is
$$
x = \left(\begin{array}{rr}
1 & -\frac{1}{10} \\ 0 & \frac{1}{10}
\end{array}\right) x'
$$
or
\begin{align*}
x_1 &= x_1' -\frac{1}{10} x_2', \\
x_2 &= \frac{1}{10} x_2'.
\end{align*}
The function $\tilde{f}$ is therefore defined as
\begin{align*}
\tilde{f}(x') &= f(x_1' -\frac{1}{10} x_2',\frac{1}{10} x_2') \\
&=  \frac{1}{2}\left(x_1' -\frac{1}{10} x_2'\right)^2 +
\frac{101}{2} \left(\frac{1}{10} x_2'\right)^2 +  \left(x_1'
-\frac{1}{10} x_2'\right) \left(\frac{1}{10} x_2'\right), \\
&= \frac{1}{2} x_1'^2 + \frac{1}{2} x_2'^2.
\end{align*}

The derivatives of $\tilde{f}$ are
$$
\nabla \tilde{f}(x'_1,x'_2) = \left(\begin{array}{c}x'_1 \\ x'_2\end{array}\right),\;
\nabla^2 \tilde{f} =
\begin{pmatrix*}
1 & 0 \\ 0 & 1
\end{pmatrix*}.
$$

# Question 3
Apply one iteration of the steepest descent
algorithm on $\tilde{f}$ from that point, that is
$$
x'_{k+1} = x'_k - \alpha \nabla \tilde{f}(x'_k),
$$
where the step size is
$$
\alpha = \frac{\nabla \tilde{f}(x'_k)^T \nabla \tilde{f}(x'_k)}{\nabla
\tilde{f}(x'_k)^T \nabla^2 \tilde{f}(x'_k) \nabla \tilde{f}(x'_k)}.
$$
It is the Cauchy point.


The point in the new variables:
$$
x_0 = \left(\begin{array}{c}1 \\ 1\end{array}\right), \; {x_0}' = \left(\begin{array}{c}2 \\ 10\end{array}\right),
$$
so that
$$
\nabla \tilde{f}({x_0}') = \left(\begin{array}{c}2 \\ 10\end{array}\right).
$$
The step to perform along the steepest descent direction is
$$
\alpha = \frac{\nabla \tilde{f}({x_0}')^T\nabla
\tilde{f}({x_0}')}{\nabla \tilde{f}({x_0}')^T\nabla^2
\tilde{f}({x_0}')\nabla \tilde{f}({x_0}')}.
$$
As $\nabla^2 \tilde{f}({x_0}')$ is the identity matrix,
$\alpha=1$. Actually, this would be true for any $x_0$.

In [None]:
alpha = np.dot(prec_gradient, prec_gradient) / np.dot(
    prec_gradient, prec_hessian @ prec_gradient
)
print(f'{alpha=:.3g}')


Therefore, we obtain
$$
x'_1 = x'_0 - \alpha \nabla \tilde{f}({x_0}') = \left(\begin{array}{c}2 \\ 10\end{array}\right)
- \left(\begin{array}{c}2 \\ 10\end{array}\right) = \left(\begin{array}{c}0 \\ 0\end{array}\right).
$$


In [None]:
new_prec_x = prec_x - alpha * prec_gradient
print(f'{new_prec_x=}')


Identify the corresponding point in the original variables.
In the original variables, we have
\begin{align*}
x_1 &= x_1' -\frac{1}{10} x_2' = 0, \\
x_2 &= \frac{1}{10} x_2' = 0.
\end{align*}
It happens to be the optimal solution of the problem.

In [None]:
new_x = np.linalg.solve(l_k.T, new_prec_x)
print(f'{new_x=}')