# L9c: Nonlinear Optimization
To solve many real-world problems, we need methods for optimizing nonlinear objective functions subject to equality and inequality constraints. We explore the theoretical foundations and practical algorithms for constrained nonlinear optimization.

> __Learning Objectives__
>
> By the end of this lecture, students will be able to:
> - **Understand and apply the Karush-Kuhn-Tucker (KKT) conditions** to characterize optimal solutions of constrained nonlinear optimization problems, recognizing when constraint qualifications such as Slater's condition ensure that KKT conditions are necessary for optimality.
> - **Implement gradient descent with barrier and penalty methods** to solve constrained optimization problems by augmenting the objective function with smooth penalty terms that enforce constraints and adapting these penalties over iterations.
> - **Apply simulated annealing** as a derivative-free heuristic for global optimization, understanding how temperature control and acceptance probabilities enable escape from local optima and practical strategies for parameter selection.

Let's get started by looking at the general problem that we want to solve.
___

## General Problem
Let's begin with a general nonlinear optimization problem and the theory needed to establish optimality. 

Suppose we want to minimize a nonlinear objective function $f(x)$ subject to equality constraints $h_j(x) = 0$ for $j = 1, \ldots, p$ and inequality constraints $g_i(x) \leq 0$ for $i = 1, \ldots, m$. The problem can be formulated as follows:
$$
\begin{align*}
    \min_{x\in\mathbb{R}^n} \; f(x)
    \quad\text{s.t.}\quad
    \begin{cases}
    g_i(x) \le 0, & i = 1,\dots,m,\\
    h_j(x) = 0, & j = 1,\dots,p,
    \end{cases}
\end{align*}
$$

### Lagrangian
For this problem, we introduce multipliers $\lambda_i\ge{0}$ for each inequality and $\nu_j$ (free) for each equality. The Lagrangian is then given by:
$$
\boxed{
\begin{align*}
\mathcal L(x,\lambda,\nu) & =\;f(x)\;+\;\sum_{i=1}^m \lambda_i\,g_i(x)\;+\;\sum_{j=1}^p \nu_j\,h_j(x) \\
\lambda_i & \ge 0\;(i=1,\dots,m)\quad\text{convention}\\
\end{align*}}
$$



### Karush-Kuhn-Tucker (KKT) Conditions
The Karush–Kuhn–Tucker (KKT) conditions are necessary conditions for optimality in constrained nonlinear problems, generalizing Lagrange multipliers to handle both equality and inequality constraints. The following conditions are necessary for optimality:

1. __Stationarity__: The gradient of the Lagrangian with respect to $x$ must vanish at the optimal point $x^*$:
    $$
    \begin{align*}
    \nabla_x\mathcal L(x^*,\lambda^*,\nu^*) = 0\quad\Longleftrightarrow\quad\nabla f(x^*) + \sum_{i=1}^{m} \lambda_i^* \nabla g_i(x^*) + \sum_{j=1}^{p} \nu_j^* \nabla h_j(x^*) = 0.
    \end{align*}
    $$
2. __Primal feasibility__: The constraints must be satisfied at the optimal point $x^*$:
    $$    \begin{align*}
    & g_i(x^*) \le 0 \quad(i = 1, \ldots, m)\\
    & h_j(x^*) = 0 \quad(j = 1, \ldots, p).
    \end{align*}
    $$
3. **Dual feasibility**: The Lagrange multipliers for the inequality constraints must be non-negative:
    $$\lambda_i^* \ge 0 \quad(i = 1, \ldots, m).$$
4. **Complementary slackness**: For each inequality constraint, either the constraint is active (i.e., $g_i(x^*) = 0$) or the corresponding multiplier is zero ($\lambda_i^* = 0$):
    $$\lambda_i^* \cdot g_i(x^*) = 0 \quad(i = 1, \ldots, m).$$

These conditions provide a framework for analyzing constrained optimization problems. They are __necessary__ for optimality under regularity (technical) conditions such as constraint qualifications (e.g., Slater's condition).

> __Slater's Condition__: A constraint qualification that ensures the KKT conditions are necessary for optimality. It requires the existence of at least one point $x_0$ that strictly satisfies all inequality constraints and satisfies all equality constraints:
> $$g_i(x_0) < 0 \quad \text{for all } i = 1, \ldots, m \quad \text{and} \quad h_j(x_0) = 0 \quad \text{for all } j = 1, \ldots, p$$
> In other words, there exists a feasible point with slack in the inequality constraints.

__Interesting__: When $f$ and each $g_i$ are convex and each $h_j$ is affine (linear plus a constant), any point satisfying the KKT conditions is a global minimizer.

> __Convexity__: A function $f(x)$ is convex if for all $x, y$ and $\alpha \in [0,1]$:
> $$f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y)$$
> Geometrically, this means the line segment connecting any two points on the graph lies above (or on) the graph itself. Convex functions are "bowl-shaped" with no valleys. Examples: $f(x) = x^2$ is convex; $f(x) = e^x$ is convex; $f(x) = -x^2$ is not convex. Convexity is valuable in optimization because any local minimum is guaranteed to be a global minimum.


___

<div>
    <center>
        <img src="figs/Fig-GD-Schematic.svg" width="680"/>
    </center>
</div>

## Gradient Descent
Given a constrained nonlinear optimization problem, gradient descent can solve it by iteratively updating the solution in the direction of steepest descent. However, gradient descent is designed for unconstrained problems, so we must adapt it to handle constraints.

To incorporate constraints, we use barrier and penalty methods that add constraint-handling terms to the objective function.

> __Penalty and Barrier Methods__: A penalty method adds penalty terms to the objective function when constraints are violated. For inequality constraints, we use logarithmic barrier functions; for equality constraints, we use quadratic penalty terms. When a solution violates constraints, the penalty term increases, discouraging violations.

Consider the following augmented objective function:
$$
\begin{align*}
    \min_{x\in\mathbb{R}^n}\;P_{\mu,\rho}(x)\;&=f(x)\;-\;\underbrace{\frac{1}{\mu}\sum_{i=1}^m\ln\bigl(-\,g_i(x)\bigr)}_{\text{barrier term}}\;+\;\underbrace{\frac{1}{2\rho}\sum_{j=1}^p 
    \bigl[h_j(x)\bigr]^2}_{\text{penalty term}},\quad\text{where}\quad\mu>0,\;\rho>0\\
\end{align*}
$$
The barrier terms $\frac{1}{\mu}\sum_{i=1}^m\ln\bigl(-\,g_i(x)\bigr)$ penalize violations of inequality constraints, while the penalty terms $\frac{1}{2\rho}\sum_{j=1}^p\bigl[h_j(x)\bigr]^2$ penalize violations of equality constraints. 

> __Parameters__
>
> The parameters $\mu$ and $\rho$ control the strength of these penalties:
> * __Barrier weight__: The $\mu$ parameter is decreased over iterations to enforce stricter adherence to inequality constraints. As $\mu\to0$, the coefficient $\frac{1}{\mu}$ increases, strengthening the barrier effect.
> * __Penalty weight__: The $\rho$ parameter is decreased over iterations to enforce stricter adherence to equality constraints. As $\rho\to0$, the coefficient $\frac{1}{\rho}$ increases, strengthening the penalty effect.

### Algorithm
The algorithm iteratively updates the solution $x_k$ using the gradient of the augmented objective function $P_{\mu,\rho}(x)$.

__Initialization__: Given an initial guess $x_0$, set $\mu > 0$ and $\rho > 0$. Specify a tolerance $\epsilon > 0$, a maximum number of iterations $K$, and a step size (learning rate) $\alpha > 0$. Set $\texttt{converged} \gets \texttt{false}$, the iteration counter to $k \gets 0$ and specify values for the penalty update parameters $(\tau_{\mu},\tau_{\rho})\in\left(0,1\right)$.

While not $\texttt{converged}$ __do__:
1. Compute the gradient: $\nabla P_{\mu,\rho}(x_k) = \nabla f(x_k) + \frac{1}{\mu} \sum_{i=1}^m \frac{\nabla g_i(x_k)}{-g_i(x_k)} + \frac{1}{\rho} \sum_{j=1}^p h_j(x_k) \nabla h_j(x_k)$ evaluated at the current solution $x_k$.
2. Update the solution: $x_{k+1} = x_k - \alpha \nabla P_{\mu,\rho}(x_k)$.
3. Check convergence: 
     - If $\|x_{k+1} - x_k\|_{2} \leq \epsilon$, set $\texttt{converged} \gets \texttt{true}$. Return $x_{k+1}$ as the approximate solution.
     - If $k \geq K$, set $\texttt{converged} \gets \texttt{true}$ and warn that maximum iterations were reached without convergence.
4. Increment the iteration counter: $k \gets k + 1$, update $\mu\gets \tau_\mu\,\mu$ and $\rho\gets \tau_\rho\,\rho$ as needed, and repeat.

As $\mu\to0$, the barrier term grows stronger to keep the solution away from constraint boundaries. As $\rho\to0$, the penalty term grows stronger to enforce equality constraints.
___

## Simulated Annealing
Simulated annealing (SA) is a derivative-free optimization method inspired by the annealing process in metallurgy, developed by Kirkpatrick, Gelatt, and Vecchi (1983). It explores the solution space by accepting both improving and non-improving moves, controlled by a temperature parameter. This allows the algorithm to avoid getting stuck in local optima.

The algorithm minimizes the augmented objective function:
$$
\begin{align*}
    \min_{x\in\mathbb{R}^n}\;P_{\mu,\rho}(x)\;&=f(x)\;-\;\frac{1}{\mu}\sum_{i=1}^m\ln\bigl(-\,g_i(x)\bigr)\;+\;\frac{1}{2\rho}\sum_{j=1}^p 
    \bigl[h_j(x)\bigr]^2,\quad\text{where}\quad\mu>0,\;\rho>0\\
\end{align*}
$$
where $f(x)$ is the objective function, $g_i(x)$ are inequality constraints, and $h_j(x)$ are equality constraints.

__Initialize__: Given an initial solution $x_0$, penalty parameters $\mu > 0$ and $\rho > 0$, an initial temperature $T\gets{T_\circ}$, a cooling rate parameter $\alpha\in(0,1)$, the maximum number of iterations $K$ per temperature, and a minimum temperature $T_{\text{min}}$. Specify a step size $\beta > 0$, set $\texttt{converged} \gets \texttt{false}$, set $x^{\star} \gets x_0$ as the best solution found, and $x_{c}\gets{x}_{0}$ as the current solution. Specify penalty update parameters $(\tau_{\mu},\tau_{\rho})\in\left(0,1\right)$.

While not $\texttt{converged}$ __do__:

1. For $k = 1\,\text{to}\,K$:
   - Generate a candidate solution: $x^{\prime} \gets x_{c} + \beta\cdot\texttt{randn}(\texttt{size}(x_{c}))$.
   - Compute the change in objective value: $\Delta P \gets P_{\mu,\rho}(x^{\prime}) - P_{\mu,\rho}(x_{c})$
      - If $\Delta P < 0$, accept the new solution: $x_{c} \gets x^{\prime}$ (improving move).
      - If $\Delta P \geq 0$, accept with probability $p\gets\exp(-\Delta P / T)$. If $u \gets \texttt{Uniform}(0,1) \leq p$, accept: $x_c \gets x^{\prime}$ (uphill move).
    - If $P_{\mu,\rho}(x_c) < P_{\mu,\rho}(x^{\star})$, update best solution: $x^{\star}\gets{x}_{c}$.
2. Update penalty parameters: $\mu\gets \tau_\mu\,\mu$ and $\rho\gets \tau_\rho\,\rho$.
3. Check convergence: 
   - If $T \leq T_{\text{min}}$, set $\texttt{converged} \gets \texttt{true}$ and return $x^{\star}$.
   - Otherwise, update temperature: $T \gets \alpha T$


### Selecting $T_{\circ}$ and $K$

#### Sample‐and‐set approach for $T_\circ$
A practical way to select initial temperature $T_\circ$:
1. From the initial solution $x_\circ$, generate $M={100}$ random neighbor costs $\Delta{P}_{1},\dots,\Delta{P}_{M}$.
2. Find uphill moves: $J^{+}\gets\left\{j: \Delta{P}_{j}>0,\,j=1,\dots,M\right\}$. Compute mean uphill cost $\overline{\Delta P}_{+}\gets\texttt{mean}\left\{\Delta{P}_{i}\right\}_{i=J^{+}}$.
3. Choose desired initial acceptance rate $p_{\circ}\in(0.6,0.9)$.
4. Set $T_{\circ}\gets{-{\overline{\Delta P}_{+}}/{\,\ln{p_{\circ}}}}$

#### Heuristic for choosing $K$
Adapt the number of iterations per temperature based on acceptance rates:
1. Initially, set $K\gets{c}\cdot\texttt{size}(x)$, where $c\in\left[10,50\right]$.
2. After each temperature, compute the acceptance rate $\hat{p}$ (fraction of accepted moves).
    - If $\hat{p}>0.8$, decrease $K\gets\lceil 0.75K\rceil$.
    - If $\hat{p}<{0.2}$, increase $K\gets\lceil 1.5K\rceil$.


___

## Lab
In lab `L9d`, we'll use gradient descent to solve an unconstrained logistic regression binary classification problem using the banknote dataset.


## Summary

In this lecture, we explored approaches to constrained and unconstrained nonlinear optimization:

> __Key takeaways:__
>
> 1. **KKT conditions for constrained optimization**: The Karush–Kuhn–Tucker conditions are necessary conditions for optimality in constrained nonlinear problems, generalizing Lagrange multipliers to both equality and inequality constraints. These conditions consist of stationarity, primal feasibility, dual feasibility, and complementary slackness. When the problem is convex, they become sufficient for global optimality.
> 2. **Gradient descent with barrier and penalty methods**: For constrained problems, gradient descent can be applied by augmenting the objective with logarithmic barrier terms for inequality constraints and quadratic penalty terms for equality constraints. Decreasing the penalty parameters over iterations enforces stricter adherence to constraints while maintaining computational tractability.
> 3. **Simulated annealing for escaping local optima**: Simulated annealing probabilistically accepts both improving and non-improving moves controlled by temperature. The algorithm can escape local optima by accepting uphill moves early and converging to better solutions as temperature decreases. Careful selection of initial temperature and iteration counts improves effectiveness.

These methods provide tools for solving complex constrained and unconstrained optimization problems.

___