Let's get started by looking at the general problem that we want to solve.
___

## General Problem
Let’s begin with a general nonlinear optimization problem and the theory needed to establish optimality. Suppose, we want to minimize a nonlinear objective function $f(x)$ subject to equality constraints $h_j(x) = 0$ for $j = 1, \ldots, p$ and inequality constraints $g_i(x) \leq 0$ for $i = 1, \ldots, m$. The problem can be formulated as follows:
$$
\begin{align*}
    \min_{x\in\mathbb{R}^n} \; f(x)
    \quad\text{s.t.}\quad
    \begin{cases}
    g_i(x) \le 0, & i = 1,\dots,m,\\
    h_j(x) = 0, & j = 1,\dots,p,
    \end{cases}
\end{align*}
$$

### Lagrangian
For this problem, we introduce multipliers $\lambda_i\ge{0}$ for each inequality and $\nu_j$ (free) for each equality. The Lagrangian is then given by:
$$
\boxed{
\begin{align*}
\mathcal L(x,\lambda,\nu) & =\;f(x)\;+\;\sum_{i=1}^m \lambda_i\,g_i(x)\;+\;\sum_{j=1}^p \nu_j\,h_j(x) \\
\lambda_i & \ge 0\;(i=1,\dots,m)\quad\text{convention}\\
\end{align*}}
$$

### Karush-Kuhn-Tucker (KKT) Conditions
The Karush–Kuhn–Tucker (KKT) conditions play a central role in the theory and practice of constrained nonlinear optimization by generalizing the method of Lagrange multipliers to handle both equality and inequality constraints. Assuming a suitable constraint qualification (e.g., LICQ or Slater’s condition), the following are necessary for optimality:

1. __Stationarity__: The gradient of the Lagrangian with respect to $x$ must vanish at the optimal point $x^*$:
    $$
    \begin{align*}
    \nabla_x\mathcal L(x^*,\lambda^*,\nu^*) = 0\quad\Longleftrightarrow\quad\nabla f(x^*) + \sum_{i=1}^{m} \lambda_i^* \nabla g_i(x^*) + \sum_{j=1}^{p} \nu_j^* \nabla h_j(x^*) = 0.
    \end{align*}
    $$
2. __Primal feasibility__: The constraints must be satisfied at the optimal point $x^*$:
    $$    \begin{align*}
    & g_i(x^*) \le 0 \quad(i = 1, \ldots, m)\\
    & h_j(x^*) = 0 \quad(j = 1, \ldots, p).
    \end{align*}
    $$
3. **Dual feasibility**: The Lagrange multipliers for the inequality constraints must be non-negative:
    $$\lambda_i^* \ge 0 \quad(i = 1, \ldots, m).$$
4. **Complementary slackness**: For each inequality constraint, either the constraint is active (i.e., $g_i(x^*) = 0$) or the corresponding multiplier is zero ($\lambda_i^* = 0$):
    $$\lambda_i^* \cdot g_i(x^*) = 0 \quad(i = 1, \ldots, m).$$

These conditions provide a powerful framework for analyzing and solving constrained optimization problems. They are necessary for optimality under certain regularity conditions, such as the constraint qualifications (e.g., Slater's condition). If $f$ and each $g_i$ are convex and each $h_j$ is affine, then any point satisfying the KKT conditions is a global minimizer!
___

<div>
    <center>
        <img src="figs/Fig-GD-Schematic.svg" width="680"/>
    </center>
</div>

## Gradient Descent
Given the general problem of constrained nonlinear optimization, we can apply _gradient descent_ to find a solution. 

Gradient descent is a first-order optimization algorithm that iteratively updates the solution by moving in the direction of the steepest descent of the objective function. However, by default, gradient descent is designed for unconstrained optimization problems. Thus, we need to adapt it to handle constraints.

Toward this end, we can use a barrier or penalty method to incorporate the constraints into the objective function.

> __Penalty and Barrier Methods__: A penalty method involves adding a penalty term to the objective function for violating the constraints. There are several approaches to this, but one common approach is to use a quadratic penalty function for equality constraints and a __barrier function__ for inequality constraints. When a candidate solution violates the constraints, the penalty term increases, discouraging such violations in future iterations.

Consider the following (augmented) objective function that combines the original objective function with penalty terms:
$$
\begin{align*}
    \min_{x\in\mathbb{R}^n}\;P_{\mu,\rho}(x)\;&=f(x)\;-\;\underbrace{\frac{1}{\mu}\sum_{i=1}^m\ln\bigl(-\,g_i(x)\bigr)}_{\text{barrier term}}\;+\;\underbrace{\frac{1}{2\rho}\sum_{j=1}^p 
    \bigl[h_j(x)\bigr]^2}_{\text{penalty term}},\quad\text{where}\quad\mu>0,\;\rho>0\\
\end{align*}
$$
The smooth barrier terms $\frac{1}{\mu}\sum_{i=1}^m\ln\bigl(-\,g_i(x)\bigr)$ penalize violations of the inequality constraints, while the $\frac{1}{2\rho}\sum_{j=1}^p\bigl[h_j(x)\bigr]^2$ terms penalize violations of the equality constraints. 

> __Parameters__
>
> The parameters $\mu$ and $\rho$ control the strength of these penalties:
> * __Barrier weight__: The $\mu$ parameter is typically _decreased_ over iterations to enforce stricter adherence to the inequality constraints. As $\mu\to0$, the coefficient $\frac{1}{\mu}$ grows, strengthening the barrier effect.
> * __Penalty weight__: The $\rho$ parameter is typically _decreased_ over iterations to enforce stricter adherence to the equality constraints. As $\rho\to0$, the coefficient $\frac{1}{\rho}$ grows, strengthening the penalty effect.
> 
> We'll provide some heuristics for updating these parameters in the algorithm below.

### Algorithm
Let's develop a simple gradient descent algorithm for this problem. The algorithm iteratively updates the solution $x_k$ using the gradient of the augmented objective function $P_{\mu,\rho}(x)$.

__Initialization__: Given an initial guess $x_0$, set $\mu > 0$ and $\rho > 0$. Specify a tolerance $\epsilon > 0$, a maximum number of iterations $K$, and a step size (learning rate) $\alpha > 0$. Set $\texttt{converged} \gets \texttt{false}$, the iteration counter to $k \gets 0$ and specify values for the penalty update parameters $(\tau_{\mu},\tau_{\rho})\in\left(0,1\right)$.

While not $\texttt{converged}$ __do__:
1. Compute the gradient: $\nabla P_{\mu,\rho}(x_k) = \nabla f(x_k) + \frac{1}{\mu} \sum_{i=1}^m \frac{\nabla g_i(x_k)}{-g_i(x_k)} + \frac{1}{\rho} \sum_{j=1}^p h_j(x_k) \nabla h_j(x_k)$ evaluated at the current solution $x_k$.
2. Update the solution: $x_{k+1} = x_k - \alpha \nabla P_{\mu,\rho}(x_k)$. $\texttt{Note}$: $\alpha$ is fixed here, but it can be adapted dynamically based on the convergence behavior.
3. Check convergence: 
     - If $\|x_{k+1} - x_k\|_{2} \leq \epsilon$, set $\texttt{converged} \gets \texttt{true}$. Return $x_{k+1}$ as the approximate solution. $\texttt{Note}$: here we look at the Euclidean norm of the difference between the current and next solution. However, many other criteria can be used, such as the change in the objective function value or the gradient norm.
     - If $k \geq K$, set $\texttt{converged} \gets \texttt{true}$. Warn that the maximum number of iterations has been reached without convergence.
4. Increment the iteration counter: $k \gets k + 1$, update $\mu\gets \tau_\mu\,\mu$ and $\rho\gets \tau_\rho\,\rho$ as needed, and repeat.

As $\mu\to0$, the coefficient $\frac{1}{\mu}$ in the barrier term grows, creating an increasingly strong barrier that keeps the solution away from constraint boundaries (where $g_i(x)\to 0^-$). Similarly, as $\rho\to0$, the coefficient $\frac{1}{\rho}$ in the penalty term grows, enforcing $h_j(x)\to0$ ever more strictly.
___

## Simulated Annealing
Simulated annealing (SA) is inspired by the annealing process in metallurgy, where controlled cooling of a material allows it to reach a low-energy state. SA explores the solution space by accepting both improving and non-improving moves, allowing it to escape local optima. Developed by Kirkpatrick, Gelatt, and Vecchi in 1983, SA is a probabilistic technique that uses a temperature parameter to control the exploration of the solution space.

Let's take a look at the basic steps of the simulated annealing algorithm for minimizing the (augmented) objective function $P_{\mu,\rho}(x)$:
$$
\begin{align*}
    \min_{x\in\mathbb{R}^n}\;P_{\mu,\rho}(x)\;&=f(x)\;-\;\frac{1}{\mu}\sum_{i=1}^m\ln\bigl(-\,g_i(x)\bigr)\;+\;\frac{1}{2\rho}\sum_{j=1}^p 
    \bigl[h_j(x)\bigr]^2,\quad\text{where}\quad\mu>0,\;\rho>0\\
\end{align*}
$$
where $f(x)$ is the objective function, $g_i(x)$ are the inequality constraints, and $h_j(x)$ are the equality constraints.


__Initialize__: Given an initial solution guess $x_0$, penalty parameters $\mu > 0$ and $\rho > 0$, an initial temperature $T\gets{T_\circ}$, a cooling rate parameter $\alpha\in(0,1)$, the maximum number of iterations $K$ per temperature, and a minimum temperature $T_{\text{min}}$. Specify a step size (learning rate) $\beta > 0$, set $\texttt{converged} \gets \texttt{false}$, set $x^{\star} \gets x_0$ as the best solution found so far, and $x_{c}\gets{x}_{0}$ as the current solution. Specify values for the penalty update parameters $(\tau_{\mu},\tau_{\rho})\in\left(0,1\right)$.

While not $\texttt{converged}$ __do__:

1. For $k = 1\,\text{to}\,K$:
   - Generate a _new_ candidate solution: $x^{\prime} \gets x_{c} + \beta\cdot\texttt{randn}(\texttt{size}(x_{c}))$.
   - Compute the change in the objective function between the __new__ solution and the __current__ solution: $\Delta P \gets P_{\mu,\rho}(x^{\prime}) - P_{\mu,\rho}(x_{c})$
      - _Downhill move_: If $\Delta P < 0$, accept the new solution (new solution becomes the current solution): $x_{c} \gets x^{\prime}$.
      - _Uphill move_: If $\Delta P \geq 0$, accept the new solution with probability $p\gets\exp(-\Delta P / T)$. Roll a uniform random number $u \gets \texttt{Uniform}(0,1)$. If $u \leq p$, accept the new solution: $x_c \gets x^{\prime}$. 
    - Compute the change in the objective function between the __current__ solution and the __best__ solution: $\Delta P^{\star} \gets  P_{\mu,\rho}(x_c) - P_{\mu,\rho}(x^{\star})$. If $\Delta P^{\star}<0$ update __best__ solution: $x^{\star}\gets{x}_{c}$.
2. Update the penalty parameters: $\mu\gets \tau_\mu\,\mu$ and $\rho\gets \tau_\rho\,\rho$.
3. Check for convergence: 
   - If $T \leq T_{\text{min}}$, set $\texttt{converged} \gets \texttt{true}$ and return the best solution $x^{\star}$.
   - Otherwise, update the temperature: $T \gets \alpha T$


### Selecting $T_{\circ}$ and $K$
A common question is how to choose the initial temperature $T_\circ$ and the number of iterations $K$ per temperature. Here are two standard heuristics:


#### Sample‐and‐set approach for $T_\circ$.
Let's look at a simple algorithm for selecting a $T_\circ$ value, called the _sample-and-set_ approach:
1. From the initial solution $x_\circ$ (and default values for the other parameters), generate $M={100}$ random neighbor costs $\Delta{P}_{1},\dots,\Delta{P}_{M}$.
2. Find $J^{+}\gets\left\{j: \Delta{P}_{j}>0,\,j=1,\dots,M\right\}$. Compute the mean _uphill_ neighbor costs for $\overline{\Delta P}_{+}\gets\texttt{mean}\left\{\Delta{P}_{i}\right\}_{i=J^{+}}$.
3. Choose an initial desired acceptance rate $p_{\circ}\in(0.6,0.9)$.
4. Set $T_{\circ}\gets{-{\overline{\Delta P}_{+}}/{\,\ln{p_{\circ}}}}$

#### Heuristic for choosing $K$
Next, let's sketch out a simple heuristic for choosing $K$ that updates the number of iterations per temperature as the simulated annealing algorithm proceeds.
1. Initially, choose $K\gets{c}\cdot\texttt{size}(x)$, where $c\in\left[10,50\right]$.
2. After each temperature, compute the fraction of _accepted_ moves $\hat{p}$.
    - If $\hat{p}>0.8$, decrease $K\gets\lceil 0.75K\rceil$ (round to nearest integer value).
    - If $\hat{p}<{0.2}$, increase $K\gets\lceil 1.5K\rceil$ (round to nearest integer value).


___

## Lab
In lab `L9d`, we'll use gradient descent to solve an unconstrained logistic regression binary classification problem using the banknote dataset.


## Summary

In this notebook, we've explored the foundations of constrained and unconstrained nonlinear optimization through both gradient-based and heuristic approaches:

> __Key takeaways:__
>
> 1. **KKT conditions as optimality criteria**: The Karush–Kuhn–Tucker conditions provide necessary conditions for optimality in constrained nonlinear problems, generalizing Lagrange multipliers to handle both equality and inequality constraints. These conditions, stationarity, primal feasibility, dual feasibility, and complementary slackness, form the theoretical foundation for identifying optimal solutions, and when the problem is convex, they become sufficient conditions for global optimality.
> 2. **Gradient descent with barrier and penalty methods**: For constrained problems, gradient descent can be adapted by augmenting the objective function with logarithmic barrier terms for inequality constraints and quadratic penalty terms for equality constraints. The penalty parameters are gradually decreased over iterations to enforce stricter adherence to constraints, allowing the algorithm to approach feasible optimal solutions while maintaining computational tractability.
> 3. **Simulated annealing for global optimization**: As a derivative-free heuristic method, simulated annealing probabilistically accepts both improving and non-improving moves controlled by a temperature parameter, enabling escape from local optima. The algorithm's effectiveness depends on careful selection of the initial temperature (via sample-and-set heuristics based on uphill move distributions) and adaptive iteration counts per temperature level (adjusted based on acceptance rates).

These optimization methods provide the essential toolkit for solving complex constrained and unconstrained problems across engineering and machine learning applications.

___