# Optimization -  Duality

> Lagrangian Duality, Weak and Strong Duality, Complementary Slackness, Farkas' Lemma, Separation Arguments and Theorems of the Alternative

- hide: false
- use_math: true
- toc: true
- badges: true
- comments: true
- categories: ['Optimization','Applied Mathematics','Proofs']
- image: images/lp-duality.png

# Introduction

Every linear program, and every optimization problem in general, has a closely related problem called its dual which can be colloquially thought of as its evil twin. The primal and the dual represent two different perspectives on the same problem. 

In the most general case, if the primal is a minimization problem, its  dual is a maximization problem. In the case of constrained optimization, if the primal is minimization in $n$ variables and $m$ constraints, its dual is a maximization in $m$ variables and $n$ constraints. 

Furthermore, any value attained by the dual problem is a lower bound of all the values attained by the primal and, in particular, the primal optimal value. This property, called *Weak Duality*, is at the core of duality. Deriving a dual that obtains, at the very least, a useful lower bound to the primal optimal value is one of the nascent ideas behind *Duality Theory*.

In the case of problems which exhibit *Strong Duality*, such as linear programs and most convex non-linear optimization problems, the primal and the dual optima are strictly equal. That is, solving the dual guarantees that we've also solved the primal. Furthermore, since taking the dual of the dual gives back the primal, this relationship is true in the converse — if we've solved the primal then we've also solved its dual.

This is what makes Duality Theory useful in practice. Having a related, possibly easier, optimization problem gives applied scientists a huge computational advantage. Even if the dual does not turn out to be easier to solve and/or Strong Duality does not hold, we still stand to gain structural insight about the problem.

In this post, we will show how to form the dual of a problem, examine its relationship with the primal in detail, and list possible primal-dual outcomes. In doing so, we will look at duality in the general case of constrained optimization problems, in a specific type of unconstrained problem, and in linear programs.

# Deriving the Dual of a Constrained Problem

Let's first on deriving the dual of constrained optimization problems. 

As we shall see later, certain types of unconstrained problems have duals which arise from the [Fenchel-Legendre Transform](https://en.wikipedia.org/wiki/Convex_conjugate). However, in constrained problems, it is the constraints themselves which give rise to duality through the [Lagrangian](https://en.wikipedia.org/wiki/Lagrangian_relaxation).  

Take the most general form of constrained problem with $m$ inequality and $n$ equality constraints and assume nothing, as of yet, about its convexity. To make the discussion interesting, assume the problem is non-trivial (i.e. its constraint set is non-empty and contains more than one feasible point). Furthermore, so that we may have a solution to speak of, assume the problem is bounded with the finite optimal value of $f_0(x^*)$ for some optimizer $x^*$.

$$
\begin{cases}
\min_x: f_0(x)
\\
s.t.: \begin{aligned} &f_i(x) \leq 0 \ \ i = 1, ...,m
\\ 
&h_i(x) = 0 \ \ i = 1, ... ,p
\end{aligned}
\end{cases}
$$

The idea is to penalize infeasible choices of $x$ using functions that express our *displeasure* for certain choices. 

At first we use the *infinitely-hard* penalty functions $\mathbb{1}_-$ and $\mathbb{1}_0$ which are defined as follows:

$$\mathbb{1}_-(u) = 
\begin{cases}
\begin{aligned} 
&0  &\textrm{if} \ u \leq 0
\\
&\infty  &\textrm{if} \ u > 0
\end{aligned}
\end{cases}$$

$$\mathbb{1}_0(u) = 
\begin{cases}
\begin{aligned} 
&0  &\textrm{if} \ u = 0
\\
&\infty  &\textrm{if} \ u \ne 0
\end{aligned}
\end{cases}$$

Then the equivalent unconstrained problem is: 
$$\min_x: \mathcal{J}(x)$$

where $\mathcal{J}(x) = f_0(x) + \sum_{i=1}^m \mathbb{1}_-(f_i(x)) + \sum_{i=1}^p \mathbb{1}_0(h_i(x))$, which can also be expressed as:

$$\mathcal{J}(x) = \begin{cases}\begin{aligned} 
&f_0(x) \ \ \textrm{if $x$ is feasible}
\\
&\infty \ \ \textrm{otherwise}
\end{aligned}\end{cases}$$

Informally, if $\hat x$ is chosen s.t *any* of the constraints is broken then the minimization incurs an infinitely positive penalty. Therefore, such a $\hat x$ will never be selected over a solution $x$ that gives a finite value $f_0(x)$. Moreover, since $f_0(x) \leq f_0(x^*) \ \ \forall x$ by optimality of $x^*$ in the original problem, the optimal value will be $f_0(x^*)$.

Hence: 
$$\min_x \mathcal{J}(x) = f_0(x^*) \tag{1}$$

Furthermore, since the minimizer $x^*$ for the original problem is feasible, $\mathcal{J}(x^*) = f_0(x^*)$ by definition. That is:
$$\mathcal{J}(x^*) = \min_x \mathcal{J}(x) \tag{2} \\ 
\textrm{or, equivalently} \\ 
x^* = \arg \min_x \mathcal{J}(x)
$$

The first identity (equation $(1)$) shows that it suffices to minimize the unconstrained objective $\mathcal{J}(x)$ instead of the original problem since doing so results in $f_0(x^*)$, the optimal value of the original. The second identity (equation $(2$), on the other hand, says that a minimizer of the unconstrained problem $\mathcal{J}(x)$ is also a minimizer of the original problem. This turns out to be important, since minimizing an unconstrained problem is easier.

As we know, the local minima of unconstrained problems occur at their stationary points which can be identified using the *gradient optimality condition*. Once identified, a global minimizer can be observed by evaluating the objective at each stationary point.

But we cannot find the gradient of $\mathcal{J}(x)$ and set it to zero because the infinitely-hard penalty functions are discontinuous and, therefore also, non-differentiable. That is, $\nabla \mathcal{J}(x)$ simply does not exist.

To sidestep this difficulty we use linear relaxations instead of $\mathbb{1}_-$ and $\mathbb{1}_0$. 

## The Lagrangian, Dual Variables, and the Dual Function

The Lagrangian linear relaxation, sometimes simply referred to as the *Lagrangian*, is: 

$$\mathcal{L}(x,\lambda,\mu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \mu_i h_i(x) \ \ \textrm{where} \ \lambda \geq 0$$

We call the $\lambda_i$'s the *Lagrange multipliers* corresponding to the inequality constraints, and the $\mu_i$'s the *Lagrange multipliers* corresponding to the equality constraints. The vectors $\lambda$ and $\mu$ are called the *Lagrange multiplier vectors* or, for reasons that will soon become apparent, the *dual variables*. 

> Note: In many sources, the Lagrangian is simply stated as $\mathcal{L}(x,\lambda) = f_0(x) + \sum_{i=1}^n \lambda_i f_i(x)$. Indeed, by breaking down the equality constraints $h_i(x) = 0$ into $h_i(x) \leq 0$ and $-h_i(x) \leq 0$, we can transform a problem with equality constraints into one with only inequality constraints. So, the latter representation of the Lagrangian is still general enough to account for equality constraints.
<br>

## A Lagrangian Lower Bound on the Optimal

Not only does the Lagrangian augment a constrained problem into an unconstrained problem that's solvable by using the method of unconstrained optimization, it also gives us the *dual problem*.

The first thing to note about the Lagrangian is that $\lambda \geq 0$ is necessary. This is because, in the event that an inequality constraint is violated, say $f_i(x) > 0$ for some $i$, the corresponding $\lambda_i$ must be non-negative in order to apply a positive penalty for the violation. This extends to all the entries of $\lambda$, resulting in the $\lambda \geq 0$ coordinate-wise condition.

On the other hand, $\mu$ is free to assume any value since the equality constraints can be violated in either direction and both scenarios must be penalized.

The second thing to note about the $\mathcal{L}$ is that, even though we apply a positive penalty that scales linearly in the severity of the violation, this penalty is *still* not as severe as the infinite penalty we were applying in $\mathcal{J}$. Also, in the Lagrangian, we actually *reward* feasible choices of $x$ that have margin. That is, in the event that $f_i(x) < 0$ strictly for some $i$, $\lambda_if_i(x)$ is a non-positive reward for the minimization problem. 

All of this is to say that the Lagrangian is a lower bound of the unconstrained problem with the infinitely-hard penalty. That is:

$$\mathcal{L}(x,\lambda,\mu) \leq J(x) \ \ \forall x, \lambda \geq 0, \mu$$ 

This fact is also true by noticing that $\lambda_i f_i(x) \leq \mathbb{1}_-(f_i(x))$ and $\mu_i h_i(x) \leq \mathbb{1}_0(h_i(x))$ for all $i$. 

But then:

$$\min_x \mathcal{L}(x,\lambda,\mu) \leq J(x) \ \ \forall x, \lambda \geq 0, \mu \tag{1}$$

In particular, if $x^*$ is an optimal solution to the original problem, we have:

$$\min_x \mathcal{L}(x,\lambda,\mu) \leq J(x^*) \ \ \forall \lambda \geq 0, \mu$$

But $J(x^*) = f_0(x^*) = p^*$, so we have:

$$\min_x \mathcal{L}(x,\lambda,\mu) \leq p^* \ \ \forall \lambda \geq 0, \mu \tag{2}$$

Designating the original problem as the *primal*, we call $g(\lambda, \mu) := \min_x \mathcal{L}(x, \lambda, \mu)$ the *dual function* since it represents a lower bound on the primal optimal value. 

Then inequality $(2)$ becomes our first formulation of Weak Duality.

$$g(\lambda, \mu) \leq p^* \ \ \forall \lambda \geq 0, \mu \tag{WD 1}$$

From here, we move to define the *dual problem*.

## The Lagrange Dual Problem

It's natural, at this point, to ask what the *tightest* lower bound $d^*$ on $p^*$ is. That is, what's the largest $d^*$ s.t. $d^* \leq p^*$? 

This amounts to finding the values $\lambda^* \geq 0$, and $\mu^*$ for which $g(\lambda^*, \mu^*)$ is maximized. We call this the *Lagrange dual problem* (or simply the *dual problem*).

As a general optimization problem, it can be stated as:

$
\begin{cases}
\max_{\lambda, \mu}: g(\lambda, \mu)
\\
s.t.: \lambda \geq  0
\end{cases}
$

Looking at the above, we can see why the Lagrange multipliers are referred to as the *dual variables* — they simply end up being the variables of the dual problem.

# Weak Duality

We've already stated Weak Duality as $(4)$. In this section, we give a few alternate formulations.

If $x^*$ is primal optimal, and $\lambda^* \geq 0$, $\mu^*$ are dual optimal then we have Weak Duality in terms of the tightest lower bound as:

$$
\begin{aligned} 
g(\lambda^*, \mu^*) &\leq \mathcal{J}(x^*) \\
&\textrm{or} \\ 
d^* &\leq p^*
\end{aligned} \tag{WD 2} 
$$

## The Max-Min Characterization

Since $g(\lambda, \mu) := \min_{x} \mathcal{L}(x, \lambda, \mu)$. The dual optimal, as can be seen from the dual problem, is:
$$d^* = \max_{\lambda \geq 0, \mu} \left\{ \min_x \mathcal{L} (x, \lambda, \mu) \right\} \tag{3}$$ 

We will now see that the primal optimal can be similarly expressed as:

$$p^* = \min_x \left\{ \max_{\lambda \geq 0, \mu} \mathcal{L} (x, \lambda, \mu) \right\} \tag{4}$$

To see this note that, for a fixed $x$, maximizing the Lagrangian over $\lambda \geq 0$ and $\mu$ recovers $\mathcal{J}(x)$ — the unconstrained problem with the infinitely-hard penalty. 
That is: 

$$\max_{\lambda \geq 0, \mu} \mathcal{L}(x, \lambda, \mu) = \mathcal{J}(x) \ \ \forall x \tag{5}$$

To prove this claim, fix $x$ and consider two possibilities. If all inequality constraints are respected, that is $f_i(x) \leq 0$ $\forall i$, then, in order to maximize $\mathcal{L}$, the best we can do is set $\lambda_i = 0$ $\forall i$ which results in the optimal value $f_0(x)$. In case when *any* inequality constraint is violated, that is $f_i(x) > 0$ for some $i$, the result of maximizing $\mathcal{L}$ is $\infty$ by choosing $\lambda_i \rightarrow \infty$ and $\lambda_j = 0$ $\forall j \ne i$. 

Using similar logic, if all equality constraints are respected then $h_i(x) = 0$ $\forall i$. In this case any choice of $\mu$ results in $J(x) = f_0(x)$. If, on the other hand, some equality constraint is violated then $h_i(x) \ne 0$ for some $i$. By choosing $\mu \rightarrow \pm \infty$, where the sign depends on the direction of the violation, the result can be made $\infty$.

Minimizing equation $(5)$ over $x$ yields: 

$$ \min_x \mathcal{J}(x) = \min_x \left\{ \max_{\lambda \geq 0, \mu} \mathcal{L}(x, \lambda, \mu) \right\}$$

But $\min_x \mathcal{J}(x) = p^*$, concluding the proof of claim $(4)$.

This gives us a way to express Weak Duality in a more symmetric way.

$$\max_{\lambda \geq 0, \mu} \left\{ \min_x \mathcal{L} (x, \lambda, \mu) \right\} \leq \min_x \left\{ \max_{\lambda \geq 0, \mu} \mathcal{L} (x, \lambda, \mu) \right\} \tag{WD 3}$$

## Max-Min Inequality and its Interpretations

Weak Duality can also be derived through a non-optimization lens using the quantities

$$\max_{\lambda \geq 0, \mu} \left\{ \min_x \mathcal{L} (x, \lambda, \mu) \right\} \ \ \textrm{and} \ \ \min_x \left\{ \max_{\lambda \geq 0, \mu} \mathcal{L} (x, \lambda, \mu) \right\}$$

by using the [max-min inequality](https://en.wikipedia.org/wiki/Max%E2%80%93min_inequality).

### The Max-Min Inequality

The max-min inequality makes no assumptions about the function. It's simply true for all functions of the form $f: X \times Y \rightarrow \mathbb{R}$, and it states that:
$$
\inf_{y\in Y} \left\{ \sup_{x\in X} f(x,y) \right\} \geq \sup_{x\in X} \left\{ \inf_{y\in Y} f(x,y) \right\}
$$

Since no assumption is made on $f$, the inequality certainly also holds for $\mathcal{L}$. And, since we're in the special case where the optima are assumed to exist, the functions  attain the optima. That is, we can replace $\sup$ and $\inf$ in the inequality with $\max$ and $\min$. This results in the symmetric formulation of Weak Duality as stated above.

All that remains in establishing Weak Duality is to prove the max-min inequality.

For any $f$, and $x \in X$, $y \in Y$ we have:
$$f(x,y) \geq \min_x f(x,y) \ \ \forall y$$
The right hand side is now only a function of $y$, so maximizing both sides w.r.t. $y$ yields: 
$$ \max_y f(x,y) \geq \max_y \left\{ \min_x f(x,y) \right\} \ \ \forall x$$
The right hand side is now a constant, so minimizing both sides w.r.t. $x$ results in the desired conclusion.
$$\min_x \left\{ \max_y f(x,y) \right\} \geq \max_y \left\{ \min_x f(x,y) \right\}$$


### Game-Theoretic Interpretation

An intuitive way to see the validity of the max-min inequality comes from Game Theory.

Suppose two players $X$, and $Y$, are playing a game in which player $X$'s goal is to minimize the score $f(X,Y)$ whereas player $Y$'s goal is to maximize it. Suppose, per the rules of the game, player $X$ has the first turn. Player $X$ plays $x$ with no knowledge of how player $Y$ will respond. Player $Y$, then, plays $y(x)$ which is a choice informed by that of player $X$. So player $Y$ has a clear tactical advantage in this game. 

The quintessential example is the game of *Rock, Paper, Scissors*. The simultaneous game is fair. However, if one player gets to see the other players' choice first, they will win every time.

Conversely, If a second game is played such that player $Y$ goes first, the advantage lies with player $X$. 

Formally, suppose the game is described by $f(x,y)$ where $x$ and $y$ represent the choices available to players $X$ and $Y$ respectively. 

The score of the first game will be $\min_x \left\{ \max_y f(x,y) \right\}$. 

Similarly, in the second game the score will be $\max_y \left\{ \max_x f(x,y) \right\}$.

Since player $Y$, whose goal is to maximize the score, has an advantage in the first game, we have the max-min inequality 

$$\min_x \left\{ \max_y f(x,y) \right\} \geq \max_y \left\{ \max_x f(x,y) \right\}$$

# Strong Duality

Strong Duality is the case in which the primal and the dual optima agree. That is: 

$$
\begin{aligned}
g(\lambda^*, \mu^*) &= \mathcal{J}(x^*) \\
&\textrm{or} \\
p^* &= d^*
\end{aligned} \tag{SD 1}
$$

Alternatively, in its max-min characterization:

$$\max_{\lambda \geq 0, \mu} \left\{ \min_x \mathcal{L} (x, \lambda, \mu) \right\} = \min_x \left\{ \max_{\lambda \geq 0, \mu} \mathcal{L} (x, \lambda, \mu) \right\} \tag{SD 2}$$

Optimization problems that exhibit this property are called Strongly Dual. 

As mentioned in the introduction, Strong Duality gives us the ability to solve a related, possibly easier, optimization problem. As we shall see it also gives us powerful optimality conditions. So, knowing in advance whether or not a problem is Strongly Dual will be beneficial to us.

All linear programs are Strongly Dual as we will prove. When it comes to non-linear optimization, however, Strong Duality is not guaranteed. The good news is that sufficient conditions for Strong Duality exist and we will give them shortly.

## An Easier Dual Problem 

As mentioned briefly in the introduction, Strong Duality gives us the option of solving the original optimization problem in a different, possibly easier way. Let's qualify this statement further.

The original, possibly non-convex, problem was that of finding the primal optimal value $p^* = \min_x \left\{ \max_{\lambda \geq 0, \mu} \mathcal{L} (x, \lambda, \mu) \right\}$.

This amounts to fixing some $x$, and then maximizing $\mathcal{L}$ over $\lambda \geq 0$, $\mu$. As we've mentioned before this recovers $J(x)$ which is a non-differentiable function.

Meanwhile, the dual problem is that of finding the dual optimal value $d^* = \max_{\lambda \geq 0, \mu} \left\{ \min_x \mathcal{L} (x, \lambda, \mu) \right\}$.

This amounts to fixing some $\lambda \geq 0$, $\mu$, and then minimizing $\mathcal{L}$ over $x$.

Minimizing the Lagrangian over $x$ may still be difficult, but at least it lends itself to using the method of unconstrained optimization. Furthermore, the resulting dual function $g(\lambda, \mu) = \min_x \mathcal{L}(x, \lambda, \mu)$ is *always* easy to maximize over $\lambda \geq 0$ and $\mu$, since it's a concave objective with a linear constraint $\lambda \geq 0$. The reason the objective is concave, of course, is that it's the pointwise minimum of linear functions in $\lambda$, and $\mu$. 

If Strong Duality holds then we've just found an easier approach to the original problem. To see why, recall that:

$$g(\lambda, \mu) := \min_x \mathcal{L}(x, \lambda,\mu) \\ \textrm{and} \\ \mathcal{J}(x) = \max_{\lambda \geq 0, \mu} \mathcal{L}(x, \lambda, \mu) \ \ \textrm{see $(5)$}$$

Then:

$$g(\lambda, \mu) \leq \mathcal{L}(x, \lambda, \mu) \leq \mathcal{J}(x) \ \ \forall x, \lambda \geq 0, \mu \tag{6}$$

The point-wise inequality $(6)$ is true in general since, for any $f: X \rightarrow Y$, $\min_x f(x,y) \leq f(x,y) \leq \max_y f(x,y)$. It is also, in particular, true for the primal-dual optimal pair $x^*$, and $(\lambda^*, \mu^*)$. That is: 

$$g(\lambda^*, \mu^*) \leq \mathcal{L}(x^*, \lambda^*, \mu^*) \leq \mathcal{J}(x^*)$$

However, by $(SD \ 1)$, $g(\lambda^*, \mu^*) = \mathcal{J}(x^*)$.

This forces $\mathcal{L}(x^*, \lambda^*, \mu^*) = g(\lambda^*, \mu^*) = \min_x \mathcal{L}(x, \lambda^*, \mu^*)$, which says exactly that $x^*$ is a minimizer of $\min_x \mathcal{L}(x, \lambda^*, \mu^*)$. 

So, once we have $\lambda^*$ and $\mu^*$ which, as mentioned above, are easy to find, we can use the method of unconstrained optimization on $\mathcal{L}(x,\lambda^*, \mu^*)$ to find $x^*$. This is the way to solve the dual and, as we can see, it's an easier problem than solving the primal directly which involves minimizing $\mathcal{J}(x)$.
 
In conclusion, Strong Duality gives us an easier way to solve the original problem. However, if it fails to hold not all hope is lost... By solving the dual problem we can still obtain a useful lower bound for the primal optimal value through Weak Duality.

## Optimality Conditions

Strong Duality obtains powerful optimality conditions known as *Complementary Slackness* and *Stationarity Condition*. These are often collected into the mantle of [*Karush–Kuhn–Tucker (KKT) Conditions*](https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions), which is simply Complementary Slackness, Stationarity Condition, and feasibility conditions of optimal solutions put together. 

In the absence of Strong Duality, the KKT Conditions are necessary, but insufficient, for optimality. However, for Strongly Dual problems the KKT Conditions become a *certificate of optimality*. That is, they are both necessary and sufficient for optimality.

### Complementary Slackness

Strong Duality gives gives the following optimality condition.

> **Complementary Slackness:** &nbsp; Suppose $x^*$ and $(\lambda^*, \mu^*)$ are primal-dual optimal for a Strongly Dual problem. Then $\lambda^*_i f_i(x^*) = 0 \ \ \forall i$.
<br>

Informally, if a primal constraint at an optimal $x^*$ is *loose*, that is $f_i(x^*) \ne 0$, then its corresponding dual variable $\lambda^*_i$ in the optimal dual solution $\lambda^*$ must be zero. Conversely, if the dual variable $\lambda_i^*$ is positive then the corresponding constraint must be *tight*. On the other hand, if a primal constraint is *tight* at $x^*$, Complementary Slackness tells us nothing about its corresponding dual variable. 

Let's prove this result.

Suppose $x^*$ and $(\lambda^*, \mu^*)$ are primal-dual optimal for a Strongly Dual problem as in the hypothesis. 

Then, by $(SD \ 1)$, $g(\lambda^*, \mu^*) = \mathcal{J}(x^*)$.

But since $\mathcal{J}(x^*) = f_0(x^*)$, we have:

$$
\begin{aligned}
f_0(x^*) &= g(\lambda^*, \mu^*) \\ 
&= \min_x \mathcal{L}(x, \lambda^*, \mu^*) \\ 
&\leq \mathcal{L}(x^*, \lambda^*, \mu^*) \\
&=  f_0(x^*) + \sum_{i=1}^m \lambda_i^* f_i(x) + \sum_{i=1}^p \mu_i^* h_i(x^*) \\
&\leq f_0(x^*)
\end{aligned}
$$

To see why the last inequality holds, note that $\sum_{i=1}^p \mu_i^* h_i(x^*) = 0$ since $h_i(x^*) = 0 \ \ \forall i$ by feasibility of $x^*$. Then again, by feasibility of $x^*$, $\forall i$ $f_i(x^*) \leq 0$. And since, by construction of $\mathcal{L}$, $\lambda \geq 0$ we have $\sum_{i=1}^m \lambda^*_i f_i(x^*) \leq 0$. 

But taken altogether this says $f_0(x^*) \leq f_0(x^*)$ which can *only* be true through strict equality. Then it must be the case that $\sum_{i=1}^m \lambda^*_i f_i(x^*) = 0$. But, being a sum of non-positive terms, $\sum_{i=1}^m \lambda^*_i f_i(x^*) = 0$ *if and only if* $\lambda^*_i f_i(x^*) = 0 \ \ \forall i$ which is what we wanted to show. 

#### Economic Intuition of Complementary Slackness

In order to understand what Complementary Slackness means intuitively, the concept of dual variables as *marginal prices* is useful. The dual variable associated with a primal constraint is called the marginal price of that constraint. It represents how much the objective function *would* improve were the constraint relaxed. 

That is, suppose our goal is to maximize profit and a particular constraint represents how much of a resource we are allowed to utilize. The question we're asking ourselves is: if we  utilized more of this resource, how much would we gain in profit? 

Complementary Slackness says: if the marginal price (i.e. a dual variable $\lambda^*_i$) is positive then the profit could be increased by utilizing more of this resource. Hence, the constraint corresponding to $\lambda^*_i$ must be tight. Otherwise $x^*$ is not optimal, and a better solution can be found by making the non-binding constraint binding. 

Note that in non-linear optimization this is only local behavior. That is, $\lambda^*_i$ predicts the improvement in the objective function *only* for small-enough change in the amount $f_i(x^*)$ of resource utilized.

We will see why this is the case when we consider the geometric intuition of Complementary Slackness in the case of linear programs.

### Stationarity Condition

Strong Duality also gives another condition of optimality.

> **Stationarity Condition:** &nbsp; Suppose $x^*$ and $(\lambda^*, \mu^*)$ are primal-dual optimal for a Strongly Dual problem. Then $x^*$ is a stationary point of $\mathcal{L}(x, \lambda^*, \mu^*)$. That is: 
$$\nabla_x f_0(x^*) + \sum_i^m \lambda^*_i\nabla_xf_i(x^*) + \sum_{i=1}^p \mu^*_i\nabla_xh_i(x^*) = 0$$
<br>

This follows from our discussion in the section titled ['An Easier Dual Problem'](https://v-poghosyan.github.io/blog/optimization/applied%20mathematics/proofs/2022/02/07/Optimization-LP-Duality.html#An-Easier-Dual-Problem). To summarize, we found that a primal optimal $x^*$ is a minimizer of the unconstrained, differentiable objective $\mathcal{L}(x, \lambda^*, \mu^*)$ where $(\lambda^*, \mu^*)$ are dual optimal.

So, to find $x^*$, we simply apply the first-order necessary condition to $\mathcal{L}(x, \lambda^*, \mu^*)$ which immediately gives the Stationarity Condition. 

#### Geometric Intuition of Stationarity Condition

PICTURE

#### Generalization of Unconstrained Optimization

SAME AS GRAD F = 0 BUT RESTRICTS DIRECTIONS TO FEASIBLE ONES

ALSO REDUCES TO FIRST ORDER CONDITIONS FOR ALREADY UNCONSTRAINED PROBLEM.

### Karush-Kuhn-Tucker (KKT) Conditions 

As mentioned above, the feasibility conditions together with Complementary Slackness and Stationarity Condition are often collected together into one mantle called the KKT Conditions. 

> **KKT Conditions:** &nbsp; $x^*, (\lambda^*, \mu^*)$ satisfy the KKT conditions if the following hold:
&nbsp;
> 1. $\nabla_x f_0(x^*) + \sum_{i \in I} \lambda^*_i\nabla_xf_i(x^*) + \sum_{i=1}^p \mu^*_i\nabla_xh_i(x^*) = 0$ where $I$ is the set of indices of the active inequality constraints.
> 2. $\lambda^*_if_i(x^*) = 0 \ \ \forall i$
> 3. $g_i(x^*) \leq 0 \ \ \forall i$ 
> 4. $h_i(x^*) = 0 \ \ \forall i$ 
> 5. $\lambda^* \geq 0$
<br>

What we did not mention earlier was that these conditions apply only to problems with differentiable objective and constraints. For the case in which one (or more) of the objective or constraints is not differentiable, there is a subdifferential version of the KKT Conditions. However, this is beyond the scope of this post.

As promised, the KKT Conditions together with Strong Duality give a certificate of optimality.

> **Certificate of Optimality** &nbsp; If Strong Duality holds, then $x^*, (\lambda^*, \mu^*)$ are primal-dual optimal if and only if they satisfy the KKT conditions. 
<br> 

## Slater's Condition - Sufficient Condition for Strong Duality

Even though not all Strongly Dual problems are convex and not all convex programs are Strongly Dual, convexity together with *Slater's Condition* is sufficient for Strong Duality.  
> **Slater's Condition:** &nbsp; $\exists \ \hat x$ s.t. $f_i(\hat x) < 0$, and $h_i(\hat x) = 0$ $\forall i$.
<br>

<br>
> Note: The equality constraints $h_i(\hat x) = 0$ are often stated as linear equality constraints $A \hat x = b$ in certain sources. This is simply due to the observation that equality constraints are convex constraints if and only if all of the $h_i$'s are linear.
<br>

Informally, Slater's Condition says that the existence of a feasible point which has margin w.r.t. all the inequality constraints is needed in addition to convexity. In even simpler terms, the feasible region must have an interior point. 

The sufficient condition for Strong Duality is then

> **Sufficient Condition for Strong Duality:** &nbsp; Any convex optimization problem satisfying Slater's Condition has Strong Duality.
<br>

The proof of this is beyond what we're trying to achieve in this post. However we motivate it geometrically in the following paragraph. 

### Geometric Intuition Behind Slater's Condition

PICTURE

## Saddle Point Interpretation

In this section, we will give the saddle point interpretation of Strong Duality through the *Saddle Point Theorem* which states the following.

> **Saddle Point Theorem:** &nbsp; If $x^*$ and $(\lambda^*, \mu^*)$ are primal and dual optimal solutions for a convex problem which satisfies Slater's Condition, they form a saddle point of the associated Lagrangian. Conversely, if $(x^*,\lambda^*, \mu^*)$ is a saddle point of a Lagrangian, then $x^*$ is a primal optimal, and $(\lambda^*, \mu^*)$ a is dual optimal of the associated Strongly Dual problem.
<br>

<br>

> Note: This isn't a necessary condition of Strong Duality since not all Strongly Dual problems satisfy the Sufficient Condition for Strong Duality. That is, it would be incorrect to replace "a convex problem which satisfies Slater's Condition" in the above theorem with "a Strongly Dual problem." In fact, not all Strongly Dual problems are convex to begin with, in which case the Lagrangian has no saddle points to speak of. 
<br>

First, let's show that the max-min inequality holds with strict equality at saddle points. This will yield one direction of the Saddle Point Theorem proof.


### Max-Min Equality at Saddle Points 

For certain types of functions, informally speaking those that are saddle-shaped, the max-min inequality holds with strict equality. 

The proof is as follows.

Let $f: X \times Y \rightarrow \mathbb{R}$ be saddle-shaped.

We call a point $(\hat x, \hat y) \in X \times Y$ a *saddle-point* of $f(x,y)$ if $f(\hat x, y) \leq f(\hat x, \hat y) \leq f(x, \hat y)$ for all $x \in X$ and $y \in Y$.

In other words, $\hat x$ minimizes $f(x, \hat y)$ over $X$, and $\hat y$ maximizes $f(\hat x,y)$ over $Y$. That is

$$f(\hat x, \hat y) = \inf_{x \in X} f(x, \hat y) \ \ \textrm{and} \ \ f(\hat x, \hat y) = \sup_{y \in Y} f(\hat x, y)$$

But then

$$\sup_{y \in Y} \left\{ \inf_{x \in X} f(x,y) \right\} = \inf_{x \in X} f(x, \hat y) = f(\hat x, \hat y) \ \ \textrm{and} \ \ \inf_{x \in X} \left\{ \sup_{y \in Y} f(x,y) \right\} = \sup_{y \in Y} f(\hat x, y) = f(\hat x, \hat y)$$

So the order of optimization over $X$ and $Y$ does not matter. That is, we're in the strict case of the max-min inequality.

$$\sup_{y \in Y} \left\{ \inf_{x \in X} f(x,y) \right\} = \inf_{x \in X} \left\{ \sup_{y \in Y} f(x,y) \right\}$$


### Proof of Saddle Point Theorem

**$\implies$:**

Suppose $x^*$ and $(\lambda^*, \mu^*)$ are primal and dual optimal solutions for a convex problem which satisfies Slater's Condition. Then the problem is Strongly Convex by the Sufficient Condition for Strong Duality. 



\\\  NEED KKT CONDITIONS  ///


# LINEAR PROGRAMS

BEST WAY TO UNDERSTAND INTUITIVELY COMPLEMENTARY SLACKNESS AND OTHER PROPERTIES OF STRING DUALITY IN KKT CONDITIONS

# Weak Duality in Linear Programs

Let's now focus on linear programs.

Suppose the primal is an LP of the form

$
\begin{cases}
\min_x: c^Tx
\\
s.t.: \begin{aligned} &Ax \geq b
\\ 
&x \geq 0
\end{aligned}
\end{cases}
$

By constructing the Lagrangian and going through the steps above we can show its dual is for the form

$
\begin{cases}
\max_p: b^Tp
\\
s.t.: \begin{aligned} &A^Tp \leq c
\\ 
&p \geq 0
\end{aligned}
\end{cases}
$

Weak Duality states

> **Weak Duality:** &nbsp; For any primal feasible $x$ and for all dual feasible $p$, $c^Tx \geq b^Tp$.
<br>

That is, any dual feasible solution $b^Tp$ is a *lower bound* for all primal feasible solutions $c^Tx$. Conversely, any primal feasible solution $c^Tx$ is an *upper bound* for all dual feasible solutions $b^Tp$. 

##  Proof of Weak Duality

Let $(p, x)$ be respectively dual-primal feasible. Then $c^Tx = x^Tc \geq x^TA^Tp \geq b^Tp$.

# Strong Duality

While Weak Duality is a useful result, the real strength of duality theory lies in *Strong Duality*. Strong duality is a re-statement of Von Neumann's [Minimax Theorem](https://en.wikipedia.org/wiki/Minimax_theorem) which lays out the conditions for which the max-min inequality holds with strict equality. Roughly speaking, it holds for functions that are saddle-shaped — convex in one variable and concave in the another.

Instead of proving the Minimax Theorem in the general case, we will stay topical and prove Strong Duality for LP's. That is, the Minimax Theorem as it pertains to the special case of linear programs... 

> **Strong Duality:** &nbsp; If the primal is feasible and bounded with optimal $x^*$ then the dual is also feasible and bounded. Furthermore, if the dual has optimum $p^*$ then $c^Tx^* = b^Tp^*$.
<br>

To prove Strong Duality, we require *Farkas' Lemma*.

## Farkas' Lemma

*Farkas' Lemma* belongs to the class of theorems called *Theorems of the Alternative* — these are a theorems stating that exactly one of two statements holds true.

The lemma simply states that a given vector $c$ is either a [conic combination](https://v-poghosyan.github.io/blog/optimization/applied%20mathematics/proofs/2022/01/23/Optimization-Review-of-Linear-Algebra-and-Geometry.html#Conic-Combinations-of-$n$-Points) of $a_i$'s for some $i \in I$, or it's separated from their cone by some hyperplane. 

We state Farkas' Lemma without offering proof since it has such an obvious geometric interpretation.

> **Farkas' Lemma:** &nbsp; For any vector $c$ and $a_i \ \ (i \in I)$ either the first or the second statement holds:  
&nbsp;
> * $\exists p \geq 0$ s.t. $c = \sum_{i \in I} a_ip_i$
> * $\exists$ vector $d$ s.t. $d^Ta_i \geq 0 \ \ \forall i \in I$ but $d^Tc < 0$

## Proof of Strong Duality in LP's

\\\ ANOTHER SUFFICIENT CONDITION FOR STRONG DUALITY IS BEING A LINEAR PROGRAM. THE PROOF BELOW IS INDEPENDENT OF PAST SUFFICIENT CONDITIONS. ///

The proof is by construction. 

Suppose $x^*$ is a primal optimal solution. Let the set $I_{x^*} = \{ i : a_i^Tx^* = b_i\}$ be the set of the indices of the active constraints at $x^*$. Our goal is to construct a dual optimal solution $p^*$ s.t. $c^Tx^* = b^Tp^*$. 

Let $d$ be any vector that satisfies $d^Ta_i \geq 0 \ \ \forall i \in I_{x^*}$. That is, $d$ is a feasible direction w.r.t. to all the active constraints.

A small, positive $\epsilon$-step in the direction of $d$ results in point $x^* + \epsilon d$ that's still feasible. The fact that the step is small is what guarantees no inactive constraints are violated.

Let's compare the value of the objective at $x^* + \epsilon d$ to the value of the objective at $x^*$.

By the assumption that $x^*$ is optimal, we have $c^Tx^* \leq c^T(x^* + \epsilon d) = c^Tx^* + \epsilon c^Td$. Thus, $c^Td = d^Tc \geq 0$

> Note: $d^Tc$ is nothing but the *directional derivative* at the minimizer $x^*$. It is a *first-order necessary-condition* that the *directional derivative* in any feasible direction $d$ be non-negative at any minimizer $x^*$. This is analogous to the first-derivative test for scalar-valued functions. So, this result should have been expected...
<br>

But since $d$ is a vector s.t. $d^Ta_i \geq 0 \ \ \forall i \in I_{x^*}$ and $d^Tc \geq 0$, then $d$ does *not* separate $c$ from the cone of the $a_i$'s. And since $d$ was arbitrary, this puts us in the setting of Farkas' Lemma. Namely, there exist *no* vectors $d$ that separate $c$ from the cone. This means the second statement in Farkas' Lemma is violated and the first must be true — $c$ must a conic combination of the $a_i$'s that are active at the minimizer. In other words, $\exists p \geq 0$ s.t. $c = \sum_{i \in I_{x^*}} p_ia_i$. 

> Note: $c = \sum_{i \in I_{x^*}} p_ia_i$ should remind us of the Lagrange optimality condition in the general case of convex optimization. Recall that the Lagrange condition states that a point $x^*$ is optimal for a convex problem with objective $f(x)$ and constraints $g_i(x)=0$ if and only if $\exists  \lambda_i$ for each active constraint s.t. $\nabla f(x^*) = \sum_i \lambda_i \nabla g_i(x^*)$. In fact, Farka's lemma is what underpins the Lagrange condition through the assumption that the non-linear objective $f$ and non-linear constraints $g_i$ behave linearly in a small neighborhood of $x^*$. 
<br>

But $p$ has dimension equal to only the number of active constraints at $x^*$. To be a dual variable at all, it must have dimension equal to the number of all primal constraints. We extend $p$ to $p^*$ by setting all the entries that do not correspond to the active constraints at $x^*$ to be zero. 

That is $p^*_i = \begin{cases} p_i \ \ \textrm{if} \ \  i \in I_{x^*} \\ 0   \ \ \textrm{if} \ \  i \notin I_{x^*} \end{cases}$. 

Now $A^Tp^*  = \sum_{i} p^*_ia_i = c$, so any feasibility condition in the dual, whether it be $A^Tp \leq c$, $A^Tp \geq c$, or $A^Tp = c$, is satisfied by $p^*$. 

Furthermore, the dual objective at $p^*$ agrees with the primal objective at $x^*$.

$$b^Tp^* = \sum_{i} b_ip_i^* = \sum_{i \in I_{x^*}} b_ip_i^* + \sum_{i \notin I_{x^*}} b_ip_i^* = \sum_{i \in I_{x^*}} a_i^Tx^*p_i^* = (\sum_{i \in I_{x^*}} p_ia_i^T)x^* = c^Tx^* $$

However, it still remains to be shown that $p^*$ is dual optimal. 

Whenever the primal objective and the dual objective agree on a value, the respective solutions must be primal-dual optimal. This is simply true by Weak Duality, which states that $b^Tp \leq c^Tx^*$ $\forall p$. So, $c^Tx^*$ is an upper bound for any dual feasible solution. But the dual is a maximization problem, so the dual optimal must be $p^*$ s.t. $b^Tp^* = c^Tx^*$.


NOTE THAT WE HAVE CONSTRUCTED THE DUAL OPTIMAL BY EXPLICITY SATISFYING THE KKT CONDITIONS IN THE LINEAR CASE. SO THIS IS THE PROOF THAT KKT PAIR => PRIMAL/DUAL OPTIMAL.

# Theorems of the Alternative

As mentioned earlier, these are theorems that describe exclusively disjoint scenarios that together comprise the entire outcome space. Formally, these are theorems of the form $A \implies \neg B \land \neg A \implies  B$  where $A$, and $B$ are logical statements.

Note that theorems of equivalence (i.e. theorems of the form *'the following are equivalent - TFAE'*) can also be formulated as theorems of the alternative. To say that $A$ and $B$ are equivalent means $ A \iff B$. But this breaks down as $A \implies B \land B \implies A$. Letting $\hat B = \neg B$ we can rewrite the above as $A \implies \neg \hat B \land B \implies A$. But, by taking the contrapositive, $B \implies A$ becomes $\neg A \implies \neg B$, which is to say $\neg A \implies \hat B$. In summary, we have shown that $A \iff B$ is equivalent to $A \implies \neg \hat B \land \neg A \implies \hat B$.

So, the class of theorems of the alternative is much broader than it appears and includes theorems of equivalence.

## Example of a Theorem of the Alternative

To see how we can prove a theorem of the alternative, it helps to state one. 

> **Theorem:** &nbsp; Exactly one of the following two statements most hold for a given matrix A.
&nbsp;
> 1. $\exists x \ne 0$ s.t. $Ax = 0$ and $x \geq 0$
> 2. $\exists p$ s.t. $p^TA > 0$
<br>

### Using a Separation Argument

At the heart of separation arguments lies this simple fact. 

> **Separating Hyperplane Theorem:** For any convex set $C$, if a point $\omega \notin C$ then there exists a hyperplane separating $\omega$ and $C$.
<br>

Farkas' Lemma, for instance, is proved by a separation argument that uses, as its convex set, the conic combination of the $a_i$'s. The conclusion is immediate since in Farkas' Lemma the first statement plainly says that a vector belongs to the convex set, and the second statement plainly says there exists a separating hyperplane between the two. 

This is the pattern all separation arguments must follow. However, in general, it may take a bit of work to define the problem-specific convex set and also to show that the two statements are *really* talking about belonging to this set, and separation from it. However, once these three things are accomplished the proof is complete. 

Using this idea, let's give a proof of the above theorem of the alternative using a separation argument.

#### Proof

First order of business is to come up with a convex set. 

Let's take $C = \{ z : z = Ay, \sum_i y_i = 1, y \geq 0 \}$ to be the convex hull of the columns of $A$.

The first statement in the theorem was that $\exists x \ne 0$ s.t. $Ax = 0$ and $x \geq 0$.

Since $x \ne 0$ and $x \geq 0$ we can scale as $x$ as $y = \alpha x$ until $\sum_i y_i = 1$.

So, the first statement is equivalent to saying the origin belongs to the convex hull $C$ (i.e. $0 \in C$)

The second statement was that $\exists p$ s.t. $p^TA > 0$. This is equivalent to saying that all the columns of $A$ lie to one side of the separating hyperplane introduced by $p$.

But all $z \in C$ are convex combinations of $A$'s columns. In particular since they're a convex combination they're also a conic combination, so all $z \in C$ also lie on the same side of the hyperplane. That is $p^Tz > 0 \ \ \forall z \in C$. 

But, of course, $p^T0 = 0$ (not $> 0$). So, according to the second statement, the origin is separated from $C$. 

This concludes the proof since the two statements must be mutually exclusive. 

### Using Strong Duality

Strong duality isn't just a tool for applied science, it has important theoretical uses. For instance, now that we've proven it we can use Strong Duality, instead of a separation argument, to prove theorems of the alternative. 

Since it gives us feasibility of two different constraint sets, it makes sense to use duality to prove theorems of existence. 

Let's take the aforementioned theorem of the alternative for example...

#### Proof

To prove the theorem we need to show two things. First, we need to show $1 \implies \neg 2$, then we need to show $\neg 1 \implies 2$.

The $1 \implies \neg 2$ direction is simple. 

Suppose $\exists x \ne 0$ s.t. $Ax = 0$ and $x \geq 0$. 

Then $\forall p \ \ (p^TA)x = p^T(Ax) = p^T0 = 0$ (not $> 0$).

We tackle the $\neg 1 \implies 2$ direction using duality.

The strategy is to construct a linear program based on $\neg 1$ such that the feasibility of its dual implies $2$.

We can express $\neg 1$ as '$\forall x \ne 0$, either $Ax \ne 0$ or $x < 0$.' Equivalently, '$x \ne 0 \implies Ax \ne 0$ or $x < 0$.' Taking the contrapositive, statement $1$ becomes '$Ax = 0$ and  $x \geq 0 \implies x = 0$.' 

So, let's form the LP 

$
\begin{cases}
\max_x: \textbf{1}^Tx
\\
s.t.: \begin{aligned} &Ax = 0
\\ 
&x \geq 0
\end{aligned}
\end{cases}
$

Note that $x = 0$ is a feasible solution to the LP. Furthermore, assuming statement $1$ guarantees that $x = 0$ is the only feasible solution. Thus, the LP is feasible and bounded. 

By Strong Duality, its dual must also be feasible and bounded. 

The dual is...

$
\begin{cases}
\min_p: \textbf{0}^Tp
\\
s.t.: p^TA \geq \textbf{1}
\end{cases}
$

... and since it's feasible, $\exists p$ s.t. $p^TA \geq 1 > 0$ which demonstrates the truth of statement $2$.  

# Complementary Slackness

*Complementary Slackness* is a fundamental property that exists between any primal optimal solution and any dual optimal solution. 

In the preceding section on Strong Duality we constructed a dual optimal by setting those of its variables that corresponded to the inactive constraints of the primal optimal to be zero. 

This is true in general, for all primal-dual optimal pairs. 

If a primal's constraint is loose at a some primal optimal, then the corresponding variable in the dual optimal is zero, and vice versa. 

Formally, this can be stated as

> **Complementary Slackness:** if $x$ is primal feasible and $p$ is dual feasible, then $x$ and $p$ are respectively optimal iff:
&nbsp;
> 1. $(b_i - \sum_{j} a_{ij}x_j)p_i = 0 \ \ \forall i$
> 2. $(\sum_{i} a_{ij}p_i - c_j)x_j = 0  \ \ \forall j$
<br>

If we recall, in the proof of Strong Duality we constructed a dual optimal by setting those of its variables that corresponded to the primal's slack constraints to be zero. In other words, we constructed a dual optimal in such a way as to satisfy the Complementary Slackness theorem. So, the fact that this generalizes to all primal-dual optima shouldn't surprise us. 

However, the above does not constitute a proof of Complementary Slackness, so let's offer one.

Take as a starting point the primal-dual pair

$
\textrm{P} \ \ 
\begin{cases}
\min_x: c^Tx
\\
s.t.: \begin{aligned} &Ax \geq b
\\ 
&x \geq 0
\end{aligned}
\end{cases}
$

$
\textrm{D} \ \ 
\begin{cases}
\max_p: b^Tp
\\
s.t.: \begin{aligned} &A^Tp \leq c
\\ 
&p \geq 0
\end{aligned}
\end{cases}
$

## Proof of Complementary Slackness

**Sufficiency $\impliedby$:**

Suppose both equalities hold.

Summing each over all $i$'s and $j$'s respectively and adding the results we get

$$\sum_i \left(b_i - \sum_j a_{ij}x_j \right)p_i + \sum_j \left( \sum_i a_{ij}p_i - c_j \right)x_j = 0$$

Which simplifies to 

$$\sum_i b_ip_i - \sum_i \sum_j a_{ij}x_jy_i + \sum_j \sum_i a_{i,j}y_ix_j - \sum_j c_jx_j = 0$$

Or, in matrix-vector form

$$b^Tp - p^TAx + p^TAx - c^Tx = 0$$

The middle two terms cancel, and we get $b^Tp = c^Tp$. 

By Weak Duality, $x$ and $p$ are primal-dual optimal.

**Necessity $\implies$:**

Suppose $x$ and $p$ are primal-dual optimal. 

By Strong Duality $b^Tp = c^Tx$. 

In other words, $b^Tp - c^Tx = 0$. Adding and subtracting the terms canceled in the first part, we can bring the sum to the form

$$b^Tp - p^TAx + p^TAx - c^Tx = 0$$

Which is, once again, the same as

$$\sum_i \left(b_i - \sum_j a_{ij}x_j \right)p_i + \sum_j \left( \sum_i a_{ij}p_i - c_j \right)x_j = 0$$

But $p$ is dual feasible, so $p_i \geq 0 \ \ \forall i$. And since $x$ is primal feasible, $Ax \geq b$ implies $(b_i - \sum_j a_{ij}x_j) \leq 0 \ \ \forall i$. 

Similarly, $x_j \geq 0 \ \ \forall j$ and $( \sum_i a_{ij}p_i - c_j) \geq 0 \ \ \forall j$. 

So the above expression is a sum of all non-positive terms that adds up to zero. This can only happen if each term is equal to zero. 