# Optimization -  Duality

> Lagrangian Duality, Weak and Strong Duality, Optimality Conditions, Farkas' Lemma, and Theorems of the Alternative

- hide: false
- toc: true
- badges: true
- comments: true
- categories: ['Optimization','Applied Mathematics','Proofs']

# Introduction

Every convex optimization problem, designated as the ***primal***, has a related problem called its ***dual*** which can be colloquially thought of as its evil twin. The primal and the dual represent two different perspectives on the same problem. 

In the most general case, if the primal is a minimization problem, its dual is a maximization problem. In the case of constrained optimization, if the primal is minimization in $n$ variables and $m$ constraints then its dual is a maximization in $m$ variables and $n$ constraints. 

Furthermore, *any* feasible value of the dual is a lower-bound for *all* feasible values of the primal. In particular, should they both exist, the dual optimum is a lower bound for the primal optimum. This property, called ***weak duality***, is at the core of ***duality theory***. Having a problem that obtains, at the very least, a useful lower-bound for the primal optimum and, possibly, the primal optimum itself is the nascent idea of formulating the dual.

In the best case scenario, problems exhibit a property called ***strong duality***, which guarantees the primal and the dual optima agree with each other. Strongly dual problems include, but are not limited to, all linear programs and a category of convex non-linear optimization problems. For such problems, solving the dual guarantees that we've also solved the primal. Furthermore, taking the dual of the dual gives back the primal. So this relationship is true in the converse — if we've solved the primal then we've also solved its dual.

This is what makes duality theory so useful in practice. Having a related, usually easier, optimization problem gives applied scientists a huge computational advantage. However, even if the dual does not turn out to be any easier to solve or strong duality fails to hold, we still stand to gain structural insights about the primal problem.

In this post we will show how the dual of a problem arises, we will examine in detail its relationship with the primal, and list all possible primal-dual outcomes. In doing so, we will look at duality in the general case of constrained optimization problems, in the specific case of linear programs, and in a certain category of unconstrained problem.

# The Dual of a Constrained Problem

First, let's focus on deriving the dual of a constrained optimization problem. We shall see that, in a sense, constraints are what give rise to duality through the [Lagrangian](https://en.wikipedia.org/wiki/Lagrangian_relaxation). Certain types of unconstrained problems also have duals which arise from introducing dummy constraints or directly through the [Fenchel-Legendre Transform](https://en.wikipedia.org/wiki/Convex_conjugate).

Take the most general form of a convex, constrained problem with $m$ inequality and $n$ equality constraints. To make the discussion interesting, assume the problem is non-trivial (i.e. its constraint set is non-empty and contains more than one feasible point). Furthermore, so that we may have a solution to speak of, assume the problem is bounded with the finite optimum $f_0(x^*)$ for some optimizer $x^*$.

<br>
$$
\begin{aligned}
\min_x &: f_0(x)
\\
s.t. &: \begin{aligned} &f_i(x) \leq 0 \ \ i = 1, ...,m
\\ 
&h_i(x) = 0 \ \ i = 1, ... ,p
\end{aligned}
\end{aligned} \tag{P}
$$
<br>

> Note: The $f_i$'s and the $h_i$'s in the constraints must necessarily be convex in order for their sublevel-sets, and hence the problem itself, to be convex. However, the equality constraints may be given as $Ax = b$ in some other sources. These representations are almost equivalent. The $0$-th level-set of $Ax - b$ is indeed a convex set. However, $h_i$'s in the equality constraints $h_i(x) = 0$ need not be linear for the their $0$-th level-set $\{ x : h_i(x) = 0 \}$ to be convex. For example, in $\mathbb{R}$, $x^2 = 0$ does represent a convex level-set. Note, however, that $x^2 = 0$ can be reduced to $x = 0$ which is, indeed, linear. The notion of [quasi-linearity](https://en.wikipedia.org/wiki/Quasiconvex_function) is what's needed here but, in practice, we simply *define* a general convex problem as having only linear equality constraints. Doing so assists in the analysis of problems and in the development of computational methods.

Since optimizing an unconstrained problem is considerably easier than optimizing a constrained problem, we seek to augment the constrained problem into an equivalent unconstrained problem. 

The idea is to penalize infeasible $x$ using functions that express our *displeasure* for certain choices. 

At first we use the *infinitely-hard penalty functions* $\mathbb{1}_-$ and $\mathbb{1}_0$ which are defined as follows:

<br>
$$\mathbb{1}_-(u) = 
\begin{cases}
\begin{aligned} 
&0  &\textrm{if} \ u \leq 0
\\
&\infty  &\textrm{if} \ u > 0
\end{aligned}
\end{cases}$$
<br>
$$\mathbb{1}_0(u) = 
\begin{cases}
\begin{aligned} 
&0  &\textrm{if} \ u = 0
\\
&\infty  &\textrm{if} \ u \ne 0
\end{aligned}
\end{cases}$$
<br>

Then the equivalent unconstrained problem can be stated as:

<br>
$$\min_x: \mathcal{J}(x)$$
<br>

where $\mathcal{J}(x) = f_0(x) + \sum_{i=1}^m \mathbb{1}_-(f_i(x)) + \sum_{i=1}^p \mathbb{1}_0(h_i(x))$. 

Equivalently, we can express the objective $\mathcal{J}(x)$ as:

<br>
$$\mathcal{J}(x) = \begin{cases}\begin{aligned} 
&f_0(x) \ \ \textrm{if $x$ is feasible}
\\
&\infty \ \ \textrm{otherwise}
\end{aligned}\end{cases}$$
<br>

In [137]:
# Include MatPlotLib plots of constrained f0(x) and uncopnstrained J(x) for a simple quadratic 
# with a simple constraint set

Informally, if $\hat x$ is chosen s.t *one or more* of the constraints are broken then the minimization incurs an infinitely positive penalty. Therefore, such a $\hat x$ will never be selected over any feasible choice $x$ which gives a finite value $f_0(x)$. Moreover, by optimality of $x^*$ in the original problem, we have $f_0(x) \leq f_0(x^*) \ \ \forall x$. So, the optimum of $\mathcal{J}(x)$ will also be $f_0(x^*)$.

That is:

<br>
$$\min_x \mathcal{J}(x) = f_0(x^*) \tag{1}$$
<br>

Moreover, since the optimizer $x^*$ for the original problem is feasible, $\mathcal{J}(x^*) = f_0(x^*)$ by definition. It follows from $(1)$ that:

<br>
$$\mathcal{J}(x^*) = \min_x \mathcal{J}(x) \tag{2.1}$$
<br>

Or, equivalently:

<br>
$$x^* = \arg \min_x \mathcal{J}(x) \tag{2.2}$$
<br>

$(1)$ says that it suffices to minimize the unconstrained objective $\mathcal{J}(x)$ instead of the original problem since doing so results in $f_0(x^*)$, the optimum of the unconstrained problem. $(2.1)$ and $(2.2)$, on the other hand, say that it suffices to find an optimizer $x^*$ of the unconstrained problem, since such a point will also be an optimizer of the constrained problem.

As we know, the local optima of unconstrained problems occur at their *stationary points* which can be easily identified using the *unconstrained optimality condition*.

> **Unconstrained Optimality Condition:** &nbsp; If $x^*$ is an optimizer of the unconstrained objective $f_0(x)$ then $\nabla f_0(x^*) = 0$. That is $x^*$ is a ***stationary point*** of $f_0(x)$.

Once the stationary points have been found, a global minimizer can be identified among them simply by evaluating the objective at each stationary point.

However, we're immediately beset by a problem. We cannot find the gradient of $\mathcal{J}(x)$ and set it to zero because the infinitely-hard penalty functions are discontinuous and non-differentiable. That is, $\nabla \mathcal{J}(x)$ simply does not exist.

To sidestep this difficulty we use linear relaxations instead of $\mathbb{1}_-$ and $\mathbb{1}_0$. 


## The Lagrangian, Dual Variables, and the Dual Function

The ***Lagrangian linear relaxation***, sometimes simply referred to as the ***Lagrangian***, is:

<br>
$$\mathcal{L}(x,\lambda,\mu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \mu_i h_i(x)$$
$$\textrm{where} \ \lambda \geq 0$$
<br>

We call the $\lambda_i$'s the ***Lagrange multipliers*** corresponding to the inequality constraints, and the $\mu_i$'s those corresponding to the equality constraints. The vectors $\lambda$ and $\mu$, composed of these Lagrange multipliers, are called the ***Lagrange multiplier vectors*** or, for reasons that will soon become apparent, the ***dual variables***. 

> Note: In some sources, the Lagrangian is simply stated as $\mathcal{L}(x,\lambda) = f_0(x) + \sum_{i=1}^n \lambda_i f_i(x)$. Indeed, by separating the equality constraints $h_i(x) = 0$ into $h_i(x) \leq 0$ and $-h_i(x) \leq 0$, we can transform a problem with equality constraints into one with only inequality constraints. So, this formulation of the Lagrangian is still general enough to account for problems with equality constraints.

## A Lagrangian Lower-Bound

Not only does the Lagrangian relax the unconstrained augmentation of the constrained problem, it also plays a natural role in the formulation of the ***dual problem*** as promised.

The first thing to note about the Lagrangian is that the coordinate-wise $\lambda \geq 0$ condition is crucial. This is because, in the event that an inequality constraint is violated, say $f_i(x) > 0$, the corresponding $\lambda_i$ must be non-negative in order to apply a positive penalty to the minimization. 

On the other hand, $\mu$ is free to assume any value since the equality constraints can be violated in either direction and both scenarios must be penalized.

The second thing to note about the Lagrangian is that, even though it applies a positive penalty that scales linearly in the severity of the violation, this penalty is, nevertheless, not as severe as the infinite penalty applied in $\mathcal{J}(x)$. Also, in the Lagrangian, we may actually be *rewarding* feasible choices of $x$ that have margin. That is, in the event that $f_i(x) < 0$, $\lambda_if_i(x)$ is a non-positive reward for the minimization problem. 

All of this is to say that the Lagrangian is a point-wise lower-bound of the unconstrained problem. That is, the following inequality holds:

<br>
$$\mathcal{L}(x,\lambda,\mu) \leq J(x) \ \ \forall x, \lambda \geq 0, \mu \tag{3.1}$$
<br>

This fact is also true by graphing each of the linear and infinite penalties and noticing that $\lambda_i f_i(x) \leq \mathbb{1}_-(f_i(x))$ and $\mu_i h_i(x) \leq \mathbb{1}_0(h_i(x))$ for all $i$ constraints. 


Taking $\inf$ w.r.t. $x$ of the LHS in $(3.1)$, we obtain something of interest.

<br>
$$\inf_x \mathcal{L}(x,\lambda,\mu) \leq J(x) \ \ \forall x, \lambda \geq 0, \mu \tag{3.2}$$
<br>

> Note: The Lagrangian may not attain its $\min$ w.r.t. $x$, in which case the LHS is simply $-\infty$. We shall see later, once we define the ***dual function*** and the ***duality gap***, that this corresponds to the dual function being $-\infty$ $\forall \lambda \geq0, \mu$ and the duality gap being $\infty$. For now, we assume the minimum *is* attained and thus $\inf_x \mathcal{L}(x, \lambda, \mu) = \min_x \mathcal{L}(x, \lambda, \mu)$.

Designating the original problem as the *primal* $(P)$, we call $g(\lambda, \mu) := \min_x \mathcal{L}(x, \lambda, \mu)$ the ***dual function*** because it exhibits the property of weak duality. That is, per $(3.2)$, any feasible value of $g(\lambda, \mu)$ is a lower-bound for any feasible value of the primal.

Taking min of both sides in $(3.1)$, we have a more specific flavor of weak duality.

<br>
$$g(\lambda,\mu) \leq \min_x \mathcal{J}(x) \ \ \forall \lambda \geq 0, \mu$$
<br>

And, since $\mathcal{J}(x^*) = f_0(x^*) = \min_x \mathcal{J}(x)$, we have:

<br>
$$g(\lambda,\mu) \leq f_0(x^*) \ \ \forall \lambda \geq 0, \mu \tag{3.3}$$
<br>

That is, any feasible value of the dual is a lower-bound for the primal optimum.

Maximizing both sides of $(3.3)$ by noticing that the RHS is a constant, and by assuming the LHS attains its $\max$ we get an even more specific flavor of weak duality.

<br>
$$\max_{\lambda \geq 0, \mu} g(\lambda,\mu) \leq f_0(x^*) \tag{3.4}$$
<br>

That is, the dual optimum is a lower-bound for the primal optimum.

From here we move, quite naturally, to defining the *dual problem* $(D)$.

## The Lagrange Dual Problem

It's natural, to ask what the *tightest* lower bound on the primal optimal value $f_0(x^*)$ is. This amounts to finding the values $\lambda^* \geq 0$, and $\mu^*$ for which $g(\lambda^*, \mu^*)$ is maximized. We call this the ***Lagrange dual problem*** or, simply, the ***dual problem***.

It can be stated as:

<br>
$$
\begin{aligned}
\max_{\lambda, \mu} &: g(\lambda, \mu)
\\
s.t. &: \lambda \geq  0
\end{aligned} \tag{D}
$$
<br>

Looking at the above, it becomes immediately clear why we were motivated to call $\lambda$, and $\mu$ the *dual variables*. They are the variables of the dual problem.

# Weak Duality and Interpretations

We now return to the general setting of constrained optimization.

We've already seen weak duality formulated as $(3.2)$, $(3.3)$, and $(3.4)$. But, there's yet another, more symmetric, formulation of weak duality.

Suppose $x^*$ and $(\lambda^*, \mu^*)$ are primal-dual optimal. Then from $(3.4)$ we have weak duality in terms of the primal and dual optima $g(\lambda^*, \mu^*)$ as:

<br>
$$g(\lambda^*, \mu^*) \leq f_0(x^*) \tag{3.5}$$
<br>

But since $g(\lambda^*, \mu^*)$ is the solution to the dual $(D)$, and $g(\lambda, \mu) = \min_x \mathcal{L}(x, \lambda, \mu)$:

<br>
$$g(\lambda^*, \mu^*) = \max_{\lambda \geq 0, \mu} \left\{ \min_x \mathcal{L}(x, \lambda, \mu) \right\} \tag{4.1}$$
<br>

Similarly, it can be shown that:

<br>
$$f_0(x^*) = \min_x \left\{ \max_{\lambda \geq 0, mu} \mathcal{L}(x, \lambda, \mu) \right\} \tag{4.2}$$
<br>

To see this, note that for some $x$ fixed by the outer minimizer, maximizing the Lagrangian over $\lambda \geq 0$ and $\mu$ recovers $\mathcal{J}(x)$. 

If all of the inequality constraints are respected, that is $f_i(x) \leq 0$ $\forall i$, then, in order to maximize the Lagrangian, the best we can do is set $\lambda_i = 0$ $\forall i$. In case *any* inequality constraint is violated, that is $f_i(x) > 0$ for some $i$, the result of maximizing the Lagrangian can be made $\infty$ by choosing $\lambda_i \rightarrow \infty$ and $\lambda_j = 0$ $\forall j \ne i$. 

Using similar logic, if all equality constraints are respected then $h_i(x) = 0$ $\forall i$. In this case $\mu_i$ can be chosen to be any value. If, on the other hand, some equality constraint is violated then $h_i(x) \ne 0$ for some $i$. By choosing $\mu_i \rightarrow \pm \infty$, where the sign depends on the direction of the violation, the result can be made $\infty$.

Thus we have shown that:

<br>
$$\begin{aligned}\max_{\lambda \geq 0, \mu} \mathcal{L}(x,\lambda,\mu) &= \begin{cases}\begin{aligned} 
&f_0(x) \ \ \textrm{if $x$ is feasible}
\\
&\infty \ \ \textrm{otherwise}
\end{aligned}\end{cases} \\ &= \mathcal{J}(x)\end{aligned}$$
<br>

Now, since $x^*$ is the solution to the primal $(P)$ and $\min_x J(x) = f_0(x^*)$ we have $(4.2)$ as promised. 

Then, weak duality can be stated as:

<br>
$$\max_{\lambda \geq 0, \mu} \left\{ \min_x \mathcal{L}(x, \lambda, \mu) \right\} \leq \min_x \left\{ \max_{\lambda \geq 0, mu} \mathcal{L}(x, \lambda, \mu) \right\} \tag{3.6}$$
<br>

## The Max-Min Inequality

The inequality expressed as $(3.6)$ is, in fact, a general result in mathematics called the [*Max-Min Inequality*](https://en.wikipedia.org/wiki/Max%E2%80%93min_inequality). To summarize, the Max-Min Inequality makes no assumptions about the function. It's simply true for all functions of the form $f: X \times Y \rightarrow \mathbb{R}$, and it states that:

<br>
$$\sup_{x\in X} \left\{ \inf_{y\in Y} f(x,y) \right\} \leq \inf_{y\in Y} \left\{ \sup_{x\in X} f(x,y) \right\}$$
<br>

Since no assumption is made on $f$, the inequality also holds for the Lagrangian, $\mathcal{L}$. And, since we're in the special case where the optimal values of the primal $(P)$ and the dual $(D)$ are assumed to exist, the functions attain the respective optima. That is, we can replace $\sup$ and $\inf$ in the inequality with $\max$ and $\min$ which obtains the promised symmetric formulation of weak duality as $(3.6)$.

We can now prove weak duality through a non-optimization lens simply by proving the Max-Min Inequality.

For any $f$, and $x \in X$, $y \in Y$ we have:

<br>
$$f(x,y) \leq \sup_y f(x,y) \ \ \forall x$$
<br>

The right hand side is now only a function of $x$, so minimizing both sides w.r.t. $x$ yields:

<br>
$$ \inf_x f(x,y) \leq \inf_x \left\{ \sup_y f(x,y) \right\} \ \ \forall y$$
<br>

The right hand side is now a constant, so maximizing both sides w.r.t. $y$ results in the desired conclusion.

<br>
$$\sup_y \left\{ \inf_x f(x,y) \right\} \leq \inf_x \left\{ \sup_y f(x,y) \right\}$$
<br>

> Note: As we can see, the Max-Min Inequality proof mirrors the steps taken to obtain $(3.2)$ through $(3.4)$ from $(3.1)$. In fact, $(3.1)$ is of form $f(x,y) \leq \sup_y f(x,y) \ \ \forall x$, the first step of the Max-Min Inequality proof, since, in $(3.1)$, $J(x)$ is, as shown earlier, equivalent to $\max_{\lambda \geq 0, \mu} L(x, \lambda, \mu)$. 

### Game-Theoretic Interpretation

The Max-Min Inequality is perhaps best understood intuitively as a game between two adversarial optimizers. 

The LHS of the Max-Min Inequality can be interpreted as the following game. First, the outer maximizer, player $Y$, fixes its choice $y$. Then, the inner minimizer, player $X$, chooses $x_y = \arg \inf_x f(x,y)$ which depends on the outer's choice of $y$. Suppose $y^* = \arg \inf_y f(x,y)$ is what player $Y$'s choice would have been were it to act independently of the actions of player $X$. We can imagine a scenario in which the score $f(x_{y^*}, y^*)$ is less than the score $f(x_y, y)$ for some other choice of $y$. So, player $Y$ cannot do as well as it would've done independently, whereas player $X$ is free to do its best. Hence, player $X$, the second player, restricts the choices of player $Y$, the first player. 

If the goal is to score low then player $X$ has the advantage by playing second turn. Conversely, if the goal is to score high player $Y$ has the advantage if it goes after player $X$. This is exactly what the Max-Min Inequality says.

# Strong Duality

Strong duality is the case in which the primal and the dual optima agree strictly. 

<br>
$$g(\lambda^*, \mu^*) = f_0(x^*) \tag{5.1}$$
<br>


Alternatively, in its Max-Min characterization:

<br>
$$\max_{\lambda \geq 0, \mu} \left\{ \min_x \mathcal{L} (x, \lambda, \mu) \right\} = \min_x \left\{ \max_{\lambda \geq 0, \mu} \mathcal{L} (x, \lambda, \mu) \right\} \tag{5.2}$$
<br>

A common way to say a problem is strongly dual is to say its ***duality gap*** is zero. The duality gap is defined as the difference between the primal and dual optima, that is $f_0(x^*) - g(\lambda^*, \mu^*)$. This characterization, then, follows immediately from the definition of strong duality as stated in $(5.1)$.

Optimization problems that exhibit the property of strong duality are called ***strongly dual***. 

As mentioned briefly in the introduction, strong duality gives applied scientists the ability to solve an equivalent, usually easier, dual optimization problem instead of the primal one which may be difficult to solve.
Strong duality also obtains powerful optimality conditions which allow us to check if suspected optima are, indeed, optimal. We will soon make both of these claims rigorous but, for now, it's enough to think of them as benefits of strong duality so that we understand the value in knowing, in advance, whether or not a given problem is strongly dual. 

We shall see that all linear programs are strongly dual by a direct proof. When it comes to non-linear optimization, however, strong duality is not a guarantee. The good news is that sufficient conditions for strong duality do exist and will be provided next.

## Slater's Condition - Sufficient Condition for Strong Duality

Not all convex programs are strongly dual, however convexity together with ***Slater's condition*** is sufficient to establish strong duality. 

> **Slater's Condition:** &nbsp; $\exists \ \hat x$ s.t. $f_i(\hat x) < 0$, and $h_i(\hat x) = 0$ $\forall i$.

Informally, Slater's condition says that the existence of a feasible point which has margin w.r.t. all the inequality constraints is needed in addition to convexity. In even simpler terms, the feasible region must have an interior point. 

The sufficient condition for Strong duality is then:

> **Sufficient Condition for Strong Duality:** &nbsp; Any convex optimization problem satisfying Slater's condition has zero duality gap.

The proof of this is beyond what we're trying to accomplish in this post.

## The Max-Min Equality

Just as weak duality is the Max-Min Inequality in disguise, strong duality is the Max-Min Equality which, in general, holds for functions $f: X \times Y \rightarrow \mathbb{R}$ that have additional structure. Roughly speaking, when $f$ is saddle-shaped, that is convex in one variable and concave in the other, the Max-Min Inequality holds with strict equality. 

The following theorem, which we offer without proof, translates this result into the setting of optimization.

> **Saddle Point Theorem:** &nbsp; If $x^*$ and $(\lambda^*, \mu^*)$ are primal and dual optimal solutions for a convex problem which satisfies Slater's condition, they form a saddle point of the associated Lagrangian. Furthermore, if $(x^*, (\lambda^*, \mu^*))$ is a saddle point of a Lagrangian, then $x^*$ is primal optimal and $(\lambda^*, \mu^*)$ is dual optimal for the associated problem, and the ***duality gap*** is zero.

> Note: This theorem should *not* be taken as a ***certificate of strong duality***. If the Lagrangian is saddle-shaped then the associated problem is strongly dual, however the converse is not true. Since not all strongly dual problems are convex problems which satisfy Slater's condition, if a problem is strongly dual it is *not* guaranteed that its Lagrangian is saddle-shaped.

In keeping with the game theoretic intuition developed in the section on weak duality, one can imagine a game in which the first player's optimal choice is independent of the second player's actions. In such a game, both players are free to play their best strategies and, consequently, the order of play is not important.

## An Easier Dual Problem 


Let's further qualify what we mean when we say strong duality gives an equivalent, usually easier, problem to solve. 

At the start of this post we considered a general convex program. However, everything we've discussed about Lagrangian duality applies to non-convex problems too. Suppose the primal problem is non-convex. The task is that of finding the primal optimum:

<br>
$$f_0(x^*) = \min_x \left\{ \max_{\lambda \geq 0, \mu} \mathcal{L} (x, \lambda, \mu) \right\}$$
<br>

But maximizing the Lagrangian over $\lambda \geq 0$ and $\mu$ for a fixed $x$, recovers $\mathcal{J}(x)$: a non-differentiable objective. So, we cannot use the unconstrained optimality condition in finding the stationary points of $\mathcal{J}(x)$ which is what's required in the next step.

Meanwhile, the dual problem is that of finding the dual optimum:

<br>
$$g(\lambda^*, \mu^*) = \max_{\lambda \geq 0, \mu} \left\{ \min_x \mathcal{L} (x, \lambda, \mu) \right\}$$
<br>

Minimizing the Lagrangian over $x$ for fixed $\lambda \geq 0$ and $\mu$ may still be a difficult problem but, at least, it lends itself to using the method of unconstrained optimization. Moreover, the resulting dual function $g(\lambda, \mu) = \min_x \mathcal{L}(x, \lambda, \mu)$ is a point-wise minimum of linear functions in $\lambda$ and $\mu$, so its always concave in those variables. Meanwhile, the constraint $\lambda \geq 0$ is a simple convex, in fact linear, constraint. So, overall, the dual problem is a convex problem regardless of the convexity of the primal. 

Solving the convex dual problem is usually easier that solving the non-convex primal. However, even if the primal is a convex problem to begin with, the dual may still be easier to solve. The primal could have more variables than constraints,  in which case its dual is a problem with more constraints than variables making it easier to solve.

So, in the settings described above, if strong duality holds we've found an easier approach to the primal problem. If strong duality fails to hold then, at the very least, we've found a useful lower-bound to the primal optimum. 

## Optimality Conditions
    
Strong duality also obtains two powerful optimality conditions known as ***stationarity condition*** and ***complementary slackness***. These are often bundled into the [*Karush–Kuhn–Tucker (KKT) Conditions*](https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions) which will be provided shortly.
    
### Stationarity Condition

In the section titled [An Easier Dual Problem](https://v-poghosyan.github.io/blog/optimization/applied%20mathematics/proofs/2022/02/07/Optimization-LP-Duality.html#An-Easier-Dual-Problem) we mentioned that the dual problem is that of finding the dual optimal value:

<br>
$$g(\lambda^*, \mu^*) = \max_{\lambda \geq 0, \mu} \left\{ \min_x \mathcal{L} (x, \lambda, \mu) \right\}$$
<br>

If strong duality holds, this dual optimum agrees with the primal optimum. That is: 

<br>
$$g(\lambda^*, \mu^*) = f_0(x^*)$$
<br>

Turns out there's more that can be said. As we saw earlier optimizing the unconstrained objective $\mathcal{J}(x)$ not only resulted in the primal optimum $f_0(x^*)$ for some optimal $x^*$ of the constrained problem, the very same point $x^*$ itself turned out to be an optimizer of $\mathcal{J}(x)$. Similarly, we can show that the primal optimum $x^*$ for some primal-dual optimal pair $(x^*, (\lambda^*, \mu^*))$ optimizes $\mathcal{L}(x, \lambda^*, \mu^*)$. In other words, the primal optimum $x^*$ is a stationary point of the Lagrangian at the dual optimum $(\lambda^*,\mu^*)$.

That is:

<br>
$$x^* = \arg \min_x \mathcal{L} (x, \lambda^*, \mu^*) \tag{6.1}$$ 
<br>

Or, equivalently:

<br>
$$\min_x \mathcal{L}(x, \lambda^*, \mu^*) = \mathcal{L}(x^*, \lambda^*, \mu^*) \tag{6.2}$$
<br>

We can think of $(6.1)$ and $(6.2)$ as the analogs of $(2.1)$ and $(2.2)$ for the Lagrangian. This is exactly what we've been digging for. Recall that the original motivation in augmenting the constrained problem into $\mathcal{J}(x)$ was to find the former's optimizer using the unconstrained optimality condition on $\mathcal{J}(x)$. $(2.1)$ or $(2.2)$ would then guarantee that the optimizer of $\mathcal{J}(x)$ we found was, itself, the optimizer of the original problem. Failing that, we relaxed $\mathcal{J}(x)$ into $\mathcal{L}(x, \lambda, \mu)$ hoping we can still do the same. $(6.1)$ and $(6.2)$ guarantee we can. They say that the optimizer $x^*$ of the original problem can be found by optimizing the unconstrained objective $\mathcal{L}(x, \lambda^*, \mu^*)$ and, since the latter is everywhere differentiable w.r.t. $x$, we can now proceed.

In practice, however, $(6.1)$ and $(6.2)$ only give us a way to solve for a primal optimal $x^*$ directly if a dual optimal $(\lambda^*, \mu^*)$ is already known. That is, any time the dual problem is easier to solve than the primal.

More generally, this fact gives us a way to check if a pair $(x^*,(\lambda^*,\mu^*))$ is primal-dual optimal – an optimality condition known as *stationarity condition*.

> **Stationarity Condition:** &nbsp; Suppose $x^*$ and $(\lambda^*, \mu^*)$ are primal-dual optimal for a strongly dual problem. Then:
<br>
$$\nabla_x f_0(x^*) + \sum_i^m \lambda^*_i\nabla_xf_i(x^*) + \sum_{i=1}^p \mu^*_i\nabla_xh_i(x^*) = 0$$

The stationary condition is obtained simply by an application of the unconstrained optimality condition to $\mathcal{L}(x, \lambda^*, \mu^*)$.

<br>
$$\nabla_x \mathcal{L} (x^*, \lambda^*, \mu^*) = 0$$ 
<br>

Then, expanding the LHS gives:

<br>
$$\nabla_x f_0(x^*) + \sum_i^m \lambda^*_i\nabla_xf_i(x^*) + \sum_{i=1}^p \mu^*_i\nabla_xh_i(x^*) = 0$$ 
<br>

For the sake of completeness, since we stated them without offering a proof, let's prove the equivalent claims $(6.1)$ and $(6.2)$ from which stationarity condition ultimately follows.

#### Proof of Claims (6.1) and (6.2)

Suppose $x^*$ and $(\lambda^*, \mu^*)$ are primal-dual optimal for a strongly dual problem. 

The following point-wise inequality holds in general since its LHS is a minimization over $x$ and its RHS is a maximization over $(\lambda, \mu)$ of the Lagrangian.

<br>
$$g(\lambda, \mu) \leq \mathcal{L}(x, \lambda, \mu) \leq \mathcal{J}(x) \ \ \forall x, \lambda \geq 0, \mu$$
<br>

It is also, in particular, true for the primal-dual optimal pair. That is:

<br>
$$g(\lambda^*, \mu^*) \leq \mathcal{L}(x^*, \lambda^*, \mu^*) \leq \mathcal{J}(x^*) \tag{7.1}$$
<br>

However, $\mathcal{J}(x^*) = f_0(x^*)$ and, by strong duality, $g(\lambda^*, \mu^*) = f_0(x^*)$. Hence, $g(\lambda^*, \mu^*) = \mathcal{J}(x^*)$ and $(7.1)$ is actually the equality.

<br>
$$\mathcal{L}(x^*, \lambda^*, \mu^*) = g(\lambda^*, \mu^*) \tag{7.2}$$ 
<br>

Substituting, the definition of the dual function for the RHS of $(7.2)$, we get:

<br>
$$\mathcal{L}(x^*, \lambda^*, \mu^*) = \min_x \mathcal{L}(x, \lambda^*, \mu^*)$$ 
<br>

Which is exactly $(6.2)$ and, by equivalence, also $(6.1)$.

### Complementary Slackness

Strong duality also obtains another optimality condition known as *complementary slackness (CS)*.

> **Complementary Slackness (CS):** &nbsp; Suppose $x^*$ and $(\lambda^*, \mu^*)$ are primal-dual optimal for a strongly dual problem. Then:
<br>
$$\lambda^*_i f_i(x^*) = 0 \ \ \forall i$$

Informally, if a primal constraint at an optimal $x^*$ is *loose*, that is $f_i(x^*) \ne 0$, then its corresponding dual variable $\lambda^*_i$ in the dual optimal $\lambda^*$ must be zero. Conversely, if the dual variable $\lambda_i^*$ is positive then the corresponding constraint must be *tight*.

Note that if a primal constraint is *tight* at $x^*$, complementary slackness tells us nothing about its corresponding dual variable. 

#### Proof of Complementary Slackness

Suppose $x^*$ and $(\lambda^*, \mu^*)$ are primal-dual optimal for a strongly dual problem. 

Expanding the RHS we obtain:

<br>
$$
\begin{aligned}
f_0(x^*) &= g(\lambda^*, \mu^*) \\ 
&= \min_x \mathcal{L}(x, \lambda^*, \mu^*) \\ 
&= \mathcal{L}(x^*, \lambda^*, \mu^*) \\
&=  f_0(x^*) + \sum_{i=1}^m \lambda_i^* f_i(x) + \sum_{i=1}^p \mu_i^* h_i(x^*) \\
&\leq f_0(x^*)
\end{aligned} \tag{8.1}
$$
<br>

The first equality holds by strong duality, the second holds by the definition of the dual function, the third equality holds by $(6.2)$, and the fourth is true by expansion of $\mathcal{L}(x^*, \lambda^*, \mu^*)$.

To see why the last inequality holds, note that: 

<br>
$$\sum_{i=1}^p \mu_i^* h_i(x^*) = 0$$
<br>

since, by feasibility of $x^*$, $h_i(x^*) = 0 \ \ \forall i$. Then again, by feasibility of $x^*$, we have:

<br>
$$f_i(x^*) \leq 0  \ \ \forall i \tag{8.2}$$
<br>

Furthermore, by construction of the Lagrangian, $\lambda \geq 0$. So, together with $(8.2)$, we have: 

<br>
$$\sum_{i=1}^m \lambda^*_i f_i(x^*) \leq 0$$
<br>

But taken altogether $(8.1)$ says $f_0(x^*) \leq f_0(x^*)$ which can *only* hold through strict equality. 

Then it must be the case that $\sum_{i=1}^m \lambda^*_i f_i(x^*) = 0$

Being a sum of non-positive terms, $\sum_{i=1}^m \lambda^*_i f_i(x^*) = 0$ *if and only if* 

<br>
$$\lambda^*_i f_i(x^*) = 0 \ \ \forall i \tag{8.3}$$
<br>

which is complementary slackness.

## Karush-Kuhn-Tucker (KKT) Conditions 

As mentioned above, complementary slackness and stationarity condition are often bundled into the KKT Conditions.

In the absence of strong duality the KKT Conditions are necessary but insufficient for optimality. However, for problems which are strongly dual the KKT Conditions become a ***certificate of optimality***. That is, they are both necessary and sufficient.

> **KKT Conditions:** &nbsp; The primal-dual pair $(x^*, (\lambda^*, \mu^*))$ satisfies the ***KKT conditions*** if the following hold:
&nbsp;
> 1. $\nabla_x f_0(x^*) + \sum_{i=1}^m \lambda^*_i\nabla_xf_i(x^*) + \sum_{i=1}^p \mu^*_i\nabla_xh_i(x^*) = 0$
> 2. $\lambda^*_if_i(x^*) = 0 \ \ \forall i$
> 3. $g_i(x^*) \leq 0 \ \ \forall i$ 
> 4. $h_i(x^*) = 0 \ \ \forall i$ 
> 5. $\lambda^* \geq 0$

We recognize *KKT-1* as the stationarity condition, and *KKT-2* as complementary slackness. *KKT-3* through *5* simply ensure primal-dual feasibility.

Primal-dual pairs which satisfy the KKT Conditions are called ***KKT pairs***.

> Note: These conditions only apply to problems with differentiable objective and constraints. For the case in which one or more of the objective or constraints is non-differentiable, there is an easy generalization of the KKT conditions using sub-differentials. However, sub-differentials are beyond the scope of this post.

As promised, KKT Conditions together with strong duality obtain the following certificate of optimality.

> **Certificate of Optimality:** &nbsp; If strong duality holds, then $x^*, (\lambda^*, \mu^*)$ are primal-dual optimal if and only if they are a KKT Pair.

We have already shown one direction of this in the sections on [stationarity condition](https://v-poghosyan.github.io/blog/optimization/applied%20mathematics/proofs/2022/02/07/Optimization-Duality.html#Stationarity-Condition) and [complementary slackness](https://v-poghosyan.github.io/blog/optimization/applied%20mathematics/proofs/2022/02/07/Optimization-Duality.html#Complementary-Slackness), where we proved that being a primal-dual optimal pair in a strongly convex problem guarantees $(x^*, (\lambda^*, \mu^*))$ is also a KKT pair.

Showing the other direction affords us with a geometric viewpoint of the KKT conditions. [*Farka's Lemma*](https://en.wikipedia.org/wiki/Farkas%27_lemma), which we will shortly provide, is what underpins this result. 

In fact, $x^*$ being primal optimal is enough to guarantee the existence of a $(\lambda, \mu)$ s.t. $(x^*, (\lambda, \mu))$ is a KKT pair without the need for strong duality. If strong duality does hold, however, we also have that the $(\lambda, \mu)$ are dual optimal. In other words, if strong duality holds then being a KKT pair guarantees primal-dual optimality.

We will offer a proof of the above for the simple case of linear programs (which, as we may recall, are strongly dual). In this proof, we will construct a dual variable from a primal optimal $x^*$ by enforcing the KKT conditions, and then show that the dual variable obtained through this construction is, in fact, dual optimal.

### Generalization of Unconstrained Optimization


--- UNREVISED SECTION BEGINS ---

The KKT conditions represent a strict generalization of the unconstrained optimality condition for use in constrained problems. 

Note that if there are no constraints, the KKT conditions simply reduce to the familiar unconstrained optimality condition:

<br>
$$\nabla_x f_0(x^*) = 0$$

In order to discuss optimality in constrained problems, we must first define the concept of a *feasible direction*. 

<br>
> **Feasible Direction:** &nbsp; A unit vector $d$ is called a *feasible direction* at any $x$ if $x + \epsilon d$ remains feasible for $\epsilon > 0$ small enough.
<br>

Then, we can generalize the unconstrained optimality condition by using Taylor Expansion as follows. 

For small enough $\epsilon > 0$, we can estimate $f_0(x^* + \epsilon d)$, where $x^*$ is optimal, for any feasible $d$ by its linear approximation:
<br> 
$$f_0(x^* + \epsilon d) = f_0(x^*) + \epsilon \nabla f_0(x^*)^Td$$

But since $x^*$ is optimal, we have:
<br>
$$
\begin{aligned}
f_0(x^*) &\leq f_0(x^* + \epsilon d) \\
& = f_0(x^*) + \epsilon \nabla f_0(x^*)^Td
\end{aligned}
$$

Which means $\nabla f_0(x^*)^Td \geq 0$. And since $d$ was an arbitrary feasible direction, the result holds for all feasible directions $d$. 

<br>
> **Constrained Optimality Condition:** &nbsp; If $x^*$ is an optimizer of $f_0$ over some constraint set then, for any feasible direction $d$ at $x^*$, $\nabla f_0(x^*)^Td \geq 0$.
<br>
<br>

In words, the directional derivative of the objective function in *any* feasible direction at the optimizer must be non-negative. This ensures that moving in any allowable direction does not improve the objective.

We've already seen that being a primal-dual optimal pair guarantees that $(x^*, (\lambda^*, \mu^*))$ is also a KKT Pair. But, in showing that, we did not use the constrained optimality condition at $x^*$. Doing so is worth it, however, because it provides a key geometric insight. So, let's show that if $x^*$ satisfies the constrained optimality condition then its KKT Pair exists. 

If a particular constraint is loose at $x^*$ then taking a small enough step in any direction from $x^*$ does not violate it. Formally, if $f_i(x^*) < 0$, then $f_i(x^* + \epsilon d) \leq 0 \ \ \forall d$. So, loose constraints do not pose any restrictions on the feasible directions.

However, if a constraint is tight at $x^*$, that is $f_i(x^*) = 0$, then we must be careful not to violate it. For small enough $\epsilon > 0$, we can estimate $f_i(x^* + \epsilon d)$ by its linear Taylor Expansion as:
<br> 
$$f_i(x^* + \epsilon d) = f_i(x^*) + \epsilon \nabla f_i(x^*)^Td$$

For feasibility, we want $f_i(x^* \epsilon d) \leq 0$. So, we require:
<br>
$$f_i(x^*) + \epsilon \nabla f_i(x^*)^Td \leq 0$$

But since $f_i$ is tight at $x^*$, $f_i(x^*) = 0$, which leaves us with:
<br> 
$$\nabla f_i(x^*)^Td \leq 0 \ \ \forall i \ \textrm{that are binding at $x^*$}$$

Clearly the above is a restriction on $d$. The feasible directions can now be stated as:
<br>
$$d \ \textrm{s.t.} \ \nabla f_i(x^*)^Td \leq 0 \ \ \forall i \ \textrm{that are binding at $x^*$} \tag{8.1}$$

Or, equivalently:
<br>
$$d \ \textrm{s.t.} \ - \nabla f_i(x^*)^Td \geq 0 \ \ \forall i \ \textrm{that are binding at $x^*$} \tag{8.2}$$

But, since $x^*$ is optimal, by the generalized unconstrained optimality condition:
<br>
$$\nabla f_0(x^*)^Td \geq 0 \ \ \forall \ \textrm{feasible} \ d \tag{8.3}$$. 

That is, for all $d$ as in $(8.2)$.

But together, $(8.2)$ and $(8.3)$ say that $\not \exists \ d$ which defines a separating hyperplane between $\nabla f_0(x^*)$ and $-\nabla f_i(x^*)$ for all binding constraints $i$. This means that the *only other* alternative scenario must be true — $\nabla f_0(x^*)$ must lie in the cone of the $-\nabla f_i(x^*)$'s. Incidentally, this is what's known as a *theorem of the alternative*, specifically Farka's Lemma, which will soon be covered in detail.

Formally, $\exists \ \lambda^* \geq 0$ s.t.
<br>
$$\nabla f_0(x^*) + \sum_{i \in I} \lambda^*_i f_i(x^*) = 0 \tag{8.4}$$

Where $I = \{i : f_i(x^*) = 0 \}$ is the set of active inequality constraints.

But, upon closer examination, $(8.4)$ is exactly $(KKT \ 1)$, $(KKT \ 2)$, and $(KKT \ 5)$ rolled into one condition. The remaining conditions, $(KKT \ 3)$ and $(KKT \ 4)$, of course, follow from the assumed feasibility of $x^*$.

We can also show that if $(x^*, (\lambda^*, \mu^*))$ is a KKT Pair then $x^*$ is optimal by the constrained optimality condition. Note, however, this says nothing about the dual optimality of $(\lambda^*, \mu^*)$ if no assumption of Strong Duality is made. The argument is simply reversed. If $\nabla f_0(x^*)$ is in the aforementioned cone then going in any feasible direction makes life worse. Which is exactly what the constrained optimality condition says. This, of course, provides a strong geometric interpretation of the Stationary Condition.

# The Dual of an Unconstrained Problem

As mentioned briefly, in the case of certain types of unconstrained problems, the ***Fenchel-Legendre (FL) Transform*** is what gives rise to the dual. 

First, we define the FL transform which is also known as a ***convex conjugate*** for reasons that will soon become apparent. 

<br>
> **FL Transform / Convex Conjugate:** &nbsp; The *FL Transform* or *Convex Conjugate* of a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is: 
$$f^*(y) = \sup_x \left\{y^Tx - f(x)\right\}$$
<br>

We note some key properties of the FL Transform.

## FL Transform - a Convex Operation

The FL Transform $f^*$ is always convex regardless of the convexity of $f$. 

That's because, for a fixed $x$, $y^Tx - f(x)$ is a linear function in $y$. So, $f^*$ is a point-wise supremum of linear functions, making it convex. 

## The Case of Involution

The double FL Transform $f^{**}$ does not always recover $f$. To see this, note that, as an FL Transform of the function $f^*$, $f^{**}$ is always convex. Therefore, $f^{**} \ne f$ if $f$ is non-convex. 

But convexity alone is not enough to guarantee involution. We need an additional condition on $f$, namely that its sub-level sets must be closed, to ensure $f^{**} = f$.

## Inverse Gradients

If $f$ has closed sub-level sets and is convex then the gradients of $f$ and $f^*$ are inverses. That is, assuming both $f$ and $f^*$ are differentiable:
<br>
$$y = \nabla f(x) \iff x = \nabla f^*(y)$$

Let's prove the $\implies$ direction. 

Suppose $y = \nabla f(x)$. By $f$'s convexity: 
<br>
$$f(\hat x) \geq f(x) + y^T(\hat x - x) \ \ \forall \hat x$$

And so:
<br>
$$y^T \hat x - f(\hat x) \leq y^T x - f(x) \ \ \forall \hat x$$

By taking supremum over $x$ and by noting that, since the sub-level sets are closed, the supremum is attained, we obtain:
<br>
$$f^*(y) = y^T x - f(x)$$

The desired result follows by taking the gradient of both sides w.r.t. $y$. That is:
<br>
$$\nabla f^*(y) = x$$

The $\impliedby$ direction is similar. We start from the assumption that $x = \nabla f^*(y)$ and get the desired result by using the involution property $f^{**} = f$. 

## FL Duality

As mentioned, the FL Transform has a natural role in duality.

Suppose the unconstrained optimization problem is:
<br>
$$\min_x : f(x) + h(Ax)$$

Where $f$ and $h$ are convex functions, and $A$ is a matrix representing a bounded linear transformation. 

We introduce a dummy variable $y$ and form the artificial constraint $y = Ax$. The problem becomes:
<br>
$$
\begin{aligned} 
\min_{x,y} &: f(x) + h(y) \\ 
s.t. &: Ax = y
\end{aligned}$$

Forming the Lagrangian gives us:
<br>
$$\mathcal{L}(x,y,z) = f(x) + h(y) + z^T(Ax - y)$$

Then, the dual function is the following FL Transform:
<br>
$$
\begin{aligned}
g(z) &= \min_{x,y} \mathcal{L}(x,y,z) \\
&= \min_{x,y} f(x) + h(y) + z^T(Ax - y) \\
&= \min_{x,y} (A^Tz)^Tx + f(x) - z^Ty + h(y) \\ 
&= \min_x \left\{ (A^Tz)^Tx + f(x) \right\} + \min_y \left\{ -z^Ty + h(y) \right\} \\
&= \min_x \left\{ -\left((-A^Tz)^Tx - f(x)\right) \right\} + \min_y \left\{ -\left(z^Ty - h(y)\right) \right\} \\
&= - \max_x \left\{ (-A^Tz)^Tx - f(x) \right\} - \max_y \left\{ z^Ty - h(y) \right\} \\
&= - f^*(-A^Tz) - h^*(z)
\end{aligned}
$$

And, consequently, the dual problem is: 
<br> 
$$\max_z: - f^*(-A^Tz) - h^*(z)$$

Note that the dual is, indeed, an easy problem since the negative of an FL Transform is always concave regardless of the convexity of $f$ and $h$. So, the dual problem is a maximization of a concave function which is an easy optimization problem.

# LINEAR PROGRAMS

BEST WAY TO UNDERSTAND INTUITIVELY COMPLEMENTARY SLACKNESS AND OTHER PROPERTIES OF STRING DUALITY IN KKT CONDITIONS

# Weak Duality in Linear Programs

Let's now focus on linear programs.

Suppose the primal is an LP of the form

$
\begin{cases}
\min_x: c^Tx
\\
s.t.: \begin{aligned} &Ax \geq b
\\ 
&x \geq 0
\end{aligned}
\end{cases}
$

By constructing the Lagrangian and going through the steps above we can show its dual is for the form

$
\begin{cases}
\max_p: b^Tp
\\
s.t.: \begin{aligned} &A^Tp \leq c
\\ 
&p \geq 0
\end{aligned}
\end{cases}
$

Weak Duality states

> **Weak Duality:** &nbsp; For any primal feasible $x$ and for all dual feasible $p$, $c^Tx \geq b^Tp$.
<br>

That is, any dual feasible solution $b^Tp$ is a *lower bound* for all primal feasible solutions $c^Tx$. Conversely, any primal feasible solution $c^Tx$ is an *upper bound* for all dual feasible solutions $b^Tp$. 

##  Proof of Weak Duality

Let $(p, x)$ be respectively dual-primal feasible. Then $c^Tx = x^Tc \geq x^TA^Tp \geq b^Tp$.

# Strong Duality

While Weak Duality is a useful result, the real strength of duality theory lies in *Strong Duality*. Strong duality is a re-statement of Von Neumann's [Minimax Theorem](https://en.wikipedia.org/wiki/Minimax_theorem) which lays out the conditions for which the max-min inequality holds with strict equality. Roughly speaking, it holds for functions that are saddle-shaped — convex in one variable and concave in the another.

Instead of proving the Minimax Theorem in the general case, we will stay topical and prove Strong Duality for LP's. That is, the Minimax Theorem as it pertains to the special case of linear programs... 

> **Strong Duality:** &nbsp; If the primal is feasible and bounded with optimal $x^*$ then the dual is also feasible and bounded. Furthermore, if the dual has optimum $p^*$ then $c^Tx^* = b^Tp^*$.
<br>

To prove Strong Duality, we require *Farkas' Lemma*.

## Farkas' Lemma

*Farkas' Lemma* belongs to the class of theorems called *Theorems of the Alternative* — these are a theorems stating that exactly one of two statements holds true.

The lemma simply states that a given vector $c$ is either a [conic combination](https://v-poghosyan.github.io/blog/optimization/applied%20mathematics/proofs/2022/01/23/Optimization-Review-of-Linear-Algebra-and-Geometry.html#Conic-Combinations-of-$n$-Points) of $a_i$'s for some $i \in I$, or it's separated from their cone by some hyperplane. 

We state Farkas' Lemma without offering proof since it has such an obvious geometric interpretation.

> **Farkas' Lemma:** &nbsp; For any vector $c$ and $a_i \ \ (i \in I)$ either the first or the second statement holds:  
&nbsp;
> * $\exists p \geq 0$ s.t. $c = \sum_{i \in I} a_ip_i$
> * $\exists$ vector $d$ s.t. $d^Ta_i \geq 0 \ \ \forall i \in I$ but $d^Tc < 0$

## Proof of Strong Duality in LP's

\\\ ANOTHER SUFFICIENT CONDITION FOR STRONG DUALITY IS BEING A LINEAR PROGRAM. THE PROOF BELOW IS INDEPENDENT OF PAST SUFFICIENT CONDITIONS. ///

The proof is by construction. 

Suppose $x^*$ is a primal optimal solution. Let the set $I_{x^*} = \{ i : a_i^Tx^* = b_i\}$ be the set of the indices of the active constraints at $x^*$. Our goal is to construct a dual optimal solution $p^*$ s.t. $c^Tx^* = b^Tp^*$. 

Let $d$ be any vector that satisfies $d^Ta_i \geq 0 \ \ \forall i \in I_{x^*}$. That is, $d$ is a feasible direction w.r.t. to all the active constraints.

A small, positive $\epsilon$-step in the direction of $d$ results in point $x^* + \epsilon d$ that's still feasible. The fact that the step is small is what guarantees no inactive constraints are violated.

Let's compare the value of the objective at $x^* + \epsilon d$ to the value of the objective at $x^*$.

By the assumption that $x^*$ is optimal, we have $c^Tx^* \leq c^T(x^* + \epsilon d) = c^Tx^* + \epsilon c^Td$. Thus, $c^Td = d^Tc \geq 0$

> Note: $d^Tc$ is nothing but the *directional derivative* at the minimizer $x^*$. It is a *first-order necessary-condition* that the *directional derivative* in any feasible direction $d$ be non-negative at any minimizer $x^*$. This is analogous to the first-derivative test for scalar-valued functions. So, this result should have been expected...
<br>

But since $d$ is a vector s.t. $d^Ta_i \geq 0 \ \ \forall i \in I_{x^*}$ and $d^Tc \geq 0$, then $d$ does *not* separate $c$ from the cone of the $a_i$'s. And since $d$ was arbitrary, this puts us in the setting of Farkas' Lemma. Namely, there exist *no* vectors $d$ that separate $c$ from the cone. This means the second statement in Farkas' Lemma is violated and the first must be true — $c$ must a conic combination of the $a_i$'s that are active at the minimizer. In other words, $\exists p \geq 0$ s.t. $c = \sum_{i \in I_{x^*}} p_ia_i$. 

> Note: $c = \sum_{i \in I_{x^*}} p_ia_i$ should remind us of the Lagrange optimality condition in the general case of convex optimization. Recall that the Lagrange condition states that a point $x^*$ is optimal for a convex problem with objective $f(x)$ and constraints $g_i(x)=0$ if and only if $\exists  \lambda_i$ for each active constraint s.t. $\nabla f(x^*) = \sum_i \lambda_i \nabla g_i(x^*)$. In fact, Farka's lemma is what underpins the Lagrange condition through the assumption that the non-linear objective $f$ and non-linear constraints $g_i$ behave linearly in a small neighborhood of $x^*$. 
<br>

But $p$ has dimension equal to only the number of active constraints at $x^*$. To be a dual variable at all, it must have dimension equal to the number of all primal constraints. We extend $p$ to $p^*$ by setting all the entries that do not correspond to the active constraints at $x^*$ to be zero. 

That is $p^*_i = \begin{cases} p_i \ \ \textrm{if} \ \  i \in I_{x^*} \\ 0   \ \ \textrm{if} \ \  i \notin I_{x^*} \end{cases}$. 

Now $A^Tp^*  = \sum_{i} p^*_ia_i = c$, so any feasibility condition in the dual, whether it be $A^Tp \leq c$, $A^Tp \geq c$, or $A^Tp = c$, is satisfied by $p^*$. 

Furthermore, the dual objective at $p^*$ agrees with the primal objective at $x^*$.

$$b^Tp^* = \sum_{i} b_ip_i^* = \sum_{i \in I_{x^*}} b_ip_i^* + \sum_{i \notin I_{x^*}} b_ip_i^* = \sum_{i \in I_{x^*}} a_i^Tx^*p_i^* = (\sum_{i \in I_{x^*}} p_ia_i^T)x^* = c^Tx^* $$

However, it still remains to be shown that $p^*$ is dual optimal. 

Whenever the primal objective and the dual objective agree on a value, the respective solutions must be primal-dual optimal. This is simply true by Weak Duality, which states that $b^Tp \leq c^Tx^*$ $\forall p$. So, $c^Tx^*$ is an upper bound for any dual feasible solution. But the dual is a maximization problem, so the dual optimal must be $p^*$ s.t. $b^Tp^* = c^Tx^*$.


NOTE THAT WE HAVE CONSTRUCTED THE DUAL OPTIMAL BY EXPLICITY SATISFYING THE KKT CONDITIONS IN THE LINEAR CASE. SO THIS IS THE PROOF THAT KKT PAIR => PRIMAL/DUAL OPTIMAL.

# Theorems of the Alternative

As mentioned earlier, these are theorems that describe exclusively disjoint scenarios that together comprise the entire outcome space. Formally, these are theorems of the form $A \implies \neg B \land \neg A \implies  B$  where $A$, and $B$ are logical statements.

Note that theorems of equivalence (i.e. theorems of the form *'the following are equivalent - TFAE'*) can also be formulated as theorems of the alternative. To say that $A$ and $B$ are equivalent means $ A \iff B$. But this breaks down as $A \implies B \land B \implies A$. Letting $\hat B = \neg B$ we can rewrite the above as $A \implies \neg \hat B \land B \implies A$. But, by taking the contrapositive, $B \implies A$ becomes $\neg A \implies \neg B$, which is to say $\neg A \implies \hat B$. In summary, we have shown that $A \iff B$ is equivalent to $A \implies \neg \hat B \land \neg A \implies \hat B$.

So, the class of theorems of the alternative is much broader than it appears and includes theorems of equivalence.

## Example of a Theorem of the Alternative

To see how we can prove a theorem of the alternative, it helps to state one. 

> **Theorem:** &nbsp; Exactly one of the following two statements most hold for a given matrix A.
&nbsp;
> 1. $\exists x \ne 0$ s.t. $Ax = 0$ and $x \geq 0$
> 2. $\exists p$ s.t. $p^TA > 0$
<br>

### Using a Separation Argument

At the heart of separation arguments lies this simple fact. 

> **Separating Hyperplane Theorem:** For any convex set $C$, if a point $\omega \notin C$ then there exists a hyperplane separating $\omega$ and $C$.
<br>

Farkas' Lemma, for instance, is proved by a separation argument that uses, as its convex set, the conic combination of the $a_i$'s. The conclusion is immediate since in Farkas' Lemma the first statement plainly says that a vector belongs to the convex set, and the second statement plainly says there exists a separating hyperplane between the two. 

This is the pattern all separation arguments must follow. However, in general, it may take a bit of work to define the problem-specific convex set and also to show that the two statements are *really* talking about belonging to this set, and separation from it. However, once these three things are accomplished the proof is complete. 

Using this idea, let's give a proof of the above theorem of the alternative using a separation argument.

#### Proof

First order of business is to come up with a convex set. 

Let's take $C = \{ z : z = Ay, \sum_i y_i = 1, y \geq 0 \}$ to be the convex hull of the columns of $A$.

The first statement in the theorem was that $\exists x \ne 0$ s.t. $Ax = 0$ and $x \geq 0$.

Since $x \ne 0$ and $x \geq 0$ we can scale as $x$ as $y = \alpha x$ until $\sum_i y_i = 1$.

So, the first statement is equivalent to saying the origin belongs to the convex hull $C$ (i.e. $0 \in C$)

The second statement was that $\exists p$ s.t. $p^TA > 0$. This is equivalent to saying that all the columns of $A$ lie to one side of the separating hyperplane introduced by $p$.

But all $z \in C$ are convex combinations of $A$'s columns. In particular since they're a convex combination they're also a conic combination, so all $z \in C$ also lie on the same side of the hyperplane. That is $p^Tz > 0 \ \ \forall z \in C$. 

But, of course, $p^T0 = 0$ (not $> 0$). So, according to the second statement, the origin is separated from $C$. 

This concludes the proof since the two statements must be mutually exclusive. 

### Using Strong Duality

Strong duality isn't just a tool for applied science, it has important theoretical uses. For instance, now that we've proven it we can use Strong Duality, instead of a separation argument, to prove theorems of the alternative. 

Since it gives us feasibility of two different constraint sets, it makes sense to use duality to prove theorems of existence. 

Let's take the aforementioned theorem of the alternative for example...

#### Proof

To prove the theorem we need to show two things. First, we need to show $1 \implies \neg 2$, then we need to show $\neg 1 \implies 2$.

The $1 \implies \neg 2$ direction is simple. 

Suppose $\exists x \ne 0$ s.t. $Ax = 0$ and $x \geq 0$. 

Then $\forall p \ \ (p^TA)x = p^T(Ax) = p^T0 = 0$ (not $> 0$).

We tackle the $\neg 1 \implies 2$ direction using duality.

The strategy is to construct a linear program based on $\neg 1$ such that the feasibility of its dual implies $2$.

We can express $\neg 1$ as '$\forall x \ne 0$, either $Ax \ne 0$ or $x < 0$.' Equivalently, '$x \ne 0 \implies Ax \ne 0$ or $x < 0$.' Taking the contrapositive, statement $1$ becomes '$Ax = 0$ and  $x \geq 0 \implies x = 0$.' 

So, let's form the LP 

$
\begin{cases}
\max_x: \textbf{1}^Tx
\\
s.t.: \begin{aligned} &Ax = 0
\\ 
&x \geq 0
\end{aligned}
\end{cases}
$

Note that $x = 0$ is a feasible solution to the LP. Furthermore, assuming statement $1$ guarantees that $x = 0$ is the only feasible solution. Thus, the LP is feasible and bounded. 

By Strong Duality, its dual must also be feasible and bounded. 

The dual is...

$
\begin{cases}
\min_p: \textbf{0}^Tp
\\
s.t.: p^TA \geq \textbf{1}
\end{cases}
$

... and since it's feasible, $\exists p$ s.t. $p^TA \geq 1 > 0$ which demonstrates the truth of statement $2$.  

# Complementary Slackness

*Complementary Slackness* is a fundamental property that exists between any primal optimal solution and any dual optimal solution. 

In the preceding section on Strong Duality we constructed a dual optimal by setting those of its variables that corresponded to the inactive constraints of the primal optimal to be zero. 

This is true in general, for all primal-dual optimal pairs. 

If a primal's constraint is loose at a some primal optimal, then the corresponding variable in the dual optimal is zero, and vice versa. 

Formally, this can be stated as

> **Complementary Slackness:** if $x$ is primal feasible and $p$ is dual feasible, then $x$ and $p$ are respectively optimal iff:
&nbsp;
> 1. $(b_i - \sum_{j} a_{ij}x_j)p_i = 0 \ \ \forall i$
> 2. $(\sum_{i} a_{ij}p_i - c_j)x_j = 0  \ \ \forall j$
<br>

If we recall, in the proof of Strong Duality we constructed a dual optimal by setting those of its variables that corresponded to the primal's slack constraints to be zero. In other words, we constructed a dual optimal in such a way as to satisfy the Complementary Slackness theorem. So, the fact that this generalizes to all primal-dual optima shouldn't surprise us. 

However, the above does not constitute a proof of Complementary Slackness, so let's offer one.

Take as a starting point the primal-dual pair

$
\textrm{P} \ \ 
\begin{cases}
\min_x: c^Tx
\\
s.t.: \begin{aligned} &Ax \geq b
\\ 
&x \geq 0
\end{aligned}
\end{cases}
$

$
\textrm{D} \ \ 
\begin{cases}
\max_p: b^Tp
\\
s.t.: \begin{aligned} &A^Tp \leq c
\\ 
&p \geq 0
\end{aligned}
\end{cases}
$

## Proof of Complementary Slackness

**Sufficiency $\impliedby$:**

Suppose both equalities hold.

Summing each over all $i$'s and $j$'s respectively and adding the results we get

$$\sum_i \left(b_i - \sum_j a_{ij}x_j \right)p_i + \sum_j \left( \sum_i a_{ij}p_i - c_j \right)x_j = 0$$

Which simplifies to 

$$\sum_i b_ip_i - \sum_i \sum_j a_{ij}x_jy_i + \sum_j \sum_i a_{i,j}y_ix_j - \sum_j c_jx_j = 0$$

Or, in matrix-vector form

$$b^Tp - p^TAx + p^TAx - c^Tx = 0$$

The middle two terms cancel, and we get $b^Tp = c^Tp$. 

By Weak Duality, $x$ and $p$ are primal-dual optimal.

**Necessity $\implies$:**

Suppose $x$ and $p$ are primal-dual optimal. 

By Strong Duality $b^Tp = c^Tx$. 

In other words, $b^Tp - c^Tx = 0$. Adding and subtracting the terms canceled in the first part, we can bring the sum to the form

$$b^Tp - p^TAx + p^TAx - c^Tx = 0$$

Which is, once again, the same as

$$\sum_i \left(b_i - \sum_j a_{ij}x_j \right)p_i + \sum_j \left( \sum_i a_{ij}p_i - c_j \right)x_j = 0$$

But $p$ is dual feasible, so $p_i \geq 0 \ \ \forall i$. And since $x$ is primal feasible, $Ax \geq b$ implies $(b_i - \sum_j a_{ij}x_j) \leq 0 \ \ \forall i$. 

Similarly, $x_j \geq 0 \ \ \forall j$ and $( \sum_i a_{ij}p_i - c_j) \geq 0 \ \ \forall j$. 

So the above expression is a sum of all non-positive terms that adds up to zero. This can only happen if each term is equal to zero. 