CS524: Introduction to Optimization Lecture 30
======================================

## Michael Ferris<br> Computer Sciences Department <br> University of Wisconsin-Madison 

## November 13, 2023
--------------


In [1]:
%load_ext gams.magic
m = gams.exchange_container

# Classifiers

Consider now the situation in which we have two sets of points in  $\mathbb{R}^n$ which are labeled as $P_+$ and $P_-$. Denoting any one of these points by $a$, we would like to construct a function $\phi$ so that 
$$\phi(a)>0 \text{ if } a \in P_+$$ and 
$$\phi(a)<0 \text{ if } a \in P_-$$ 

The function $\phi$ is known as a <font color=red>classifier</font>. Given a new point $a$, we can use $\phi$ to classify $a$ as belonging to either $P_+$ (if $\phi(a)>0$) or $P_-$ (if $\phi(a)<0$). An example of such a problem constructs a linear function 
$$\phi(a)=a^Tw-\gamma$$ 
to classify fine needle aspirates of tumors as either malignant or bengin. We give a brief description of <font color=red>support vector machines</font>, a modern tool for classification.

![image1.png](image1.png)

shows a separating hyperplane $w^Tx=\gamma$ for two point sets, obtained by finding a pair $(w,\gamma)$ that is feasible for (1) as well as the bounding hyperplanes $w^Tx=\gamma \pm1$.

We start by describing the construction of a linear classifier, which has the form $\phi(a)=w'a-\gamma$, where $w \in \mathbb{R}^n$ and $\gamma \in\mathbb{R}$. Ideally, the hyerplane defined by $\phi(a)=0$ should completely separate the two sets $P_+$ and $P_-$, so that $\phi(a)>0$ for $a \in P_+$ and $\phi(a)<0$ for $a \in P_-$. If such $(w,\gamma)$ exists, then by redefining $(w,\gamma)$ as

$$
\frac{(w,\gamma)}{\min_{a\in P_+\cup P_-} |w^Ta-\gamma|},
$$

we have that
$$
a \in P_+ \Rightarrow \phi(a) = w^Ta-\gamma \geq 1, \tag{1a}
$$
$$
a \in P_- \Rightarrow \phi(a) = w^Ta-\gamma \leq -1. \tag{1b}
$$

To express these conditions as a system of linear inequalities, we denote the points by $a_i$ and let $m=|P_+|+|P_-|$, where $|P_+|$ and $|P_-|$ denote the number of points in $P_+$ and $P_-$ respectively. We then define labels $y_i$ for each point $a_i$ as follows:

$$
y_{i} =
  \begin{cases}
    1       & \quad \text{if } a_i \in P_+;\\
    -1  & \quad \text{if } a_i \in P_-;
  \end{cases}
$$

Conditions (1) can thus be written succinctly as follow:

$$
y_i(a_i^Tw-\gamma)\geq 1 \tag{2}
$$

# Margin maximization

If it is possible to separate the two sets of points, then it is desirable to maximize the distance (margin) between the bounding hyperplanes, which is depicted above by the Euclidean distance between the two dotted lines. It can be shows [1] that this separation margin is

$$
\frac {2}{\|w\|'},
$$

where $\|w\|'$ denotes the dual norm. If we take the norm to be the Euclidean ($\ell_2$) norm, which is self-dual, then maximization of 2/$\|w\|'$ can be achieved by minimization of $\|w\|$ or $\|w\|^2 = w^Tw$.

Hence, we can solve the following quadratic program to find the separating hyperplane (defined by $w$ and $\gamma$ and hence the classfier function $\phi$) with maximum (Euclidean) margin:

$$
\min_{w,\gamma} \frac{1}{2}w^Tw \text{ s.t. } y_i(a_i^Tw-\gamma)\geq 1, i=1,\ldots,m \tag{3}
$$

The <font color=red>support vectors</font> are the points $a_j$ that lie on the bounding hyperplanes and such that the corresponding Lagrange multipliers of the constraints of (3) are positive. These correspond to the active constraints in (3).

In practice, it is usually not possible to find a hyperplane that separates the two sets because no such hyperplane exists. In such cases, the quadratic program (3) is infeasible, but we can define other problems that identify separating hyperplanes "as nearly as practicable," in some sense.

![image2.png](image2.png)

This shows two linearly nonseparable point sets and the hyperplane obtained by solving a problem of the form (5). The bounding hyperplanes are also shown. In this formulation, the support vectors are the points from each set $P_-$ and $P_+$ that lie on the wrong side of their respective bounding hyperplanes.

We can define a vector $\delta$ whose components indicate the amount by which the constraints (2) are violated, as follows:

$$
y_i(a_i^Tw-\gamma)+\delta_i \geq 1, \delta_i \geq 0. \tag{4}
$$

We could measure the total violation by summing the components of $\delta$, and add some multiple of this quantity to the objective of (3), to obtain

$$
\min_{w,\gamma,\delta} \frac{1}{2}w^Tw+\nu \displaystyle\sum_{i} \delta_i \text{ s.t. } y_i(a_i^Tw-\gamma)+\delta_i \geq 1, \delta_i \geq 0, \tag{5}
$$

where $\nu$ is some positive parameter. Note that this is equivalent to:

$$
\min_{w,\gamma} \frac{1}{2}w^Tw+\nu \displaystyle\sum_{i} \max(1-y_{i}(a_i^Tw-\gamma),0)
$$
where the latter term is a sum of loss functions (and the particular one used here is called the hinge loss $\ell_H(r) = \max(0,1-y r)$ where $y = \pm 1$).
This problem (5) is referred to as a (linear) <font color=red>support vector machine</font> [6, 2, 4].


## Nonlinear separator

Instead of a linear separator, we can transform the data $a_i$ by some nonlinear transformation $\psi$ and then perform support vector machine classification on the vectors $\psi(a_i)$ instead.  The nonlinear <font color=red>classification function</font> operates as:
  \begin{align*} \phi (\psi(a)) > 0 & \text{ implies } a \in P_+, \\
    \phi (\psi(a)) < 0 & \text{ implies } a \in P_-
    \end{align*}

  The separating hypersurface is then $\{x: w^T \psi(x) - \gamma  = 0\}$ and the optimization problem is:
   $$
    \min_{w,\gamma} \nu \sum_{j=1}^m \max ( 1 - y_j (w^T\psi(a_j)-\gamma),0) + \frac{1}{2} w^Tw
  $$

# Duality

The dual problem (exercise) is:
  $$ \min_{\alpha \in R^m} \frac{1}{2} \alpha^T Q \alpha - \sum_i \alpha_i \text{ s.t. } 0 \leq \alpha_i \leq \nu, y^T \alpha = 0 $$
  where
  $$ Q_{ij} = y_i y_j \psi(a_i)^T \psi(a_j) $$
  The solution $w$ of the primal can be recovered as
  $$ w = \sum_{j=1}^m y_j \alpha_j \psi(a_j) $$
  and $\gamma$ is the multiplier on the constraint $y^T \alpha = 0$.

  Support vectors are those whose $\alpha_i$ are not zero.

# Kernels

The problem can be specified directly using $Q$ instead of via $\psi$, via a <font color=red>kernel function</font> $K$ with $K(a_i,a_j)$ replacing $\psi(a_i)^T\psi(a_j)$.
  The classifier function then operates as:
  \begin{align*} \sum_{j=1}^m y_j \alpha_j \psi(a_j)^T \psi(a) - \gamma = w^T\psi(a) - \gamma > 0
    & \text{ implies } a \in P_+, \\
    \sum_{j=1}^m y_j \alpha_j \psi(a_j)^T \psi(a) - \gamma < 0 & \text{ implies } a \in P_-
    \end{align*}
    which can be evaluated directly via the kernel function as:
    \begin{align*} \sum_{j=1}^m y_j \alpha_j K(a_j, a) - \gamma > 0 & \text{ implies } a \in P_+, \\
      \sum_{j=1}^m y_j \alpha_j K(a_j,x) - \gamma < 0 & \text{ implies } a \in P_-
    \end{align*}
    
A popular choice of kernel is the Gaussian kernel:
    $$ K(a_j,a_i) := \exp \left( -\frac{1}{2 \sigma} \left\Vert a_j - a_i\right\Vert^2 \right) $$

with related Gram matrix $Q$ given by
    $$ Q_{ij} = y_i y_j K(a_i, a_j) $$

<font color=red>This can be used in the model to determine $w$ (essentially $\alpha$) and $\gamma$ for the classifier.</font>

A	function	$K(x,y)$	is	a	valid	kernel	if	it	corresponds	to	an	inner	product	in some	(perhaps	infinite	dimensional)	feature	space.

General	condition: construct	the	Gram	matrix $K(a_i ,a_j)$ and check	that	it's	
positive	semidefinite, (this is Mercer's condition).

In [2]:
%gams_cleanup --closedown