# SVM

This files includes usage of svm for discrete condition.

---

## Intro

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection. In this file, we only introduce the svm for classifcation.

## Math Principle

Here is the recommended [video](https://www.youtube.com/watch?v=_PwhiWxHK8o)

### Important Concepts

1. Decision Boundary
2. Widest Gutter
3. Kernel Function

### Decision Boundary

Here is simply illustration on two dimensional case.

![svm1](../../../src/svm.png)

The vector w is perpendicular to the vector c, and we define l by equation
$$\langle v,w\rangle \geq -b$$
Thus we obtain the decision rule with the relation
$$\langle v,w\rangle +b \geq 0$$

### Widest Gutter

The basic idea to separate two points is to make the gutter as large as possible. Since l is the central line, we have
$$\forall x_+ \in \text{train set postive}, \langle x_+,w\rangle \geq 1 \Leftrightarrow \forall x_-,\ \langle x_-, w\rangle + b \leq -1$$

For convenience, we define symbol $sgn(\cdot)$, $sgn(x_+)=1$, $sgn(x_-)=-1$. The former relationship can be rewritten to
$$sgn(x_i)(\langle x_i, w\rangle +b)\geq 1, \text{for all}\ x_i \in \text{training set}.$$

We define $sgn(x_i)(\langle x_i, w \rangle +b)=0$ if $x_i$ is in the gutter.

Take $x_+$, $x_-$ on the boundary, we have the width of gutter is equal to $\langle(x_+-x_-),\frac{w}{||w||}\rangle$, width=$\frac{2}{||w||}$ with the former definition.

To maximize the width, we need to minimize the $||w||$ which is equivalent to get min{$\frac{1}{2}||w||$}. (just for mathematica convenience) Then we apply the Lagrange multiplier to get the constrained extrema.

$$L=\frac{1}{2} ||w||^2 - \sum \lambda_i[sgn(x_i)(\langle w,x_i\rangle+b)-1]$$

We have

$$
\left\lbrace\begin{aligned}
\frac{\partial L}{\partial w}=0\\
\frac{\partial L}{\partial b}=0
\end{aligned}
\right.
\Rightarrow

\left\lbrace\begin{aligned}
w=\sum\lambda_i sgn(x_i)x_i\\
\sum\lambda_i sgn(x_i)=0
\end{aligned}
\right.
$$

By plugging the result back to the original expression, we have
$$\displaystyle L=\sum\lambda_i +\frac{1}{2}\sum_i\sum_j \lambda_i\lambda_j sgn(x_i) sgn(x_j)$$

And the decision rule becomes

$$\left\langle\sum\lambda_i sgn(x_i)x_i, u\right\rangle+b \geq 0 \Rightarrow x_i \in x_+$$

### Kernel Function

For inseparable problems, we can use a transformation $\phi$ s.t. $\lbrace x_+\rbrace$ and $\lbrace x_-\rbrace$ are separable. The problem is transformed to find $max\lbrace \langle \phi(x_i),\phi(u)\rangle\rbrace$

For convenience, we define kernel fucntion $k(x_i,x_j)=\langle\phi(x_i),\phi(x_j)\rangle$

---

## Advantage & Disadvantage

- Advantage
  - Effective in high dimensional spaces.
  - Still effective in cases where number of dimensions is greater than the number of samples.
  - Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
  - Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.
- Disadvantage
  - If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
  - SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

## Realization

There are mainly three model for svm, SVC, NuSVC and LinearSVC. The LinearSVC is a faster model with linear kernel s.t. it doesn't accept kernerl parameter.

### SVC

The comparing to basic svm, the SVC introduces the penalty term C. The equation of gutter width is modified to find
$$\displaystyle \mathop{min}\limits_{\omega, b, \xi} \frac{1}{2}\lVert w \rVert^2 + C \sum_i \xi_i$$

subject to
$$sgn(x_i)(\langle w, \phi(x_i)\rangle+b) \geq 1 - \xi_i, \ \xi_i \geq 0$$

$\xi$ stands for the distance from the correct boundry. For those $x_i$ is wrongly classified, the chosen hypersurface will be punished according to $\xi_i$ and C.

Similarly, by talking the Langrange multiplier, we get the constrained extrema by

$$\mathop{min}\limits_{\alpha}\frac{1}{2}\alpha^TQ\alpha-e^T\alpha\ \text{and} \ sgn{x_i}\alpha=0, 0\leq \alpha_i \leq C$$

Q is a n by n positive semidefinite matrix which $Q_{ij}=sgn(x_i)sgn(x_j)K(x_i,x_j)$. $\alpha_i$ are called the dual coefficients.

Once the optimization problem is solved, the decision rule should be
$$\sum sgn(x_i)\alpha_iK(x_i,u)+b$$

### NuSVC

The NuSVC is the a reparameterization of SVC and therefore mathematically equivalent.

We introduce a new parameter $\nu$ (instead of C) which controls the number of support vectors and margin errors: $\nu \in (0,1]$ is an upper bound on the fraction of margin errors and a lower bound of the fraction of support vectors. A margin error corresponds to a sample that lies on the wrong side of its margin boundary: it is either misclassified, or it is correctly classified but does not lie beyond the margin.