<a href="https://colab.research.google.com/github/yiboxu20/MachineLearning/blob/main/Resources/Module1/SVM4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Multiclass SVMs
### One-against-all scheme:
- Construct $K$ binary SVMs with parameters $(\mathbf{w}^j, w_0^j),1\le j\le K$. Each classifiers the current class $j$ against all others.   

- Given a new feature $\mathbf{x}$, the further the point is from the decision boundary of some binary SVM in the "positive" direction, the more likely we think it belongs to that class.

- That is, the predicted class is set to
  $$ \text{arg}\max_{1\le j\le K}\{\phi(\mathbf{x})\cdot(\mathbf{w}^{j}) +w_0^j\}$$

### One-against-one scheme:

- construct a binary SVMs for each pair of classes; in total $\frac{K(K-1)}{2}$ binary SVMs for $K$-class classification.

- All the binary classifiers are tested; for each of them, a win for one class is
a vote for that class. The class with the most votes wins.

Example: MNIST classification

# Summary of SVM and logistic regression: a unified framework

- Construct a **decision function** (or **classifier**) $f_\theta(\mathbf{x}):\mathbb{R}^d \rightarrow \mathbb{R}$, parameterized by $\theta$.
  -  Linear model: $f_\theta(\mathbf{x})= \mathbf{x}\mathbf{w}+w_0$ with $\theta=(\mathbf{w}, w_0)$.
  - Polynomials model up to degree $2$: $f_\theta(\mathbf{x})=\phi(\mathbf{x})\mathbf{w} +w_0$, with $\phi(\mathbf{x})=[x_1, \dots, x_d, x_1^2, \dots, x_d^2, \sqrt{2}x_1x_2, \dots, \sqrt{2}x_{d-1}x_d]$ and $\theta=(\mathbf{w}, w_0)$.

- **Decision boundary** is given by the equation $f_{\theta}(\mathbf{x})=0$

- Magnitude of $f_\theta(\mathbf{x})$ or equivanlently, $yf_\theta(\mathbf{x})$ reflects how far the sample $\mathbf{x}$ is from the decision boundary, where label $y\in\{\pm1\}$.

- Sign of $f_\theta(\mathbf{x})$ predicts the class. Sign of $yf_\theta(\mathbf{x})$ indicates if sample $\mathbf{x}$ is
correctly classified.

- Choosing a decreasing **loss function** $\ell: \mathbb{R}\rightarrow \mathbb{R}$, penalizing upon the
discrepancy between model output and the corresponding label.

- Inconsistency between the output $f_\theta(\mathbf{x})$ and $y$ is measured by $\ell(f_\theta(\mathbf{x}))$; negative $yf_\theta(\mathbf{x})$ implies misclassification and incurs a relatively large loss






Fit the model on training dataset $\{\mathbf{x}^{(i)}, y^{(i)}\}_{i=1}^N$,

$$\boxed{\min_{\theta}\lambda\sum_{i=1}^N\ell(y^{(i)}f_\theta(\mathbf{x}^{(i)}))+ R(\theta) } $$
Often comes with with a regularizer $R(\theta)$ like $\|\theta\|_2^2, \|\theta\|_1$.



### 1.  Perceptron:
- Classifier: $f_\theta(\mathbf{x})= \mathbf{x}\mathbf{w} +w_0$.

- Loss function: 0-1 loss function. $\ell(z)=\mathbb{1}_{z<0}$.

- Regularizer: None.

- Minimize the total number of misclassified training samples

- Solver: SGD with the "fake" gradient

### 2.   Logistic regression:
- Classifier: $f_\theta(\mathbf{x})=\mathbf{x}\mathbf{w} +w_0$.
- Loss function: log-loss function. $\ell(z)=\log(1+\exp(-z))$. If $y\in\{\pm 1\}$, $p(y|\mathbf{x},\theta)=\ell(yf_\theta(\mathbf{x}))$.

- Regularizer: None.

- Minimize the negative log likelihood when the probability is defined as before.

- Solver: SGD with the true gradient

### 3. Soft Margin Classification
- Classifier: $f_\theta(\mathbf{x})= \mathbf{x}\mathbf{w} +w_0$.

- Loss function: Hinge loss function. $\ell(z)=\max\{1-z, 0\}$.

- Regularizer: Tikhonov($l_2$) regulation, $\|\mathbf{w}\|_2^2$

- Minimize the hinge loss with Tikhonov regularization.

- Solver: solve the primal problem directily with SGD or solve the dual problem with SMO.

### 4. Hard Margin Classification
- Same as Soft Margin Classification with $\lambda =+\infty$.

- Solver: solve the dual problem with SMO.

### 5. Kernel SVM
- Classifier: $f_\theta(\mathbf{x})= \phi(\mathbf{x})\mathbf{w} +w_0=\sum_{i=1}^N\alpha_iy^{(i)}\mathcal{K}(\mathbf{x}^{(i)}, \mathbf{x})+w_0$.

- Loss function: Hinge loss function. $\ell(z)=\max\{1-z, 0\}$.

- Regularizer: Tikhonov($l_2$) regulation, $\|\mathbf{w}\|_2^2$.

- Minimize the hinge loss with Tikhonov regularization.

- Solver: solve the dual problem with SMO.

  - Polynomial kernels: $\mathcal{K}(\mathbf{x}, \mathbf{z}) = (1+\mathbf{x}^\top\mathbf{z})^k $.

  - Gaussian kernels: $\mathcal{K}(\mathbf{x}, \mathbf{z}) = \exp(-\|\mathbf{x}-\mathbf{z}\|^2/2\sigma^2)$.











In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


<img src="https://github.com/yiboxu20/MachineLearning/blob/main/Resources/images/losses.png?raw=true" width="500" />

1. log loss. $\ell(z)=\log(1+\exp(-z))$
2. exp loss. $\ell(z)=\exp(-z)$

They are both smooth convex relaxation of 0-1 loss function.

## SVM for regression
Fit the model on training dataset $\{\mathbf{x}^{(i)}, y^{(i)}\}_{i=1}^N$, in regression, we have

$$\boxed{\min_{\theta}\lambda\sum_{i=1}^N\ell(y^{(i)}, f(\mathbf{x}^{(i)},\mathbf{w}))+ R(\theta) } $$
where $f(\mathbf{x}^{(i)},\mathbf{w})$ is the function for regression, i.e, $f(\mathbf{x}^{(i)},\mathbf{w})= \phi(\mathbf{x})\mathbf{w}$. $\phi(\mathbf{x})$ can be up to $k$-th order polynomials, or even Gaussian basis functions
$$\phi(\mathbf{x})=\left[\exp\left(-\frac{\|\mathbf{x}-\mathbf{x}^{(1)}\|^2}{2\sigma^2}\right), \dots, \exp\left(-\frac{\|\mathbf{x}-\mathbf{x}^{(N)}\|^2}{2\sigma^2}\right) \right]\in \mathbb{R}^N $$

### 1. ridge regression

- Loss function: Square loss. $\ell(y^{(i)}, f(\mathbf{x}^{(i)},\mathbf{w})) =\left(y^{(i)}- f(\mathbf{x}^{(i)},\mathbf{w})\right)^2$.

- Regularizer: Tikhonov regularization. $\|\mathbf{w}\|^2_2$

### 2. LASSO

- Loss function: Square loss. $\ell(y^{(i)}, f(\mathbf{x}^{(i)},\mathbf{w})) =\left(y^{(i)}- f(\mathbf{x}^{(i)},\mathbf{w})\right)^2$.

- Regularizer: $l_1$ regularization. $\|\mathbf{w}\|_1$

### 3.  $\epsilon$-insensitive loss

- Loss function: **$\epsilon$-insensitive loss function**:
$$\ell(y^{(i)}, f(\mathbf{x}^{(i)},\mathbf{w}))=\max(|y^{(i)}- f(\mathbf{x}^{(i)},\mathbf{w})|-\epsilon , 0) $$

- Regularizer: Tikhonov regularization. $\|\mathbf{w}\|^2_2$

### 4. Huber loss
- Loss function: Huber loss, $\ell(y^{(i)}, f(\mathbf{x}^{(i)},\mathbf{w}))=h(y^{(i)}- f(\mathbf{x}^{(i)},\mathbf{w}))$
$$h(r) = \begin{cases} r^2 & \text{if }|r|\le c  \\ 2c|r|-c^2 & \text{Otherwise} \end{cases} $$

- Regularizer: Tikhonov regularization. $\|\mathbf{w}\|^2_2$

- (mixed quadratic/linear): robustness to outliers



<img src="https://github.com/yiboxu20/MachineLearning/blob/main/Resources/images/loss_function.png?raw=true" width="500" />

### Final notes on Loss functions
Regressors and classifiers can be constructed by a “mix ‘n’ match” of loss
functions and regularizers to obtain a learning machine suited to a
particular application.

- $l_1$—SVM
$$ \min_{\mathbf{w}\in \mathbb{R}^d, w_0\in \mathbb{R}} \lambda\sum_{i=1}^N \max\left\{0, 1-\mathbf{y}^{(i)}(\mathbf{x}^{(i)}\mathbf{w} +w_0)\right\} +\frac{1}{2} \|\mathbf{w}\|_1$$
- Least squares SVM

$$ \min_{\mathbf{w}\in \mathbb{R}^d, w_0\in \mathbb{R}} \lambda\sum_{i=1}^N \left(\max\left\{0, 1-\mathbf{y}^{(i)}(\mathbf{x}^{(i)}\mathbf{w} +w_0)\right\}\right)^2 +\frac{1}{2} \|\mathbf{w}\|_2^2$$