# Support Vector Machines
### Jack Bennetto
#### February 2, 2017

## Objectives
 * Explain margins, support vectors, and hyperplanes
 * Compute a linear SVC model, tuning C
 * Compute a non-linear SVM, tuning $\gamma$ or degree
 * Explain the relationship of all the hyperparameters to bias and variance

## Agenda

Morning agenda

 * SVM as a classifier
 * Decision boundaries
 * Maximum-margin classification
 
Afternoon agenda
 * Soft boundaries and scalar-vector classification
 * Additional features for non-linear boundaries
 * Kernels and SVMs
 * Other stuff

# Morning Lecture

## What is SVM?

The SVM was first developed in the '60s by Vladimir Vapnik, it wasn't well known until the '90s.

Advantages
 * Best model for well-separated data 
 * Works well for for high-dimensional data
 * Non-linear kernels can match non-linear boundaries
 
Disadvantages
 * Doesn't do well at providing probability
 * Not very interpretable, or visualizable for high-dimensional data
 * Slow to train
 * Need to standardize/normalize features



SVM can also be used for regression.

In [None]:
import numpy as np
import scipy.stats as scs

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [None]:
def plot_boundary(ax, b0, b1, b2, margin=0, ls='-', color='k', alpha=1.0):
    '''
    Plot a decision boundary on an existing axis
    '''
        
    # save the limits, so the decision boundary doesn't expand them
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()

    # boundary_x are the x axis values for the boundary, from X[0]
    # boundary_y are the y axis values for the boundary, from X[1]
    boundary_x = np.zeros(4)
    boundary_y = np.zeros(4)
    boundary_x[:2] = xlim[0], xlim[1]
    boundary_y[:2] = (margin - b0 - b1 * boundary_x[:2]) / b2
    boundary_y[2:] = ylim[0], ylim[1]
    boundary_x[2:] = (margin - b0 - b2 * boundary_y[2:]) / b1
    
    # we could just plot all the points,
    # but if we do that with a dotted line it plots over itself
    irng = [boundary_x.argmin(), boundary_x.argmax()]
    ax.plot(boundary_x[irng], boundary_y[irng], color=color, ls=ls, alpha=alpha)
    # restore the saved limits
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)

## Decision boundaries

SVM is focused on situations 

Let's choose some points, plot them and consider a couple possible boundaries.

For the rest of the morning, we'll only consider cases where we have two classes that can be completely separated with a single line. In the afternoon we'll extend that so include soft margins and non-linear boundaries.

In [None]:
np.random.seed(42)
npts = 100
X_a = np.zeros((npts, 2))
X_a[npts/2:, 0] = scs.norm(0,1).rvs(npts/2)
X_a[npts/2:, 1] = scs.norm(0,5).rvs(npts/2)
X_a[:npts/2, 0] = scs.norm(24,10).rvs(npts/2)
X_a[:npts/2, 1] = scs.norm(0,1).rvs(npts/2)
# rotate a bit
X_a = X_a.dot([[1,-.2],[.2,1]])
y_a = np.zeros(npts, dtype='int32')
y_a[npts/2:] = 1

model_lr_a = LogisticRegression(intercept_scaling=100)
model_lr_a.fit(X_a, y_a)
model_sv_a = SVC(kernel='linear')
model_sv_a.fit(X_a, y_a)

In [None]:
fig, ax = plt.subplots(figsize=(15,8))
ax.scatter(X_a[:,0], X_a[:,1], color=np.array(['r', 'b'])[y_a])
ax.set_aspect('equal')

plot_boundary(ax, model_lr_a.intercept_[0], model_lr_a.coef_[0][0], model_lr_a.coef_[0][1], color='k')
plot_boundary(ax, model_sv_a.intercept_[0], model_sv_a.coef_[0][0], model_sv_a.coef_[0][1], color='k')
ax.plot([3.41,3.41],[-15,10], color='k')
#plot_boundary(ax, 4.17468436879, -1.00207931899, 0.200425620857, color='g')

ax.text(2.4,8.8, 'a', fontsize=16)
ax.text(4.6,8.8, 'b', fontsize=16)
ax.text(7.2,8.8, 'c', fontsize=16)

Which line do you like best?

These three lines represent three different models.

## Maximum-margin classification

One approach to choosing the best decision boundary is find the one that's the farthest from any of the points. The distance from the nearest points is called the margin.

In [None]:
fig, ax = plt.subplots(figsize=(15,8))
ax.scatter(X_a[:,0], X_a[:,1], color=np.array(['r', 'b'])[y_a])
ax.set_aspect('equal')

b0 = model_sv_a.intercept_[0]
b1 = model_sv_a.coef_[0][0]
b2 = model_sv_a.coef_[0][1]
plot_boundary(ax, b0, b1, b2, color='k')
plot_boundary(ax, b0, b1, b2, margin=-1, ls=':', color='k')
plot_boundary(ax, b0, b1, b2, margin=1,  ls=':', color='k')
ax.scatter(model_sv_a.support_vectors_[:,0], model_sv_a.support_vectors_[:,1], s=200, edgecolors='black', facecolors='none')

The points at the margins (circled) are called *support vectors*.

## Hyperplanes

A line is great when we have two features, but what if we have more?

A **hyperplane** is a affine subspace with a dimension of one less than the ambient space. That's what we need for a decision boundary.

If we just have one feature, that's a point. If we have two features (as above) the hyperplane is a line. In three dimensions it's a plane, and in four dimensions...

Mathematically, a hyperplane in $p$-dimensional space can be described by

$$\beta_0 + \beta_1x_1 + \beta_2x_2 + ...  + \beta_px_p = 0$$

**Warning: For most classification problems we use $y_i \in \{0, 1\}$. For SVC we use $y_i \in \{-1, 1\}$.**

So

$\beta_0 + \beta_1x_{i1} + ...  + \beta_px_{ip} > 0$ for $y_i = +1$

$\beta_0 + \beta_1x_{i1} + ...  + \beta_px_{ip} < 0$ for $y_i = -1$

or

$y_i(\beta_0 + \beta_1x_{i1} + ...  + \beta_px_{ip}) > 0$

We can describe the MMC as finding the maximum value of $M$ for which

$$y_i(\beta_0 + \beta_1x_{i1} + ...  + \beta_px_{ip}) \ge M \;\forall\; i$$

with the constraint 

$$\sum_{j=1}^p \beta_j= 1$$


(To see that, recall that in general $\frac{\bar x_i \cdot \bar v}{||\bar v||}$ is length of the projection of $\bar x_i$ along the vector $\bar v$, or the distance from the hyperplane $\bar v \cdot \bar x = 0$.)

Note that for the support vectors on the margin, we have

$$y_i(\beta_0 + \beta_1x_{i1} + ...  + \beta_px_{ip}) - M = 0$$

That's a constrained-minimization problem, so we use Lagrange multipliers. In the end, the coefficients will be linear combinations of the $x_i$ using the multipliers $\alpha_i$. Most importantly, the lagrangian is dependent on the dot product of the $x_i$'s. While there isn't a closed-form, it is convex so solving numerically won't get stuck in a local maximum.


One helpful video on the math (that uses a slightly different formalism): https://www.youtube.com/watch?v=_PwhiWxHK8o

Note that for MMC pre-processing will be necessary to normalize or standardize the features.

This afternoon we'll discuss a few generalizations to MMC for situations when the classes aren't separable by a hyperplane.

# Afternoon Lecture

## Soft margins

So far, we've only considered the case in which there exists a hyperplane that can divide the two classes. In the real world, even with well-separated classes, we'll always end up with points on the wrong side. To deal with that we'll add a cost to points that violate the margin (or, equivalently, allow a certain 'budget' for points to go past the margin).

First, a function to generate a bunch of points from distributions I choose that might overlap a bit, depending on the seed.

In [None]:
def generate_points(seed):
    '''Generate 100 bunch of points based on a seed in 2 different classes/distributions,
    choosen so they might overlap slightly.'''
    np.random.seed(seed)
    npts = 100
    X = np.zeros((npts, 2))
    X[npts/2:, 0] = scs.norm(0,1.5).rvs(npts/2)
    X[npts/2:, 1] = scs.norm(0,1.5).rvs(npts/2)
    X[:npts/2, 0] = scs.norm(4,1.5).rvs(npts/2)
    X[:npts/2, 1] = scs.norm(6,1.5).rvs(npts/2)
    y = np.zeros(npts, dtype='int32')
    y[npts/2:] = 1
    return X, y

In [None]:
def projection(b0, b1, b2, x1, x2):
    '''Find projection of a point to a line'''
    x1p = (b2*( b2*x1 - b1*x2) - b1*b0) / (b1**2 + b2**2)
    x2p = (b1*(-b2*x1 + b1*x2) - b2*b0) / (b1**2 + b2**2)
    m = np.sqrt(b1**-2 + b2**-2)/2
    dist = np.sqrt((x1-x1p)**2 + (x2-x2p)**2) / m
    return x1p, x2p, dist

In [None]:
C = 1
X, y = generate_points(8)
model_sv = SVC(kernel='linear', C=C)
model_sv.fit(X, y)

fig, ax = plt.subplots(figsize=(8,8))
ax.scatter(X[:,0], X[:,1], color=np.array(['r', 'b'])[y], s=8)
ax.set_aspect('equal')
b0 = model_sv.intercept_[0]
b1 = model_sv.coef_[0][0]
b2 = model_sv.coef_[0][1]
plot_boundary(ax, b0, b1, b2, color='k')
plot_boundary(ax, b0, b1, b2, margin=-1, ls=':', color='k')
plot_boundary(ax, b0, b1, b2, margin=1,  ls=':', color='k')
ax.scatter(model_sv.support_vectors_[:,0], model_sv.support_vectors_[:,1], s=200, edgecolors='black', facecolors='none')
total_slack=0.
for i, support in enumerate(model_sv.support_vectors_):
    margin = np.sign(model_sv.dual_coef_[0][i])
    x1p, x2p, dist = projection(b0-margin, b1, b2, support[0], support[1]) 
    ax.plot([support[0], x1p], [support[1], x2p], 'g')
    total_slack += dist
ax.set_title('Soft margins with SVC, C={0}, total slack={1:.2f}'.format(C, total_slack))
ax.set_xlim((-2,5))
ax.set_ylim((0,6))

The *slack* associated with $x_i$ is $\epsilon_i$ is distance that any given point is past the margin, divided by the width of the margin M. So $\epsilon_i$ is

 * 0 if $x_i$ is correctly classified, either right on the margin or outside it,
 * 1 if $x_i$ is at the decision boundary,
 * 2 if $x_i$ is at the opposite margin, etc.

We can allow a certain amount of slack C:

$$\sum_{i=1}^n \epsilon_i = budget$$

while requiring

$$y_i(\beta_0 + \beta_1x_{i1} + ...  + \beta_px_{ip}) \ge M (1 - \epsilon_i)\;\forall\; i$$

**Warning: C is used inconsitently in sklearn**

To solve this again we follow Lagrange multipliers.

### The Penalty term

Let's investigate what happens when try different values for C for the same data.

In [None]:
def svc_plots(X, y):
    '''Fit an SVC and plot a graph for various of different values of the penalty'''
    fig, axes = plt.subplots(1,5,figsize=(20,8))
    for ax, C in zip(axes, [100, 10, 1, .1, .01]):
        model_sv = SVC(kernel='linear', C=C)
        model_sv.fit(X, y)
        ax.scatter(X[:,0], X[:,1], color=np.array(['r', 'b'])[y], s=5)
        ax.set_aspect('equal')
        b0 = model_sv.intercept_[0]
        b1 = model_sv.coef_[0][0]
        b2 = model_sv.coef_[0][1]
        plot_boundary(ax, b0, b1, b2, color='k')
        plot_boundary(ax, b0, b1, b2, margin=-1, ls=':', color='k')
        plot_boundary(ax, b0, b1, b2, margin=1,  ls=':', color='k')
        ax.scatter(model_sv.support_vectors_[:,0], model_sv.support_vectors_[:,1], s=200, edgecolors='black', facecolors='none')
        ax.set_title('C = {}'.format(C))

In [None]:
X,y = generate_points(101)
svc_plots(X,y)

In [None]:
X,y = generate_points(99)
svc_plots(X,y)

For the second row we're taking a different sample from the same distribution and finding the decision boundary. The decision boundaries on the right-hard (C=0.01) graphs are pretty similar for the two different samples, while for the left-hand graphs (C=100) they are pretty different. One the the other hand, the right-hand graphs are pretty far for the "true" boundary. This suggests a bias-variance tradeoff.

Let's get a lot more random samples and plot the decision boundaries to see the variance.

In [None]:
fig, axes = plt.subplots(1,5,figsize=(20,8))
for ax, C in zip(axes, [100, 10, 1, .1, .01]):
    ax.set_aspect('equal')
    ax.set_title('C = {}'.format(C))
    ax.set_xlim((-4,10))
    ax.set_ylim((-4,12))

for seed in xrange(100):
    X,y = generate_points(seed)
    for ax, C in zip(axes, [100, 10, 1, .1, .01]):
        model = SVC(kernel='linear', C=C)
        model.fit(X, y)
        plot_boundary(ax, model.intercept_[0], model.coef_[0][0], model.coef_[0][1], color='k', alpha=0.2)


## Adding polynomial features

Consider a simple 1-dimensional classification

In [None]:
x = scs.uniform(-10,20).rvs(50)
y = ((x > 4) | (x < -4)).astype(int)

In [None]:
jitter = scs.uniform(-0.03, .06).rvs(50)
fig, ax = plt.subplots(figsize=(8,.5))
ax.scatter(x, jitter, alpha=0.5, color=np.array(['r', 'b'])[y])
ax.axis('off')
plt.show()

In [None]:
x2 = np.power(x, 2)
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(x, x2, alpha=0.5, color=np.array(['r', 'b'])[y])

Now we can fit an SVC model and get the boundary.

In [None]:
model = SVC(kernel='linear')
model.fit(np.stack([x, x2]).T, y)
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(x, x2, alpha=0.5, color=np.array(['r', 'b'])[y] )
plot_boundary(ax, model.intercept_[0], model.coef_[0][0], model.coef_[0][1], color='k')
plot_boundary(ax, model.intercept_[0], model.coef_[0][0], model.coef_[0][1], margin=-1, ls=':', color='k')
plot_boundary(ax, model.intercept_[0], model.coef_[0][0], model.coef_[0][1], margin=1,  ls=':', color='k')

In [None]:
npts = 1000
x1 = scs.uniform(-10, 20).rvs(npts)
x2 = scs.uniform(-10, 20).rvs(npts)
#y = ((x1**2 + x2**2) > 50).astype(int)
y = ((x1**2 - x2**2) > 10).astype(int)
fig, axes = plt.subplots(1, 3, figsize=(16,5))
for ax, y in zip(axes, [(x1**2 + x2**2) > 40,
                        (x1**2 - x2**2) > 10,
                        (x1*x2*x2 - x1**3) < 50
                       ]):
    y = y.astype(int)
    ax.set_xlim(-10,10)
    ax.set_ylim(-10,10)

    ax.scatter(x1, x2, color=np.array(['r','b'])[y], s=10)

We can handle all of these sorts of boundaries by adding lots of extra dimensions. That's great, except we really need to add many, many dimensions.

Suppose we have 3 features and want to consider all 2nd-order polynomials. How many more features will we need to add?

Here's a video showing this in 3-d space: https://www.youtube.com/watch?v=3liCbRZPrZA

## Kernels

It turns out that the math above only depends on the dot products of the vectors with each other. We could replace that with a different kernel and get different results.

$$\begin{align}
K(x_i, x_{i'}) & = \sum_{j=1}^p x_{ij} x_{i'j} &\text{Linear Kernel}\\
K(x_i, x_{i'}) & = (1 + \sum_{j=1}^p x_{ij} x_{i'j})^d & \text{Polynomial Kernal}\\
K(x_i, x_{i'}) & = \exp(-\gamma(\sum_{j=1}^p (x_{ij} x_{i'j})^2) & \text{Radial Basis Function Kernel} 
\end{align}$$

Polynomial kernels require the degree for a hyperparameter.

How will this affect bias and variance?

RBF kernels require a hyperparameter $\gamma$. Larger $\gamma$ will tighten the boundaries, so...?


## Multi-class classification

Up until now, we've been focusing on binary classification.

SVM (like Logistic Regression) doesn't naturally fit to a multiple target classes. There are a couple approaches to expanding a binary classifier to a multi-class problem.

#### One-vs-rest (OvR)

The fastest approach is to build one model for each class, comparing each class to everything else. For any point we choose the class that's the most favored by the corresponding model. This is used in sklearn's LogisticRegression as well as LinearSVC.

#### One-vs-one (OvO)

The other approach is to build a model for each of the $k \choose 2$ pairs of classes, training each on just those samples that are in one class or the other. In this approach the overall model predicts the class that is choosen by the most models. This is used by sklearn's SVC and NuSVC.

## Probabilities

Sklearn's SVC doesn't generate probabilities by default; to do so you need to create the model with `probability=True`. In this case it uses cross-validation to fit a logistic function, which slows down the fit.

## Unbalanced classes

If you have seriously unbalanced classes and a small C (i.e., soft margin) the smaller class might get overwelmed. One way to deal with this is with the `class_weight` argument.

## Other loss functions

SVM uses a "hinge loss" function, but others are possible with the LinearSVC.

In [None]:
xpts = np.linspace(-3,3,100)
yhinge = 1 - xpts
yhinge[yhinge < 0] = 0
ylogit = 1/(1+np.exp(xpts))
fig, ax = plt.subplots()
ax.plot(xpts, yhinge)
ax.plot(xpts, ylogit)

ax.set_ylim((-1, 3))