### Support Vector Machines

In this chapter, we discuss the support vector machine (SVM), an approach
for classification that was developed in the computer science community in
the 1990s and that has grown in popularity since then. SVMs have been
shown to perform well in a variety of settings, and are often considered one
of the best “out of the box” classifiers.

The support vector machine is a generalization of a simple and intu
itive classifier called the maximal margin classifier, which we introduce in
Section 9.1. Though it is elegant and simple, we will see that this classifier
unfortunately cannot be applied to most data sets, since it requires that
the classes be separable by a linear boundary. In Section 9.2, we introduce
the support vector classifier, an extension of the maximal margin classifier
that can be applied in a broader range of cases. Section 9.3 introduces the
support vector machine, which is a further extension of the support vec
tor classifier in order to accommodate non-linear class boundaries. Support
vector machines are intended for the binary classification setting in which
there are two classes; in Section 9.4 we discuss extensions of support vector
machines to the case of more than two classes. In Section 9.5 we discuss
the close connections between support vector machines and other statistical
methods such as logistic regression.

People often loosely refer to the maximal margin classifier, the support
vector classifier, and the support vector machine as “support vector
machines”. To avoid confusion, we will carefully distinguish between these
three notions in this chapter.

#### Maximal Margin Classifier

In this section, we define a hyperplane and introduce the concept of an
optimal separating hyperplane.

##### What Is a Hyperplane?

In a p-dimensional space, a hyperplane is a flat affine subspace of hyperplane dimension $ p - 1 $. For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace—in other words, a line. In three dimensions, a hyperplane is a flat two-dimensional subspace—that is, a plane. In $ p > 3 $ dimensions, it can be hard to visualize a hyperplane, but the notion of a $(p - 1)$-dimensional flat subspace still applies.

The mathematical definition of a hyperplane is quite simple. In two dimensions, a hyperplane is defined by the equation:

$$
\beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0 \tag{9.1}
$$

for parameters $ \beta_0, \beta_1, $ and $ \beta_2 $. When we say that (9.1) “defines” the hyperplane, we mean that any $ X = (X_1, X_2)^T $ for which (9.1) holds is a point on the hyperplane. Note that (9.1) is simply the equation of a line, since indeed in two dimensions, a hyperplane is a line.

Equation (9.1) can be easily extended to the p-dimensional setting:

$$
\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p = 0 \tag{9.2}
$$

defines a p-dimensional hyperplane, again in the sense that if a point $ X = (X_1, X_2, \ldots, X_p)^T $ in p-dimensional space (i.e. a vector of length $ p $) satisfies (9.2), then $ X $ lies on the hyperplane.

Now, suppose that $ X $ does not satisfy (9.2); rather,

$$
\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p > 0. \tag{9.3}
$$

Then this tells us that $ X $ lies to one side of the hyperplane. On the other hand, if

$$
\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p < 0, \tag{9.4}
$$

then $ X $ lies on the other side of the hyperplane. So we can think of the hyperplane as dividing p-dimensional space into two halves. One can easily determine on which side of the hyperplane a point lies by simply calculating the sign of the left-hand side of (9.2). A hyperplane in two-dimensional space is shown in Figure 9.1.



##### Classification Using a Separating Hyperplane

Now suppose that we have an $ n \times p $ data matrix $ X $ that consists of $ n $ training observations in p-dimensional space:

$$
x_1 =
\begin{bmatrix}
x_{11} \\
\vdots \\
x_{1p}
\end{bmatrix}
\quad
\ldots
\quad
x_n =
\begin{bmatrix}
x_{n1} \\
\vdots \\
x_{np}
\end{bmatrix} \tag{9.5}
$$

and that these observations fall into two classes—that is, $ y_1, \ldots, y_n \in \{ -1, 1 \} $ where $ -1 $ represents one class and $ 1 $ the other class.  We also have a test observation, a $ p $-vector of observed features $ x^* = (x^*_1, \ldots, x^*_p)^T $. Our goal is to develop a classifier based on the training data that will correctly classify the test observation using its feature measurements. We have seen a number of approaches for this task, such as linear discriminant analysis and logistic regression in Chapter 4, and classification trees, bagging, and boosting in Chapter 8. We will now see a new approach that is based upon the concept of a separating hyperplane.

Suppose that it is possible to construct a hyperplane that separates the training observations perfectly according to their class labels. Examples of three such separating hyperplanes are shown in the left-hand panel of Figure 9.2. We can label the observations from the blue class as $ y_i = 1 $ and those from the purple class as $ y_i = -1 $. Then a separating hyperplane has the property that:

$$
\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} > 0 \quad \text{if } y_i = 1,
$$

and

$$
\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} < 0 \quad \text{if } y_i = -1.
$$

Equivalently, a separating hyperplane has the property that:

$$
y_i (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}) > 0
$$

for all $ i = 1, \ldots, n $.

If a separating hyperplane exists, we can use it to construct a very natural classifier: a test observation is assigned a class depending on which side of the hyperplane it is located. The right-hand panel of Figure 9.2 shows an example of such a classifier. That is, we classify the test observation $ x $ based on the sign of 

$$
f(x^*) = \beta_0 + \beta_1 x^*_1 + \beta_2 x^*_2 + \cdots + \beta_p x^*_p.
$$

If $ f(x^*) $ is positive, then we assign the test observation to class 1, and if $ f(x^*) $ is negative, then we assign it to class -1. We can also make use of the magnitude of $ f(x^*) $. If If $ f(x^*) $ is far from zero, then this means that $ x $ lies far from the hyperplane, and so we can be confident about our class assignment for $ x $. On the other hand, if $ f(x^*) $ is close to zero, then $ x $ is located near the hyperplane, and so we are less certain about the class assignment for $ x $. Not surprisingly, and as we see in Figure 9.2, a classifier that is based on a separating hyperplane leads to a linear decision boundary.



##### The Maximal Margin Classifier

In general, if our data can be perfectly separated using a hyperplane, then there will in fact exist an infinite number of such hyperplanes. This is because a given separating hyperplane can usually be shifted a tiny bit up or down, or rotated, without coming into contact with any of the observations. Three possible separating hyperplanes are shown in the left-hand panel of Figure 9.2. In order to construct a classifier based upon a separating hyperplane, we must have a reasonable way to decide which of the infinite possible separating hyperplanes to use.

A natural choice is the maximal margin hyperplane (also known as the maximal optimal separating hyperplane), which is the separating hyperplane that is farthest from the training observations. That is, we can compute the (perpendicular) distance from each training observation to a given separating hyperplane; the smallest such distance is the minimal distance from the observations to the hyperplane, and is known as the margin. The maximal margin hyperplane is the separating hyperplane for which the margin is largest—that is, it is the hyperplane that has the farthest minimum distance to the training observations. We can then classify a test observation based on which side of the maximal margin hyperplane it lies. This is known as the maximal margin classifier. We hope that a classifier that has a large maximal margin on the training data will also have a large margin on the test data, and hence will classify the test observations correctly. Although the maximal margin classifier is often successful, it can also lead to overfitting when $ p $ is large.

If $ \beta_0, \beta_1, \ldots, \beta_p $ are the coefficients of the maximal margin hyperplane, then the maximal margin classifier classifies the test observation $ x $ based on the sign of 

$$
f(x^*) = \beta_0 + \beta_1 x^*_1 + \beta_2 x^*_2 + \cdots + \beta_p x^*_p.
$$

Figure 9.3 shows the maximal margin hyperplane on the data set of Figure 9.2. Comparing the right-hand panel of Figure 9.2 to Figure 9.3, we see that the maximal margin hyperplane shown in Figure 9.3 does indeed result in a greater minimal distance between the observations and the separating hyperplane—that is, a larger margin. In a sense, the maximal margin hyperplane represents the mid-line of the widest “slab” that we can insert between the two classes.

Examining Figure 9.3, we see that three training observations are equidistant from the maximal margin hyperplane and lie along the dashed lines indicating the width of the margin. These three observations are known as support vectors, since they are vectors in $ p $-dimensional space (in Figure 9.3, $ p = 2 $) and they “support” the maximal margin hyperplane in the sense that if these points were moved slightly, then the maximal margin hyperplane would move as well. Interestingly, the maximal margin hyperplane depends directly on the support vectors, but not on the other observations: a movement to any of the other observations would not affect the separating hyperplane, provided that the observation’s movement does not cause it to cross the boundary set by the margin. The fact that the maximal margin
hyperplane depends directly on only a small subset of the observations is
an important property that will arise later in this chapter when we discuss
the support vector classifier and support vector machines.