# Support Vector Machines (SVM) - Summary and Practical Examples

This notebook summarizes key theoretical concepts from the chapter on Support Vector Machines (SVM), including:
- Linear SVM classification (hard margin, soft margin)
- Kernel trick and nonlinear SVM classification
- SVM regression
- Online SVMs and hinge loss

Practical examples use Scikit-Learn to illustrate core concepts.


## 1. Introduction to Support Vector Machines

SVM is a powerful model for classification, regression, and outlier detection. It seeks the widest possible margin ("street") separating classes by maximizing the distance between the decision boundary and the closest training points called **support vectors**. This large margin principle improves generalization.

**Key idea:** Fit a hyperplane that separates classes while maximizing the margin.

## 2. Linear SVM Classification

- **Hard margin:** strict separation, no errors allowed; works only if data is perfectly linearly separable and sensitive to outliers.
- **Soft margin:** allows some margin violations controlled by hyperparameter \$ C \$, trading off margin size and classification errors for better generalization.

### SVM Objective (Soft Margin):
\$ \min_{w,b,\zeta} \; \frac{1}{2} \|w\|^2 + C \sum_{i=1}^m \zeta_i \quad \text{subject to} \quad y_i(w^T x_i + b) \geq 1 - \zeta_i, \; \zeta_i \geq 0 \$

### SVM prediction function for a new instance \$x\$:
\$ f(x) = \text{sign}(w^T x + b) \$

Training finds \$w, b\$ to maximize margin while minimizing violations.

In [1]:
# Linear SVM classification example on iris dataset
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

iris = load_iris()
X = iris.data[:, (2, 3)]  # petal length and width
y = (iris.target == 2).astype(np.float64)  # Iris virginica vs others

svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("linear_svc", LinearSVC(C=1, loss="hinge"))
])
svm_clf.fit(X, y)

# Predict a sample
print(svm_clf.predict([[5.5, 1.7]]))  # expected: 1 (Iris virginica)

[1]


## 3. Feature Scaling Importance

SVMs are sensitive to feature scales. Always standardize features (zero mean, unit variance) to get better margins and performance.

## 4. Nonlinear SVM Classification and Kernel Trick

Most datasets are not linearly separable. Two approaches:
- Manually add polynomial features to make data linearly separable (costly for high degrees)
- Use kernel trick to implicitly compute dot products in high-dimensional space without explicit feature mapping.

Common kernels:
- Linear kernel: \$ K(a,b) = a^T b \$
- Polynomial kernel: \$ K(a,b) = (\gamma a^T b + r)^d \$
- Gaussian RBF kernel: \$ K(a,b) = \exp(-\gamma \|a - b\|^2) \$
- Sigmoid kernel: \$ K(a,b) = \tanh(\gamma a^T b + r) \$

Kernel trick lets SVMs learn complex decision boundaries efficiently.

In [2]:
# Nonlinear SVM with polynomial kernel on moons dataset
from sklearn.datasets import make_moons
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

X, y = make_moons(n_samples=100, noise=0.15, random_state=42)
poly_kernel_svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
])
poly_kernel_svm_clf.fit(X, y)

print('Support vectors count:', len(poly_kernel_svm_clf.named_steps["svm_clf"].support_))

Support vectors count: 23


## 5. Gaussian RBF Kernel

The Gaussian RBF kernel maps instances to an infinite-dimensional space allowing SVMs to learn very complex boundaries with only a few hyperparameters to tune: \$ \gamma \$ controlling influence of instances and \$ C \$ controlling trade-off between margin width and violations.

In [3]:
# SVM with RBF kernel on moons dataset
rbf_kernel_svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="rbf", gamma=2, C=1))
])
rbf_kernel_svm_clf.fit(X, y)
print('Support vectors count:', len(rbf_kernel_svm_clf.named_steps["svm_clf"].support_))

Support vectors count: 29


## 6. Dual Problem and Kernel Trick

SVM training can be expressed as a dual quadratic optimization problem focusing on the support vectors and dot products between input vectors.

By substituting dot products with kernel functions, the "kernel trick" allows learning in high-dimensional feature spaces without explicit transformation.

This reduces computation and enables effective nonlinear classification.

## 7. SVM Regression (Support Vector Regression)

SVMs can also perform regression by fitting a "tube" of width \$ \epsilon \$ around the regression function, ignoring errors within the tube and penalizing ones outside it.

Linear and kernelized versions exist similar to classification.

Example using linear SVR in Scikit-Learn follows.

In [4]:
# Linear Support Vector Regression example
from sklearn.svm import LinearSVR

# Generate linear data with noise
X_reg = 2 * np.random.rand(100, 1)
y_reg = 4 + 3 * X_reg.ravel() + np.random.randn(100)

svr_reg = LinearSVR(epsilon=0.5, C=1.0, max_iter=10000)
svr_reg.fit(X_reg, y_reg)
print(f"SVR coefficient: {svr_reg.coef_}")

SVR coefficient: [3.01087415]


## 8. Online SVMs and Hinge Loss

SVMs can be trained incrementally using stochastic gradient descent on the hinge loss function:

\$ J(w,b) = \frac{1}{2} \|w\|^2 + C \sum_i \max(0, 1 - y_i(w^T x_i + b)) \$

The hinge loss penalizes instances on the wrong side or inside the margin and is linear beyond the margin threshold.

Online training is useful for large-scale or streaming data.

For very large nonlinear problems, deeper models like neural networks may be preferable.

### This notebook provides a practical and theoretical overview of SVM to get started with training and using Support Vector Machines effectively.