SVM is a powerful machine learning model capable of performing linear and non linear classification, regression and novelty detection. THey dont scale very well with large datasets. But perform very well with small to medium sized datasets (hundreds to thousands of instances)

# Linear SVM

Decision boundary of an SVM classifier is the line that separates the 2 classes but also stays as far away from the closest training instances as possible. SVM classifier can be thought of fitting the widest possible street between the classes. This is called large margin classification. Adding more instances off the street won't affect decision boundary at all. Its fully determined by the instances located on the edge of the street. These instances are called support vectors. SVM is sensitive to feature scales.

**Hard margin classification**

We strictly impose that all instances must be off the street and on the correct side. 2 issues with this - It works only if data is linearly separable. Second its sensitive to outliers. Model wont probably generalize well.

**Soft margin classification**

Objective is to find a good balance between keeping the street as large as possible and limiting the margin violations.

**Regularization using C**

Reducing C makes the street larger but it leads to more margin violations. ie more instances supporting the street. Less chance of overfitting. But if we reduce toomuch model ends up underfitting. If SVM model is overfitting, try regularizing it by reducing C.

Scores that SVM use to make predictions is the signed distance between each instance and the decision boundary.

In [1]:
from sklearn.datasets import load_iris
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

In [2]:
iris = load_iris(as_frame=True)
X = iris.data[["petal length (cm)", "petal width (cm)"]].values
y = (iris.target == 2)  # Iris virginica

svm_clf = make_pipeline(StandardScaler(),
                        LinearSVC(C=1, dual=True, random_state=42))
svm_clf.fit(X, y)

In [3]:
X_new = [[5.5, 1.7], [5.0, 1.5]]
svm_clf.predict(X_new)

array([ True, False])

In [4]:
svm_clf.decision_function(X_new)

array([ 0.66163411, -0.22036063])

# Non linear SVM classification

For non linear datasets, add more features such as polynomial features so that its a linearly separable dataset. Then use LinearSVC classifier. (Time complexity of LinearSVC is O(m*n)

Kernalized SVMS - O(m^2 * n) to O(m^3 * n)

kernel="poly" hyperparameters - degree, coef0

kernel="rbf" Add features computed using similarity function. hyperparameters - gamma, C . If model is overfitting reduce gamma otherwise increase gamma.

In [5]:
from sklearn.datasets import make_moons
from sklearn.preprocessing import PolynomialFeatures

X, y = make_moons(n_samples=100, noise=0.15, random_state=42)

polynomial_svm_clf = make_pipeline(
    PolynomialFeatures(degree=3),
    StandardScaler(),
    LinearSVC(C=10, max_iter=10_000, dual=True, random_state=42)
)
polynomial_svm_clf.fit(X, y)

# Polynomial Kernel

In [6]:
from sklearn.svm import SVC

poly_kernel_svm_clf = make_pipeline(StandardScaler(),
                                    SVC(kernel="poly", degree=3, coef0=1, C=5))
poly_kernel_svm_clf.fit(X, y)

In [7]:
rbf_kernel_svm_clf = make_pipeline(StandardScaler(),
                                   SVC(kernel="rbf", gamma=5, C=0.001))
rbf_kernel_svm_clf.fit(X, y)

# SVM Regression

It tries to fit as many instances as possible on the street while limiting margin violations. The width of the street is controlled by the hyperparameter epsilon.

Reducing epsilon increases the number of support vectors which regularizes the model.If you add more training instances withing the margin, it will not affect the model's predictions thus the model is said to be epsilon insensitive.

To tackle non linear regression tasks, we can use kernelized SVM model.

LinearSVR (also LinearSVC) scales linearly with the size of the training set while SVR(also SVC) gets much too slow when the training set grows very large.

**How predictions are made?**

A linear SVM classifier first computes the decision function "w_transpose X + b" - w -feature vector, b- bias . If result is +ve, predicted class y_hat is the positive class (1) else negative class(0).

**How training is done ?**

It requires finding the weights vector w and the bias term b that make the street or margin as wide as possible while limiting the number of margin violations. To make the width larger, we need to make w smaller. Tweaking bias term b shifts the margin around without affecting its size. Also decision function>1 for all positive instances and < 1 for negative instances. For soft margin slack variable zeta measures how much the ith instance is allowed to violate the margin.

In [9]:
from sklearn.svm import LinearSVR
import numpy as np

# extra code – these 3 lines generate a simple linear dataset
np.random.seed(42)
X = 2 * np.random.rand(50, 1)
y = 4 + 3 * X[:, 0] + np.random.randn(50)

svm_reg = make_pipeline(StandardScaler(),
                        LinearSVR(epsilon=0.5, dual=True, random_state=42))
svm_reg.fit(X, y)