# Algorithm & Formulation

Support Vector Machine is a supervised learning algorithm for classification, regression and outlier detection. SVM finds the best decision boundary that maximizes the margin between classes. 

- Maximizing margin → better generalization
- Uses convex optimization → global optimum guaranteed
- Can model non-linear decision boundaries using kernels

$$
\Large \{(x_i, y_i)\}_{i=1}^n, \quad x_i \in \mathbb{R}^d,\; y_i \in \{-1, +1\}
$$

We begin with the simplest possible classifier; a hyperplane. 

$$
\Large w.x + b = 0
$$

where w is a d dimensional normal vector and b is the bias ; the classification rule is as follows ; 

$$
\LARGE \hat{y} =\begin{cases}+1, & \text{if } w \cdot x + b \ge 0 \\-1, & \text{if } w \cdot x + b < 0\end{cases}
$$

If the data is linearly separable, there exist infinitely many hyperplanes that correctly classify the data. So the question is Which hyperplane should we choose for best generalization?. 

Support vector machines answer this by adopting the maximum margin principle;  Among all separating hyperplanes, choose the one that maximises the distance to the nearest data points from both classes. This distance is called the margin. 

# Objective & Optimization

To formalize the margin, we fix the scale of $(w,b)$ and define two parallel hyperplanes; 

Positive Margin Hyperplane; $w.x+b=+1$ and Negative Margin Hyperplane; $w.x+b=-1$. These hyperplanes are parallel to the decision boundary and pass through the closest points of each class( support vectors).

All training points must lie outside or on the margin hyperplanes; ie.

 $\Large y_i(w.x_i+b)≥1  \text{ for all i}$

The above inequality ensures correct classification and enforces a minimum margin. 

Distance between two hyperplanes is given by ; 

$$
\Large \text{Margin width} = \frac{2}{\|w\|}
$$

So our objective is to maximise this margin or equivalently minimizing $\|w\|$

Final Primal Optimization Problem ( Hard Margin SVM )

$$
\begin{aligned}
\min_{w,\, b} \quad & \frac{1}{2}\|w\|^2 \\\text{subject to} \quad & y_i \bigl(w \cdot x_i + b\bigr) \ge 1,\quad i = 1, \dots, n\end{aligned}
$$

We write the constraints in standard form ; 

$$
g_i(w,b) = 1 - y_i (w \cdot x_i + b) \le 0
$$

We introduce Lagrange Multipliers; one multiplier per constraint. Each multiplier enforces one constraint. 

$$
\mathcal{L}(w,b,\alpha)=\frac{1}{2}\|w\|^2+\sum_{i=1}^n\alpha_i \left(1 - y_i (w \cdot x_i + b)\right)
$$

We take partial derivate with respect to w; and set it zero ; 

$$
\frac{\partial \mathcal{L}}{\partial w}=w - \sum_{i=1}^n \alpha_i y_i x_i
$$

We take the partial derivate with respect to b

$$
\frac{\partial \mathcal{L}}{\partial b}=- \sum_{i=1}^n \alpha_i y_i
$$

Up to now we assumed that Data is perfectly separable; but in practice data is noisy and classes overlap . perfect separation may not exist. So the original constraint $y_i(w.x_i+b)≥1$ may be impossible to satisfy. Slack variables allow controlled violations. 

$$
y_i \bigl(w \cdot x_i + b\bigr) \ge 1 - \xi_i
$$

- $0≤\xi_i<1$ - inside margin, correct side;
- $\xi_i=0$ - correct , outside margin
- $\xi_i=1$ - on decision boundary
- $\xi_i>1$ - misclassified

Following is the updated Optimization Objective ( Soft-Margin SVM) 

$$
\min_{w,\, b,\, \xi}\quad\frac{1}{2}\|w\|^2+C \sum_{i=1}^n \xi_i
$$

$$
y_i(w \cdot x_i + b\bigr) \ge 1 - \xi_i,\quad\xi_i \ge 0
$$

C controls penalty strength; 

Once we add slack variables, we have two type of constraints in the Lagrange function.

 

$$
\Large \mathcal{L}=\frac{1}{2}\|w\|^2+C \sum_{i=1}^n \xi_i+\sum_{i=1}^n \alpha_i \bigl(1 - \xi_i - y_i (w \cdot x_i + b)\bigr)-\sum_{i=1}^n \mu_i \xi_i
$$

when we take the derivates ; we get a bound; 

$$
0<=\alpha_i<=C
$$

The slack penalty C limits how much violation we are willing to tolerate. No point should dominate the solution arbitrarily. So , SVM caps the influence of any single point. 

- $\alpha_i=0$, point is irrelevant
- $0<\alpha_i<C$ , point lies on margin
- $\alpha_i = C$, point violates margin or is misclassified.

After solving the optimization problem, we plug in the optimal parameters; 

we substitute w back into model ; 

$$
f(x) = w \cdot x + b=\left( \sum_{i=1}^n \alpha_i y_i x_i \right) \cdot x + b
$$

$$
f(x)=\sum_{i=1}^n \alpha_i y_i (x_i \cdot x) + b
$$

Following is the kernel generalisation of the objective function ;

$$
f(x)=\sum_{i=1}^n \alpha_i y_i K(x_i, x) + b
$$

# Kernel Trick

In many problems, the data may not be linearly separable in the input space. So we try to map data into a higher dimensional space; 

$$
\Large \phi: \mathbb{R}^d \rightarrow \mathcal{H}
$$

Then the decision function becomes ; 

$$
\Large f(x) = \sum_{i=1}^n \alpha_i y_i (\phi(x_i) \cdot \phi(x)) + b
$$

But computing $\phi(x)$ explicitly may be very expensive, infinite dimensional or may be impossible in practice. So if we can find a function K such that ;

$$
\Large K(x_i, x) = \phi(x_i) \cdot \phi(x)
$$

Then we dont not need to compute $\phi(x)$; Now, the final Kernelized Decision Function becomes ; 

$$
\Large f(x) =\sum_{i=1}^n \alpha_i y_i K(x_i, x) + b
$$

The Gaussian ( RBF) kernel corresponds to an infinite - dimensional feature space ;

# Simple Implementation

In [1]:
import numpy as np

# Toy linearly separable dataset
X = np.array([[2, 2], [4, 4], [4, 0], [6, 2]])
y = np.array([1, 1, -1, -1])  # labels {-1, +1}

# Initialize weights and bias
w = np.zeros(2)
b = 0.0

# Learning rate
lr = 0.01
epochs = 1000

# Gradient descent on primal SVM (simplified)
for epoch in range(epochs):
    for i in range(len(X)):
        if y[i] * (np.dot(w, X[i]) + b) < 1:
            # Misclassified or inside margin
            w += lr * (y[i] * X[i] - 2 * 0.01 * w)  # 0.01 = regularization
            b += lr * y[i]
        else:
            # Correct and outside margin
            w -= lr * 2 * 0.01 * w  # only regularization

print("Weights:", w)
print("Bias:", b)


Weights: [-0.52568088  0.8202978 ]
Bias: 0.5000000000000002


# Soft Margin 

In [2]:
from sklearn.svm import SVC

# Example dataset
X = np.array([[2, 2], [4, 4], [4, 0], [6, 2], [5, 5], [1, 0]])
y = np.array([1, 1, -1, -1, 1, -1])

# Soft-margin linear SVM
C = 1.0
clf = SVC(kernel='linear', C=C)
clf.fit(X, y)

print("Weights (w):", clf.coef_)
print("Bias (b):", clf.intercept_)


Weights (w): [[-0.5       1.249856]]
Bias (b): [-0.499808]


# Non-linearly separable data

In [3]:
# Non-linearly separable data
X = np.array([[0, 0], [1, 1], [0, 1], [1, 0]])
y = np.array([1, 1, -1, -1])

# RBF kernel SVM
clf_rbf = SVC(kernel='rbf', C=1.0, gamma=1.0)
clf_rbf.fit(X, y)

print("Support vectors:\n", clf_rbf.support_vectors_)


Support vectors:
 [[0. 1.]
 [1. 0.]
 [0. 0.]
 [1. 1.]]


# rbf kernel

In [4]:
def rbf_kernel(X1, X2, gamma=1.0):
    K = np.zeros((X1.shape[0], X2.shape[0]))
    for i in range(X1.shape[0]):
        for j in range(X2.shape[0]):
            K[i,j] = np.exp(-gamma * np.linalg.norm(X1[i]-X2[j])**2)
    return K

K = rbf_kernel(X, X, gamma=1.0)
print("RBF Kernel matrix:\n", K)


RBF Kernel matrix:
 [[1.         0.13533528 0.36787944 0.36787944]
 [0.13533528 1.         0.36787944 0.36787944]
 [0.36787944 0.36787944 1.         0.13533528]
 [0.36787944 0.36787944 0.13533528 1.        ]]


# Kernel : Mapping 

In [5]:
# Define an explicit feature mapping ϕ(x)
# Compute dot product in feature space
# Compute the kernel directly in input space
# Show they are exactly the same → so mapping is unnecessary
# 

import numpy as np

x = np.array([1, 2])
z = np.array([3, 4])

def phi(x):
    return np.array([
        x[0] ** 2,
        np.sqrt(2) * x[0] * x[1],
        x[1] ** 2
    ])
phi_x = phi(x)
phi_z = phi(z)

dot_feature_space = np.dot(phi_x, phi_z)
dot_feature_space


np.float64(121.0)

In [6]:
kernel_value = (np.dot(x, z)) ** 2
kernel_value


np.int64(121)

In [7]:
np.isclose(dot_feature_space, kernel_value)


np.True_