# How the Algorithm Works

**The objective of SVM is to find a hyperplane that separates the data into two categories for binary classification.**

## Key Points:

### Hyperplane
A hyperplane is a subspace whose dimension is one less than that of its ambient space. For instance, in an n-dimensional space, the hyperplane is (n-1)-dimensional.

For SVMs, the goal is to maximize the margin between the two classes. The equation of the hyperplane can be represented as:

$$wx - b = 0$$

where $w$ is the weight vector, $x$ is the feature vector, and $b$ is the bias.

To classify data points, we use the following conditions:

- For a data point with label $y = 1$, we want $wx - b \geq 1$.
- For a data point with label $y = -1$, we want $wx - b \leq -1$.

In general, we aim for:

$$y(wx - b) \geq 1$$

for all data points.

### Margin

The margin is defined as the distance between the hyperplane and the nearest data points from both classes, which are called support vectors. The larger the margin, the better the generalization ability of the classifier.

## Gradient Equations

The gradients of the cost function with respect to the weights $w$ and bias $b$ are used to update these parameters during the optimization process. The cost function includes a term for maximizing the margin (which depends on $w$) and a regularization term to prevent overfitting.

For a data point $x_i$ with label $y_i$ that satisfies $y_i(wx_i - b) \geq 1$ (correctly classified and outside the margin), the gradients are:

$$\frac{\partial J}{\partial w} = 2\lambda w$$
$$\frac{\partial J}{\partial b} = 0$$

For a data point $x_i$ with label $y_i$ that does not satisfy $y_i(wx_i - b) \geq 1$ (either incorrectly classified or within the margin), the gradients are:

$$\frac{\partial J}{\partial w} = 2\lambda w - y_ix_i$$
$$\frac{\partial J}{\partial b} = -y_i$$

Here, $J$ is the cost function of SVM, $\lambda$ is the regularization parameter (which controls the trade-off between increasing the margin size and ensuring that the $x_i$ lie on the correct side of the margin), $w$ is the weight vector, $b$ is the bias term, $x_i$ is the ith data point, and $y_i$ is the corresponding label. This setup ensures that the classifier not only finds a separating hyperplane (if it exists) but also seeks the one that maximizes the margin between classes, which is central to SVM's classification strategy.
