# Support Vector Machines
Support Veector machines are a classification method where we try to draw a line that divides two regions by the largest margin possible. The simplified mathematics behind this goes as follows:

We start off with a statement, calculating the label of two given points.

$$ w_0 + \boldsymbol{w}^T\boldsymbol{x}_{pos} = 1 \\ w_0 + \boldsymbol{w}^T\boldsymbol{x}_{pos} = -1 $$

Technically, these two equations can describe hiperplanes that are parallel to the decision boundary. We want the distance between these two hyperplanes to be as large as possible. To accomplish this, we combine the two by subtracting the latter from the former to create

$$ w_0 + \boldsymbol w^T \boldsymbol x_{pos} - w_o - \boldsymbol w^T \boldsymbol x_{neg} = \boldsymbol w^T(\boldsymbol x_{pos} - \boldsymbol x_{neg}) = 2 $$

To normalize this equation, we divide both sides by the by the $l_2$ norm

$$ ||\boldsymbol w|| = \sqrt{\sum_{j=1}^m w^2_j } $$

to get

$$ \frac{\boldsymbol w^T(\boldsymbol x_{pos} - \boldsymbol x_{neg})}{||\boldsymbol w ||} = \frac2{||\boldsymbol w ||} $$

We can now use a variety of methods to maximize the value on the RHS of the above equation to maximize the distance between the two hyperplanes mentioned above under the constraint that we still classify everything correctly which can be most consisly expressed as

$$ y^{(i)} (w_0 + \boldsymbol w^T \boldsymbol x^{(i)} ) \ge 1 \space\space \forall_i $$

which basically says that the "label" on each point should match the sign of each classification value.

## Non-linearly seperable cases
In non-linearly seperable datasets, we can introduce (what I am going to call) a fudge factor to the constraints. If we seperate out the consiscely written version above, we can write this fudge factor into our conditions as

$$ w_0 + \boldsymbol w^T \boldsymbol x^{(i)} \ge 1 - \xi^{(i)} \text{ if } y^{(i)} = 1 \\ w_0 + \boldsymbol w^T \boldsymbol x^{(i)} \le -1 + \xi^{(i)} \text{ if } y^{(i)} = -1  $$

for $i = 1\dots N$. This fudge factor introduces itself into the maximization term (which we have actually turned into a minimization) as

$$ \frac12 ||\boldsymbol w ||^2 + C \left(\sum_i\xi^{(i)}\right)$$

Here, $C$ can be treated as a hyperparameter to adjust how much tolerance we will allow in misclassification.

Now, to actually do it.

In [1]:
from sklearn.svm import SVC

svm = SVC(kernel='linear', C=1.0, random_state=1)
svm.fit()


TypeError: fit() missing 2 required positional arguments: 'X' and 'y'