# Support Vector Machines

Support vector machines are classification algorithms that divide a data set into categories based by slicing **through the widest gap between categories.**  

The Support Vector Machine, in general, handles pointless data better than the K Nearest Neighbors algorithm, and definitely will handle outliers better. It also achieves similar accuracy, only at a much faster pace.  


## Vector Basics
- A vector has both a magnitude and a direction.
- The dot product tells you what amount of one vector goes in the direction of another. Dot product here, is used in order to measure the distance from one example to the median of the street (line of separation between - and + examples).  

More about vectors <a href="https://www.svm-tutorial.com/2014/11/svm-understanding-math-part-2/">here</a>


## How Do Support Vector Machines Work?
Given a set of training examples – each of which is marked for belonging to one of two categories – a support vector machine training algorithm builds a model. This model assigns new examples into one of the two categories. This makes the support vector machine a non-probabilistic binary linear classifier.

More specifically, an SVM model maps the data points as points in space and divides the separate categories so that they are divided by an open gap that is as wide as possible. New data points are predicted to belong to a category based on which side of the gap they belong to.
<br>
For example consider this visualization:
<img src="https://www.freecodecamp.org/news/content/images/2020/06/image-57.png">
Here, if a new data point falls on the left side of the green line, it will be labeled with the red category. Similarly, if a new data point falls on the right side of the green line, it will get labelled as belonging to the blue category.
This green line is called a **hyperplane**, which is an important piece of vocabulary for support vector machine algorithms.

### What is a hyperplane?
Hyperplane is just a line in more that 3 dimensions
<img src="https://miro.medium.com/max/1280/1*H2QEWsP9-W4rBdIaxfVExg.jpeg">
<br>
Let’s take a look at a different visual representation of a support vector machine:
<img src="https://www.freecodecamp.org/news/content/images/2020/06/image-58.png">

In this diagram, the hyperplane is labelled as the **optimal hyperplane**. Support vector machine theory defines the optimal hyperplane as the one that maximizes the margin between the closest data points from each category.

As you can see, the margin line actually touches three data points – two from the red category and one from the blue category. These data points which touch the margin lines are called **support vectors** and are where support vector machines get their name from.  

Once you find the support vectors, you want to create lines that are **maximally separated** between each other. From here, we can easily find the decision boundary by taking the total width and dividing it by 2.


<a href="https://www.freecodecamp.org/news/a-no-code-intro-to-the-9-most-important-machine-learning-algorithms-today/">source</a>

### How could we define a hyperplane?

Let’s look at the two-dimensional case first. The two-dimensional linearly separable data can be separated by a line. The function of the line is `y=ax+b`. We rename `x` with `x1` and `y` with `x2` and we get:
`ax1−x2+b=0`

If we define `x = (x1,x2)` and `w = (a,−1)`, we get:
`w⋅x+b=0`

This equation is derived from two-dimensional vectors. But in fact, it also works for any number of dimensions. This is the equation of the hyperplane.


## Classifier
Once we have the hyperplane, we can then use the hyperplane to make predictions. We define the hypothesis function h as:  
![Capture.PNG](attachment:Capture.PNG)

The point above or on the hyperplane will be classified as class +1, and the point below the hyperplane will be classified as class -1.

Or, we can simply say that the classification function is just: `sign(x.w + b)`  [sign being positive or negative]

So basically, the goal of the SVM learning algorithm is to find a hyperplane which could separate the data accurately. There might be many such hyperplanes. And we need to find the best one, which is often referred as the optimal hyperplane.


## SVM optimization problem
Let’s first consider the equation of the hyperplane `w⋅x+b=0`. We know that if the point (x,y) is on the hyperplane, `w⋅x+b=0`. If the point (x,y) is not on the hyperplane, the value of `w⋅x+b` could be positive or negative. For all the training example points, we want to know the point which is closest to the hyperplane.

Except for the whole thing where we have no idea what `w` is and what `b` is. There are an infinite number of w's and b's that might satisfy our equation, but we want the "best" separating hyperplane. The best separating hyperplane is the one with the most width between the data it is separating. We can probably guess that's going to play a part in the optimization of w and b.  
<br>
Once we find a `w` and `b` to satisfy the constraint problem (the vector w with the smallest magnitude with the largest b), our decision function for classification of unknown points would just simply ask for the value of `x.w + b`.

The SVM's optimization problem is a convex problem, where the convex shape is the magnitude of vector w:
<img src="https://pythonprogramming.net/static/images/machine-learning/svm-convex-problem.png">  
<br>
The objective of this convex problem is to find the minimum magnitude of vector w. One way to solve convex problems is by *"stepping down"* until you cannot get any further down. You will start with a large step, quickly getting down. Once you find the bottom, you are going to slowly go back the other way, repeating this until you find the absolute bottom. 

## SVM for Non-Linear Data Sets
- Sometimes, we cannot find a straight line to separate the two categories of classification. In that case will use the **Kernel Trick!**  
<br>
- The basic idea is that when a data set is inseparable in the current dimensions, **add another dimension**, maybe that way the data will be separable.  
<br>
- To solve this problem we shouldn’t just **blindly add another dimension**, we should transform the space so we generate this level difference intentionally.  
<br>
- Let's assume that we add another dimension called `X3`. Another important transformation is that in the new dimension the points are organized using this formula `x1² + x2²`.  
These transformations are called kernels. Popular kernels are: Polynomial Kernel, Gaussian Kernel, Radial Basis Function (RBF) etc.  
(<a href="https://towardsdatascience.com/svm-and-kernel-svm-fed02bef1200">source</a>)

Kernels are similarity functions, which take two inputs and return a similarity using inner products. Not only can you create your own new machine learning algorithms with Kernels, you can also translate existing machine learning algorithms into using Kernels.  

What kernels are going to allow us to do, possibly, is work in many dimensions, without actually paying the processing costs to do it. Kernels do have a requirement: They rely on inner products. ("dot product" is same as "inner product").  

What we need to do in order to verify whether or not we can get away with using kernels is confirm that every interaction with our featurespace is an inner product. We'll start at the end, and work our way back to confirm this.  
So if you look at the equations of SVM `y = sign(x.w + b)`, we can use kernels. 

## Why Kernels?
Generally, kernels will be defined by something like: k(x, x`)  
<br>
The kernel function is applied to x and x prime, and will equal the inner product of z and z prime, where the z values are from the z dimension (our new dimension space).
The z values are the result of some function(x), and these z values are dotted together to give us our kernel function's result.
<img src="https://pythonprogramming.net/static/images/machine-learning/z-values-kernel-function-of-x.png">

There are quite a few pre-made kernels, but typically, the default is Radial Basis Function (RBF) kernel, since it can take us to a proposed "infinite" number of dimensions.

## Soft Margin Support Vector Machine
First, there are two major reasons why the soft-margin classifier might be superior. One reason is our data is not perfectly linearly separable, but is very close and it makes more sense to continue using the default linearly kernel. The other reason is, even if you are using a kernel, you may wind up with significant **over-fitment** or **overfitting** if you want to use a hard-margin. For example, consider:
<img src="https://pythonprogramming.net/static/images/machine-learning/example%20data-not-linearly-separable.png">
Assuming a hard-margin and the support vector hyperplanes, it looks like this:
<img src="https://pythonprogramming.net/static/images/machine-learning/hard-margin-with-many-support-vectors.png">
In this case, every single data sample for the positive class is a support vector, and only two of the negative class aren't support vectors. This signals to use a high chance of overfitting having happened.  
<br>
What if we did something like this instead:
<img src="https://pythonprogramming.net/static/images/machine-learning/linear-soft-margin-example.png">
We have a couple errors or violations noted by arrows, but this is likely to classify future featuresets better overall. What we have here is a **"soft margin"** classifier, which allows for some **"slack"** on the errors that we might get in the optimization process.  
<br>
The closer to 0 the slack is, the more "hard-margin" we are. The higher the slack, the more soft the margin is. If slack was 0, then we'd have a typical hard-margin classifier. As you might guess, however, we'd like to ideally minimize slack. 