# Support Vector Machine (SVM)
***Objective***: To maximize the **Margin** which is the distance between the separating **hyperplane** (i.e. decision boundary) and the **Support Vectors** (i.e. the training samples that are closest to this hyperplane).

- ***Q***: Why do we need _decision boundaries with large margins_?
- ***A***: They tend to have a _lower generalization error_ (whereas models with small margins are more prone to overfitting.)

>The _positive_ and _negative_ hyperplanes that are parallel to the decision boundary, can be expressed as follows:
$$w_0 + w^Tx_{pos} = 1$$
$$w_0 + w^Tx_{neg} = -1$$

>>By subtracting those two linear equations from each other, we get:
$$\Rightarrow w^T(x_{pos}-x_{neg}) =2$$

>>We can **normalize** this by the length of the vector $w$, which is defned as follows:
$$\|w\| =\sqrt{\sum_{j=1}^{m} w_j^2}$$

>So we arrive at the following equation:
### $$\frac{w^T(x_{pos}-x_{neg})}{\|w\|}=\frac{2}{\|w\|}$$
The left side of the equation can be interpreted as the _distance between the positive and negative hyperplane_ (i.e the **margin** that we want to maximize).

Now,

***Objective function of the SVM***:  The maximization of this margin by maximizing $\color{purple}{\frac{2}{\|w\|}}$ under the constraint that the samples are classifed correctly, which can be written as follows:
$$w_0+w^Tx^{(i)}\geq1 \;if\:y^{(i)} = 1$$
$$w_0+w^Tx^{(i)}<-1 \;if\:y^{(i)} = -1$$

This can also be written more compactly as follows:
### $$y^{(i)}(w_0+w^Tx^{(i)})\geq 1\forall_i$$

In practice, though, it is easier to minimize the reciprocal term $\color{purple}{\frac{1}{2}\|w\|^2}$

***Dealing with nonlinearly separable data using slack variables***: Slack variable, $\xi$ was introduced because linear constraints need to be relaxed for nonlinearly separable data to allow "_convergence of the optimization in the presence of misclassifcations under the appropriate cost penalization_." 
The positive-values slack variable is simply added to the linear constraints:
$$w^Tx^{(i)}\geq \;if\:y^{(i)} = 1 - \xi^{(i)}$$
$$w^Tx^{(i)}< -1 \;if\:y^{(i)} = 1 + \xi^{(i)}$$

So

**The new objective to be minimized:** 
### $$\frac{1}{2}\|w\|^2 + C(\sum_{i}\xi^{(i)})$$
Using the variable $C$, we can then control the penalty for misclassifcation. Large values of $C$ correspond to large error penalties, whereas, we are less strict about misclassifcation errors if we choose smaller values for $C$. We can then we use the parameter $C$ to control the width of the margin and therefore tune the bias-variance trade-off.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv('Classified Data', index_col=0)

## Data Exploration

## Data Cleaning

## Building the Model


### Creating features and Labels

### Preprocessing/Scaling
`sklearn.preprocessing` package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

*Standardization of datasets* is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: *Gaussian with zero mean and unit variance.*

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

### Splitting the dataset
While experimenting with any learning algorithm, it is important not to test the prediction of an estimator on the data used to fit the estimator as this would not be evaluating the performance of the estimator on new data. This is why datasets are often split into train and test data.
#### A random permutation, to split the data randomly
```python
np.random.seed(42)
indices = np.random.permutation(len(X))
X_train = X[indices[:-20]]
y_train = y[indices[:-20]]
X_test = X[indices[-20:]]
y_test = y[indices[-20:]]
```
#### But we will use the `train_test_split` function from `sklearn.model_selection`

In [1]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=101)

### Importing the Model

### Create and fit a    Classifier

## Predictions

## Evaluation

In [3]:
from sklearn.metrics import classification_report, confusion_matrix