# Support Vector Machine

The materials used in this tutorial are based on the applied exercises provided in the book <font color="orange">"An Introduction to Statistical Learning with Applications in R"</font> (ISLR). We are trying to demonstrate how to use R to train a SVM model on real-world datasets. Besides the exercises that we are going to cover in this tutorial, it is worth trying the other applied exercises given in the book by yourself.

The library for SVM is "e1071".

## Task 1

We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features. 

### (a) 
Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them. For instance, you can do this as follows: 

In [None]:
x1=runif(500)-0.5
x2=runif(500)-0.5
y=1*(x1^2-x2^2 > 0) 

### (b) 
Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the y- axis. 

In [None]:
plot(x1, x2, xlab = "X1", ylab = "X2", col = (4 - y), pch = (3 - y))

### (c) 
Fit a logistic regression model to the data, using X1 and X2 as predictors.

The results show that both variables are insignificant for predicting y

### (d) 
* Apply this model to the training data in order to obtain a predicted class label for each training observation. 

* Plot the observations, colored according to the predicted class labels. The decision boundary should be linear. 

 This boundary is linear as seen in the figure.

### (e) 
Now fit a logistic regression model to the data using non-linear functions of $X_1$ and $X_2$ as predictors (e.g. $X_1^2, X_1 \times X_2, \log(X_2$), and so forth). 

Here again, none of the variables are statistically significants.

### (f) 
* Apply this model to the training data in order to obtain a predicted class label for each training observation. 
* Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear. 

The non-linear decision boundary is surprisingly very similar to the true decision boundary.

### (g) 
* Fit a support vector classifier to the data with X1 and X2 as predictors. 
* Obtain a class prediction for each training observation. 
* Plot the observations, colored according to the predicted class labels. 

In [None]:
library(e1071)
data$y <- as.factor(data$y)

The following plot code depends on the predict value. You might have error message 

In [None]:
plot(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = (4 - 0), pch = (3 - 0), xlab = "X1", ylab = "X2")
points(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = (4 - 1), pch = (3 - 1))

In [None]:
# plot(data[preds == 1, ]$x1, data[preds == 1, ]$x2, col = (4 - 1), pch = (3 - 1), xlab = "X1", ylab = "X2")
# points(data[preds == 0, ]$x1, data[preds == 0, ]$x2, col = (4 - 0), pch = (3 - 0))

This support vector classifier (even with low cost) classifies all points to a single class.

### (h) 
* Fit a SVM using a non-linear kernel to the data. 
* Obtain a class prediction for each training observation. 
* Plot the observations, colored according to the predicted class labels. 

In [None]:
data$y <- as.factor(data$y)

Here again, the non-linear decision boundary is surprisingly very similar to the true decision boundary.

### (i) 
Comment on your results. 

We may conclude that SVM with non-linear kernel and logistic regression with interaction terms are equally very powerful for finding non-linear decision boundaries. Also, SVM with linear kernel and logistic regression without any interaction term are very bad when it comes to finding non-linear decision boundaries. However, one argument in favor of SVM is that it requires some manual tuning to find the right interaction terms when using logistic regression, although when using SVM we only need to tune gamma.

## Task 2 

This problem involves the OJ data set which is part of the ISLR package. 

### (a) 
Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations. 

In [None]:
library(ISLR)

set.seed(1)
train <- sample(nrow(OJ), 800)
OJ.train <- OJ[train, ]
OJ.test <- OJ[-train, ]

### (b) 
Fit a support vector classifier to the training data using cost=0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained. 

Support vector classifier creates 432 support vectors out of 800 training points. Out of these, 217 belong to level MM and remaining 215 belong to level CH.

### (c) 
What are the training and test error rates?

In [None]:
library(caret)

### (d) 
Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10. 

We may see that the optimal cost is 0.1

### (e) 
Compute the training and test error rates using this new value for cost.

### (f) 
Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for “gamma”.

Radial kernel with default gamma creates 379 support vectors, out of which, 188 belong to level CH and remaining 191 belong to level MM. The classifier has a training error of 14.5% and a test error of 17% which is a slight improvement over linear kernel. We now use cross validation to find optimal cost.

### (g) 
Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree=2. 

Polynomial kernel with default gamma creates 454 support vectors, out of which, 224 belong to level CH and remaining 230 belong to level MM. The classifier has a training error of 17.2% and a test error of 18.8% which is no improvement over linear kernel. We now use cross validation to find optimal cost.

Tuning reduce train and test error rates.

Overall, which approach seems to give the best results on this data ?

Overall, radial basis kernel seems to be producing minimum misclassification error on both train and test data.