# Introduction to Support Vector Machines

The goal of a support vector machine is to find  the optimal separating hyperplane which maximizes the margin of the training data. The first thing we can see from this definition, is that a SVM needs training data, which means it is a supervised learning algorithm. It is also important to know that SVM is a classification algorithm, which means we will use it to predict if something belongs to a particular class.

This http://www.svm-tutorial.com/svm-tutorial/ contains a basic tutorial of SVM. It is highly recommended that you read this, before proceeding further with this lab exercise.

The first part, you deal with using inbuilt scikit learn SVM function.
Note: *StackOverFlow* is programmer's best friend. If you have any doubts syntax related or otherwise, there is a high probability that someone would have already posted about it.

We will use Breast cancer database provided by UCI Machine Learning repository. 

* This https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data contains the breast cancer database. 

* This https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names contains the details of what each attribute represents in the data. 

* In short, Class <b> 2 </b> reprsents benign tumour and Class <b> 4 </b> represents malignant tumour.

## Using *scikit-learn* svm function

In [12]:
# import required stuff.

import numpy as np 

#before sklearn 0.18
# from sklearn import preprocessing, cross_validation, neighbors, svm

from sklearn import preprocessing, neighbors, svm
from sklearn.model_selection import cross_validate, train_test_split
import pandas as pd 

In [13]:
# Read the required data. 
# This data is from UCI Machine Learning repository's Breast cancer database.
# https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
# Refer to https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names
# for more info on data.

df = pd.read_csv('data/breast-cancer-wisconsin.data.txt')
# Handling missing attributes in data, '?' in data denotes missing attributes
df.replace('?', -99999, inplace=True)
df.drop(['id'], 1, inplace=True)

In [14]:
# Get the features and labels.
X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])

In [15]:
# Use a random 80-20 split of data for training and testing resp.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [16]:
# Using scikit learn's SVM function
clf = svm.SVC()
clf.fit(X_train, y_train)

SVC()

In [17]:
# Finding the accuracy. Not bad, huh ??
accuracy = clf.score(X_test, y_test)
print (accuracy)

0.6428571428571429


In [18]:
# Finding out the prediction of this SVM classifier on new data points. 
# Here [4,2,1,1,1,2,3,2,1] and [4,2,1,2,2,2,3,2,1] are the two data points
example_measures = np.array([[4,2,1,1,1,2,3,2,1], [4,2,1,2,2,2,3,2,1]])

# to get of the deprecation error use the line below.
example_measures = example_measures.reshape(len(example_measures), -1)

prediction = clf.predict(example_measures)
print(prediction)


[2 2]


#### Questions:

  1. What do you think is the significance of this line `df.drop(['id'], 1, inplace=True)` in the above example. Run the above program by commenting out this line. What do you observe and why?
  2. Do you observe any difference in accuracy or predictions of SVM and k-NN classifier.