<a href="https://colab.research.google.com/github/vt-ai-ml/fall2019-meetings/blob/master/SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Support Vector Machines

In [0]:
# import statements, must always run
from sklearn.svm import SVC
from sklearn import feature_extraction, model_selection
import matplotlib.pyplot as plt
import pandas as pd
from collections import Counter

%matplotlib inline

## General Idea

Motivating example:
Assume we have some fruits pears (green) and apples (blue). Let's say we plot them by their weight (x-axis). We then find an unknown fruit (black). Is it a pear or an apple?

(Note: this is a binary classification problem; that is, we want to classify a data point as one of two labels)

In [0]:
gx = [1, 2, 3]
bx = [5, 6, 7]
y = [1, 1, 1]

plt.scatter(gx, y, c='green')
plt.scatter(bx, y, c='blue')
plt.scatter([4.5], [1], c='black')
plt.show()

Solution: We can draw a line on the graph. Everyone on the left side we can call a pear, everything on the right side we call an apple.

In [0]:
gx = [1, 2, 3]
bx = [5, 6, 7]
y = [1, 1, 1]

plt.scatter(gx, y, c='green')
plt.scatter(bx, y, c='blue')
plt.plot([3.5, 3.5], [.995, 1.005], 'orange')
plt.scatter([4.5], [1], c='black')
plt.show()

Are there other possible lines we could draw instead?

In [0]:
gx = [1, 2, 3]
bx = [5, 6, 7]
y = [1, 1, 1]

plt.scatter(gx, y, c='green')
plt.scatter(bx, y, c='blue')

plt.plot([3.5, 3.5], [.995, 1.005], 'orange')
plt.plot([4, 4], [.995, 1.005], 'red')
plt.plot([4.5, 4.5], [.995, 1.005], 'yellow')
plt.show()

How do we determine which line is the best one? Why?

In [0]:
gx = [1, 2, 3]
bx = [5, 6, 7]
y = [1, 1, 1]

plt.scatter(gx, y, c='green')
plt.scatter(bx, y, c='blue')

plt.plot([4, 4], [.995, 1.005], 'red')
plt.show()

Solution: Want to maximize the "margin".

Margin: distance between the line (hyperplane) and the closest data points.

In [0]:
gx = [1, 2, 3]
bx = [5, 6, 7]
y = [1, 1, 1]

plt.scatter(gx, y, c='green')
plt.scatter(bx, y, c='blue')

plt.plot([3, 3], [.995, 1.005], 'orange', ls='--')
plt.plot([4, 4], [.995, 1.005], 'red')
plt.plot([5, 5], [.995, 1.005], 'yellow', ls='--')
plt.show()

Why is maximizing the margin a good approach?

Intuition: We want to pick a line to divide the data into two parts, but also far away from the data points. Then, if we get a data point closer  to the hyperplane (the "middle"), the point will still be on the right side.

In [0]:
gx = [1, 2, 3]
bx = [5, 6, 7]
y = [1, 1, 1]

plt.scatter(gx, y, c='green')
plt.scatter(bx, y, c='blue')

plt.plot([3.5, 3.5], [.995, 1.005], 'orange')
plt.plot([4, 4], [.995, 1.005], 'red')
plt.plot([4.5, 4.5], [.995, 1.005], 'yellow')
plt.scatter([3.75], [1], c='black')
plt.scatter([4.25], [1], c='black')
plt.show()

So what if we have more than just weight? That is, what if we want to classify based on more than one feature?

Solution: Add more dimensions! We make each dimension represent a new feature and we (still) use a hyperplane to split the data. [Multiple dimension examples.](https://blog.statsbot.co/support-vector-machines-tutorial-c1618e635e93)

So, if have want to classify by $n$ features, we plot the data in $n$ dimensions where each part of the coordinate is a feature. Then we draw the hyperplane which will be the equivalent of a "line" to separate the data in $n$ dimensions. (a line in 2D, plane in 3D, etc.)

## Implementation Details

Kernel:

We we've been drawing a straight line to separate the data, but there are more options:
* linear (line)
* polynomial (e.g. $x^2,x^3$)
* radial basis function (RBF)

Polynomial Kernel Example:
![Polynomial Example](https://i.imgur.com/rlCdLqV.jpg)

RBF Kernel Example:
![RBF Example](https://i.imgur.com/edNXrb9.jpg)

## Advantages/Disadvantages and Use Cases

Advantages:
* Good for both linearly and non-linearly separable data
* Good in both cases where the labels are known and unknown (you can still find trends even if the label is unknown)
* Helps avoid overfitting
* Efficient methods of creating

Disadvantage:
* Choosing the kernel to use
* Choosing other parameters (C and gamma, see Extra Info)

Example use cases of SVM:
* Face detection
* Text categorization
* Bioinformations (used for  protein classification)

## Spam Example

What's commonly used as the opposite of spam (for email)?

In [0]:
# load/display data
data_url = 'https://raw.githubusercontent.com/ParakweetLabs/EmailIntentDataSet/master/src/resources/Ask0729-fixed.txt'
df = pd.read_csv(data_url, sep='\t', names=['Ham?', 'Subject Line'])
df

In [0]:
# How many of each type do we have?
print(pd.value_counts(df['Ham?']))

In [0]:
# print the most common words
most_common_words = Counter(' '.join(df['Subject Line']).split()).most_common(10)
words_df = pd.DataFrame.from_dict(most_common_words)
words_df.plot.bar(x=0, y=1)

Basic NLP:

Stop words: common words which are normally filtered out, because they have no real significance in our data (e.g., "the", "to", "and")

SVM only works with numbers for coordinates. Solution: convert words to numbers!

In [0]:
# remove stop words and map words to numbers
f = feature_extraction.text.CountVectorizer(stop_words='english')
X = f.fit_transform(df['Subject Line'])
Y = df['Ham?'].map({'No': 0, 'Yes': 1})

In [0]:
# create all the kernels
rbf = SVC(kernel='rbf', gamma='auto')
linear = SVC(kernel='linear', gamma='auto')
poly = SVC(kernel='poly', gamma='auto')

# train data
rbf.fit(X, Y)
linear.fit(X, Y)
poly.fit(X, Y)

# see accuracy from training
print('rbf', rbf.score(X, Y))
print('linear', linear.score(X, Y))
print('poly', poly.score(X, Y))

## References
* [Do SVMs only do binary classification?
](https://www.quora.com/Do-SVMs-only-do-binary-classification)
* [What is the intuition behind margin in SVM?
](https://www.quora.com/What-is-the-intuition-behind-margin-in-SVM)
* [Support Vector Machines Tutorial](https://data-flair.training/blogs/svm-support-vector-machine-tutorial/)
* [Support Vector Machines for Classification
](https://mubaris.com/posts/svm/)
* [Real-Life Applications of SVM](https://data-flair.training/blogs/applications-of-svm/)

