Exercise 7 - Advanced Support Vector Machines
=====

Support vector machines let us predict catergories. In this example we will be looking at practically using SVMs by formatting data correctly, visualising the SVM model and then evaluating the SVM model.

We will be looking at __prions__ - misfolded proteins that are associated with several fatal neurodegenerative diseases (kind of like Daleks, if you have seen Doctor Who). Looking at examples of proteins mass and weight, we will build a predictive model to detect prions in blood samples.

In [None]:
# Run this code!
# It sets up the graphing configuration
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as graph
%matplotlib inline
graph.rcParams['figure.figsize'] = (15,5)
graph.rcParams["font.family"] = 'DejaVu Sans'
graph.rcParams["font.size"] = '12'
graph.rcParams['image.cmap'] = 'rainbow'

Step 1
-----

Lets load up the data first, and save it temporarily as myData. Our dataset is called "PrionData.csv"

In [None]:
import pandas as pd
import numpy as np

###---REPLACE ??? BELOW WITH 'Data/PrionData.csv' (INCLUDING THE QUOTES) TO LOAD THE DATA FROM THAT FILE 
myData = pd.read_csv(???)
###

Step 2
-----

Lets take a look at the data.

We'll just look at the very top of the data set using the `dataset.head()` function

In [None]:
###--- WRITE print(myData.head()) TO VIEW THE TOP 5 DATA POINTS OF THE DATA SET ---###

###

Looks like we have an extra column, this happens regularly when exporting data sets from a program like Excel and then importing them into a dataframe.

Step 3
-----

Lets get rid of that extra column, and then check that it's gone. You don't have to code anything here.

In [None]:
data = myData.drop(['Unnamed: 0'], axis = 1)
print(data.head())

All gone!

Step 4
-----

Let's graph the data set to better understand what we're working with.

Looking at the output of the last step we can see the 'categories' we're looking at is called __prion_status__ (the label).

Fill in the ???'s below with our variables __mass__ and __weight__.

In [None]:
###--- REPLACE THE ??? BELOW WITH 'mass' (INCLUDING THE QUOTES) ---###
X = data[???]
###

###--- REPLACE THE ??? BELOW WITH 'weight' (INCLUDING THE QUOTES) ---###
Y = data[???]
###

target = [1 if data['prion_status'][n] == 'prion' else 0 for n in range(0, data['prion_status'].shape[0])]

graph.scatter(X, Y, c = target, zorder = 10, s = 40)

graph.title("classification plot for prion data")
graph.ylabel("mass")
graph.xlabel("weight")

graph.show()

Step 5
-------

Let's split up our data into test and training sets. We'll start by checking the total number of instances in our dataset by using the DataFrame attribute *shape*. The first number is the one we want.

In [None]:
###--- REPLACE THE ??? BELOW WITH THE NAME OF THE ATTRIBUTE WE WANT TO LOOK AT - shape ---###
data.???
###

Step 6
-----

That's a decent number! We'll use 400 examples for our training set, and the remainder for our test set.

In [None]:
# This makes our training set out of the first 400 examples
train_X = data.drop(['prion_status'], 1).truncate(after = 399)
train_Y = data['prion_status'].truncate(after = 399)

In [None]:
###--- REPLACE THE ???s BELOW WITH 400 TO MAKE THE TEST SET OUT OF THE REMAINING EXAMPLES ---###
test_X = data.drop(['prion_status'], 1).truncate(before = ???).reset_index(drop = True)
test_Y = data['prion_status'].truncate(before = ???).reset_index(drop = True)
###

Step 7
-----

Well done! Lets look at a summary of our training data. Below type in: `train_X.describe()` to get a better idea of what we're dealing with.

In [None]:
###--- REPLACE THE ???s BELOW WITH 'describe()' ---###
print(train_X.???)
print(train_Y.???)
###

314 non-prions out of 400, which means there's 86 prions in there. That looks about right if we refer to the graph we made in Step 4.

Let's take a look at our test set too. Use the `describe()` function again, this time looking at __test__ instead of train.

In [None]:
###--- CODE THE DESCRIBE FUNCTION BELOW ---###
print(test_X.???)
print(test_Y.???)
###

Looks good to me! Alright, enough of that - lets make an SVM.

Step 8
-----

Below we will make an SVM, similar to the previous exercise.

Remember, the syntax for SVM's is:

`SVM_Model = svm.SVC().fit(features, labels)`

In [None]:
from sklearn import svm
SVM_Model = svm.SVC(gamma = 'auto')

### REPLACE THE ???'S BELOW WITH OUR FEATURES train_X AND LABELS train_Y ---###
SVM_Model = svm.SVC(gamma = 'auto').fit(???, ???)
###

Well done! We've made a SVM Model from our training set.

Step 9
-----

Lets use our model to make some predictions. Fill in the ???'s below with the data we want to predict.

In [None]:
###--- REPLACE THE ???'S BELOW WITH OUR train_X AND test_X FEATURE SETS TO GENERATE THE PREDICTIONS ---###
train_P = SVM_Model.predict(???.values)
test_P = SVM_Model.predict(???.values)
###

In [None]:
# Don't edit!
# This changes our labels so that 1 = prion and 0 = non-prion
def numericise(data):
    return np.asarray([1 if data[n] == 'prion' else 0 for n in range(0, np.asarray(data).shape[0])])

Now to plot it!

In [None]:
# Don't edit this! It's our method to plot the graph

def plotSVM(clf, data, target):
    #numericise our target set
    target = numericise(target)

    graph.scatter(data['mass'], data['weight'], c = target, zorder = 10, edgecolor = 'k', s = 40)
    
    # Put the result into a colour plot
    XX, YY = np.mgrid[0:1:255j, 0:1:255j]
    Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()]).reshape(XX.shape)
    graph.pcolormesh(XX, YY, Z > 0)
    graph.contour(XX, YY, Z, colors = ['k', 'k', 'k'], linestyles = ['--', '-', '--'], levels = [-.5, 0, .5])
    
    graph.ylim(0, 1)
    graph.xlim(0, 1)
    
    graph.show()

In [None]:
plotSVM(SVM_Model, train_X, train_Y)

Step 10
-----

Looks good to me. Now lets take a look at our test set.

Remember the syntax for plotting SVM's is:
`plot(SVM_Model, features, labels)`

Add our __test__ set below to see how it looks

In [None]:
###--- REPLACE THE ???s BELOW WITH test_X AND test_Y ---###
plotSVM(SVM_Model, ???, ???)
###

In [None]:
# Run this code too!
# But don't edit it - this is a function to evaluate the SVM's accuracy
def evaluateSVM(pred, real, name):
    pred, real = numericise(pred), numericise(real)
    x = 0
    for i in range(0, real.shape[0]):
        if pred[i] == real[i]:
            x = x + 1
    print(name, "Set Accuracy:", x / real.shape[0] * 100, "%")

Step 11
-----

Nearly there. Let's take a look at our training and test set.

In [None]:
###--- REPLACE THE ???s BELOW WITH TO EVALUATE OUR train_Y AND test_Y SETS ---###
evaluateSVM(train_P, ???, 'Train')
evaluateSVM(test_P, ???, 'Test')
###

That's a good result. 

Conclusion
------

Well done! We've taken a data set, cleaned and prepared it, made a SVM, and then evaluated it. Well done!

You can go back to the course now, or you can carry on playing around with different SVM Kernels below. It doesn't take long to switch kernels.

OPTIONAL: Step 12
-----

Want to have a play around with different kernels for your SVM models? It's really easy!

The standard kernel is a Radial Basis Function kernel. But there's a few more you can choose from - linear (`linear`), polynomial (`poly`), and sigmoid (`sigmoid`). Lets try them out.

If you wanted to use a linear kernel, all you need to do is add `kernel='linear'` to your model. Like this:

`SVM_Model = svm.SVC(kernel='linear')`

Give it a go with all the different kernels below. The first one is done for you

In [None]:
# Make a new linear SVM model
SVM_Model = svm.SVC(kernel = 'linear').fit(train_X, train_Y)

# Plot the new linear SVM model
plotSVM(SVM_Model, train_X, train_Y)
plotSVM(SVM_Model, test_X, test_Y)

# Use the model to predict the training and test sets.
train_P = SVM_Model.predict(train_X.values)
test_P = SVM_Model.predict(test_X.values)

# Evaluate the model using the training and test sets
evaluateSVM(train_P, train_Y, 'Train')
evaluateSVM(test_P, test_Y, 'Test')

You can see the hyperplane is a lienar line!

Now lets try a sigmoid kernel

In [None]:
# Make a new sigmoid SVM model

###--- REPLACE THE ??? BELOW WITH 'sigmoid' (INCLUDING THE QUOTES) ---###
SVM_Model = svm.SVC(kernel = ???, gamma = 4, coef0 = 0).fit(train_X, train_Y)
###

plotSVM(SVM_Model, train_X, train_Y)
plotSVM(SVM_Model, test_X, test_Y)

train_P = SVM_Model.predict(train_X.values)
test_P = SVM_Model.predict(test_X.values)

evaluateSVM(train_P, train_Y, 'Train')
evaluateSVM(test_P, test_Y, 'Test')

Perhaps a sigmoid kernel isn't a good idea for this data set....

Lets try a polynomial kernel

In [None]:
# Make a new polynomial SVM model

###--- REPLACE THE ??? BELOW WITH 'poly' (INCLUDING THE QUOTES) ---###
SVM_Model = svm.SVC(kernel = ???, gamma = 10, degree = 3, coef0 = 0).fit(train_X, train_Y)
###

plotSVM(SVM_Model, train_X, train_Y)
plotSVM(SVM_Model, test_X, test_Y)

train_P = SVM_Model.predict(train_X.values)
###--- TRY TO FILL THIS ONE IN BY YOURSELF BY LOOKING AT WHAT WE'VE DONE PREVIOUSLY ---###
test_P = ???(test_X.values)
###

evaluateSVM(train_P, train_Y, 'Train')
evaluateSVM(test_P, test_Y, 'Test')

If we were to carry on analyzing prions like this, polynomial looks like a good choice. If the data set was more complicated we could try different degrees for the polynomial to see which one was the most accurate. This is __`tuning`__ a model.

Well done!