# Modern Data Science 
**(Module 03: Pattern Classification)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au), Australia

---


# Session J - Support Vector Machines

### Import the needed python modules and the data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
data = pd.read_table("pc_data/cervical.txt")
data.shape

In [None]:
data.head()

### Let's look at the library sizes (total read counts per sample)

In [None]:
sizes = data.sum(numeric_only=True)
sizes.plot.bar()

### Normalization

Our library sizes are very different.  Lets normalize the data for library size so we are comparing apples to apples.

Here I use the normalization method of counts per milion (CPM).  We divide each count by the library size to give the proportion of total reads for each gene, then multiply by 1 million to get counts per 1 million.

In [None]:
ID = data.ID
data = data.drop('ID', 1)
sums = data.sum()
cpm = (data.div(sums))*1000000
cpm.insert(loc=0, column='ID', value=ID)
cpm.shape

In [None]:
cpm.head()

In [None]:
sizes = cpm.sum(numeric_only=True)
sizes.plot.bar()

### Here I re-format the data into a form suitable for input into the machine learning algorithm

In [None]:
cpm = cpm.transpose()
cpm = np.array(cpm[1:])
class_labels = np.array(["normal"]*29 + ["tumor"]*29)

### Train/Test Split - split the data into a training set and a testing set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(cpm, class_labels, test_size=0.20, random_state=2)

In [None]:
print(X_train.shape)
print(X_test.shape)

### Now we'll use support vector machines with a linear kernel to create a classifier

In [None]:
from sklearn import svm
svc = svm.SVC(kernel='linear', C=1.0).fit(X_train, y_train)

### Predict classifications of the test set data with our model and see how well the model performs

Here are the predictions:

In [None]:
predictions = svc.predict(X_test)
print(predictions)

Here is the ground truth (we know what the samples are ahead of time):

In [None]:
print(y_test)

Here is our prediction accuracy:

In [None]:
print("Prediction accuracy = ", round((svc.score(X_test, y_test)*100), 1), "%")