## Train a classifier to predict leukemia subtype using the Support Vector Machines method
<p>Support Vector Machines (SVMs) are classification models built by a supervised learning algorithm that analyzes the features of phenotype-labeled samples and attempts to find the line or plane (more formally, <a href="https://en.wikipedia.org/wiki/Hyperplane" target="_blank">hyperplane</a>) that separates them.  These models can then be applied to unlabelled samples to predict their classifications.</p>

## Import required Python libraries

In [6]:
import numpy as np
import pandas as pd
from sklearn import svm

ModuleNotFoundError: No module named 'numpy'

### Utility function to read expression dataset
- Accepts a file in the GenePattern [RES](http://www.broadinstitute.org/cancer/software/genepattern/file-formats-guide#RES) format
- Returns a matrix of type numpy.ndarray

In [7]:
def read_dataset(filename):
    f=open(filename + ".res", "r")
    hdrs = f.readline()
    all_hdrs = hdrs.split('\t')
    col_nums = [i for i,x in enumerate(all_hdrs) if ((x != '') and (x !='\n'))]
    f.close()
    col_nums.remove(0)
    result_df=pd.read_table(trainfile+".res",header=0,index_col=0,usecols=col_nums,skiprows=[1,2],skipinitialspace=True)
    result_matrix = result_df.as_matrix().transpose()
    return result_matrix

### Utility function to write result data matrix to a file

In [8]:
def write_matrix(data_matrix, fname):
    f = open(fname,"w")
    for i in range(0,len(data_matrix)):
        for j in range(0,len(data_matrix[i])):
            f.write(str(data_matrix[i][j])+"\t")
        f.write("\n")
    f.close()

### Base filenames for train and test sets:

In [2]:
testfile = "GCM_Test"
trainfile = "GCM_Training"

### Read class label files for train and test sets:

In [3]:
labels = {}
vectors = {}

for f in ["Training","Test"]:
    class_file = "GCM_" + f + ".cls"
    c_file = open(class_file,"r")
    hdr = c_file.readline()
    labels[f] = c_file.readline()[2:].split()
    vectors[f] = c_file.readline().split()

## Read train and test datasets

In [54]:
train_matrix = read_dataset(trainfile)
test_matrix = read_dataset(testfile)

## Create linear SVM model on training dataset

In [None]:
clf = svm.LinearSVC()
data_fit = clf.fit(train_matrix, vectors["Training"])

## Run resulting model on test dataset
- Numbers indicate predicted class for each sample

In [None]:
data_predict = clf.predict(test_matrix)

## Write results to file

In [None]:
write_matrix(data_predict,"prediction_results.txt")

## References

Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., and Lander, E.S. 1999. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. <a href="http://science.sciencemag.org/content/286/5439/531.long" target="_blank">Science 286:531-537</a>.

Rifkin, R., Mukherjee, S., Tamayo, P., Ramaswamy, S., Yeang, C-H, Angelo, M., Reich, M., Poggio, T., Lander, E.S., Golub, T.R., Mesirov, J.P. 2003. An Analytical Method for Multiclass Molecular Cancer Classification. <a href="http://epubs.siam.org/doi/abs/10.1137/S0036144502411986" target="_blank">SIAM Review 45(4):706-723</a>.

Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C-H, Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P., Poggio, T., Gerald, W., Loda, M., Lander, E.S., Golub, T.R.  2001. Multiclass cancer diagnosis using tumor gene expression signatures. <a href="http://www.pnas.org/content/98/26/15149.." target="_blank">PNAS 98(26):5149–15154</a>.

Slonim, D.K., Tamayo, P., Mesirov, J.P., Golub, T.R., Lander, E.S. 2000. Class prediction and discovery using gene expression data. In <a href="http://dl.acm.org/citation.cfm?id=332564" target="_blank">Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB)</a>. ACM Press, New York. pp. 263-272.

Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. 2002. Gene selection for cancer classification using support vector machines. <a href="https://link.springer.com/article/10.1023%2FA%3A1012487302797" target="_blank">Mach. Learn., 46(1-3), 389–422</a>.

Hsu, C-W., Chang, C-C., Lin, C-J. 2016. <a href="https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf" target="_blank">A Practical Guide to Support Vector Classification</a>.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E. 2011. Scikit-learn: Machine Learning in Python. <a href="http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html" target="_blank">JMLR 12, pp. 2825-2830</a>.