# Support Vector Classification using RAPIDS cuML

This notebook demonstrates how to use cuML SVM on the forest cover type dataset and measures its performance compared to scikit-learn SVM and ThunderSVM. This notebook provides supplementary information for the Benchmark section of the [RAPIDS cuML SVC](https://nvda.ws/3c3Qy8H) blog post.

We also have have simpler [notebook](https://github.com/rapidsai/cuml/blob/branch-0.13/notebooks/svm_demo.ipynb) on how to use cuML SVM.

In [None]:
import cuml
import numpy as np
import sklearn.svm

from cudf import Series
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from timeit import default_timer

For the measurements in the blog, we compiled [ThunderSVM](https://github.com/Xtra-Computing/thundersvm) from source (this [version](https://github.com/Xtra-Computing/thundersvm/tree/f604b42e15012d164cdf5ee14b528ab94d535b91)). You can also [search PyPI](https://pypi.org/search/?q=thundersvm) for a package that matches your CUDA version and pip install that.

In [None]:
# !pip install thundersvm-cuda10
try:
    import thundersvm
    thundersvm_loaded = True
except ImportError:
    thundersvm_loaded = False

## Load the dataset
We use the [covertype](https://archive.ics.uci.edu/ml/datasets/covertype) dataset. We do the following processing steps:
- We transform it to binary classification (class 2 or not).
- Scale the first 10 columns. The rest are binary input, we leave it unchanged.
- Split the data: use 90% of the samples for training, and 10% for testing.

In [None]:
X, y = fetch_covtype(data_home='/mydata/blog/scikit_learn_data', return_X_y=True)

y = (y==2).astype(np.float32) # Make it into bynary classification

X[:,0:10] = sklearn.preprocessing.StandardScaler().fit_transform(X[:,0:10])

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=77, test_size=0.1)

## Train and predict with cuML SVC
We define the an SVC classifier using radial basis function ('rbf') kernel with gamma=1, and 2G kernel cache.

In [None]:
svc = cuml.svm.SVC(kernel='rbf', C=10, gamma=1, cache_size=2000)

### Fit the classifier

In [None]:
svc.fit(X_train, y_train)

The warning message about column major data layout can be ignored.

### Make prediction and check accuracy

In [None]:
y_pred = svc.predict(X_test)
cuml.metrics.accuracy_score(y_test, y_pred)

## Benchmark SVC classifiers
The actual training time depends on the dataset and the model parameters. The penalty parameter (C) has a particularly strong effect on it. Theoretically, the training time for an SVM is around O(N^2) for small C and closer to O(N^3) for large C, where N is the number of training vectors. The example that we show here scales quadratically. Because of this, it would take very long to fit the whole dataset using the CPU. Instead we measure the execution time on several subset of the data.

### Prepare data
While cuML use only dense input, ThunderSVM and scikit-learn SVM runs more efficiently on this dataset using a sparse representation. Therefore we transform the dataset to sparse format.

In [None]:
X_train = csr_matrix(X_train.astype(np.float32))
X_test = csr_matrix(X_test.astype(np.float32))
print('Fraction of nonzeros {:4.1f} %'.format(X_train.nnz / np.prod(X_train.shape)*100))

### Helper routines for to benchmark SVM classifiers

In [None]:
def time_svm(clf, X_train, X_test, y_train, y_test):
    """ Helper script to measue time to fit and predict a classifier .
    
    Parameters:
    clf - initialized classifier
    X_train - feature vectors for training
    X_test - feature vectors for testing
    y_train - train labels
    y_test - test labels
    
    Retruns a list of four values [m, t_fit, t_pred, acc]:
    m - number of training vectors
    t_fit - time to train in seconds
    t_pred - time to predict in seconds
    acc - accuracy score
    """
    
    # Measure time to fit
    start = default_timer()
    clf.fit(X_train, y_train)
    stop = default_timer()
    t_fit = stop - start
    
    # Measure time to predict
    start = default_timer()
    clf.predict(X_train)
    stop = default_timer()
    t_pred = stop - start
        
    # Calculate accuracy
    y_pred = clf.predict(X_test)
    if isinstance(y_pred, Series):
        acc = cuml.metrics.accuracy_score(y_test, y_pred)
    else: 
        acc = sklearn.metrics.accuracy_score(y_test, y_pred)
        
    return [X_train.shape[0], t_fit, t_pred, acc]

def cuml_time_svm(X_train, X_test, y_train, y_test, params):
    clf = cuml.svm.SVC(**params)
    if isinstance (X_train, csr_matrix):
        # cuML needs dense inputs matrices
        X_train = X_train.toarray()
        X_test = X_test.toarray()
    return time_svm(clf, X_train, X_test, y_train, y_test)    

def skl_time_svm(X_train, X_test, y_train, y_test, params):
    if X_train.shape[0] <= 50000: # cut-off for Sklearn training
        clf = sklearn.svm.SVC(**params)
        return time_svm(clf, X_train, X_test, y_train, y_test)
    else:
        return [X_train.shape[0], np.nan, np.nan, np.nan]
    
def thunder_time_svm(X_train, X_test, y_train, y_test, params):
    thu_params = dict(params)
    if thu_params['kernel']=='poly':
        thu_params['kernel'] = 'polynomial'
    clf = thundersvm.SVC(**thu_params)
    return time_svm(clf, X_train, X_test, y_train, y_test)

def run_benchmark(X_train, X_test, y_train, y_test, m_list, params, run_skl=True, run_thunder=True, run_cuml=True):
    # We store the benchmark results in matrices with four columns: m, t_fit, t_pred, accuracy
    res_skl = np.zeros((len(m_list),4)) 
    res_cuml = np.zeros((len(m_list),4)) 
    res_thunder = np.zeros((len(m_list),4))
        
    i = 0
    for m in m_list:
        X_in = X_train[:m,:]
        y_in = y_train[:m]

        if run_cuml:            
            res_cuml[i,:] = cuml_time_svm(X_in, X_test, y_in, y_test, params)
            print('cuML    time for traning size {:6} is {:4.2f} sec, accuracy {:%}'.format(m, res_cuml[i,1], res_cuml[i,3]))
            
        if run_skl:
            res_skl[i,:] = skl_time_svm(X_in, X_test, y_in, y_test, params)
            print('Skl     time for traning size {:6} is {:4.2f} sec, accuracy {:%}'.format(m, res_skl[i,1], res_skl[i,3]))
            
        if run_thunder:
            res_thunder[i,:] = thunder_time_svm(X_in, X_test, y_in, y_test, params)
            print('Thunder time for traning size {:6} is {:4.2f} sec, accuracy {:%}'.format(m, res_thunder[i,1], res_thunder[i,3]))
        i += 1      

    return res_cuml, res_skl, res_thunder

### Define SVC parameters
We will use the same parameters as above.

In [None]:
params = {'kernel':'rbf', 'C':1, 'gamma': 1, 'cache_size':2000}

### Run benchmark

In [None]:
# We start with a warmup
_ = run_benchmark(X_train, X_test, y_train, y_test, [10, 100], params, run_thunder=thundersvm_loaded)

# Run the benchmark
m_list = [10, 100, 1000, 10000, 50000, 100000, 200000, 300000, 400000, X_train.shape[0]]
res_cuml, res_skl, res_thunder = run_benchmark(X_train, X_test, y_train, y_test, m_list, params, run_thunder=thundersvm_loaded)

### Plot the results

In [None]:
fig = plt.figure(figsize = (15,4))
ax = fig.add_subplot(131)
ax.plot(res_skl[:,0], res_skl[:,1], 'o-', label='scikit-learn')
ax.plot(res_thunder[:,0], res_thunder[:,1], 's-', label='ThunderSVM ')
ax.plot(res_cuml[:,0], res_cuml[:,1], '>-', label='cuML')
   
ax.set_xlabel('n_samples')
ax.set_ylabel('train time (s)')
ax.legend()
ax.set_title('time to train')

ax = fig.add_subplot(132)
ax.plot(res_skl[:,0], res_skl[:,2], 'o-', label='scikit-learn')
ax.plot(res_thunder[:,0], res_thunder[:,2], 's-', label='ThunderSVM')
ax.plot(res_cuml[:,0], res_cuml[:,2], '>-', label='cuML SVM')
ax.set_xlabel('n_samples')
ax.set_ylabel('train predict (s)')
ax.legend()
ax.set_title('time to predict')

ax = fig.add_subplot(133)
ax.plot(res_skl[:,0], res_skl[:,3]*100, 'o-', label='scikit-learn')
ax.plot(res_thunder[:,0], res_thunder[:,3]*100, '*-', label='ThunderSVM')
ax.plot(res_cuml[:,0], res_cuml[:,3]*100, '>-', label='cuML SVM')
ax.set_xlabel('n_samples')
ax.set_ylabel('accuracy %')
ax.legend()
ax.set_title('accuracy')
plt.show()