## Introduction and Key Concepts

cuML accelerates machine learning on GPUs. The library follows a couple of key principals, and understanding these will help you take full advantage cuML.

### 1. Where possible, match the scikit-learn API

cuML estimators look and feel just like scikit-learn estimators. You initialize them with key parameters, fit them with a fit method, then call predict or transform for inference.

### 2. Accept flexible input types, return predictable output types

cuML estimators can accept NumPy arrays, cuDF dataframes, cuPy arrays, 2d PyTorch tensors, and really any kind of standards-based Python array input you can throw at them.

By default, outputs will mirror the data type you provided.

### 3. Be fast!

On a modern GPU, these can exceed the performance of CPU-based equivalents by a factor of anything from 4x (for a medium-sized linear regression) to over 1000x (for large-scale tSNE dimensionality reduction). In many cases, performance advantages appear as the dataset grows.

## cuML Estimators

This notebook provides an overview of several machine learning estimators in cuML, demonstrating how to train and evaluate them with built-in metrics functions. For more in depth information and examples, please refer to the documentation.

In [34]:
import cudf
import cuml
from cuml.model_selection import train_test_split
from cuml.datasets import make_classification, make_regression

NFEATURES = 20

X, y = make_classification(
    n_samples=100000,
    n_features=NFEATURES,
    n_informative=NFEATURES,
    n_redundant=0,
    n_classes=2,
    class_sep=0.01,
    random_state=12
)

Xr, yr = make_regression(
    n_samples=100000,
    n_features=NFEATURES,
    n_informative=NFEATURES,
    noise=90,
    random_state=12
)


X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)

Xr_train, Xr_test, yr_train, yr_test = train_test_split(Xr, yr,
                                                        random_state=42)

### Regressors

In [35]:
clf = cuml.linear_model.Lasso()
clf.fit(Xr_train, yr_train)
print(clf.score(Xr_test, yr_test))
print(clf.coef_)

0.8731609582901001
[42.26208   31.648277  97.97266    8.028292  98.45462   17.758625
 29.110216  42.0269    41.503304  50.976448  38.841347  79.8951
 15.154717  35.345844  77.980385  89.59795   27.576366   3.3293636
 37.284145   2.56229  ]


In [36]:
clf = cuml.linear_model.Ridge()
clf.fit(Xr_train, yr_train)
print(clf.score(Xr_test, yr_test))
print(clf.coef_)

0.8735741376876831
[43.275894  32.63923   98.98531    9.039696  99.46612   18.758041
 30.10358   43.022507  42.504524  51.973602  39.85379   80.87418
 16.1243    36.368073  78.97133   90.585976  28.600662   4.368889
 38.307487   3.5439644]


In [37]:
clf = cuml.neighbors.KNeighborsRegressor()
clf.fit(Xr_train, yr_train)
print(clf.score(Xr_test, yr_test))

0.6750473976135254


### Classifiers

In [38]:
clf = cuml.linear_model.LogisticRegression()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
print(clf.coef_.ravel())

[E] [12:51:03.483279] L-BFGS line search failed
0.497079998254776
[-3.7582314e-03  9.1156486e-04  3.1163811e-03  1.0851971e-03
  7.8051696e-03  4.0029390e-03  4.2287796e-04 -7.8248983e-04
  4.5197555e-03  3.6304430e-03  2.1824990e-04 -1.9584077e-03
  8.6510426e-04  4.9314434e-03 -4.2862576e-03  4.9139344e-05
 -3.5439176e-03  1.7734672e-03  4.5621200e-03 -1.0905696e-02]


In [39]:
clf = cuml.linear_model.LogisticRegression(penalty="l1")
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
print(clf.coef_.ravel())

0.49724000692367554
[-3.7409374e-03  9.3495927e-04  3.0662529e-03  1.0176267e-03
  7.7631096e-03  4.0295441e-03  4.1809116e-04 -7.7687390e-04
  4.5163226e-03  3.6031262e-03  2.0021595e-04 -2.0515765e-03
  8.1636111e-04  4.8390576e-03 -4.2563956e-03  5.7828958e-05
 -3.5315230e-03  1.7584465e-03  4.5539355e-03 -1.0913123e-02]


In [40]:
clf = cuml.neighbors.KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

0.9721199870109558


In [41]:
clf = cuml.neighbors.KNeighborsClassifier(n_neighbors=10)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

0.974399983882904


### Intermission: CPU Performance Comparison

In [42]:
from sklearn.neighbors import KNeighborsClassifier as sk_KNeighborsClassifier

In [43]:
X_train_cpu, X_test_cpu = X_train.get(), X_test.get()
y_train_cpu, y_test_cpu = y_train.get(), y_test.get()

In [45]:
%%time

clf = sk_KNeighborsClassifier(n_neighbors=10, n_jobs=-1)
clf.fit(X_train_cpu, y_train_cpu)
print(clf.score(X_test_cpu, y_test_cpu))

0.9744
CPU times: user 12min 17s, sys: 23min 57s, total: 36min 15s
Wall time: 50.2 s


In [46]:
%%time

clf = cuml.neighbors.KNeighborsClassifier(n_neighbors=10)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

0.974399983882904
CPU times: user 45.9 ms, sys: 159 ms, total: 205 ms
Wall time: 201 ms


### Back to cuML Classifiers

In [9]:
clf = cuml.svm.SVC()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

0.9753999710083008


In [10]:
clf = cuml.svm.SVC(C=10)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

0.980679988861084


### Clusterers

In [11]:
kmeans = cuml.cluster.KMeans(n_clusters=2)
kmeans.fit(X_train)

print("labels:")
print(kmeans.labels_)
print("cluster_centers:")
print(kmeans.cluster_centers_)

labels:
[1 0 1 ... 1 0 1]
cluster_centers:
[[-0.52050364  1.3660686  -0.5398778   1.138281    0.885171    0.59392375
   0.24359192  0.5743805  -0.02063601 -0.30981672  0.4525381  -0.4883322
  -0.36849728 -0.40376672 -0.27069744  0.80245805  0.9221232  -0.26733166
  -0.38114342  0.44678494]
 [ 0.5452945  -1.3669842   0.5409455  -1.1116333  -0.90776545 -0.6070346
  -0.25297645 -0.58496463  0.03812201  0.31686893 -0.43159282  0.49286634
   0.3576051   0.42367586  0.28590336 -0.8342137  -0.9239553   0.24012113
   0.36838338 -0.49222034]]


### Dimensionality Reduction

In [12]:
pca = cuml.decomposition.PCA(n_components = 2)
pca.fit(X_train)

print(f'Components: {pca.components_}')
print(f'Explained variance: {pca.explained_variance_}')
exp_var = pca.explained_variance_ratio_
print(f'Explained variance ratio: {exp_var}')

print(f'Singular values: {pca.singular_values_}')
print(f'Mean: {pca.mean_}')
print(f'Noise variance: {pca.noise_variance_}')

Components: [[-0.19620655  0.48750752 -0.21592808  0.37008977  0.31834662  0.19903679
   0.10499372  0.21698622  0.00413262 -0.1112079   0.10932487 -0.1532098
  -0.12400512 -0.1290357  -0.10037347  0.30771324  0.3142921  -0.096614
  -0.1489145   0.16214935]
 [-0.01918813 -0.15371218  0.34882915  0.31031907  0.17148638 -0.0455441
  -0.23478465 -0.18442973 -0.06603965 -0.12563434  0.41213584 -0.3326168
  -0.14456522 -0.13729379  0.22417799 -0.18594134  0.07163288  0.17343424
   0.36539772  0.2084818 ]]
Explained variance: [12.816997 12.112327]
Explained variance ratio: [0.09732978 0.09197865]
Singular values: [980.43964 953.1067 ]
Mean: [ 0.01252334 -0.00078583  0.00066356  0.01305394 -0.01151238 -0.00669956
 -0.00475181 -0.00543115  0.00875004  0.00360132  0.01036651  0.00238477
 -0.00535896  0.01005383  0.00766974 -0.01607423 -0.00113762 -0.01354437
 -0.00629009 -0.02283039]
Noise variance: [0.]


## cuML Preprocesing

cuML supports a wide variety of preprocessing capabilities, mirroring the Scikit-learn API where appropriate and extending it in other areas.

In [13]:
from cuml.preprocessing import Binarizer

binarizer = Binarizer()
binarizer.fit_transform(X_train)

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 1., 1., 1.],
       ...,
       [1., 0., 1., ..., 0., 1., 0.],
       [1., 1., 0., ..., 1., 0., 1.],
       [0., 0., 1., ..., 1., 0., 0.]], dtype=float32)

In [14]:
from cuml.preprocessing import KBinsDiscretizer

discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
discretizer.fit_transform(X_train)

array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 0, ..., 2, 0, 1],
       [1, 1, 0, ..., 1, 1, 2],
       ...,
       [1, 1, 1, ..., 1, 1, 1],
       [1, 2, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1]], dtype=int32)

In [15]:
from cuml.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(2)
result = poly.fit_transform(X_train)
print(result.shape)
print(result)

(75000, 231)
[[ 1.00000000e+00 -1.14202595e+00 -6.54440224e-01 ...  6.81346804e-02
   4.15505946e-01  2.53388143e+00]
 [ 1.00000000e+00 -1.45847034e+00 -8.49974513e-01 ...  4.62020073e+01
   2.33691654e+01  1.18202190e+01]
 [ 1.00000000e+00  2.27742687e-01 -1.14044130e+00 ...  1.71411586e+00
   4.44261217e+00  1.15142765e+01]
 ...
 [ 1.00000000e+00  2.66800523e+00 -7.49021471e-01 ...  1.02879000e+00
  -3.14288735e+00  9.60132027e+00]
 [ 1.00000000e+00  8.77428353e-01  4.40193510e+00 ...  1.05567837e+00
  -2.13801026e+00  4.33000088e+00]
 [ 1.00000000e+00 -3.55584711e-01 -1.69996846e+00 ...  2.73472723e-03
   3.68916094e-02  4.97669697e-01]]


In [16]:
stemmer = cuml.preprocessing.text.stem.PorterStemmer()
word_stems =  cudf.Series(['revivals','singing','adjustable'])
stemmer.stem(word_stems)

0     reviv
1      sing
2    adjust
dtype: object

## Pipelines

To quote the wonderful scikit-learn documentation, `Pipeline` "sequentially [applies] a list of transforms and a final estimator" to a dataset.

By collecting transformations and training into a single pipeline, we can confidently do things like cross-validation and hyper-parameter optimization without worrying about data leakage.

cuML transformations and estimators are fully compatible with the scikit-learn Pipeline API.

In [17]:
clf = cuml.svm.SVC()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.9753999710083008

In [18]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', cuml.preprocessing.StandardScaler()),
    ('svc', cuml.svm.SVC()),
])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

0.9747200012207031

In [19]:
# This is entirely on the GPU!

pipe = Pipeline([
    ("polynomial_features", cuml.preprocessing.PolynomialFeatures(2)),
    ('scaler', cuml.preprocessing.StandardScaler()),
    ('svc', cuml.svm.SVC()),
])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

0.9798799753189087

## Explainability

Model explainability is often critically important. cuML provides a GPU-accelerated SHAP Kernel Explainer and a Permutation Explainer.

In [20]:
from cuml.explainer import KernelExplainer

Xr, yr = make_regression(
    n_samples=102,
    n_features=10,
    noise=0.1,
    random_state=42)

Xr_train, Xr_test, yr_train, yr_test = train_test_split(
    Xr,
    yr,
    test_size=2,
    random_state=42)

model = cuml.svm.SVR().fit(Xr_train, yr_train)

cu_explainer = KernelExplainer(
    model=model.predict,
    data=Xr_train,
    is_gpu_model=True)

cu_shap_values = cu_explainer.shap_values(Xr_test)
cu_shap_values

array([[-2.6573505 ,  0.43958002, -0.09425537, -0.10002708,  0.03937579,
        -0.20362636,  0.19862506,  0.59732676, -0.09788719, -0.1783818 ],
       [ 2.8891382 ,  0.12803541,  0.091938  ,  0.0609557 ,  0.06538174,
        -0.00621617,  0.15530689,  0.08116505,  0.07060298, -0.02306223]],
      dtype=float32)

## Pickling Models

So far, we've only stored our models in memory. This final section demonstrates basic pickling of both single-GPU and multi-GPU cuML models for persistence.

### Single GPU

We can pickle individual estimators.

In [21]:
import pickle

In [22]:
pickle.dump(kmeans, open("kmeans.pkl", "wb"))
loaded_kmeans = pickle.load(open("kmeans.pkl", "rb"))
loaded_kmeans.labels_

array([1, 0, 1, ..., 1, 0, 1], dtype=int32)

We can even pickle the pipeline we made earlier.

In [23]:
pickle.dump(pipe, open("model.pkl", "wb"))
loaded_pipeline = pickle.load(open("model.pkl", "rb"))

print(loaded_pipeline.score(X_test, y_test))
loaded_pipeline.predict(X_test)

0.9798799753189087


array([1, 0, 1, ..., 1, 1, 0])

### Multi-GPU

This is the first use of a multi-GPU model in this notebook. cuML uses Dask for distributed model training. We'll walk through an example below, but we encourage you to explore the cuML documentation for more information and examples.

In [24]:
from dask.distributed import Client
from dask_cuda import LocalCUDACluster

cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="0,1")
client = Client(cluster)
client

0,1
Client  Scheduler: tcp://127.0.0.1:35899  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 2  Cores: 2  Memory: 1.48 TiB


In [25]:
from cuml.dask.datasets import make_blobs
from cuml.dask.cluster import KMeans

n_workers = len(client.scheduler_info()["workers"].keys())

X, y = make_blobs(n_samples=5000,
                  n_features=20,
                  centers=3,
                  cluster_std=0.4,
                  random_state=0,
                  n_parts=n_workers*5)

X = X.persist()
y = y.persist()

dist_model = KMeans(n_clusters=5)
dist_model.fit(X)

<cuml.dask.cluster.kmeans.KMeans at 0x7fa970d25700>

We can combine the distributed cuML model into a single GPU model. Then, we can pickle it like we did above.

In [26]:
single_gpu_model = dist_model.get_combined_model()
pickle.dump(single_gpu_model, open("distributed_kmeans_model.pkl", "wb"))

In [27]:
loaded_single_gpu_model = pickle.load(open("distributed_kmeans_model.pkl", "rb"))
loaded_single_gpu_model.cluster_centers_

array([[-4.429179 ,  5.4974027, -5.678268 , -1.6267937, -9.382265 ,
         0.5799359,  4.296654 , -2.9607224, -5.014361 ,  9.608334 ,
         8.356784 , -6.348927 , -6.358568 ,  2.0520344,  4.157308 ,
        -9.048441 ,  4.6119018,  8.822718 ,  6.8774323,  2.209063 ],
       [-5.7150435,  2.1754901, -3.9756098, -1.690938 , -5.2170424,
         7.543781 ,  2.7733128,  8.433652 ,  1.6140108,  0.9986322,
        -2.719721 ,  4.483001 , -4.410575 ,  2.3989952,  1.6275728,
        -2.4665058, -5.2080455, -1.7500803, -8.178893 ,  2.6448464],
       [ 4.8104296,  8.403568 , -9.23086  ,  9.379987 ,  8.524553 ,
        -1.0736501,  3.3421483, -7.806808 , -0.5735142,  0.2650673,
         5.5215764, -4.1019826,  4.268111 , -2.8475323,  3.6268995,
        -4.1613436, -3.608778 ,  6.2141366, -6.914194 , -1.0919937],
       [-5.9435043,  2.2477934, -3.766747 , -1.6608801, -5.4187307,
         7.610018 ,  3.0834143,  8.648033 ,  1.5557412,  1.1172905,
        -3.0038812,  4.449345 , -4.4684057,  