# Introduction to cuML

What is cuML?

## Introduction and key concepts

cuML accelerates machine learning on GPUs. The library follows a couple of key principals, and understanding these will help you take full advantage cuML.

### 1. Where possible, match the scikit-learn API

cuML estimators look and feel just like scikit-learn estimators. You initialize them with key parameters, fit them with a fit method, then call predict or transform for inference.

### 2. Accept flexible input types, return predictable output types

cuML estimators can accept NumPy arrays, cuDF dataframes, cuPy arrays, 2d PyTorch tensors, and really any kind of standards-based Python array input you can throw at them.

By default, outputs will mirror the data type you provided.

### 3. Be fast!

On a modern GPU, these can exceed the performance of CPU-based equivalents by a factor of anything from 4x (for a medium-sized linear regression) to over 1000x (for large-scale tSNE dimensionality reduction). In many cases, performance advantages appear as the dataset grows.

## cuML vs Scikit-Learn

## cuML Estimators

This notebook provides an overview of several machine learning estimators in cuML, demonstrating how to train and evaluate them with built-in metrics functions. For more in depth information and examples, please refer to the documentation.

In [121]:
import cudf
import cuml
from cuml.svm import SVC
from cuml.preprocessing import StandardScaler, PolynomialFeatures
from cuml.model_selection import train_test_split
from cuml.datasets import make_classification

NFEATURES = 20

X, y = make_classification(
    n_samples=100000,
    n_features=NFEATURES,
    n_informative=NFEATURES,
    n_redundant=0,
    n_classes=2,
    class_sep=0.01,
    random_state=0
)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)

### Regressors

### Classifiers

In [86]:
clf = cuml.linear_model.LogisticRegression()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
print(clf.coef_.ravel())

0.5044000148773193
[-0.00013668 -0.0040487   0.00017586 -0.00540551 -0.00147197 -0.00204108
  0.00046605  0.00248705  0.0037112   0.00328126  0.00025797  0.00146517
  0.00565837 -0.00182544  0.00371502 -0.001151   -0.00102958  0.00103293
 -0.00349904 -0.00844826]


In [92]:
clf = cuml.linear_model.LogisticRegression(penalty="l1")
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
print(clf.coef_.ravel())

[E] [09:28:56.829869] QWL-QN line search failed
0.5035600066184998
[-0.00010016 -0.00403982  0.00014952 -0.00536854 -0.00146793 -0.00205053
  0.000467    0.00254936  0.00367489  0.00327556  0.00027593  0.00140943
  0.00564877 -0.00188041  0.00372037 -0.001163   -0.00100568  0.00103245
 -0.00349095 -0.00846354]


In [98]:
clf = cuml.neighbors.KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

0.9712399840354919


In [99]:
clf = cuml.neighbors.KNeighborsClassifier(n_neighbors=10)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

0.9761199951171875


In [101]:
clf = cuml.svm.SVC()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

0.9762799739837646


In [103]:
clf = cuml.svm.SVC(C=10)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

0.9816399812698364


### Clusterers

In [110]:
kmeans = cuml.cluster.KMeans()
kmeans.fit(X_train)

print("labels:")
print(kmeans.labels_)
print("cluster_centers:")
print(kmeans.cluster_centers_)

labels:
[3 1 7 ... 5 3 7]
cluster_centers:
[[-1.2001286   0.2915472   0.73022574 -1.4476534   0.12419276  0.6032779
  -1.3671764   0.4796237  -1.2630613  -1.9679332  -0.05651553  1.1516273
   0.0911471  -0.47311252 -0.07382447 -0.40850422  1.1987472   1.3432238
  -1.3668767  -0.67849404]
 [ 1.1188177  -0.36091506 -0.05150037  1.4357436   0.7456412  -1.5450572
   2.1083956  -0.0056184  -0.4879958   0.5860339   0.93496215  1.467241
  -2.200274    0.0739655  -0.98056763 -0.82171    -1.1563358  -1.5384593
  -0.29150116 -0.17171538]
 [ 0.3375449  -1.3611176   1.5672036  -0.33588976  0.44241628  1.1863414
  -0.54662424 -0.86296666  0.53161645 -0.9384144   0.362147   -0.87245566
  -1.7132689  -1.983615    1.2805843   0.49160016 -0.97978216  0.17033358
  -1.10449     3.1581523 ]
 [ 1.4229035  -1.0691209  -0.19737092 -1.6390495   1.2999135   0.8948438
   0.26133105  2.3867197   0.18294281  1.8707604   0.577021    1.6806108
   0.9930441  -0.68840504 -2.0330095   1.0724192   0.93112355  1.0622234

### Dimensionality Reduction

In [119]:
pca = cuml.decomposition.PCA(n_components = 2)
pca.fit(X_train)

print(f'Components: {pca.components_}')
print(f'Explained variance: {pca.explained_variance_}')
exp_var = pca.explained_variance_ratio_
print(f'Explained variance ratio: {exp_var}')

print(f'Singular values: {pca.singular_values_}')
print(f'Mean: {pca.mean_}')
print(f'Noise variance: {pca.noise_variance_}')

Components: [[-0.22738276 -0.04777836  0.12365042  0.15176034 -0.13479295  0.11212273
  -0.09851787 -0.4067749   0.01762475 -0.35134166  0.0454287  -0.23628762
  -0.31318912 -0.10454161  0.4274099  -0.1378494  -0.18946336 -0.2125933
  -0.24190477  0.26424477]
 [ 0.01717635  0.32100588 -0.13514338  0.38807714 -0.13682115 -0.46206656
   0.10881904 -0.23131591  0.0594103   0.05421089 -0.1797142  -0.3431435
   0.15159144  0.26999882  0.0112941  -0.04975148 -0.07820843 -0.10089543
  -0.02100286 -0.39810154]]
Explained variance: [13.981019 11.280637]
Explained variance ratio: [0.10451543 0.08432865]
Singular values: [1023.99335  919.8024 ]
Mean: [ 0.00325871  0.00313095 -0.00831611 -0.00489904  0.01347446  0.00533854
 -0.00385251  0.00124974 -0.01193748 -0.00200827 -0.00380131  0.01164985
 -0.00314607 -0.00098215 -0.00985608 -0.01344736 -0.00536451 -0.02178996
 -0.00178788 -0.01472216]
Noise variance: [0.]


## cuML Preprocesing

cuML supports a wide variety of preprocessing capabilities, mirroring the Scikit-learn API where appropriate and extending it in other areas.

In [132]:
from cuml.preprocessing import Binarizer

binarizer = Binarizer()
binarizer.fit_transform(X_train)

array([[0., 1., 1., ..., 1., 0., 1.],
       [0., 1., 0., ..., 1., 1., 0.],
       [0., 1., 0., ..., 1., 1., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 1., 1., 1.],
       [1., 1., 0., ..., 1., 0., 0.]], dtype=float32)

In [134]:
from cuml.preprocessing import KBinsDiscretizer

discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
discretizer.fit_transform(X_train)

array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 2, 2, 1],
       [1, 1, 1, ..., 2, 1, 1],
       ...,
       [0, 1, 0, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 2, 1, ..., 2, 1, 1]], dtype=int32)

In [144]:
from cuml.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(2)
result = poly.fit_transform(X_train)
print(result.shape)
print(result)

(75000, 231)
[[ 1.0000000e+00 -2.2086940e+00  1.0529693e+00 ...  1.4669826e+00
  -1.1677995e+00  9.2963308e-01]
 [ 1.0000000e+00 -3.6910493e+00  3.7494469e+00 ...  2.9491693e+01
  -1.4921916e+01  7.5500441e+00]
 [ 1.0000000e+00 -2.4439375e+00  2.5241287e+00 ...  3.8922136e+00
  -1.6989938e+00  7.4162936e-01]
 ...
 [ 1.0000000e+00 -4.3003263e+00 -2.8717377e+00 ...  1.4919987e-03
  -1.6069251e-01  1.7307043e+01]
 [ 1.0000000e+00 -8.7786084e-01 -1.6236888e+00 ...  1.0309407e+00
   2.9249423e+00  8.2985249e+00]
 [ 1.0000000e+00  3.1875753e+00  4.1635799e+00 ...  5.7780309e+00
   3.6880007e+00  2.3539765e+00]]


In [151]:
stemmer = cuml.preprocessing.text.stem.PorterStemmer()
word_stems =  cudf.Series(['revivals','singing','adjustable'])
stemmer.stem(word_stems)

0     reviv
1      sing
2    adjust
dtype: object

## Pipelines

To quote the wonderful scikit-learn documentation, `Pipeline` "sequentially apply a list of transforms and a final estimator" to a dataset.

By collecting transformations and training into a single pipeline, we can confidently do things like cross-validation and hyper-parameter optimization without worrying about data leakage.

cuML transformations and estimators are fully compatible with the scikit-learn Pipeline API.

In [159]:
clf = cuml.svm.SVC()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.9759600162506104

In [160]:
pipe = Pipeline([
    ('scaler', cuml.preprocessing.StandardScaler()),
    ('svc', cuml.svm.SVC()),
])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

0.9761999845504761

In [161]:
# This is entirely on the GPU!

pipe = Pipeline([
    ("polynomial_features", cuml.preprocessing.PolynomialFeatures(2)),
    ('scaler', cuml.preprocessing.StandardScaler()),
    ('svc', cuml.svm.SVC()),
])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

0.9809200167655945

## Explainability

Model explainability is often critically important. cuML provides a GPU-accelerated SHAP Kernel Explainer and a Permutation Explainer.

## Pickling Models

So far, we've only stored our models in memory. This final section demonstrates basic pickling of both single-GPU and multi-GPU cuML models for persistence.

This is the first use of a multi-GPU model in this notebook. cuML uses Dask for distributed model training. We'll walk through an example below, but we encourage you to explore the cuML documentation for more information and examples.