# Introduction to cuML

What is cuML?

## Introduction and key concepts

cuML accelerates machine learning on GPUs. The library follows a couple of key principals, and understanding these will help you take full advantage cuML.

### 1. Where possible, match the scikit-learn API

cuML estimators look and feel just like scikit-learn estimators. You initialize them with key parameters, fit them with a fit method, then call predict or transform for inference.

### 2. Accept flexible input types, return predictable output types

cuML estimators can accept NumPy arrays, cuDF dataframes, cuPy arrays, 2d PyTorch tensors, and really any kind of standards-based Python array input you can throw at them.

By default, outputs will mirror the data type you provided.

### 3. Be fast!

On a modern GPU, these can exceed the performance of CPU-based equivalents by a factor of anything from 4x (for a medium-sized linear regression) to over 1000x (for large-scale tSNE dimensionality reduction). In many cases, performance advantages appear as the dataset grows.

## cuML vs Scikit-Learn

## cuML Estimators

This notebook provides an overview of several machine learning estimators in cuML, demonstrating how to train and evaluate them with built-in metrics functions.

In [46]:
## cuML Estimators

### Regressors

### Classifiers

### Clusterers

### Dimensionality Reduction

## cuML Preprocesing

## Pipelines

## Pipelines

To quote the wonderful scikit-learn documentation, `Pipeline` "sequentially apply a list of transforms and a final estimator" to a dataset. By collecting transformations and training into a single pipeline, we can confidently do things like cross-validation and hyper-parameter optimization without worrying about data leakage.

cuML transformations and estimators are fully compatible with the scikit-learn Pipeline API.

## Explainability

Model explainability is often critically important. cuML provides a GPU-accelerated SHAP Kernel Explainer and a Permutation Explainer.

## Pickling Models

So far, we've only stored our models in memory. This final section demonstrates basic pickling of both single-GPU and multi-GPU cuML models for persistence.

This is the first use of a multi-GPU model in this notebook. cuML uses Dask for distributed model training. We'll walk through an example below, but we encourage you to explore the cuML documentation for more information and examples.

In [23]:
from cuml.svm import SVC
from cuml.preprocessing import StandardScaler, PolynomialFeatures
from cuml.model_selection import train_test_split
from cuml.datasets import make_classification

from sklearn.pipeline import Pipeline

In [35]:
NFEATURES = 20

X, y = make_classification(
    n_samples=100000,
    n_features=NFEATURES,
    n_informative=NFEATURES,
    n_redundant=0,
    n_classes=2,
    class_sep=0.01,
    random_state=0
)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)

In [37]:
clf = SVC()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.9762799739837646

In [38]:
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC()),
])
# The pipeline can be used as any other estimator
# and avoids leaking the test set into the train set
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

0.9764000177383423

In [41]:
pipe = Pipeline([
    ("polynomial_feat", PolynomialFeatures(2)),
    ('scaler', StandardScaler()),
    ('svc', SVC()),
])
# The pipeline can be used as any other estimator
# and avoids leaking the test set into the train set
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

0.981440007686615