# Introduction to cuML 

cuML is suite of GPU-accelerated machine learning algorithms, designed to accelerate your data science and analytical workloads. From pre-processing data through to training and evaluating models, cuML proivdes a user-friendly API and a wide range of functionality to help you get the most from your GPUs.  

### Key Concepts

The following key concepts sit at the core of cuML's design, and enable you to get the most out of your data:

#### 1. Where possible, match the scikit-learn API

cuML estimators look and feel just like scikit-learn estimators. You initialize them with key parameters, fit them with a fit method, then call predict or transform for inference.

#### 2. Accept flexible input types, return predictable output types

cuML estimators can accept NumPy arrays, cuDF dataframes, cuPy arrays, 2d PyTorch tensors, and really any kind of standards-based Python array input you can throw at them.

By default, outputs will mirror the data type you provided.

#### 3. Be fast!

On a modern GPU, these can exceed the performance of CPU-based equivalents by a factor of anything from 4x (for a medium-sized linear regression) to over 1000x (for large-scale tSNE dimensionality reduction). In many cases, performance advantages appear as the dataset grows.

In this notebook we step through some of the functionality of cuML, in the context of a standard data science workflow. 

We begin importing the cuML module, as well as cuDF, and simulating some data to use in the rest of the notebook.

In [None]:
import cudf
import cuml

In the next cell we simulate 100,000 data samples. Each sample has 70 features, and belongs to one of two distinct classes. 

In [None]:
from cuml.datasets import make_classification, make_regression

NFEATURES = 20

X, y = make_classification(
    n_samples=100000,
    n_features=NFEATURES,
    n_informative=NFEATURES,
    n_redundant=0,
    n_classes=2,
    class_sep=0.01,
    random_state=12
)


Let's take a look at one  sample, below: 

In [None]:
print(X[0], y[0])

## Split data into training and testing set

We use the `train_test_split` function to divide our data into training and testing sets. 

We'll use the testing set later to evaluate the performance of the models we train. 

In [None]:
from cuml.model_selection import train_test_split

## set train_size such that 70% of data is in the training set, 30% in the test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=101)

In [None]:
print(len(X_train))
print(len(X_test))

## Explore and preprocess the data

Now that we have split our data into training and test sets we can begin to apply transformations. Just like scikit-learn, cuML estimators admit the _initialise_, _fit_, and _predict_ or _transform_ functionality. 

Let's see this in action with a the `MaxAbsScaler`. This scaler transforms each feature (column) of our data set by scaling it so that the maximum absolute value of each feature is 1.  

In [None]:
from cuml.preprocessing import MaxAbsScaler

In [None]:
## initialise the estimator
ma_scaler = MaxAbsScaler()

# fit the scaler to our training data
ma_scaler.fit(X_train)


# transform the testing data: 
ma_scaler.transform(X_test)

Similarly, we can use a `RobustScaler` to transform the data so that each feature is on a similar scale. 

This Scaler removes the median and scales the data according to the interquantile range.

In [None]:
from cuml.preprocessing import RobustScaler

# initialise the estimator
rs = RobustScaler()

# fit the estimator to the training data
rs.fit(X_train)

# transform testing data
rs.transform(X_test)

And we can inspect properties of the Scaler, such as the scale factor: 

In [None]:
rs.scale_

## Dimensionality Reduction

When exploring our data, we often want to project the features down to 2-dimensions so that we can plot and visualise the data set, and see if we can identify patterns. 

We begin by using Principle Component Analysis (PCA), a linear dimensionality reduction technique. PCA requires input data to be on the same scale, so we first transform our data using the RobustScaler.

In [None]:
from cuml import PCA

In [None]:
%%time
# initialise the estimator
pca = PCA(n_components = 2)

# fit the estimater to our training data
pca.fit(rs.transform(X_train))

## transform our testing data
pca_test = pca.transform(rs.transform(X_test))

We can examine the proportion of variance explained by the PCA and inspect the components:

In [None]:
print(f'Components: {pca.components_}')
print(f'Explained variance: {pca.explained_variance_}')
exp_var = pca.explained_variance_ratio_
print(f'Explained variance ratio: {exp_var}')

PCA is fast, but there are more sophisticated techniques we can use to possibly expose more structure in the data. Due to the non-linearity of these alternative dimensionality reduction techniques, they are more computationaly expensive. However, we benefit here from the acceleration provided by NVIDIA GPUs and the RAPIDS implementations. 

UMAP is a non-linear dimensionality reduction technique:

In [None]:
from cuml import UMAP

In [None]:
%%time
umap = UMAP(n_components = 2)
umap.fit(X_train)
umap_test = umap.transform(X_test)

As you can see, UMAP is notably slower than PCA, but let's see if it allows us to uncover more structure in our data by plotting the projected test data: 

In [None]:
import matplotlib.pyplot as plt
import cupy as cp

In [None]:
# transfering data to cpu to plot.
umap_cpu = cp.asnumpy(umap_test)
pca_cpu = cp.asnumpy(pca_test)

In [None]:
plt.scatter(umap_cpu[:,0], umap_cpu[:,1], c=cp.asnumpy(y_test))

In [None]:
plt.scatter(pca_cpu[:,0], umap_cpu[:,1], c = cp.asnumpy(y_test))

### Training a  model 

Now that we've transformed our data, and have been able to identify structure in the data we can go ahead and train a model to distinguish between the two classes of data. Let's start by training a logistic regression model. 

Again, we follow the _initialise_, _fit_, _predict_ workflow that we used with the scalers and dimensionality reduction techniques earlier in the notebook. 

In [None]:
## initialise
clr = cuml.LogisticRegression()

## fit to scaled data
clr.fit(rs.transform(X_train), y_train)

## predict 
clr_preds = clr.predict(rs.transform(X_test))
clr_preds

### Evaluating the model

cuML provides a range of built in metrics to evaluate model performance. 


In [None]:
cuml.metrics.accuracy.accuracy_score(y_test, clr_preds)

It looks like this prediction accuracy is only slightly higher than 50%. We would expect similar results by just tossing a coin to allocate classes. Let's inviestigate this further by looking at a confusion matrix:

In [None]:
cuml.metrics.confusion_matrix(y_test, clr_preds)

The confusion matrix tells us that there are many misclassifications in both the '0' and '1' classes. Let's try to train another model and see if we can get better performance:

In [None]:
# initialise
ckn = cuml.neighbors.KNeighborsClassifier()

# fit
ckn.fit(rs.transform(X_train), y_train)

# predict
ckn_preds = ckn.predict(rs.transform(X_test))
ckn_preds

In [None]:
cuml.metrics.accuracy.accuracy_score(y_test, ckn_preds)

In [None]:
cuml.metrics.confusion_matrix(ckn_preds, y_test)

For our dataset, the k-nearest neighbour model is much better at predicting classes than the Logistic Regression model. 

## Pipelines

To quote the wonderful scikit-learn documentation, `Pipeline` "sequentially [applies] a list of transforms and a final estimator" to a dataset.

By collecting transformations and training into a single pipeline, we can confidently do things like cross-validation and hyper-parameter optimization without worrying about data leakage.

cuML transformations and estimators are fully compatible with the scikit-learn Pipeline API.

In our previous examples we used a RobustScaler followed by a k-Nearest neighbour model. Let's put those together in a pipeline:

In [None]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', cuml.preprocessing.RobustScaler()),
    ('knn', cuml.neighbors.KNeighborsClassifier()),
])

We can fit the whole pipeline in one command, and make predictions from the raw data, without having to first call the scale, then the model.

In [None]:
%%time
pipe.fit(X_train, y_train)

In [None]:
%%time
pipe.predict(X_test)

In [None]:
cuml.metrics.confusion_matrix(pipe.predict(X_test), y_test)

### Sidebar: comparison with scikit-learn

Although we're using the scikit-learn Pipeline above, all of our data remains on the GPU thoughout the execution. Let's see how long the comparative transformations and modeling take when we run these on the CPU: 

In [None]:
import pandas as pd

#transfer data to cpu
cpu_X_train = pd.DataFrame(X_train)
cpu_X_test = pd.DataFrame(X_test)
cpu_y_train = cp.asnumpy(y_train)

In [None]:
import sklearn
from sklearn import neighbors

cpu_pipe = Pipeline([
    ('scaler', sklearn.preprocessing.RobustScaler()),
    ('knn', sklearn.neighbors.KNeighborsClassifier()),
])

In [None]:
%%time
cpu_pipe.fit(cpu_X_train, cpu_y_train)

In [None]:
%%time
cpu_pipe.predict(cpu_X_test)

So we can run the same pipeline on CPU with no code changes needed, but it is orders of magnitude slower to do so. 

## Explainability

Model explainability is often critically important. cuML provides a GPU-accelerated SHAP Kernel Explainer and a Permutation Explainer.

In [None]:
from cuml.explainer import KernelExplainer
from cuml.datasets import make_classification, make_regression
from cuml.model_selection import train_test_split
import cuml
Xr, yr = make_regression(
    n_samples=102,
    n_features=10,
    noise=0.1,
    random_state=42)

Xr_train, Xr_test, yr_train, yr_test = train_test_split(
    Xr,
    yr,
    test_size=2,
    random_state=42)

model = cuml.svm.SVR().fit(Xr_train, yr_train)

cu_explainer = KernelExplainer(
    model=model.predict,
    data=Xr_train,
    is_gpu_model=True)

cu_shap_values = cu_explainer.shap_values(Xr_test)
cu_shap_values

## Pickling Models

So far, we've only stored our models in memory. This final section demonstrates basic pickling cuML models, and pipelines, for persistence. This allows us to load these models into other environments or programs and use them to make predictions on new data. 

We can pickle individual estimators.

In [None]:
import pickle

In [None]:
pickle.dump(model, open("model.pkl", "wb"))
loaded_model = pickle.load(open("model.pkl", "rb"))
loaded_model

We can even pickle the pipeline we made earlier.

In [None]:
pickle.dump(pipe, open("pipeline.pkl", "wb"))
loaded_pipeline = pickle.load(open("pipeline.pkl", "rb"))

print(loaded_pipeline.score(X_test, y_test))
loaded_pipeline.predict(X_test)

We hope this notebook has shown you how you can use cuML to carry out your standard Machine Learning and analytics workflows on NVIDIA GPUs. 

To find out more, check out [RAPIDS.ai](http://rapids.ai) and look at the cuML [docs](https://docs.rapids.ai/api/cuml/stable/) to see the full range of the cuML functionality. 