# Training and Evaluating Machine Learning Models with cuML

cuML is NVIDIA's GPU-accelerated machine learning library that implements popular ML algorithms with CUDA optimization. It 
provides scikit-learn-like APIs while leveraging GPU acceleration to deliver significant speedups compared to CPU-based 
implementations. cuML is part of the RAPIDS suite of open-source software libraries.

This notebook explores several basic machine learning estimators in cuML, demonstrating how to train them and evaluate them 
with built-in metrics functions. All of the models are trained on synthetic data, generated by cuML's dataset utilities.

1. Random Forest Classifier
2. UMAP
3. DBSCAN
4. Linear Regression

In [1]:
import cuml
from cupy import asnumpy 
from joblib import dump, load

## 1. Classification

### Random Forest Classification and Accuracy metrics

The Random Forest classification algorithm builds several decision trees, and aggregates each of their outputs to 
make a prediction. For more information on cuML's implementation of the Random Forest Classification model please refer 
to: [Random Forest Classifier API Documentation](https://docs.rapids.ai/api/cuml/stable/api.html#cuml.ensemble.RandomForestClassifier)

Accuracy score is the ratio of correct predictions to the total number of predictions. It is used to measure the performance 
of classification models. For more information on the accuracy score metric please refer to [Wikipedia's article on Accuracy and Precision](https://en.wikipedia.org/wiki/Accuracy_and_precision).

For more information on cuML's implementation of accuracy score metrics please refer to the [cuML Accuracy Score API Documentation](https://docs.rapids.ai/api/cuml/stable/api.html#cuml.metrics.accuracy.accuracy_score).

The cell below shows an end to end pipeline of the Random Forest Classification model. Here the dataset was generated by 
using scikit-learn's make_classification dataset. The generated dataset was used to train and run predict on the model. 
Random forest's performance is evaluated and then compared between the values obtained from the cuML and scikit-learn accuracy metrics.

In [2]:
from cuml.datasets.classification import make_classification
from cuml.model_selection import train_test_split
from cuml.ensemble import RandomForestClassifier as cuRF
from sklearn.metrics import accuracy_score


In [None]:
%%time

# synthetic dataset dimensions
n_samples = 1000
n_features = 10
n_classes = 2

# random forest depth and size
n_estimators = 25
max_depth = 10

# generate synthetic data [ binary classification task ]
X, y = make_classification(
    n_classes=n_classes,
    n_features=n_features,
    n_samples=n_samples,
    random_state=0,
)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

model = cuRF(
    max_depth=max_depth,
    n_estimators=n_estimators,
    random_state=0,
)

trained_RF = model.fit(X_train, y_train)

predictions = model.predict(X_test)

cu_score = cuml.metrics.accuracy_score(y_test, predictions)
sk_score = accuracy_score(asnumpy(y_test), asnumpy(predictions))

print(" cuml accuracy: ", cu_score)
print(" sklearn accuracy : ", sk_score)

# save
dump(trained_RF, "RF.model")

# to reload the model uncomment the line below
# loaded_model = load('RF.model')

## Clustering

### UMAP and Trustworthiness metrics
UMAP is a dimensionality reduction algorithm that performs non-linear dimension reduction. It can also be used for visualization.
For additional information on the UMAP model please refer to the [RAPIDS UMAP documentation](https://docs.rapids.ai/api/cuml/stable/api.html#cuml.UMAP)

Trustworthiness is a measure of the extent to which the local structure is retained in the embedding of the model. Therefore, 
if a sample predicted by the model lies within the unexpected region of the nearest neighbors, then those samples would be penalized. 

**Additional Resources:**
- [scikit-learn trustworthiness documentation](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.trustworthiness.html)
- [cuML trustworthiness documentation](https://docs.rapids.ai/api/cuml/stable/api.html#cuml.metrics.trustworthiness.trustworthiness)

The cell below shows an end to end pipeline of UMAP model. The blobs dataset is created by cuML's equivalent of `make_blobs` 
function to be used as the input. The output of UMAP's fit_transform is evaluated using the trustworthiness function. The
values obtained by scikit-learn and cuML's trustworthiness are compared below.


In [4]:
from cuml.datasets import make_blobs
from cuml.manifold.umap import UMAP as cuUMAP
from sklearn.manifold import trustworthiness
import numpy as np

In [None]:
%%time

n_samples = 1000
n_features = 100
cluster_std = 0.1

X_blobs, y_blobs = make_blobs(
    n_samples=n_samples,
    cluster_std=cluster_std,
    n_features=n_features,
    random_state=0,
    dtype=np.float32,
)

trained_UMAP = cuUMAP(n_neighbors=10).fit(X_blobs)
X_embedded = trained_UMAP.transform(X_blobs)

cu_score = cuml.metrics.trustworthiness(X_blobs, X_embedded)
sk_score = trustworthiness(asnumpy(X_blobs), asnumpy(X_embedded))

print("cuml's trustworthiness score:", cu_score)
print("sklearn's trustworthiness score:", sk_score)

# save
dump(trained_UMAP, "UMAP.model")

# to reload the model uncomment the line below
# loaded_model = load('UMAP.model')

### DBSCAN and Adjusted Random Index

DBSCAN is a popular and powerful clustering algorithm. For additional information on the DBSCAN model please refer to the 
[DBSCAN documentation](https://docs.rapids.ai/api/cuml/stable/api.html#cuml.DBSCAN)

We create the blobs dataset using the cuML equivalent of `make_blobs` function.

Adjusted random index is a metric which is used to measure the similarity between two data clusters, and it is adjusted to 
take into consideration the chance grouping of elements.
For more information on Adjusted random index please refer to: [Wikipedia Rand index documentation](https://en.wikipedia.org/wiki/Rand_index)

The cell below shows an end to end model of DBSCAN. The output of DBSCAN's fit_predict is evaluated using the Adjusted 
Random Index function. The values obtained by scikit-learn and cuML's adjusted random metric are compared below.

In [8]:
from cuml import DBSCAN as cumlDBSCAN
from sklearn.metrics import adjusted_rand_score

In [None]:
n_samples = 1000
n_features = 100
cluster_std = 0.1

X_blobs, y_blobs = make_blobs(
    n_samples=n_samples,
    n_features=n_features,
    cluster_std=cluster_std,
    random_state=0,
    dtype=np.float32,
)

cuml_dbscan = cumlDBSCAN(eps=3, min_samples=2)

trained_DBSCAN = cuml_dbscan.fit(X_blobs)

cu_y_pred = trained_DBSCAN.fit_predict(X_blobs)

cu_adjusted_rand_index = cuml.metrics.cluster.adjusted_rand_score(y_blobs, cu_y_pred)
sk_adjusted_rand_index = adjusted_rand_score(asnumpy(y_blobs), asnumpy(cu_y_pred))

print(f"cuml's adjusted random index score: {cu_adjusted_rand_index}")
print(f"sklearn's adjusted random index score: {sk_adjusted_rand_index}")

# save and optionally reload
dump(trained_DBSCAN, "DBSCAN.model")

# to reload the model uncomment the line below
# loaded_model = load('DBSCAN.model')

## Regression

### Linear regression and  R^2 score
Linear Regression is a simple machine learning model where the response `y` is modeled by a linear combination of the 
predictors in `X`.

`R^2` score is also known as the coefficient of determination. It is used as a metric for scoring regression models. It
scores the output of the model based on the proportion of total variation of the model. For more information on the `R^2`
score metrics please refer to [Wikipedia page on Coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination)

For more information on cuML's implementation of the `r2` score metrics please refer to [cuML R2 Score Documentation](https://docs.rapids.ai/api/cuml/stable/api.html)

The cell below uses the Linear Regression model to compare the results between cuML and scikit-learn trustworthiness metric. For more 
information on cuML's implementation of the Linear Regression model please refer to [cuML Linear Regression Documentation](https://docs.rapids.ai/api/cuml/stable/api.html)

In [10]:
from cuml.datasets import make_regression
from cuml.model_selection import train_test_split
from cuml.linear_model import LinearRegression as cuLR
from sklearn.metrics import r2_score


In [None]:
%%time

n_samples = 2**10
n_features = 100
n_info = 70

X_reg, y_reg = make_regression(
    n_samples=n_samples,
    n_features=n_features,
    n_informative=n_info,
    random_state=123,
)

X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
    X_reg, y_reg, train_size=0.8, random_state=10
)

cuml_reg_model = cuLR(fit_intercept=True, normalize=True, algorithm="eig")

trained_LR = cuml_reg_model.fit(X_reg_train, y_reg_train)
cu_preds = trained_LR.predict(X_reg_test)

cu_r2 = cuml.metrics.r2_score(y_reg_test, cu_preds)
sk_r2 = r2_score(asnumpy(y_reg_test), asnumpy(cu_preds))

print(f"cuml's r2 score : {cu_r2}")
print(f"sklearn's r2 score : {sk_r2}")

# save and reload
dump(trained_LR, "LR.model")

# to reload the model uncomment the line below 
# loaded_model = load('LR.model')

# Example integration with an existing workflow
Chances are that you already have an existing workflow that uses `scikit-learn`. You may even have custom transformers implemented for data preprocessing.

This example takes a pipeline that uses [skrub](https://skrub-data.org/stable/index.html), an open-source package that aims at bridging the gap between tabular data sources and machine-learning models.

Here we will show how different tools that leverage the `scikit-learn` API can be combined in a pipeline.
## Easy learning on a dataframe

Let's first retrieve the dataset, using one of the downloaders from the `skrub.datasets` module. As all the downloaders,
`~skrub.datasets.fetch_employee_salaries` returns a dataset with attributes `X`, and `y`. `X` is a dataframe which contains 
the features (aka design matrix, explanatory variables, independent variables). `y` is a column (pandas Series) which
contains the target (aka dependent, response variable) that we want to learn to predict from `X`. In this case `y` is the annual salary.

In [None]:
# colab only: uncomment to install skrub
# ! pip install skrub

In [None]:
from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()
employees, salaries = dataset.X, dataset.y
employees

Most machine-learning algorithms work with arrays of numbers. The challenge here is that the employees dataframe is a 
heterogeneous set of columns: some are numerical (`'year_first_hired'`), some dates (`'date_first_hired'`), some have a few 
categorical entries (`'gender'`), some many (`'employee_position_title'`). Therefore our table needs to be "vectorized": processed
to extract numeric features.

`skrub` provides a custom transformer, called a `TableVectorizer` to preprocess the data for us.

In [None]:
from skrub import TableVectorizer

vectorizer = TableVectorizer()
vectorized_employees = vectorizer.fit_transform(employees)
vectorized_employees

## A simple Pipeline for tabular data

The `TableVectorizer` outputs data that can be understood by a scikit-learn estimator. Therefore we can easily build a
2-step scikit-learn `Pipeline` that we can fit, test or cross-validate and that works well on tabular data.

In [14]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline

In [None]:

pipeline = make_pipeline(TableVectorizer(), RandomForestRegressor())
pipeline

In [None]:
%%time 
# In colab this will take ~ 3-4min to run - you might want to skip this cell
results = cross_validate(pipeline, employees, salaries)
scores = results["test_score"]
print(f"R2 score:  mean: {np.mean(scores):.3f}; std: {np.std(scores):.3f}")
print(f"mean fit time: {np.mean(results['fit_time']):.3f} seconds")

Now let's swap out `scikit-learn`'s `RandomForestRegressor` for `cuML`'s and notice that it runs in seconds instead of 
minutes.

In [None]:
from cuml.ensemble import RandomForestRegressor as cuRFR

pipeline = make_pipeline(TableVectorizer(), cuRFR())

In [None]:
%%time

results = cross_validate(pipeline, employees, salaries)
scores = results["test_score"]
print(f"R2 score:  mean: {np.mean(scores):.3f}; std: {np.std(scores):.3f}")
print(f"mean fit time: {np.mean(results['fit_time']):.3f} seconds")

## Conclusion

In this notebook, we learned:

* How to use cuML's drop-in replacements for scikit-learn estimators
* How cuML can accelerate machine learning workflows on GPU
* How to integrate cuML with scikit-learn Pipelines
* The performance benefits of GPU-accelerated machine learning with cuML

To learn more, we encourage you to visit the [cuML documentation](https://docs.rapids.ai/api/cuml/stable/)

In the next notebook we will learn about `cuml.accel`:

[Next Notebook: 5 cuml.accel →](https://colab.research.google.com/github/rapidsai-community/tutorial/blob/main/5.cuml_accel.ipynb)