# Training and Evaluating Machine Learning Models in cuML

This notebook explores several basic machine learning estimators in cuML, demonstrating how to train them and evaluate them with built-in metrics functions. All of the models are trained on synthetic data, generated by cuML's dataset utilities.

1. Random Forest Classifier
2. UMAP
3. DBSCAN
4. Linear Regression


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rapidsai/cuml/blob/tree/branch-0.14/docs/source/estimator_intro.ipynb)

## Classification

### Random Forest Classification and Accuracy metrics

The Random Forest algorithm classification model builds several decision trees, and aggregates each of their outputs to make a prediction. For more information on cuML's implementation of the Random Forest Classification model please refer to : 
https://docs.rapids.ai/api/cuml/stable/api.html#cuml.ensemble.RandomForestClassifier

Accuracy score is the ratio of correct predictions to the total number of predictions. It is used to measure the performance of classification models. 
For more information on the accuracy score metric please refer to: https://en.wikipedia.org/wiki/Accuracy_and_precision

For more information on cuML's implementation of accuracy score metrics please refer to: https://rapidsai.github.io/projects/cuml/en/0.10.0/api.html#cuml.metrics.accuracy.accuracy_score

The cell below shows an end to end pipeline of the Random Forest Classification model. Here the dataset was generated by using sklearn's make_classification dataset. The generated dataset was used to train and run predict on the model. Random forest's performance is evaluated and then compared between the values obtained from the cuML and sklearn accuracy metrics.

In [1]:
import numpy as np
import cuml

from cuml.datasets.blobs import blobs as make_blobs
from cuml.ensemble import RandomForestClassifier as curfc
from cuml.preprocessing.model_selection import train_test_split

from sklearn.metrics import accuracy_score

n_samples = 1000
n_features = 10
n_info = 7

X_blobs, y_blobs = make_blobs(n_samples=n_samples, cluster_std=0.1,
                              n_features=n_features, random_state=0,
                              dtype=np.float32)

X_blobs_train, X_blobs_test, y_blobs_train, y_blobs_test = train_test_split(X_blobs,
                                                                            y_blobs, train_size=0.8,
                                                                            random_state=10)

cuml_class_model = curfc(max_features=1.0, n_bins=8, max_depth=10,
                         split_algo=0, min_rows_per_node=2,
                         n_estimators=30)
cuml_class_model.fit(X_blobs_train, y_blobs_train)
cu_preds = cuml_class_model.predict(X_blobs_test,y_blobs_test)

cu_accuracy = cuml.metrics.accuracy_score(y_blobs_test, cu_preds)
sk_accuracy = accuracy_score(y_blobs_test, cu_preds)

print("cuml's accuracy score : ", cu_accuracy)
print("sklearn's accuracy score : ", sk_accuracy)

cuml's accuracy score :  1.0
sklearn's accuracy score :  1.0




## Clustering

### UMAP and Trustworthiness metrics
UMAP is a dimensionality reduction algorithm which performs non-linear dimension reduction. It can also be used for visualization.
For additional information on the UMAP model please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/stable/api.html#cuml.UMAP

Trustworthiness is a measure of the extent to which the local structure is retained in the embedding of the model. Therefore, if a sample predicted by the model lied within the unexpected region of the nearest neighbors, then those samples would be penalized. For more information on the trustworthiness metric please refer to: https://scikit-learn.org/dev/modules/generated/sklearn.manifold.t_sne.trustworthiness.html

the documentation for cuML's implementation of the trustworthiness metric is: https://rapidsai.github.io/projects/cuml/en/stable/api.html#cuml.metrics.trustworthiness.trustworthiness

The cell below shows an end to end pipeline of UMAP model. Here, the blobs dataset is created by cuml's equivalent of make_blobs function to be used as the input. The output of UMAP's fit_transform is evaluated using the trustworthiness function. The values obtained by sklearn and cuml's trustworthiness are compared below.


In [2]:
import cuml
import numpy as np

from cuml.datasets.blobs import blobs as make_blobs
from cuml.manifold.umap import UMAP as cuUMAP

from sklearn.manifold import trustworthiness

# Generate a datasets with 8 "blobs" of grouped-together points so we have an interesting structure to test DBSCAN clustering and UMAP

n_samples = 2**10
n_features = 100

centers = round(n_samples*0.4)
X_blobs, y_blobs = make_blobs(n_samples=n_samples, cluster_std=0.1,
                              n_features=n_features, random_state=0,
                              dtype=np.float32)


X_embedded = cuUMAP(n_neighbors=10).fit_transform(X_blobs)

cu_score = cuml.metrics.trustworthiness(X_blobs, X_embedded)
sk_score = trustworthiness(X_blobs, X_embedded)

print(" cuml's trustworthiness score : ", cu_score)
print(" sklearn's trustworthiness score : ", sk_score)



 cuml's trustworthiness score :  0.8747406726747047
 sklearn's trustworthiness score :  0.8747260626845472


### DBSCAN and Adjusted Random Index
DBSCAN is a popular and a powerful clustering algorithm.  For additional information on the DBSCAN model please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/stable/api.html#cuml.DBSCAN

We create the blobs dataset using the cuml equivalent of make_blobs function.

Adjusted random index is a metric which is used to measure the similarity between two data clusters, and it is adjusted to take into consideration the chance grouping of elements.
For more information on Adjusted random index please refer to: https://en.wikipedia.org/wiki/Rand_index

The cell below shows an end to end model of DBSCAN. The output of DBSCAN's fit_predict is evaluated using the Adjusted Random Index function. The values obtained by sklearn and cuml's adjusted random metric are compared below.

In [3]:
import numpy as np
import cuml

from cuml.datasets.blobs import blobs as make_blobs
from cuml import DBSCAN as cumlDBSCAN

from sklearn.metrics import adjusted_rand_score

n_samples = 2**10
n_features = 100

centers = round(n_samples*0.4)
X_blobs, y_blobs = make_blobs(n_samples=n_samples, cluster_std=0.01,
                              n_features=n_features, random_state=0,
                              dtype=np.float32)

cuml_dbscan = cumlDBSCAN(eps=3, min_samples=2)
cu_y_pred = cuml_dbscan.fit_predict(X_blobs)

cu_y_pred = cu_y_pred.copy_to_host()
y_blobs = y_blobs.copy_to_host()

cu_adjusted_rand_index = cuml.metrics.cluster.adjusted_rand_score(y_blobs, cu_y_pred)
sk_adjusted_rand_index = adjusted_rand_score(y_blobs, cu_y_pred)

print(" cuml's adjusted random index score : ", cu_adjusted_rand_index)
print(" sklearn's adjusted random index score : ", sk_adjusted_rand_index)


 cuml's adjusted random index score :  1.0
 sklearn's adjusted random index score :  1.0




## Regression

### Linear regression and  R^2 score
Linear Regression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.

R^2 score is also known as the coefficient of determination. It is used as a metric for scoring regression models. It scores the output of the model based on the proportion of total variation of the model.
For more information on the R^2 score metrics please refer to: https://en.wikipedia.org/wiki/Coefficient_of_determination

For more information on cuML's implementation of the r2 score metrics please refer to : https://rapidsai.github.io/projects/cuml/en/stable/api.html#cuml.metrics.regression.r2_score

The cell below uses the Linear Regression model to compare the results between cuML and sklearn trustworthiness metric. For more information on cuML's implementation of the Linear Regression model please refer to : 
https://docs.rapids.ai/api/cuml/stable/api.html#linear-regression

In [4]:
import numpy as np
import cuml

from cuml.datasets import make_regression
from cuml.linear_model import LinearRegression as culr
from cuml.preprocessing.model_selection import train_test_split

from sklearn.metrics import r2_score

n_samples = 2**10
n_features = 100
n_info = 70

X_reg, y_reg = make_regression(n_samples=n_samples, n_features=n_features,
                               n_informative=n_info, random_state=123, dtype=np.float32)

# using cuML's train_test_split function to divide the dataset into training and testing splits
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg,
                                                                    y_reg, train_size=0.8,
                                                                    random_state=10)
cuml_reg_model = culr(fit_intercept=True,
                      normalize=True,
                      algorithm='eig')
cuml_reg_model.fit(X_reg_train,y_reg_train)
cu_preds = cuml_reg_model.predict(X_reg_test)

cu_r2 = cuml.metrics.r2_score(y_reg_test, cu_preds)
sk_r2 = r2_score(y_reg_test, cu_preds)

print("cuml's r2 score : ", cu_r2)
print("sklearn's r2 score : ", sk_r2)

cuml's r2 score :  1.0
sklearn's r2 score :  0.9999999999987945


