## Clustering

### UMAP and Trustworthiness metrics
UMAP is a dimensionality reduction algorithm that performs non-linear dimension reduction. It can also be used for visualization.
For additional information on the UMAP model please refer to the [RAPIDS UMAP documentation](https://docs.rapids.ai/api/cuml/stable/api.html#cuml.UMAP)

Trustworthiness is a measure of the extent to which the local structure is retained in the embedding of the model. Therefore, 
if a sample predicted by the model lies within the unexpected region of the nearest neighbors, then those samples would be penalized. 

**Additional Resources:**
- [scikit-learn trustworthiness documentation](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.trustworthiness.html)
- [cuML trustworthiness documentation](https://docs.rapids.ai/api/cuml/stable/api.html#cuml.metrics.trustworthiness.trustworthiness)

The cell below shows an end to end pipeline of UMAP model. The blobs dataset is created by cuML's equivalent of `make_blobs` 
function to be used as the input. The output of UMAP's fit_transform is evaluated using the trustworthiness function. The
values obtained by scikit-learn and cuML's trustworthiness are compared below.


In [4]:
from cuml.datasets import make_blobs
from cuml.manifold.umap import UMAP as cuUMAP
from sklearn.manifold import trustworthiness
import numpy as np

In [None]:
%%time

n_samples = 1000
n_features = 100
cluster_std = 0.1

X_blobs, y_blobs = make_blobs(
    n_samples=n_samples,
    cluster_std=cluster_std,
    n_features=n_features,
    random_state=0,
    dtype=np.float32,
)

trained_UMAP = cuUMAP(n_neighbors=10).fit(X_blobs)
X_embedded = trained_UMAP.transform(X_blobs)

cu_score = cuml.metrics.trustworthiness(X_blobs, X_embedded)
sk_score = trustworthiness(asnumpy(X_blobs), asnumpy(X_embedded))

print("cuml's trustworthiness score:", cu_score)
print("sklearn's trustworthiness score:", sk_score)

# save
dump(trained_UMAP, "UMAP.model")

# to reload the model uncomment the line below
# loaded_model = load('UMAP.model')

### DBSCAN and Adjusted Random Index

DBSCAN is a popular and powerful clustering algorithm. For additional information on the DBSCAN model please refer to the 
[DBSCAN documentation](https://docs.rapids.ai/api/cuml/stable/api.html#cuml.DBSCAN)

We create the blobs dataset using the cuML equivalent of `make_blobs` function.

Adjusted random index is a metric which is used to measure the similarity between two data clusters, and it is adjusted to 
take into consideration the chance grouping of elements.
For more information on Adjusted random index please refer to: [Wikipedia Rand index documentation](https://en.wikipedia.org/wiki/Rand_index)

The cell below shows an end to end model of DBSCAN. The output of DBSCAN's fit_predict is evaluated using the Adjusted 
Random Index function. The values obtained by scikit-learn and cuML's adjusted random metric are compared below.

In [8]:
from cuml import DBSCAN as cumlDBSCAN
from sklearn.metrics import adjusted_rand_score

In [None]:
n_samples = 1000
n_features = 100
cluster_std = 0.1

X_blobs, y_blobs = make_blobs(
    n_samples=n_samples,
    n_features=n_features,
    cluster_std=cluster_std,
    random_state=0,
    dtype=np.float32,
)

cuml_dbscan = cumlDBSCAN(eps=3, min_samples=2)

trained_DBSCAN = cuml_dbscan.fit(X_blobs)

cu_y_pred = trained_DBSCAN.fit_predict(X_blobs)

cu_adjusted_rand_index = cuml.metrics.cluster.adjusted_rand_score(y_blobs, cu_y_pred)
sk_adjusted_rand_index = adjusted_rand_score(asnumpy(y_blobs), asnumpy(cu_y_pred))

print(f"cuml's adjusted random index score: {cu_adjusted_rand_index}")
print(f"sklearn's adjusted random index score: {sk_adjusted_rand_index}")

# save and optionally reload
dump(trained_DBSCAN, "DBSCAN.model")

# to reload the model uncomment the line below
# loaded_model = load('DBSCAN.model')