# Uniform Manifold Approximation and Projection (UMAP)

UMAP is a dimensionality reduction algorithm which performs non-linear dimension reduction. It can also be used for visualization. 

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input. 

In order to convert your dataset to cudf format please read the cudf documentation on https://rapidsai.github.io/projects/cudf/en/latest/.

For additional information on the UMAP model please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/0.6.0/api.html#cuml.UMAP

In [None]:
import os
import numpy as np

import pandas as pd
import cudf as gd

from sklearn import datasets

from sklearn.metrics import adjusted_rand_score
from sklearn.cluster import KMeans

from sklearn.manifold.t_sne import trustworthiness

from umap import UMAP
from cuml.manifold.umap import cumlUMAP

## Generate Blobs Data

In [None]:
n_samples = 500
n_features = 10
n_centers = 5

In [None]:
data, labels = datasets.make_blobs(n_samples=n_samples, 
                                   n_features=n_features, 
                                   centers=n_centers)

In [None]:
cuml_umap = cumlUMAP()
embedding = cuml_umap.fit_transform(data)

Calculate the score of the results obtained using cuml's algorithm and sklearn kmeans

In [None]:
adjusted_rand_score(labels, KMeans(n_centers).fit_predict(embedding))

## Load Iris Data

In [None]:
iris = datasets.load_iris()
data = iris.data

## Fit cuML UMAP Model

In [None]:
%%time
cuml_umap = cumlUMAP(n_neighbors=10, 
                     min_dist=0.01)

embedding = cuml_umap.fit_transform(data)

## Evaluate Trustworthiness

In [None]:
trustworthiness(iris.data, embedding, 10)

Create a selection variable which will have 75% True and 25% False values. The size of the selection variable is 150

In [None]:
iris_selection = np.random.choice(
    [True, False], 150, replace=True, p=[0.75, 0.25])

data = iris.data[iris_selection]

In [None]:
cuml_umap = cumlUMAP(n_neighbors=10, min_dist=0.01, verbose=False)
cuml_umap.fit(data)

# create a new iris dataset by inverting the values of the selection variable (ie. 75% False and 25% True values) 
new_data = iris.data[~iris_selection]
# transform the new data using the previously created embedded space
embedding = fitter.transform(new_data)

In [None]:
# calculate the trustworthiness score for the new data created (new_data)
trust = trustworthiness(new_data, embedding, 10)
print(trust)