# Uniform Manifold Approximation and Projection (UMAP)

UMAP is a dimensionality reduction algorithm which performs non-linear dimension reduction. It can also be used for visualization. 

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input. 

In order to convert your dataset to cudf format please read the cudf documentation on https://docs.rapids.ai/api/cudf/stable.

For additional information on the UMAP model please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/stable/api.html#cuml.UMAP

In [None]:
import os
import numpy as np

import pandas as pd
import cudf as gd

from sklearn import datasets

from sklearn.metrics import adjusted_rand_score
from sklearn.cluster import KMeans

from sklearn.manifold.t_sne import trustworthiness

from cuml.manifold.umap import UMAP as cumlUMAP

## Generate Data

In [None]:
n_samples = 500
n_features = 10
n_centers = 5

n_neighbors = 10

In [None]:
data, labels = datasets.make_blobs(n_samples=n_samples, 
                                   n_features=n_features, 
                                   centers=n_centers)

## Fit Embeddings

In [None]:
cuml_umap = cumlUMAP()
embedding = cuml_umap.fit_transform(data)

## Evaluate Neighborhoods

Calculate the score of the results obtained using cuml's algorithm and sklearn k-means. A score of 1.0 means the labels in our embedding match the original labels (thus preserving local neighborhood structure well)

In [None]:
adjusted_rand_score(labels, KMeans(n_centers).fit_predict(embedding))

## Load Iris Data

In [None]:
iris = datasets.load_iris()
data = iris.data

## Fit Embeddings

In [None]:
%%time
cuml_umap = cumlUMAP(n_neighbors=n_neighbors, 
                     min_dist=0.01)

embedding = cuml_umap.fit_transform(data)

## Evaluate Trustworthiness

Trustworthiness is a measure of how well an embedding preserves local neighborhood structure. It uses the nearest neighbors of the input vectors to rank the neighbors of the output vectors. Large divergences in neighborhoods between input and output vectors lower the score. 

In [None]:
trustworthiness(iris.data, embedding, 10)

## Split Train / Test Data

Create a selection variable which will have 75% training and 25% testing values.

In [None]:
iris_selection = np.random.choice(
    [True, False], 150, replace=True, p=[0.75, 0.25])

data = iris.data[iris_selection]

### Train Model

In [None]:
cuml_umap = cumlUMAP(n_neighbors=n_neighbors, min_dist=0.01, verbose=False)
cuml_umap.fit(data)

# create a new iris dataset by inverting the values of the selection variable (ie. 75% False and 25% True values) 
new_data = iris.data[~iris_selection]

## Predict Model

In [None]:
embedding = cuml_umap.transform(new_data)

## Evaluate Trustworthiness

Evaluating the trustworthiness on predictions from unseen data gives an indication of UMAP's ability to map the unseen data onto the manifold constructed from the training data. 

In [None]:
trustworthiness(new_data, embedding, 10)