# Nearest Neighbors Multi-Node Multi-GPU (MNMG) Demo

The nearest neighbors multi-Node multi-GPU implementation leverages Dask to spread data and computations across multiple workers. cuML uses One Process Per GPU (OPG) layout, which maps a single Dask worker to each GPU.

The main difference between cuML's MNMG implementation of nearest neighbors and the single-GPU is that the `kneighbors()` query partitions can be broadcast each of the workers in batches and the nearest neighbors search performed in parallel.

Unlike the single-GPU implementation, The MNMG nearest neighbors API currently requires a Dask cuDF Dataframe as input. `kneighbors()` also returns a Dask cuDF Dataframe. The Dask cuDF Dataframe API is very similar to the Dask DataFrame API, but underlying Dataframes are cuDF, rather than Pandas.

For information on converting your dataset to Dask cuDF format: https://rapidsai.github.io/projects/cudf/en/latest/dask-cudf.html

For additional information on cuML's MNMG nearest neighbors implementation: https://rapidsai.github.io/projects/cuml/en/latest/api.html#nearest-neighbors

In [1]:
import numpy as np

import pandas as pd
import cudf as gd

from cuml.dask.common import to_dask_df
from cuml.dask.datasets import make_blobs

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster

from sklearn.neighbors import NearestNeighbors as skNeighbors
from cuml.dask.neighbors import NearestNeighbors as cumlNeighbors

## Start Dask Cluster

We can use the `LocalCUDACluster` to start a Dask cluster on a single machine with one worker mapped to each GPU. This is called one-process-per-GPU (OPG). 

In [2]:
cluster = LocalCUDACluster(threads_per_worker=1)
client = Client(cluster)

Port 8787 is already in use. 
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.


## Define Parameters

In [11]:
n_samples = 100000
n_features = 50

n_total_partitions = len(list(client.has_what().keys()))

n_neighbors = 5

n_query = 5000

## Generate Data

### Device

We can generate a dask_cudf.DataFrame of synthetic data for multiple clusters using `cuml.dask.datasets.make_blobs`.

In [12]:
X_cudf_train, _ = make_blobs(n_samples, 
                             n_features,
                             centers = 5, 
                             n_parts = n_total_partitions,
                             cluster_std=0.1, 
                             verbose=True)

Generating 50000 samples acgraross 2 partitions on 2 workers (total=100000 samples)


In [13]:
X_cudf_query, _ = make_blobs(n_query, 
                             n_features,
                             centers = 5, 
                             n_parts = n_total_partitions,
                             cluster_std=0.1, 
                             verbose=True)

Generating 2500 samples acgraross 2 partitions on 2 workers (total=5000 samples)


### Host

We use `cuml.dask.common.to_dask_df` to convert a dask_cuml.DataFrame using device memory into a dask.DataFrame containing Pandas in host memory. Since our baseline is not distributed, we use `compute()` to bring our data to a single process.

In [14]:
wait([X_cudf_train, X_cudf_query])

X_df_train = to_dask_df(X_cudf_train).compute()
X_df_query = to_dask_df(X_cudf_query).compute()

## Scikit-learn model

Since there is no distributed Scikit-learn equivalent to cuML's MNMG Nearest Neighbors implementation, we will use the basic brute-force nearest neighbors implementation from Scikit-learn as our baseline. 

In [15]:
%%time
knn_sk = skNeighbors(algorithm="brute", n_jobs=-1)
knn_sk.fit(X_df_train)

CPU times: user 5.33 ms, sys: 1.34 ms, total: 6.67 ms
Wall time: 5.38 ms


NearestNeighbors(algorithm='brute', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=-1, n_neighbors=5, p=2, radius=1.0)

In [16]:
%%time
D_sk, I_sk = knn_sk.kneighbors(X_df_query, n_neighbors)

CPU times: user 41.8 s, sys: 53.2 s, total: 1min 35s
Wall time: 11.1 s


## cuML Model

In [17]:
%%time
knn_cuml = cumlNeighbors(algorithm="brute")
knn_cuml.fit(X_cudf_train)

CPU times: user 40.8 ms, sys: 11.8 ms, total: 52.7 ms
Wall time: 66.6 ms


<cuml.dask.neighbors.nearest_neighbors.NearestNeighbors at 0x7f6b285912b0>

In [19]:
%%time
D_cuml, I_cuml = knn_cuml.kneighbors(X_cudf_query)

CPU times: user 235 ms, sys: 31.1 ms, total: 266 ms
Wall time: 1.32 s


## Compare Results

cuML currently uses FAISS for exact nearest neighbors search, which limits inputs to single-precision. This results in possible round-off errors when floats of different magnitude are added. As a result, it's very likely that the cuML results will not match Sciklearn's nearest neighbors exactly. You can read more in the [FAISS wiki](https://github.com/facebookresearch/faiss/wiki/FAQ#why-do-i-get-weird-results-with-brute-force-search-on-vectors-with-large-components).

### Distances

In [31]:
passed = np.allclose(D_sk, D_cuml.compute().as_gpu_matrix(), atol=1e-3)
print('compare knn: cuml vs sklearn distances %s'%('equal'if passed else 'NOT equal'))

compare knn: cuml vs sklearn distances equal


### Indices

In [30]:
sk_sorted = np.sort(I_sk, axis=1)
cuml_sorted = np.sort(I_cuml.compute().as_gpu_matrix(), axis=1)

diff = sk_sorted - cuml_sorted

# Pass if differences are less than .1%
passed = (len(diff[diff!=0]) / n_query) < 1e-2
print('compare knn: cuml vs sklearn indexes %s'%('equal'if passed else 'NOT equal'))

compare knn: cuml vs sklearn indexes equal
