# K-Means Multi-Node Multi-GPU (MNMG) Demo

K-Means multi-Node multi-GPU implementation leverages Dask to spread data and computations across multiple workers. cuML uses One Process Per GPU (OPG) layout, which maps a single Dask worker to each GPU.

The main difference between cuML's MNMG implementation of k-means and the single-GPU is that the fit can be performed in parallel for each iteration, sharing only the centroids between iterations. The MNMG version also provides the same scalable k-means++ initialization algorithm as the single-GPU version.

Unlike the single-GPU implementation, The MNMG k-means API requires a Dask cuDF Dataframe as input. `predict()` and `transform()` also return a Dask cuDF Dataframe. The Dask cuDF Dataframe API is very similar to the Dask DataFrame API, but underlying Dataframes are cuDF, rather than Pandas.

For information on converting your dataset to Dask cuDF format: https://rapidsai.github.io/projects/cudf/en/latest/dask-cudf.html

For additional information on cuML's MNMG k-means implementation: https://rapidsai.github.io/projects/cuml/en/latest/api.html#k-means-clustering

In [None]:
import numpy as np

import pandas as pd
import cudf as gd

from cuml.dask.common import to_dask_df
from cuml.dask.datasets import make_blobs

from cuml.metrics import adjusted_rand_score

%matplotlib inline
import matplotlib.pyplot as plt

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster

from dask_ml.cluster import KMeans as skKMeans
from cuml.dask.cluster.kmeans import KMeans as cumlKMeans

## Start Dask Cluster

We can use the `LocalCUDACluster` to start a Dask cluster on a single machine with one worker mapped to each GPU. This is called one-process-per-GPU (OPG). 

In [None]:
cluster = LocalCUDACluster(threads_per_worker=1)
client = Client(cluster)

In [None]:
from dask_ml.cluster import KMeans as skKMeans


from sklearn.metrics import adjusted_rand_score

%matplotlib inline
import matplotlib.pyplot as plt

## Define Parameters

In [None]:
n_samples = 1000000
n_features = 2

n_total_partitions = len(list(client.has_what().keys()))

## Generate Data

### Device

We can generate a dask_cudf.DataFrame of synthetic data for multiple clusters using `cuml.dask.datasets.make_blobs`.

In [None]:
X_cudf, Y_cudf = make_blobs(n_samples, 
                            n_features,
                            centers = 5, 
                            n_parts = n_total_partitions,
                            cluster_std=0.1, 
                            verbose=True)

### Host

We use `cuml.dask.common.to_dask_df` to convert a dask_cuml.DataFrame using device memory into a dask.DataFrame containing Pandas in host memory. 

## Scikit-learn model

Since a scikit-learn equivalent to the multi-node multi-GPU K-means in cuML doesn't exist, we will use Dask-ML's implementation for comparison.

In [None]:
%%time
kmeans_sk = skKMeans(init="k-means||",
                     n_clusters=5,
                     n_jobs=-1)
kmeans_sk.fit(X_df)

In [None]:
%%time
labels_sk = kmeans_sk.predict(X_df).compute()

## cuML Model

In [None]:
%%time
kmeans_cuml = cumlKMeans(init="k-means||",
                         n_clusters=5)
kmeans_cuml.fit(X_cudf)

In [None]:
%%time
labels_cuml = kmeans_cuml.predict(X_cudf).compute()

## Compare Results

In [None]:
%%time
score = adjusted_rand_score(labels_sk, labels_cuml.to_pandas().values)

In [None]:
passed = score == 1.0
print('compare kmeans: cuml vs sklearn labels_ are ' + ('equal' if passed else 'NOT equal'))