# Dask cuML - kNN End-to-end 

This notebook assumes the following prerequisites:
- Installed a cuda-aware MPI 
- Installed / built mpi4py with the exact same openmpi library


Once the prerequisites above have been done:
- run a dask scheduler using `dask-scheduler --scheduler-file="cluster.json"` __NOTE:__ This will create the file `cluster.json` in the current directory so that you don't need to manually keep track of ports when running workers and clients. 
- run a set of workers using `mpirun --mca pml ob1 --mca osc ^ucx -np 2 dask-mpi --no-nanny --no-scheduler --nthreads 1 --memory-limit 3000000000 --scheduler-file cluster.json` 

__NOTE:__ Currently, `dask-mpi` does not provide a way to specify custom environment variables when starting the workers. This means we need to set `CUDA_VISIBLE_DEVICES` after the fact. The `CUDA_VISIBLE_DEVICES` must be set before any cuda contexts are created (e.g. no cuda-based libraries can be imported.)

In [17]:
from dask.distributed import Client

In [18]:
# Run this if you are using an MPI-based cluster
client = Client(scheduler_file="cluster.json")

In [19]:
devs = [0, 1]
workers = list(client.has_what().keys())
worker_devs = workers[0:min(len(devs), len(workers))]

In [20]:
def set_visible(i, n):
    import os
    all_devices = list(range(n))
    vd = ",".join(map(str, all_devices[i:] + all_devices[:i]))
    print(str(vd))
    print("Selecting Device : "  + str(i))
    os.environ["CUDA_VISIBLE_DEVICES"] = vd
    
    import numba.cuda
    print("Cur device: " + str(numba.cuda.get_current_device().id))
    
dev_assigned = [client.submit(set_visible, dev, len(devs), workers = [worker]) for dev, worker in zip(devs, worker_devs)]

In [21]:
import dask_cudf
import cudf
import numpy as np

from dask_cuml import knn as cumlKNN

In [22]:
X = cudf.DataFrame([('a', np.array([0, 1, 2, 3, 4], np.float32)),
                    ('b', np.array([5, 6, 7, 7, 8], np.float32))])

X_df = dask_cudf.from_cudf(X, chunksize=1).persist()

In [23]:
lr = cumlKNN.KNN()
lr.fit(X_df)

KEY TO PART DICT: {"('from_cudf-4f13a4fb9e3f4624827e0b37059daf15', 0)": <Future: status: finished, type: DataFrame, key: ('from_cudf-4f13a4fb9e3f4624827e0b37059daf15', 0)>, "('from_cudf-4f13a4fb9e3f4624827e0b37059daf15', 1)": <Future: status: finished, type: DataFrame, key: ('from_cudf-4f13a4fb9e3f4624827e0b37059daf15', 1)>, "('from_cudf-4f13a4fb9e3f4624827e0b37059daf15', 2)": <Future: status: finished, type: DataFrame, key: ('from_cudf-4f13a4fb9e3f4624827e0b37059daf15', 2)>, "('from_cudf-4f13a4fb9e3f4624827e0b37059daf15', 3)": <Future: status: finished, type: DataFrame, key: ('from_cudf-4f13a4fb9e3f4624827e0b37059daf15', 3)>}
WHO HAS: {"('from_cudf-4f13a4fb9e3f4624827e0b37059daf15', 3)": ('tcp://10.2.166.167:38879',), "('from_cudf-4f13a4fb9e3f4624827e0b37059daf15', 0)": ('tcp://10.2.166.167:36567',), "('from_cudf-4f13a4fb9e3f4624827e0b37059daf15', 2)": ('tcp://10.2.166.167:36567',), "('from_cudf-4f13a4fb9e3f4624827e0b37059daf15', 1)": ('tcp://10.2.166.167:38879',)}
WORKER_MAP: [(('10.

In [24]:
g = lr.kneighbors(X, 1)


In [25]:
worker, f = g

In [26]:
D, I, = f.result()

In [27]:
print(str(D))

     0
0  0.0
1  0.0
2  0.0
3  1.0
4  5.0


In [None]:
print(str(I))