Skip to content
Dask cuML provides a set of multi-GPU machine learning algorithms from cuML that integrate into the Dask ecosystem.
Branch: branch-0.9
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
ci Merge branch 'branch-0.8' into skip-tests Jun 26, 2019
conda/recipes/dask-cuml Merge branch 'branch-0.8' into skip-tests Jun 26, 2019
dask_cuml Properly importing numba utils Jun 21, 2019
docs DOC Update for 0.9 release Jul 10, 2019
img Adding images for README Mar 7, 2019
thirdparty DOC moved dask license file to thirdparty folder Mar 7, 2019
travisci More updates Mar 6, 2019
.gitattributes Creating build and CI Dec 12, 2018
.gitignore Adding Dask MPI_GPU Example Jan 28, 2019
.travis.yml DOC Update for 0.9 release Jul 10, 2019
LICENSE DOC Added LICENSE file Mar 7, 2019 Creating build and CI Dec 12, 2018 Update May 8, 2019
setup.cfg Creating build and CI Dec 12, 2018 Replace underscore in setup name Mar 18, 2019 FIX Pep8 corrections Mar 19, 2019

 Dask cuML - Multi-GPU Machine Learning Algorithms

Dask cuML contains parallel machine learning algorithms that can make use of multiple GPUs on a single host. It is able to play nicely with other projects in the Dask ecosystem, as well as other RAPIDS projects, such as Dask cuDF.


As an example, the following Python snippet loads input from a csv file into a Dask cuDF Dataframe and Performs a NearestNeighbors query in parallel, on multiple GPUs:

# Create a Dask CUDA cluster w/ one worker per device
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()

# Read CSV file in parallel across workers
import dask_cudf
df = dask_cudf.read_csv("/path/to/csv")

# Fit a NearestNeighbors model and query it
from dask_cuml.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors = 10)

Dask CUDA Clusters

Using the LocalCUDACluster()

Clusters of Dask workers can be started in several different ways. One of the simplest methods used in non-CUDA Dask clusters is to use LocalCluster. For a CUDA variant of the LocalCluster that works well with Dask cuML, check out the LocalCUDACluster from the dask-cuda project.

Note: It's important to make sure the LocalCUDACluster is instantiated in your code before any CUDA contexts are created (eg. before importing Numba or cudf). Otherwise, it's possible that your workers will all be mapped to the same device.

Using the dask-worker command

If you will be starting your workers using the dask-worker command, Dask cuML requires that each worker has been started with their own unique CUDA_VISIBLE_DEVICES setting.

For example, a user with a workstation containing 2 devices, would want their workers to be started with the following CUDA_VISIBLE_DEVICES settings (one per worker):

CUDA_VISIBLE_DEVICES=0,1 dask-worker --nprocs 1 --nthreads 1 scheduler_host:8786
CUDA_VISIBLE_DEVICES=1,0 dask-worker --nprocs 1 --nthreads 1 scheduler_host:8786

This enables each worker to map the device memory of their local cuDFs to separate devices.

Note: If starting Dask workers using dask-worker, --nprocs 1 must be used.

Supported Algorithms

  • Nearest Neighbors
  • Linear Regression

More ML algorithms are being worked on.


Dask cuML relies on cuML to be installed. Refer to cuML on Github for more information.


Dask cuML can be installed using the rapidsai conda channel (if you have CUDA 9.2 installed, change the cudatoolkit=10.0 dependency to cudatoolkit=9.2 instead):

conda install -c nvidia -c rapidsai -c conda-forge -c defaults dask-cuml cudatoolkit=10.0


Dask cuML can also be installed using pip.

pip install dask-cuml

Build/Install from Source

Dask cuML depends on:

  • dask
  • dask_cudf
  • dask_cuda
  • cuml

Dask cuML can be installed with the following command at the root of the repository:

python install

Tests can be verified using Pytest:

py.test dask_cuml/test


Find out more details on the RAPIDS site

Open GPU Data Science

The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

You can’t perform that action at this time.