# cuML.accel Quickstart

`cuml.accel` provides alternative, highly-optimized implementations for specific algorithms by leveraging other specialized GPU libraries.

This notebook demonstrates how to use `cuml.accel` and compares its performance against the standard `cuml` and `scikit-learn` (CPU) implementations for the K-Nearest Neighbors algorithm.

Useful links:
- cuML documentation: https://docs.rapids.ai/api/cuml/stable

## Installation

To run this notebook, you need a machine with an NVIDIA GPU and a proper RAPIDS environment. The recommended way to install RAPIDS is with Conda.

Please visit the [RAPIDS Release Selector](https://rapids.ai/start.html#get-rapids) to get the specific `conda install` command for your system.

An example command might look like this (do not run, use the official selector):
`conda create -n rapids-25.08 -c rapidsai -c conda-forge -c nvidia cuml cudf cupy python=3.10 cuda-version=12.0`

In [4]:
import cudf
import cupy as cp
import numpy as np

# Import the models we will compare
from sklearn.neighbors import NearestNeighbors as skNearestNeighbors
from cuml.neighbors import NearestNeighbors as cumlNearestNeighbors

from cuml.datasets import make_blobs
import time

## Generate Data

We will generate a random dataset of blobs to test the algorithm's performance. The data is created directly on the GPU using `cuml.datasets.make_blobs` and a copy is transferred to the CPU for the scikit-learn comparison.

In [7]:
# %%
# Define parameters for the experiment
n_samples = 50000
n_features = 50
n_neighbors = 10

# Generate the data directly on the GPU using cuML's make_blobs
X_gpu, _ = make_blobs(n_samples=n_samples, 
                      n_features=n_features, 
                      random_state=42)

# Create a copy on the CPU (as a NumPy array) for the scikit-learn comparison
X_cpu = X_gpu.get()

print("Data generated successfully:")
print(f"Shape of the data: {X_gpu.shape}")
print(f"X_gpu type: {type(X_gpu)}")
print(f"X_cpu type: {type(X_cpu)}")

Data generated successfully:
Shape of the data: (50000, 50)
X_gpu type: <class 'cupy.ndarray'>
X_cpu type: <class 'numpy.ndarray'>


## Performance Comparison

### 1. scikit-learn (CPU Baseline)

First, we'll run the standard scikit-learn `NearestNeighbors` implementation on the CPU using the NumPy data. This will serve as our performance baseline.

In [8]:
# %%
# Instantiate the scikit-learn model
model_sk = skNearestNeighbors(n_neighbors=n_neighbors)

# Start the timer
start_time = time.time()

# Fit the model and find the neighbors
# Note: We use the CPU data (X_cpu)
model_sk.fit(X_cpu)
distances, indices = model_sk.kneighbors(X_cpu)

# Stop the timer
end_time = time.time()

# Calculate and print the duration
time_sk = end_time - start_time
print(f"Scikit-learn (CPU) time: {time_sk:.4f} seconds")

Scikit-learn (CPU) time: 1.0724 seconds


### 2. Standard cuML (GPU)

Now, we'll perform the same task using cuML on the GPU. We'll specify `algorithm='brute'` to ensure an apples-to-apples comparison with scikit-learn, which also uses a brute-force method.

In [9]:
# %%
# Instantiate the standard cuML model
model_cuml = cumlNearestNeighbors(n_neighbors=n_neighbors, algorithm='brute')

# Start the timer
start_time = time.time()

# Fit the model and find the neighbors
# Note: We use the GPU data (X_gpu)
model_cuml.fit(X_gpu)
distances_cuml, indices_cuml = model_cuml.kneighbors(X_gpu)

# IMPORTANT: Synchronize the GPU to get an accurate timing
cp.cuda.runtime.deviceSynchronize()

# Stop the timer
end_time = time.time()

# Calculate and print the duration
time_cuml = end_time - start_time
print(f"Standard cuML (GPU, Brute) time: {time_cuml:.4f} seconds")
print(f"Speedup vs CPU: {time_sk / time_cuml:.2f}x")

Standard cuML (GPU, Brute) time: 0.2801 seconds
Speedup vs CPU: 3.83x


### 3. Accelerated cuML (GPU with IVF-Flat)

Finally, we'll test the even more optimized version. The `ivf-flat` algorithm is not brute-force. It works by first building a smart index that partitions the data. When searching for neighbors, it only needs to check a few partitions instead of the entire dataset, making it much faster for large `n_samples`.

In [17]:
# %%
# Instantiate the accelerated cuML model using the IVF-Flat index
model_accel = cumlNearestNeighbors(n_neighbors=n_neighbors, algorithm='ivfflat')

# Start the timer
start_time = time.time()

# Fit the model and find the neighbors
model_accel.fit(X_gpu)
distances_accel, indices_accel = model_accel.kneighbors(X_gpu)

# Synchronize the GPU
cp.cuda.runtime.deviceSynchronize()

# Stop the timer
end_time = time.time()

# Calculate and print the duration
time_accel = end_time - start_time
print(f"Accelerated cuML (GPU, IVF-Flat) time: {time_accel:.4f} seconds")
print(f"Speedup vs CPU: {time_sk / time_accel:.2f}x")
print(f"Speedup vs Standard GPU: {time_cuml / time_accel:.2f}x")

Accelerated cuML (GPU, IVF-Flat) time: 0.2235 seconds
Speedup vs CPU: 4.80x
Speedup vs Standard GPU: 1.25x


  ret = func(*args, **kwargs)


## Summary

Let's summarize the results from our three tests. As we can see, moving from CPU to GPU provides a significant speedup, and choosing a specialized, index-based algorithm like `ivf-flat` can boost performance even further.

In [18]:
# %%
print("--- Performance Summary ---")
print(f"1. scikit-learn (CPU):         {time_sk:.4f} seconds")
print(f"2. cuml (GPU, Brute):          {time_cuml:.4f} seconds (Speedup: {time_sk / time_cuml:.2f}x)")
print(f"3. cuml (GPU, IVF-Flat):       {time_accel:.4f} seconds (Speedup: {time_sk / time_accel:.2f}x)")

--- Performance Summary ---
1. scikit-learn (CPU):         1.0724 seconds
2. cuml (GPU, Brute):          0.2801 seconds (Speedup: 3.83x)
3. cuml (GPU, IVF-Flat):       0.2235 seconds (Speedup: 4.80x)
