# cuML Quickstart: CPU vs. GPU Acceleration

This notebook provides a direct performance comparison for the K-Nearest Neighbors algorithm, showcasing the speed advantage of using `cuml` on a GPU over `scikit-learn` on a CPU.

We will compare three implementations:
1.  **Scikit-learn**: The standard CPU baseline.
2.  **cuML (Brute Force)**: A direct, apples-to-apples comparison on the GPU.
3.  **cuML (IVF-Flat)**: A more advanced, index-based algorithm for even greater acceleration on large datasets.

**Useful Links:**
* cuML Documentation: https://docs.rapids.ai/api/cuml/stable

## 1. Setup

### Environment

To run this notebook, you need a machine with an NVIDIA GPU and a RAPIDS environment. The recommended way to install RAPIDS is with Conda, as it manages both the Python packages and the underlying CUDA libraries.

Please visit the **[RAPIDS Release Selector](https://rapids.ai/start.html#get-rapids)** to generate the specific `conda` command for your system.

An example command might look like this (do not run, use the official selector):
`conda create -n rapids-25.08 -c rapidsai -c conda-forge -c nvidia cuml cudf cupy python=3.10 cuda-version=12.0`

In [None]:
import cudf
import cupy as cp
import numpy as np
import pandas as pd
import time

# Import the models we will compare
from sklearn.neighbors import NearestNeighbors as skNearestNeighbors
from cuml.neighbors import NearestNeighbors as cumlNearestNeighbors
from cuml.datasets import make_blobs

### Data Generation

We'll generate a synthetic dataset to use for our benchmarks. The data will be created directly on the GPU using `cuml.datasets.make_blobs`, and a copy will be transferred to the CPU for the scikit-learn comparison.

In [None]:
# Define parameters for the experiment
n_samples = 50000
n_features = 50
n_neighbors = 10

# Generate the data directly on the GPU
X_gpu, _ = make_blobs(n_samples=n_samples, 
                      n_features=n_features, 
                      random_state=42)

# Transfer a copy to the CPU (as a NumPy array) for scikit-learn
X_cpu = X_gpu.get()

print("Data generated successfully:")
print(f"Shape of the data: {X_gpu.shape}")
print(f"X_gpu type: {type(X_gpu)}")
print(f"X_cpu type: {type(X_cpu)}")

## 2. Performance Comparison

Now, let's time each of the three implementations.

### A. Scikit-learn (CPU Baseline)

First, we'll run the standard `NearestNeighbors` on the CPU. This will serve as our performance baseline.

In [None]:
# Instantiate the scikit-learn model
model_sk = skNearestNeighbors(n_neighbors=n_neighbors)

# Start the timer
start_time = time.time()

# Fit the model and find the neighbors using CPU data
model_sk.fit(X_cpu)
distances, indices = model_sk.kneighbors(X_cpu)

# Stop the timer
end_time = time.time()

# Calculate and print the duration
time_sk = end_time - start_time
print(f"Scikit-learn (CPU) time: {time_sk:.4f} seconds")

### B. Standard cuML (GPU Brute-Force)

Next, we'll perform the same task on the GPU. We'll specify `algorithm='brute'` for a direct, apples-to-apples comparison with scikit-learn's method.

In [None]:
# Instantiate the standard cuML model
model_cuml = cumlNearestNeighbors(n_neighbors=n_neighbors, algorithm='brute')

# Start the timer
start_time = time.time()

# Fit the model and find the neighbors using GPU data
model_cuml.fit(X_gpu)
distances_cuml, indices_cuml = model_cuml.kneighbors(X_gpu)

# IMPORTANT: Synchronize the GPU to get an accurate timing
cp.cuda.runtime.deviceSynchronize()

# Stop the timer
end_time = time.time()

# Calculate and print the duration
time_cuml = end_time - start_time
print(f"Standard cuML (GPU, Brute) time: {time_cuml:.4f} seconds")
print(f"Speedup vs CPU: {time_sk / time_cuml:.2f}x")

### C. Accelerated cuML (GPU with IVF-Flat)

Finally, we'll test the more advanced `ivfflat` algorithm. Instead of brute-force, this method builds an index to partition the data, allowing for much faster queries on large datasets. While there is a small upfront cost to build the index, the query speed is significantly improved.

In [None]:
# Instantiate the accelerated cuML model
model_accel = cumlNearestNeighbors(n_neighbors=n_neighbors, algorithm='ivfflat')

# Start the timer
start_time = time.time()

# Fit the model (which includes building the index) and find neighbors
model_accel.fit(X_gpu)
distances_accel, indices_accel = model_accel.kneighbors(X_gpu)

# Synchronize the GPU
cp.cuda.runtime.deviceSynchronize()

# Stop the timer
end_time = time.time()

# Calculate and print the duration
time_accel = end_time - start_time
print(f"Accelerated cuML (GPU, IVF-Flat) time: {time_accel:.4f} seconds")
print(f"Speedup vs CPU: {time_sk / time_accel:.2f}x")
print(f"Speedup vs Standard GPU: {time_cuml / time_accel:.2f}x")

## 3. Summary

The results clearly demonstrate the performance benefits of GPU acceleration. Moving from a CPU to a GPU provides a significant speedup, and choosing a specialized, index-based algorithm like `ivfflat` can boost performance even further.

In [None]:
# Create a summary DataFrame for a clean display
summary_data = {
    "Implementation": ["Scikit-learn (CPU)", "cuML (GPU, Brute)", "cuML (GPU, IVF-Flat)"],
    "Time (s)": [time_sk, time_cuml, time_accel],
    "Speedup vs CPU": [1.0, time_sk / time_cuml, time_sk / time_accel]
}
summary_df = pd.DataFrame(summary_data)

# Format the floats for better readability
summary_df['Time (s)'] = summary_df['Time (s)'].map('{:.4f}'.format)
summary_df['Speedup vs CPU'] = summary_df['Speedup vs CPU'].map('{:.2f}x'.format)

display(summary_df)