# A Developer's Guide to Profiling and Best Practices in cuML

This notebook is a practical guide for developers looking to optimize their GPU-accelerated machine learning workflows with `cuml` and RAPIDS. We'll follow a typical developer's journey: from initial benchmarking to deep profiling, troubleshooting common bottlenecks, and finally, summarizing the best practices for writing efficient code.

This guide will cover:
1.  **Benchmarking & Profiling**: Moving from simple timing to understanding *where* time is spent using `cProfile` and NVIDIA's Nsight Systems.
2.  **Troubleshooting Common Bottlenecks**: Tackling the two most frequent issues: Out-of-Memory errors and slow data transfers between the CPU and GPU.
3.  **A Summary of Best Practices**: A final checklist of principles for high-performance code.

In [None]:
import cudf
import cupy as cp
import numpy as np
import pandas as pd
import gc
import time
import cProfile
import pstats

from cuml.neighbors import NearestNeighbors
from cuml.datasets import make_blobs

## Part 1: From Simple Timing to Deep Profiling

You can't optimize what you can't measure. The first step in any optimization journey is to understand your code's performance.

### Step 1: Basic Benchmarking with `time`

The simplest method is to time your code's execution. It gives you a high-level baseline but doesn't tell you *why* it's slow.

In [None]:
X_gpu, _ = make_blobs(n_samples=100_000, n_features=50, random_state=42)
model = NearestNeighbors(n_neighbors=10, algorithm='ivfflat')

start_time = time.time()

model.fit(X_gpu)
distances, indices = model.kneighbors(X_gpu)
cp.cuda.runtime.deviceSynchronize()

end_time = time.time()

print(f"Total execution time: {end_time - start_time:.4f} seconds")

### Step 2: Finding Python Bottlenecks with `cProfile`

To understand which *functions* are taking the most time, we use a profiler. Python's built-in `cProfile` is an excellent first tool. It tracks every function call and tells you how much cumulative time was spent in each one, making it perfect for identifying slow spots in your Python code.

In [None]:
def knn_task_to_profile():
    X_gpu, _ = make_blobs(n_samples=100_000, n_features=50, random_state=42)
    model = NearestNeighbors(n_neighbors=10, algorithm='ivfflat')
    model.fit(X_gpu)
    distances, indices = model.kneighbors(X_gpu)
    cp.cuda.runtime.deviceSynchronize()

# Run the profiler and save the stats to a file
cProfile.run('knn_task_to_profile()', 'knn_profile_stats')

# Load and print the stats, sorted by cumulative time
p = pstats.Stats('knn_profile_stats')
p.strip_dirs().sort_stats('cumulative').print_stats(15)

### Step 3: Deep GPU Analysis with NVIDIA Nsight Systems (`nsys`)

`cProfile` shows you the Python world. To see what's happening on the GPU itself—CUDA kernel launches, memory copies, etc.—you need a system-level profiler. **NVIDIA Nsight Systems (`nsys`)** is the professional tool for this.

The workflow involves running this command-line tool on a standalone script, which we create below for convenience.

In [None]:
# We'll create a standalone Python script to be profiled by nsys
script_content = """
import cupy as cp
from cuml.neighbors import NearestNeighbors
from cuml.datasets import make_blobs

X_gpu, _ = make_blobs(n_samples=100_000, n_features=50, random_state=42)
model = NearestNeighbors(n_neighbors=10, algorithm='ivfflat')
model.fit(X_gpu)
distances, indices = model.kneighbors(X_gpu)
cp.cuda.runtime.deviceSynchronize()

print("Profiling script finished.")
"""

with open('profile_script.py', 'w') as f:
    f.write(script_content)

print("File 'profile_script.py' created successfully.")

#### How to Run `nsys`

To generate a detailed report, you would run the following command in a **separate terminal** (with the conda environment activated). **Note: This requires the NVIDIA CUDA Toolkit to be installed.**

```bash
nsys profile python profile_script.py

## Part 2: Troubleshooting the Most Common Bottlenecks

Profiling often reveals two main culprits for poor performance in GPU data science: memory issues and data transfer overhead.

### Bottleneck 1: The Memory Wall (Out-of-Memory Errors)

GPUs have a fixed amount of VRAM. An "Out of Memory" (OOM) error is the most common issue you'll face.

**Key Tactics for Memory Management:**
1.  **Check Available Memory**: Always be aware of your memory budget.
2.  **Use `float32`**: `float64` uses twice the memory and is rarely necessary for ML.
3.  **Delete and Collect**: Actively delete large, unused objects and call the garbage collector.

In [None]:
# 1. Check current memory
free_mem, total_mem = cp.cuda.runtime.memGetInfo()
print(f"GPU Memory: {free_mem / 1e9:.2f} GB Free / {total_mem / 1e9:.2f} GB Total")

# 2. Attempt to create a large float64 array (this might fail on some GPUs)
try:
    large_arr64 = make_blobs(n_samples=10_000_000, n_features=10, dtype=np.float64)[0]
    print(f"float64 Array created, using {large_arr64.nbytes / 1e9:.2f} GB")
    del large_arr64
    gc.collect()
except Exception as e:
    print(f"Failed to create float64 array as expected: {e}")


# 3. Use float32 for efficiency and actively manage memory
large_arr32 = make_blobs(n_samples=10_000_000, n_features=10, dtype=np.float32)[0]
print(f"float32 Array created, using {large_arr32.nbytes / 1e9:.2f} GB")

del large_arr32
gc.collect()
print("float32 array deleted and memory collected.")

free_mem_after, _ = cp.cuda.runtime.memGetInfo()
print(f"GPU Memory after cleanup: {free_mem_after / 1e9:.2f} GB Free")

### Bottleneck 2: The PCIe Bridge (CPU-GPU Data Transfers)

Moving data between the CPU's RAM and the GPU's VRAM is slow. Unnecessary transfers will destroy your performance gains.

> **The Golden Rule of RAPIDS:** Stay on the GPU. Only move data to the CPU (`.get()` or `.to_pandas()`) when you are completely finished with your computation.

In [None]:
X_gpu, _ = make_blobs(n_samples=5000, n_features=500, random_state=42)

# BAD: Transferring data inside a loop
start_time = time.time()
max_values_bad = []
for i in range(X_gpu.shape[1]):
    col_cpu = X_gpu[:, i].get()  # SLOW: GPU -> CPU transfer in each iteration
    max_values_bad.append(col_cpu.max())
end_time = time.time()
print(f"BAD PRACTICE (transfer in loop): {end_time - start_time:.4f} seconds")

# GOOD: Compute on GPU, transfer only the final result
start_time = time.time()
max_values_gpu = X_gpu.max(axis=0)      # FAST: Stays on GPU
max_values_good = max_values_gpu.get() # FAST: Transfer final small result array
end_time = time.time()
print(f"GOOD PRACTICE (all on GPU):    {end_time - start_time:.4f} seconds")

## Part 3: A Checklist for Best Practices

1.  **Measure First**: Use `time` for quick checks and profilers (`cProfile`, `nsys`) to find real bottlenecks before optimizing.
2.  **Stay on the GPU**: Adhere to the Golden Rule. Minimize data transfers between CPU and GPU.
3.  **Be Memory-Conscious**: Prefer `float32` precision. Actively `del` large objects and call `gc.collect()`.
4.  **Choose the Right Algorithm**: The performance difference between `brute` and `ivfflat` is a perfect example. Understand your model's parameters to make informed choices.

## Conclusion

This guide has walked through a developer's journey of optimizing a `cuml` workflow. By understanding how to profile, troubleshoot common errors, and apply best practices, you can build highly performant, GPU-accelerated machine learning pipelines.