Describe the bug
cuml.metrics.trustworthiness crashes with cudaErrorIllegalAddress when called with exactly n=76927 samples. The crash is deterministic, reproducible with random synthetic data, independent of data content, n_neighbors, and batch_size. Both smaller and larger values of n (e.g., 76924, 80000, 148781) work fine.
The crash originates in cuvs-src/cpp/src/stats/detail/trustworthiness_score.cuh and poisons the CUDA context, making all subsequent GPU operations fail.
Steps/Code to reproduce bug
import numpy as np
from cuml.metrics import trustworthiness
rng = np.random.RandomState(42)
X_high = rng.randn(76927, 768).astype(np.float32)
X_low = rng.randn(76927, 30).astype(np.float32)
# This crashes:
score = trustworthiness(X_high, X_low, n_neighbors=5, batch_size=512)
Changing 76927 to 76924 or 80000 makes it pass.
Expected behavior
trustworthiness should return a float score for any valid n.
Actual behavior
RuntimeError: CUDA error encountered at:
file=/__w/cuml/cuml/python/libcuml/build/py3-none-linux_x86_64/_deps/cuvs-src/cpp/src/stats/detail/trustworthiness_score.cuh
cudaErrorIllegalAddress: an illegal memory access was encountered
After this error, the CUDA context is permanently corrupted and all subsequent GPU operations fail.
Diagnostic results
We ran a systematic binary search and batch_size sweep on an A10G GPU:
n sweep (synthetic random float32 data, dim_high=768, dim_low=30, k=5, batch=512):
| n |
Result |
| 50,000 |
✅ Pass |
| 70,000 |
✅ Pass |
| 75,000 |
✅ Pass |
| 76,924 |
✅ Pass |
| 76,927 |
❌ Crash |
| 80,000 |
✅ Pass |
| 100,000 |
✅ Pass |
| 148,781 |
✅ Pass |
batch_size sweep at n=76927 (all crash): 64, 128, 256, 384, 512, 768, 1024
The bug is triggered by the specific value of n, not by a threshold. batch_size has no effect.
Environment
- cuML: 26.2.0 (
cuml-cu12==26.2.0, installed via pip)
- cuvs: bundled with
libcuml-cu12==26.2.0
- CUDA toolkit: 12.8.1
- GPU: NVIDIA A10G (24 GB VRAM)
- Driver: SageMaker
ml.g5.2xlarge default
- Python: 3.10
- OS: Linux (SageMaker training container,
pytorch-cuml-sagemaker-foundry:latest)
- NumPy: 1.26.4
- CuPy: 13.6.0
Additional context
This was discovered in a production Optuna hyperparameter tuning pipeline. The pipeline runs cuml_trustworthiness inside a loop over different UMAP parameter combinations. When the input dataset happens to have exactly 76,927 rows, the first call to trustworthiness crashes and poisons the CUDA context, causing all subsequent trials to fail.
The crash appears to be in the internal brute-force KNN (cuvs::neighbors::brute_force::search) or pairwise distance (cuvs::distance::pairwise_distance) call within trustworthiness_score.cuh. The fact that only specific n values trigger it (while nearby values like 76,924 and 80,000 are fine) suggests an integer arithmetic edge case in kernel grid/block dimension calculations — likely a modular arithmetic issue where n % tile_size produces a degenerate value that causes an out-of-bounds memory access. <-This was hypothesized by Claude
Workaround: Padding the input arrays to a multiple of 128 rows before calling trustworthiness avoids the bug. In this case adding a single duplicate row to make n=76928 worked without error.
Describe the bug
cuml.metrics.trustworthinesscrashes withcudaErrorIllegalAddresswhen called with exactlyn=76927samples. The crash is deterministic, reproducible with random synthetic data, independent of data content,n_neighbors, andbatch_size. Both smaller and larger values ofn(e.g., 76924, 80000, 148781) work fine.The crash originates in
cuvs-src/cpp/src/stats/detail/trustworthiness_score.cuhand poisons the CUDA context, making all subsequent GPU operations fail.Steps/Code to reproduce bug
Changing
76927to76924or80000makes it pass.Expected behavior
trustworthinessshould return a float score for any validn.Actual behavior
After this error, the CUDA context is permanently corrupted and all subsequent GPU operations fail.
Diagnostic results
We ran a systematic binary search and batch_size sweep on an A10G GPU:
n sweep (synthetic random float32 data, dim_high=768, dim_low=30, k=5, batch=512):
batch_size sweep at n=76927 (all crash): 64, 128, 256, 384, 512, 768, 1024
The bug is triggered by the specific value of
n, not by a threshold.batch_sizehas no effect.Environment
cuml-cu12==26.2.0, installed via pip)libcuml-cu12==26.2.0ml.g5.2xlargedefaultpytorch-cuml-sagemaker-foundry:latest)Additional context
This was discovered in a production Optuna hyperparameter tuning pipeline. The pipeline runs
cuml_trustworthinessinside a loop over different UMAP parameter combinations. When the input dataset happens to have exactly 76,927 rows, the first call totrustworthinesscrashes and poisons the CUDA context, causing all subsequent trials to fail.The crash appears to be in the internal brute-force KNN (
cuvs::neighbors::brute_force::search) or pairwise distance (cuvs::distance::pairwise_distance) call withintrustworthiness_score.cuh. The fact that only specificnvalues trigger it (while nearby values like 76,924 and 80,000 are fine) suggests an integer arithmetic edge case in kernel grid/block dimension calculations — likely a modular arithmetic issue wheren % tile_sizeproduces a degenerate value that causes an out-of-bounds memory access. <-This was hypothesized by ClaudeWorkaround: Padding the input arrays to a multiple of 128 rows before calling
trustworthinessavoids the bug. In this case adding a single duplicate row to make n=76928 worked without error.