Skip to content

[BUG] cuml.metrics.trustworthiness crashes with cudaErrorIllegalAddress for specific values of n #2049

@jswtraveler

Description

@jswtraveler

Describe the bug

cuml.metrics.trustworthiness crashes with cudaErrorIllegalAddress when called with exactly n=76927 samples. The crash is deterministic, reproducible with random synthetic data, independent of data content, n_neighbors, and batch_size. Both smaller and larger values of n (e.g., 76924, 80000, 148781) work fine.

The crash originates in cuvs-src/cpp/src/stats/detail/trustworthiness_score.cuh and poisons the CUDA context, making all subsequent GPU operations fail.

Steps/Code to reproduce bug

import numpy as np
from cuml.metrics import trustworthiness

rng = np.random.RandomState(42)
X_high = rng.randn(76927, 768).astype(np.float32)
X_low = rng.randn(76927, 30).astype(np.float32)

# This crashes:
score = trustworthiness(X_high, X_low, n_neighbors=5, batch_size=512)

Changing 76927 to 76924 or 80000 makes it pass.

Expected behavior

trustworthiness should return a float score for any valid n.

Actual behavior

RuntimeError: CUDA error encountered at:
  file=/__w/cuml/cuml/python/libcuml/build/py3-none-linux_x86_64/_deps/cuvs-src/cpp/src/stats/detail/trustworthiness_score.cuh
  cudaErrorIllegalAddress: an illegal memory access was encountered

After this error, the CUDA context is permanently corrupted and all subsequent GPU operations fail.

Diagnostic results

We ran a systematic binary search and batch_size sweep on an A10G GPU:

n sweep (synthetic random float32 data, dim_high=768, dim_low=30, k=5, batch=512):

n Result
50,000 ✅ Pass
70,000 ✅ Pass
75,000 ✅ Pass
76,924 ✅ Pass
76,927 ❌ Crash
80,000 ✅ Pass
100,000 ✅ Pass
148,781 ✅ Pass

batch_size sweep at n=76927 (all crash): 64, 128, 256, 384, 512, 768, 1024

The bug is triggered by the specific value of n, not by a threshold. batch_size has no effect.

Environment

  • cuML: 26.2.0 (cuml-cu12==26.2.0, installed via pip)
  • cuvs: bundled with libcuml-cu12==26.2.0
  • CUDA toolkit: 12.8.1
  • GPU: NVIDIA A10G (24 GB VRAM)
  • Driver: SageMaker ml.g5.2xlarge default
  • Python: 3.10
  • OS: Linux (SageMaker training container, pytorch-cuml-sagemaker-foundry:latest)
  • NumPy: 1.26.4
  • CuPy: 13.6.0

Additional context

This was discovered in a production Optuna hyperparameter tuning pipeline. The pipeline runs cuml_trustworthiness inside a loop over different UMAP parameter combinations. When the input dataset happens to have exactly 76,927 rows, the first call to trustworthiness crashes and poisons the CUDA context, causing all subsequent trials to fail.

The crash appears to be in the internal brute-force KNN (cuvs::neighbors::brute_force::search) or pairwise distance (cuvs::distance::pairwise_distance) call within trustworthiness_score.cuh. The fact that only specific n values trigger it (while nearby values like 76,924 and 80,000 are fine) suggests an integer arithmetic edge case in kernel grid/block dimension calculations — likely a modular arithmetic issue where n % tile_size produces a degenerate value that causes an out-of-bounds memory access. <-This was hypothesized by Claude

Workaround: Padding the input arrays to a multiple of 128 rows before calling trustworthiness avoids the bug. In this case adding a single duplicate row to make n=76928 worked without error.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions