[BUG] cuml.metrics.trustworthiness crashes with cudaErrorIllegalAddress for specific values of n

## Describe the bug

`cuml.metrics.trustworthiness` crashes with `cudaErrorIllegalAddress` when called with exactly `n=76927` samples. The crash is deterministic, reproducible with random synthetic data, independent of data content, `n_neighbors`, and `batch_size`. Both smaller and larger values of `n` (e.g., 76924, 80000, 148781) work fine.

The crash originates in `cuvs-src/cpp/src/stats/detail/trustworthiness_score.cuh` and poisons the CUDA context, making all subsequent GPU operations fail.

## Steps/Code to reproduce bug

```python
import numpy as np
from cuml.metrics import trustworthiness

rng = np.random.RandomState(42)
X_high = rng.randn(76927, 768).astype(np.float32)
X_low = rng.randn(76927, 30).astype(np.float32)

# This crashes:
score = trustworthiness(X_high, X_low, n_neighbors=5, batch_size=512)
```

Changing `76927` to `76924` or `80000` makes it pass.

## Expected behavior

`trustworthiness` should return a float score for any valid `n`.

## Actual behavior

```
RuntimeError: CUDA error encountered at:
  file=/__w/cuml/cuml/python/libcuml/build/py3-none-linux_x86_64/_deps/cuvs-src/cpp/src/stats/detail/trustworthiness_score.cuh
  cudaErrorIllegalAddress: an illegal memory access was encountered
```

After this error, the CUDA context is permanently corrupted and all subsequent GPU operations fail.

## Diagnostic results

We ran a systematic binary search and batch_size sweep on an A10G GPU:

**n sweep (synthetic random float32 data, dim_high=768, dim_low=30, k=5, batch=512):**

| n | Result |
|---------|--------|
| 50,000 | ✅ Pass |
| 70,000 | ✅ Pass |
| 75,000 | ✅ Pass |
| 76,924 | ✅ Pass |
| **76,927** | **❌ Crash** |
| 80,000 | ✅ Pass |
| 100,000 | ✅ Pass |
| 148,781 | ✅ Pass |

**batch_size sweep at n=76927 (all crash):** 64, 128, 256, 384, 512, 768, 1024

The bug is triggered by the specific value of `n`, not by a threshold. `batch_size` has no effect.

## Environment

- **cuML:** 26.2.0 (`cuml-cu12==26.2.0`, installed via pip)
- **cuvs:** bundled with `libcuml-cu12==26.2.0`
- **CUDA toolkit:** 12.8.1
- **GPU:** NVIDIA A10G (24 GB VRAM)
- **Driver:** SageMaker `ml.g5.2xlarge` default
- **Python:** 3.10
- **OS:** Linux (SageMaker training container, `pytorch-cuml-sagemaker-foundry:latest`)
- **NumPy:** 1.26.4
- **CuPy:** 13.6.0

## Additional context

This was discovered in a production Optuna hyperparameter tuning pipeline. The pipeline runs `cuml_trustworthiness` inside a loop over different UMAP parameter combinations. When the input dataset happens to have exactly 76,927 rows, the first call to `trustworthiness` crashes and poisons the CUDA context, causing all subsequent trials to fail.

The crash appears to be in the internal brute-force KNN (`cuvs::neighbors::brute_force::search`) or pairwise distance (`cuvs::distance::pairwise_distance`) call within `trustworthiness_score.cuh`. The fact that only specific `n` values trigger it (while nearby values like 76,924 and 80,000 are fine) suggests an integer arithmetic edge case in kernel grid/block dimension calculations — likely a modular arithmetic issue where `n % tile_size` produces a degenerate value that causes an out-of-bounds memory access. <-This was hypothesized by Claude

**Workaround:** Padding the input arrays to a multiple of 128 rows before calling `trustworthiness` avoids the bug. In this case adding a single duplicate row to make n=76928 worked without error.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cuml.metrics.trustworthiness crashes with cudaErrorIllegalAddress for specific values of n #2049

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Actual behavior

Diagnostic results

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

n	Result
50,000	✅ Pass
70,000	✅ Pass
75,000	✅ Pass
76,924	✅ Pass
76,927	❌ Crash
80,000	✅ Pass
100,000	✅ Pass
148,781	✅ Pass

[BUG] cuml.metrics.trustworthiness crashes with cudaErrorIllegalAddress for specific values of n #2049

Description

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Actual behavior

Diagnostic results

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions