[BUG] cuVS IVFPQ: Occasional CUDA error in raft/util/cudart_utils.hpp or CUDA error 700

**Describe the bug**
Hello Nvidia friends, we are working on benchmarking cuVS IVFPQ. I am getting a CUDA error when trying to add vectors to the index.

Setup details:
- GPU: A100 (80GB RAM)
- 50M vectors with d=256 (it reproduces with generated random float data)
- Vectors are saved to a file with np.lib.format.open_memmap or np.save, it reproduces with both. I was **not** able to reproduce when generating embeddings --> adding directly without this file IO, so this might be an issue with numpy file loading instead. Let me know if you see the same! For given files, it seems to reproduce at the same place every time (same ith batch when adding it), indicating some file corruption etc, but there are no NaNs in the vectors. Any other things to look for?
- There are confirmed to be no NaN in the vectors (something like the below does not trigger)
- It is nowhere near OOMing. nvidia-smi prints about 8 GB / 80 GB is being used on the GPU when it hits the error for m=128. When doing generated embeddings without file IO, it hits 27 GB / 80 GB for m=256.
```
    # check for NaN
    for i in range(0, len(arr)):
        if np.isnan(np.sum(arr[i])):
            print(f"FOUND NAN on i: {i}")
            break
```

The error itself either appears as this from Faiss:
```
Traceback (most recent call last):
  File "/data/users/mnorris/fbsource/fbcode/scripts/mnorris/run_cuvs.py", line 246, in <module>
    index.add(x)
  File "/home/mnorris/.conda/envs/faiss_gpu_cuvs/lib/python3.12/site-packages/faiss/class_wrappers.py", line 230, in replacement_add
    self.add_c(n, swig_ptr(x))
  File "/home/mnorris/.conda/envs/faiss_gpu_cuvs/lib/python3.12/site-packages/faiss/swigfaiss_avx512.py", line 12216, in add
    return _swigfaiss_avx512.GpuIndex_add(self, arg2, x)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: C++ exception CUDA error encountered at: file=/home/mnorris/.conda/envs/faiss_gpu_cuvs/include/raft/core/interruptible.hpp line=303: 
Faiss assertion 'err__ == cudaSuccess' failed in virtual faiss::gpu::StandardGpuResourcesImpl::~StandardGpuResourcesImpl() at /home/runner/miniconda3/conda-bld/faiss-pkg_1743488319982/work/faiss/gpu/StandardGpuResources.cpp:141; details: CUDA error 700 an illegal memory access was encountered
```
or this:
```
Traceback (most recent call last):
  File "/data/users/mnorris/fbsource/fbcode/scripts/mnorris/run_cuvs.py", line 260, in <module>
    index.add(x[i : i + BATCH_SIZE])
  File "/home/mnorris/.conda/envs/faiss_gpu_cuvs/lib/python3.12/site-packages/faiss/class_wrappers.py", line 230, in replacement_add
    self.add_c(n, swig_ptr(x))
  File "/home/mnorris/.conda/envs/faiss_gpu_cuvs/lib/python3.12/site-packages/faiss/swigfaiss_avx512.py", line 12216, in add
    return _swigfaiss_avx512.GpuIndex_add(self, arg2, x)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: C++ exception CUDA error encountered at: file=/home/mnorris/.conda/envs/faiss_gpu_cuvs/include/raft/util/cudart_utils.hpp line=148:
```

**Steps/Code to reproduce bug**

1. Install via conda (pasted below)
2. Run below Python code to repro:
```
import cupy as cp
import faiss
import numpy as np
from numpy.lib.format import open_memmap

BATCH_SIZE = 1_000_000
d=256
batches=50

def gpu_ivfpq(m, cuvs=False):
    index = faiss.index_factory(
        d, f"IVF128,PQ{m}x8", faiss.METRIC_INNER_PRODUCT
    )
    co = faiss.GpuClonerOptions()
    co.use_cuvs = cuvs
    # co.useFloat16 = True  # Error happens regardless of this setting?
    res = faiss.StandardGpuResources()
    gpu_index = faiss.index_cpu_to_gpu(res, 0, index, co)
    return gpu_index

# Generates embeddings and saves to a file. Recommend running this as separate script then loading data later.
nb=50_000_000
rs = np.random.RandomState(1234)
x = rs.normal(size=(nb, d))
fp = np.lib.format.open_memmap(
    FILE_PATH,
    mode="w+",
    shape=(nb, d),
    dtype=np.float32,
)
print(f"len(x): {len(x)}") # sanity check 50M
for i, arr in enumerate(x):
    fp[i] = arr[0]
    if i % BATCH_SIZE == 0:
        print(f"flushing at i: {i}")
        fp.flush()
fp.flush()
```
```
from numpy.lib.format import open_memmap

# Can run this part as a separate script since data generation takes a while and you want to iterate
x = open_memmap(
    FILE_PATH,
    mode="r",
    shape=(BATCH_SIZE * batches, d),
    dtype=np.float32,
)
# Run m=256 just to use more memory / stress test to repro the issue. Usually reproduces on m=128.
for m in [128, 256]:
    index = gpu_ivfpq(m, True)

    index.train(x)
    # Added in batches to see where it dies. .add(x) to add all at once also reproduces.
    for i in range(0, len(x), BATCH_SIZE):
        print(f"adding ith batch: {i}")
        index.add(x[i : i + BATCH_SIZE])  # ** dies here **
    print(f"sanity check index.ntotal: {index.ntotal}")

    # Other stuff here omitted like doing the actual benchmark with search() ...

    del index
```

**Expected behavior**

No error


**Environment details (please complete the following information):**
 - Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
 - Method of RAFT install: [conda]
```
  conda create -yn faiss_gpu_cuvs
  conda activate faiss_gpu_cuvs
  conda install -y -c pytorch -c rapidsai -c conda-forge pytorch/label/nightly::faiss-gpu-cuvs pytorch pytorch-cuda numpy
  conda install -y -c conda-forge cupy
```

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cuVS IVFPQ: Occasional CUDA error in raft/util/cudart_utils.hpp or CUDA error 700 #810

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] cuVS IVFPQ: Occasional CUDA error in raft/util/cudart_utils.hpp or CUDA error 700 #810

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions