Describe the bug
Hello Nvidia friends, we are working on benchmarking cuVS IVFPQ. I am getting a CUDA error when trying to add vectors to the index.
Setup details:
- GPU: A100 (80GB RAM)
- 50M vectors with d=256 (it reproduces with generated random float data)
- Vectors are saved to a file with np.lib.format.open_memmap or np.save, it reproduces with both. I was not able to reproduce when generating embeddings --> adding directly without this file IO, so this might be an issue with numpy file loading instead. Let me know if you see the same! For given files, it seems to reproduce at the same place every time (same ith batch when adding it), indicating some file corruption etc, but there are no NaNs in the vectors. Any other things to look for?
- There are confirmed to be no NaN in the vectors (something like the below does not trigger)
- It is nowhere near OOMing. nvidia-smi prints about 8 GB / 80 GB is being used on the GPU when it hits the error for m=128. When doing generated embeddings without file IO, it hits 27 GB / 80 GB for m=256.
# check for NaN
for i in range(0, len(arr)):
if np.isnan(np.sum(arr[i])):
print(f"FOUND NAN on i: {i}")
break
The error itself either appears as this from Faiss:
Traceback (most recent call last):
File "/data/users/mnorris/fbsource/fbcode/scripts/mnorris/run_cuvs.py", line 246, in <module>
index.add(x)
File "/home/mnorris/.conda/envs/faiss_gpu_cuvs/lib/python3.12/site-packages/faiss/class_wrappers.py", line 230, in replacement_add
self.add_c(n, swig_ptr(x))
File "/home/mnorris/.conda/envs/faiss_gpu_cuvs/lib/python3.12/site-packages/faiss/swigfaiss_avx512.py", line 12216, in add
return _swigfaiss_avx512.GpuIndex_add(self, arg2, x)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: C++ exception CUDA error encountered at: file=/home/mnorris/.conda/envs/faiss_gpu_cuvs/include/raft/core/interruptible.hpp line=303:
Faiss assertion 'err__ == cudaSuccess' failed in virtual faiss::gpu::StandardGpuResourcesImpl::~StandardGpuResourcesImpl() at /home/runner/miniconda3/conda-bld/faiss-pkg_1743488319982/work/faiss/gpu/StandardGpuResources.cpp:141; details: CUDA error 700 an illegal memory access was encountered
or this:
Traceback (most recent call last):
File "/data/users/mnorris/fbsource/fbcode/scripts/mnorris/run_cuvs.py", line 260, in <module>
index.add(x[i : i + BATCH_SIZE])
File "/home/mnorris/.conda/envs/faiss_gpu_cuvs/lib/python3.12/site-packages/faiss/class_wrappers.py", line 230, in replacement_add
self.add_c(n, swig_ptr(x))
File "/home/mnorris/.conda/envs/faiss_gpu_cuvs/lib/python3.12/site-packages/faiss/swigfaiss_avx512.py", line 12216, in add
return _swigfaiss_avx512.GpuIndex_add(self, arg2, x)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: C++ exception CUDA error encountered at: file=/home/mnorris/.conda/envs/faiss_gpu_cuvs/include/raft/util/cudart_utils.hpp line=148:
Steps/Code to reproduce bug
- Install via conda (pasted below)
- Run below Python code to repro:
import cupy as cp
import faiss
import numpy as np
from numpy.lib.format import open_memmap
BATCH_SIZE = 1_000_000
d=256
batches=50
def gpu_ivfpq(m, cuvs=False):
index = faiss.index_factory(
d, f"IVF128,PQ{m}x8", faiss.METRIC_INNER_PRODUCT
)
co = faiss.GpuClonerOptions()
co.use_cuvs = cuvs
# co.useFloat16 = True # Error happens regardless of this setting?
res = faiss.StandardGpuResources()
gpu_index = faiss.index_cpu_to_gpu(res, 0, index, co)
return gpu_index
# Generates embeddings and saves to a file. Recommend running this as separate script then loading data later.
nb=50_000_000
rs = np.random.RandomState(1234)
x = rs.normal(size=(nb, d))
fp = np.lib.format.open_memmap(
FILE_PATH,
mode="w+",
shape=(nb, d),
dtype=np.float32,
)
print(f"len(x): {len(x)}") # sanity check 50M
for i, arr in enumerate(x):
fp[i] = arr[0]
if i % BATCH_SIZE == 0:
print(f"flushing at i: {i}")
fp.flush()
fp.flush()
from numpy.lib.format import open_memmap
# Can run this part as a separate script since data generation takes a while and you want to iterate
x = open_memmap(
FILE_PATH,
mode="r",
shape=(BATCH_SIZE * batches, d),
dtype=np.float32,
)
# Run m=256 just to use more memory / stress test to repro the issue. Usually reproduces on m=128.
for m in [128, 256]:
index = gpu_ivfpq(m, True)
index.train(x)
# Added in batches to see where it dies. .add(x) to add all at once also reproduces.
for i in range(0, len(x), BATCH_SIZE):
print(f"adding ith batch: {i}")
index.add(x[i : i + BATCH_SIZE]) # ** dies here **
print(f"sanity check index.ntotal: {index.ntotal}")
# Other stuff here omitted like doing the actual benchmark with search() ...
del index
Expected behavior
No error
Environment details (please complete the following information):
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
- Method of RAFT install: [conda]
conda create -yn faiss_gpu_cuvs
conda activate faiss_gpu_cuvs
conda install -y -c pytorch -c rapidsai -c conda-forge pytorch/label/nightly::faiss-gpu-cuvs pytorch pytorch-cuda numpy
conda install -y -c conda-forge cupy
Additional context
Add any other context about the problem here.
Describe the bug
Hello Nvidia friends, we are working on benchmarking cuVS IVFPQ. I am getting a CUDA error when trying to add vectors to the index.
Setup details:
The error itself either appears as this from Faiss:
or this:
Steps/Code to reproduce bug
Expected behavior
No error
Environment details (please complete the following information):
Additional context
Add any other context about the problem here.