Sparse Conv only works on GPU0 without context manager. #544

dddab · 2022-12-17T16:47:38Z

NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4

python3 --version
Python 3.9.5

import spconv
spconv.version
'2.2.6'

To reproduce

import spconv.pytorch as spconv
import torch

num_channels = 32
N = 4

features = torch.rand([N, num_channels])
# Coordinates with shape [N, ndim + 1], batch index must be put in indices[:, 0]
indices_np = np.array([[0, 1, 1, 2], [0, 2, 1, 1], [1, 7, 8, 2], [1, 7, 9, 2]]).astype(np.int32)
indices = torch.from_numpy(indices_np)
spatial_shape = [513, 517, 29]
batch_size = 2
device_idx = 0
x = spconv.SparseConvTensor(features.cuda(device_idx), indices.cuda(device_idx), spatial_shape, batch_size)
conv = spconv.SubMConv3d(num_channels, 16, 3, padding=1, bias=False, indice_key="subm1").cuda(device_idx)
print(conv(x))

command CUDA_LAUNCH_BLOCKING=1 python3 /repro.py

If I run with device_idx=0 output is as expected

<spconv.pytorch.core.SparseConvTensor object at 0x7f9f76e15ca0>

If I run with device_idx = 1 output is:

[Exception|implicit_gemm_pair]indices=torch.Size([4, 4]),bs=2,ss=[513, 517, 29],algo=ConvAlgo.MaskImplicitGemm,ksize=[3, 3, 3],stride=[1, 1, 1],padding=[1, 1, 1],dilation=[1, 1, 1],subm=True,transpose=False spconv try to save debug data to /debug.txt, but failed with exception CUDA error: an illegal memory access was encountered. please check your SPCONV_DEBUG_SAVE_PATH Traceback (most recent call last): File "/repro.py", line 17, in <module> print(conv(x)) File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.9/dist-packages/spconv/pytorch/conv.py", line 515, in forward raise e File "/usr/local/lib/python3.9/dist-packages/spconv/pytorch/conv.py", line 492, in forward res = ops.get_indice_pairs_implicit_gemm( File "/usr/local/lib/python3.9/dist-packages/spconv/pytorch/ops.py", line 546, in get_indice_pairs_implicit_gemm SpconvOps.sort_1d_by_key_allocator(pair_mask_tv[j], RuntimeError: radix_sort: failed on 1st step: cudaErrorIllegalAddress: an illegal memory access was encountered

No debug data was saved to the file SPCONV_DEBUG_SAVE_PATH

To me it looks like it is trying to do the rules on GPU0 even though all the tensors on are GPU1.

Note it works fine if we run with with torch.cuda.device(1):

The text was updated successfully, but these errors were encountered:

dddab changed the title ~~Sparse Conv only works on GPU0~~ Sparse Conv only works on GPU0 without context manager. Dec 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse Conv only works on GPU0 without context manager. #544

Sparse Conv only works on GPU0 without context manager. #544

dddab commented Dec 17, 2022 •

edited

Sparse Conv only works on GPU0 without context manager. #544

Sparse Conv only works on GPU0 without context manager. #544

Comments

dddab commented Dec 17, 2022 • edited

dddab commented Dec 17, 2022 •

edited