Skip to content

Commit

Permalink
Register jagged_index_select to CUDA and CPU backend for inference
Browse files Browse the repository at this point in the history
Summary:
we have a new model using jagged_index_select ops and seems it does not work when we scripted the model and run in inference server with following error:
```
NotImplementedError: Could not run 'fbgemm::jagged_index_select' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'fbgemm::jagged_index_select' is only available for these backends: [BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradMeta, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

BackendSelect: fallthrough registered at fbcode/caffe2/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at fbcode/caffe2/aten/src/ATen/core/PythonFallbackKernel.cpp:153 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at fbcode/caffe2/aten/src/ATen/functorch/DynamicLayer.cpp:498 [backend fallback]
Functionalize: registered at fbcode/caffe2/aten/src/ATen/FunctionalizeFallbackKernel.cpp:290 [backend fallback]
Named: registered at fbcode/caffe2/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at fbcode/caffe2/aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at fbcode/caffe2/aten/src/ATen/native/NegateFallback.cpp:19 [backend fallback]
ZeroTensor: registered at fbcode/caffe2/aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at fbcode/caffe2/aten/src/ATen/core/VariableFallbackKernel.cpp:71 [backend fallback]
AutogradOther: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradCPU: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradCUDA: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradHIP: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradXLA: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradMPS: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradIPU: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradXPU: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradHPU: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradVE: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradLazy: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradMTIA: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradPrivateUse1: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradPrivateUse2: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradPrivateUse3: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradMeta: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
AutogradNestedTensor: registered at fbcode/deeplearning/fbgemm/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_autograd.cpp:867 [autograd kernel]
Tracer: registered at fbcode/caffe2/torch/csrc/autograd/TraceTypeManual.cpp:296 [backend fallback]
AutocastCPU: fallthrough registered at fbcode/caffe2/aten/src/ATen/autocast_mode.cpp:383 [backend fallback]
AutocastCUDA: fallthrough registered at fbcode/caffe2/aten/src/ATen/autocast_mode.cpp:250 [backend fallback]
FuncTorchBatched: registered at fbcode/caffe2/aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:710 [backend fallback]
FuncTorchVmapMode: fallthrough registered at fbcode/caffe2/aten/src/ATen/functorch/VmapModeRegistrations.cpp:28 [backend fallback]
Batched: registered at fbcode/caffe2/aten/src/ATen/LegacyBatchingRegistrations.cpp:1075 [backend fallback]
VmapMode: fallthrough registered at fbcode/caffe2/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at fbcode/caffe2/aten/src/ATen/functorch/TensorWrapper.cpp:201 [backend fallback]
PythonTLSSnapshot: registered at fbcode/caffe2/aten/src/ATen/core/PythonFallbackKernel.cpp:161 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at fbcode/caffe2/aten/src/ATen/functorch/DynamicLayer.cpp:494 [backend fallback]
PreDispatch: registered at fbcode/caffe2/aten/src/ATen/core/PythonFallbackKernel.cpp:165 [backend fallback]
PythonDispatcher: registered at fbcode/caffe2/aten/src/ATen/core/PythonFallbackKernel.cpp:157 [backend fallback]
```
this diff help register the jagged_index_select forward path with CUDA and CPU backend so that we can use the op in inference.

Differential Revision: D47462912

fbshipit-source-id: a0cc7dd641dd99ed911ee6114c21495a39f90b44
  • Loading branch information
Pengchao Wang authored and facebook-github-bot committed Jul 15, 2023
1 parent 55cd84c commit 907839c
Show file tree
Hide file tree
Showing 3 changed files with 79 additions and 0 deletions.
5 changes: 5 additions & 0 deletions fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops.cu
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,8 @@ FBGEMM_OP_DISPATCH(CUDA, "jagged_2d_to_dense", fbgemm_gpu::jagged_2d_to_dense);
FBGEMM_OP_DISPATCH(CUDA, "jagged_softmax", fbgemm_gpu::jagged_softmax);
FBGEMM_OP_DISPATCH(CUDA, "jagged_jagged_bmm", fbgemm_gpu::jagged_jagged_bmm);
FBGEMM_OP_DISPATCH(CUDA, "jagged_dense_bmm", fbgemm_gpu::jagged_dense_bmm);

FBGEMM_OP_DISPATCH(
CUDA,
"jagged_index_select",
fbgemm_gpu::jagged_index_select_2d);
1 change: 1 addition & 0 deletions fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_cpu.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1730,6 +1730,7 @@ TORCH_LIBRARY_IMPL(fbgemm, CPU, m) {
DISPATCH_TO_CPU(
"jagged_index_select_2d_forward",
fbgemm_gpu::jagged_index_select_2d_forward_cpu);
DISPATCH_TO_CPU("jagged_index_select", fbgemm_gpu::jagged_index_select_2d);
DISPATCH_TO_CPU(
"jagged_index_add_2d_forward",
fbgemm_gpu::jagged_index_add_2d_forward_cpu);
Expand Down
73 changes: 73 additions & 0 deletions fbgemm_gpu/test/jagged_tensor_ops_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -1763,6 +1763,79 @@ def test_jagged_index_select_2d(
atol=1e-2 if jagged_tensor_dtype in [torch.half, torch.bfloat16] else None,
)

@unittest.skipIf(*running_on_github)
@given(
max_seq_length=st.integers(5, 10),
batch_size=st.integers(1, 128),
num_cols=st.integers(1, 128),
num_jagged_tensor_rows=st.integers(1, 128),
index_dtype=st.sampled_from([torch.int, torch.long]),
jagged_tensor_dtype=st.sampled_from(
[
torch.float,
torch.half,
torch.int,
torch.long,
] # Disable torch.bfloat16 due to large error bound
),
use_cpu=st.booleans()
if (gpu_available and not TEST_WITH_ROCM)
else st.just(False)
if (gpu_available and TEST_WITH_ROCM)
else st.just(True),
)
@settings(max_examples=20, deadline=None)
def test_jagged_index_select_2d_in_inference(
self,
max_seq_length: int,
batch_size: int,
num_cols: int,
num_jagged_tensor_rows: int,
index_dtype: torch.dtype,
jagged_tensor_dtype: torch.dtype,
use_cpu: bool,
) -> None:
device = torch.device("cpu" if use_cpu else "cuda")
is_float = jagged_tensor_dtype in [torch.float, torch.half, torch.bfloat16]
lengths = torch.randint(
low=0,
high=max_seq_length,
size=(num_jagged_tensor_rows,),
dtype=index_dtype,
device=device,
)
indices, _ = torch.sort(
torch.randint(
low=0,
high=num_jagged_tensor_rows,
size=(batch_size,),
dtype=index_dtype,
device=device,
)
)
if is_float:
values = torch.rand(
int(lengths.sum().item()),
num_cols,
dtype=jagged_tensor_dtype,
device=device,
)
else:
values = torch.randint(
2**16,
(int(lengths.sum().item()), num_cols),
dtype=jagged_tensor_dtype,
device=device,
)
values_ref = values.detach().clone()

with torch.inference_mode():
output, _ = torch.ops.fbgemm.jagged_index_select(values, lengths, indices)
output_ref = self.jagged_index_select_2d_ref(
values_ref, lengths, indices, device
)
assert torch.equal(output, output_ref)

@given(
batch_size=st.integers(1, 128),
max_length=st.integers(0, 128),
Expand Down

0 comments on commit 907839c

Please sign in to comment.