[RFC] Unify device check and guard in kernel wrapper #55241

wenleix · 2021-04-02T18:47:39Z

Stack from ghstack:

[RFC] Unify device check and guard in kernel wrapper #55241 [RFC] Unify device check and guard in kernel wrapper
[Just unblock test] Allow opt out device_guard for structural kernel #56428 [Just unblock test] Allow opt out device_guard for structural kernel

For kernels run on CUDA backend, the following preparation steps are required for most kernels:

Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as common_device.
Acquire the device guard on common_device.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. RegisterCUDA.cpp).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

Kernel may not implement the device check, or implement in an inconsistent way.
If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.

Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. TensorIteratorBase::compute_types in ATen/TensorIterator.cpp. Used in kernels rely on TensorIterator to perform element-wise operations:

pytorch/aten/src/ATen/TensorIterator.cpp

Lines 326 to 340 in 6d030c1

    
           // Checks all tensors are on the same device, if requested 
        
           if (config.check_all_same_device_) { 
        
             // Handles CPU scalars on CUDA kernels that support them 
        
             if (!common_device.is_cpu() && 
        
                 config.allow_cpu_scalars_ && !op.is_output && op.tensor.dim() == 0 && 
        
                 op.tensor.is_cpu()) { 
        
               TORCH_CHECK(current_cpu_scalars_on_non_cpu < max_cpu_scalars_on_non_cpu, 
        
                           "Trying to pass too many CPU scalars to non-CPU kernel!"); 
        
               ++current_cpu_scalars_on_non_cpu; 
        
             } else if (op.device != common_device) { 
        
               TORCH_CHECK(false, 
        
                           "Expected all tensors to be on the same device, but " 
        
                           "found at least two devices, ", common_device, " and ", op.device, "!"); 
        
             } 
        
           }

b. checkSameGPU and checkAllSameGPU in Aten/TensorUtils.cpp:. Used in a lot of CUDA kernels.

pytorch/aten/src/ATen/TensorUtils.cpp

Lines 130 to 152 in 6d030c1

    
           void checkSameGPU(CheckedFrom c, const TensorArg& t1, const TensorArg& t2) { 
        
             if (! (t1->is_cuda()) || ! (t2->is_cuda())) { 
        
               std::ostringstream oss; 
        
               if (! t1->is_cuda()) { 
        
                 oss << "Tensor for " << t1 << " is on CPU, "; 
        
               } 
        
               if (! t2->is_cuda()) { 
        
                 oss << "Tensor for " << t2 << " is on CPU, "; 
        
               } 
        
               oss << "but expected " << ((!(t1->is_cuda() || t2->is_cuda())) ? "them" : "it") 
        
                   << " to be on GPU (while checking arguments for " << c << ")"; 
        
               AT_ERROR(oss.str()); 
        
             } 
        
             TORCH_CHECK( 
        
               t1->get_device() == t2->get_device(), 
        
               "Expected tensor for ", t1, " to have the same device as tensor for ", t2, 
        
               "; but device ", t1->get_device(), " does not equal ", t2->get_device(), 
        
               " (while checking arguments for ", c, ")"); 
        
           } 
        
           void checkAllSameGPU(CheckedFrom c, ArrayRef<TensorArg> tensors) { 
        
             checkAllSame(c, tensors, checkSameGPU); 
        
           }

c. check_attributes in ATen/native/RNN.h. Used in a few RNN kernels:

pytorch/aten/src/ATen/native/RNN.h

Lines 30 to 50 in 6d030c1

    
           inline void check_attributes(const Tensor& input, const TensorList& params, const TensorList& hiddens, bool check_dtype=false) { 
        
             auto input_device = input.device(); 
        
             auto input_dtype = input.scalar_type(); 
        
             auto check_tensors = [&](const std::string& name, const Tensor& t) { 
        
               if (!t.defined()) return; 
        
               auto t_device = t.device(); 
        
               TORCH_CHECK(input_device == t_device, 
        
                        "Input and ", name, " tensors are not at the same device, found input tensor at ", 
        
                        input_device, " and ", name, " tensor at ", t_device); 
        
               if (check_dtype) { 
        
                 auto t_dtype = t.scalar_type(); 
        
                 TORCH_CHECK(input_dtype == t_dtype, 
        
                          "Input and ", name, " tensors are not the same dtype, found input tensor with ", 
        
                          input_dtype, " and ", name, " tensor with ", t_dtype); 
        
               } 
        
             }; 
        
             for (auto h : hiddens) check_tensors("hidden", h); 
        
             for (auto p : params) check_tensors("parameter", p); 
        
           }

d. dot_check in ATen/native/cuda/LinearAlgebra.cu.

pytorch/aten/src/ATen/native/cuda/LinearAlgebra.cu

Lines 378 to 383 in 6d030c1

    
           TORCH_CHECK( 
        
               self.device() == other.device(), 
        
               "expected all tensors to be on the same device. Found: ", 
        
               self.device(), 
        
               ", ", 
        
               other.device());

These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in TensorList.

Differential Revision: D27540071

For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

facebook-github-bot · 2021-04-02T18:47:44Z

💊 CI failures summary and remediations

As of commit 636d485 (more details on the Dr. CI page):

7/7 failures possibly* introduced in this PR
- 2/7 non-scanned failure(s)

🕵️ 5 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 (1/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Apr 24 00:19:11 AssertionError: False is not tr...difference of 0 (121 vs. 121) occuring at index 0.

Apr 24 00:19:11 ======================================================================
Apr 24 00:19:11 FAIL [1.666s]: test_to (__main__.PackedSequenceTest)
Apr 24 00:19:11 ----------------------------------------------------------------------
Apr 24 00:19:11 Traceback (most recent call last):
Apr 24 00:19:11   File "test_nn.py", line 190, in test_to
Apr 24 00:19:11     self.assertEqual(b, a.to(cuda))
Apr 24 00:19:11   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1360, in assertEqual
Apr 24 00:19:11     exact_dtype=exact_dtype, exact_device=exact_device)
Apr 24 00:19:11   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1326, in assertEqual
Apr 24 00:19:11     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Apr 24 00:19:11 AssertionError: False is not true : Tensors failed to compare as equal!Found 0 different element(s) (out of 12), with the greatest difference of 0 (121 vs. 121) occuring at index 0.
Apr 24 00:19:11 
Apr 24 00:19:11 ----------------------------------------------------------------------
Apr 24 00:19:11 Ran 1 test in 1.671s
Apr 24 00:19:11 
Apr 24 00:19:11 FAILED (failures=1)
Apr 24 00:19:11 
Apr 24 00:19:11 Generating XML reports...
Apr 24 00:19:11 Generated XML report: test-reports/python-unittest/test_nn/TEST-PackedSequenceTest-20210424001909.xml
Apr 24 00:19:12 Traceback (most recent call last):
Apr 24 00:19:12   File "test/run_test.py", line 1156, in <module>

pytorch_windows_vs2019_py36_cuda10.1_test2 (2/5)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

AssertionError: "Expected all tensors to be on ...evice, but inputs are on cpu and out is on cuda:0"


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 292, in instantiated_test
    result = test_fn(self, *args)
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 723, in only_fn
    return fn(slf, device, *args, **kwargs)
  File "test_tensor_creation_ops.py", line 721, in test_cat_stack_cross_devices
    torch.cat((cpu, cpu), out=out_cuda)
AssertionError: "Expected all tensors to be on the same device" does not match "torch.cat(): all input tensors and out must be on the same device, but inputs are on cpu and out is on cuda:0"

----------------------------------------------------------------------
Ran 561 tests in 131.261s

FAILED (failures=1, skipped=78)

Generating XML reports...
Generated XML report: test-reports\python-unittest\test_tensor_creation_ops\TEST-TestLikeTensorCreationCPU-20210423224708.xml
Generated XML report: test-reports\python-unittest\test_tensor_creation_ops\TEST-TestLikeTensorCreationCUDA-20210423224708.xml
Generated XML report: test-reports\python-unittest\test_tensor_creation_ops\TEST-TestRandomTensorCreationCPU-20210423224708.xml

pytorch_linux_xenial_cuda10_2_cudnn7_py3_jit_legacy_test (3/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Apr 24 00:04:24 RuntimeError: CUDA error: an illegal memory access was encountered

Apr 24 00:04:24 ----------------------------------------------------------------------
Apr 24 00:04:24 Traceback (most recent call last):
Apr 24 00:04:24   File "/var/lib/jenkins/workspace/test/jit/test_data_parallel.py", line 88, in test_python_submodule_script
Apr 24 00:04:24     self.check_replicas(module, replicas)
Apr 24 00:04:24   File "/var/lib/jenkins/workspace/test/jit/test_data_parallel.py", line 82, in check_replicas
Apr 24 00:04:24     self.assertEqual(replica(replica_input).data, expected_output)
Apr 24 00:04:24   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1321, in assertEqual
Apr 24 00:04:24     exact_device=exact_device)
Apr 24 00:04:24   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1168, in _compareTensors
Apr 24 00:04:24     b = b.cpu()
Apr 24 00:04:24 RuntimeError: CUDA error: an illegal memory access was encountered
Apr 24 00:04:24 
Apr 24 00:04:24 ----------------------------------------------------------------------
Apr 24 00:04:24 Ran 117 tests in 36.702s
Apr 24 00:04:24 
Apr 24 00:04:24 FAILED (errors=1, skipped=1)
Apr 24 00:04:24 
Apr 24 00:04:24 Generating XML reports...
Apr 24 00:04:24 Generated XML report: test-reports/python-unittest/test_jit_legacy/TEST-jit.test_async.TestAsync-20210424000347.xml
Apr 24 00:04:24 Generated XML report: test-reports/python-unittest/test_jit_legacy/TEST-jit.test_autodiff_subgraph_slicing.TestAutodiffSubgraphSlicing-20210424000347.xml
Apr 24 00:04:24 Generated XML report: test-reports/python-unittest/test_jit_legacy/TEST-jit.test_backends.TestBackends-20210424000347.xml

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 (4/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Apr 24 00:08:44 AssertionError: False is not tr...6164904832839966), which occurred at index (0, 0).

Apr 24 00:08:44   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1030, in wrapper
Apr 24 00:08:44     method(*args, **kwargs)
Apr 24 00:08:44   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 292, in instantiated_test
Apr 24 00:08:44     result = test_fn(self, *args)
Apr 24 00:08:44   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 747, in multi_fn
Apr 24 00:08:44     return fn(slf, devices, *args, **kwargs)
Apr 24 00:08:44   File "test_torch.py", line 7336, in test_multidevice_serialization
Apr 24 00:08:44     self.assertEqual(cp, original)
Apr 24 00:08:44   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1326, in assertEqual
Apr 24 00:08:44     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Apr 24 00:08:44 AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 4532020583610935537 element(s) (out of 16) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 5.3299023951003925e-43 (-1.6164904832839966 vs. -1.6164904832839966), which occurred at index (0, 0).
Apr 24 00:08:44 
Apr 24 00:08:44 ----------------------------------------------------------------------
Apr 24 00:08:44 Ran 14 tests in 2.541s
Apr 24 00:08:44 
Apr 24 00:08:44 FAILED (failures=1)
Apr 24 00:08:44 
Apr 24 00:08:44 Generating XML reports...
Apr 24 00:08:44 Generated XML report: test-reports/python-unittest/test_torch/TEST-TestDevicePrecisionCUDA-20210424000841.xml
Apr 24 00:08:44 Traceback (most recent call last):
Apr 24 00:08:44   File "test/run_test.py", line 1156, in <module>

pytorch_windows_vs2019_py36_cuda10.1_test1 (5/5)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

AssertionError: "Expected all tensors to be on ... arugment for argument mat2 in method wrapper_mm)"

======================================================================
FAIL [0.004s]: test_matmul_device_mismatch (__main__.TestCudaComm)
----------------------------------------------------------------------
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument mat2 in method wrapper_mm)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_cuda.py", line 3836, in test_matmul_device_mismatch
    cpu @ cuda
AssertionError: "Expected all tensors to be on the same common device" does not match "Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument mat2 in method wrapper_mm)"

----------------------------------------------------------------------
Ran 159 tests in 82.063s

FAILED (failures=1, skipped=67)

Generating XML reports...
Generated XML report: test-reports\python-unittest\test_cuda\TEST-TestCuda-20210423231007.xml
Generated XML report: test-reports\python-unittest\test_cuda\TEST-TestCudaComm-20210423231007.xml
Traceback (most recent call last):

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm4.1-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

…l wrapper" For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. ghstack-source-id: 125666279 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

…rd in kernel wrapper" For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. ghstack-source-id: 125690005 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

…pper" For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. ghstack-source-id: 125691571 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

…pper" For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. ghstack-source-id: 125692886 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. ghstack-source-id: 125736005 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

tools/codegen/dest/register_dispatch_key.py

For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. ghstack-source-id: 125769869 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

wenleix · 2021-04-06T00:28:20Z

aten/src/ATen/native/native_functions.yaml

@@ -4347,6 +4350,7 @@
    SparseCUDA: hspmm_sparse_cuda

 - func: copy_sparse_to_sparse_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)
+  device_guard: False


This calls into Tensor::to :

pytorch/aten/src/ATen/SparseTensorUtils.h

Lines 34 to 39 in bf37bf7

inline void copy_into_sparse(const SparseTensor& self, const Tensor& indices, const Tensor& values, bool non_blocking) {

alias_into_sparse(

self,

indices.to(self._indices().options(), non_blocking, /*copy=*/true),

values.to(self._values().options(), non_blocking, /*copy=*/true));

}

which handles device guard:

pytorch/aten/src/ATen/native/TensorConversions.cpp

Lines 83 to 85 in bf37bf7

if (options.has_device()) {

options = options.device(ensure_has_index(options.device()));

}

Maybe this should have been a composite haha

For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

…apper" For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. ghstack-source-id: 126921579 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!

…pper" For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. ghstack-source-id: 126924122 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!

For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. ghstack-source-id: 126967104 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!

For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. ghstack-source-id: 127011444 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!

For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. ghstack-source-id: 127145739 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!

wenleix · 2021-04-23T20:14:19Z

One issue that cause CUDA memory issue is because of normal_ -- while it uses TensorItearator, it only use the metadata part of TensorIterator (but not the execution part) . So device guard is not fetched for it.

…ard in kernel wrapper" For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

wenleix · 2021-04-23T20:32:41Z

We found make TI to fetch device guard is a bit tricky:

We can fetch device guard at execution time. However, TI doesn't have more than one execution entry. Usually it's for_each(), but it can also be parallel_reduce() and serial_for_each(). To make things more complicated, for_each might delegate to serial_for_each, but user might also directly call serial_for_each
Some kernels only use TI for metadata management, but not the execution part (e.g. normal_). There are also some kernels use TI, but may or may not use the TI execution (e.g. copy_).

Given this and in general complication of this PR with device check + TI device guard + ... We will fall back with a more incremental approach by first just adding more device checks. As mentioned in #56570 (comment) . cc @ezyang

For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]

Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. ghstack-source-id: 127311961 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!

wenleix · 2021-04-23T22:59:35Z

Command to generate change in native_functions.yaml (a few is missing): https://gist.github.com/wenleix/a7d1c7eb9c504738c3c863c9a4c69839

github-actions · 2022-04-13T05:35:04Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

facebook-github-bot added the cla signed label Apr 2, 2021

This was referenced Apr 2, 2021

Simplify device guard code generation #55112

Closed

Eliminate device guard in generic dispatch key kernel wrappers #55131

Closed

wenleix marked this pull request as draft April 2, 2021 19:20

wenleix commented Apr 5, 2021

View reviewed changes

tools/codegen/dest/register_dispatch_key.py Outdated Show resolved Hide resolved

wenleix commented Apr 6, 2021

View reviewed changes

wenleix mentioned this pull request Apr 21, 2021

Support device check when fetching guard in dispatch wrapper #56570

Closed

wenleix mentioned this pull request Apr 23, 2021

Mechanism in TensorUtils.h to check if all tensors are on the same device #56628

Closed

wenleix mentioned this pull request May 11, 2021

Eliminate potential double device check #58016

Open

github-actions bot added the Stale label Apr 13, 2022

github-actions bot closed this May 13, 2022

facebook-github-bot deleted the gh/wenleix/18/head branch June 12, 2022 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Unify device check and guard in kernel wrapper #55241

[RFC] Unify device check and guard in kernel wrapper #55241

wenleix commented Apr 2, 2021 •

edited

Loading

facebook-github-bot commented Apr 2, 2021 •

edited

Loading

wenleix Apr 6, 2021

ezyang Apr 14, 2021

wenleix commented Apr 23, 2021

wenleix commented Apr 23, 2021 •

edited

Loading

wenleix commented Apr 23, 2021 •

edited

Loading

github-actions bot commented Apr 13, 2022

	// Checks all tensors are on the same device, if requested
	if (config.check_all_same_device_) {
	// Handles CPU scalars on CUDA kernels that support them
	if (!common_device.is_cpu() &&
	config.allow_cpu_scalars_ && !op.is_output && op.tensor.dim() == 0 &&
	op.tensor.is_cpu()) {
	TORCH_CHECK(current_cpu_scalars_on_non_cpu < max_cpu_scalars_on_non_cpu,
	"Trying to pass too many CPU scalars to non-CPU kernel!");
	++current_cpu_scalars_on_non_cpu;
	} else if (op.device != common_device) {
	TORCH_CHECK(false,
	"Expected all tensors to be on the same device, but "
	"found at least two devices, ", common_device, " and ", op.device, "!");
	}
	}

	void checkSameGPU(CheckedFrom c, const TensorArg& t1, const TensorArg& t2) {
	if (! (t1->is_cuda()) \|\| ! (t2->is_cuda())) {
	std::ostringstream oss;
	if (! t1->is_cuda()) {
	oss << "Tensor for " << t1 << " is on CPU, ";
	}
	if (! t2->is_cuda()) {
	oss << "Tensor for " << t2 << " is on CPU, ";
	}
	oss << "but expected " << ((!(t1->is_cuda() \|\| t2->is_cuda())) ? "them" : "it")
	<< " to be on GPU (while checking arguments for " << c << ")";
	AT_ERROR(oss.str());
	}
	TORCH_CHECK(
	t1->get_device() == t2->get_device(),
	"Expected tensor for ", t1, " to have the same device as tensor for ", t2,
	"; but device ", t1->get_device(), " does not equal ", t2->get_device(),
	" (while checking arguments for ", c, ")");
	}

	void checkAllSameGPU(CheckedFrom c, ArrayRef<TensorArg> tensors) {
	checkAllSame(c, tensors, checkSameGPU);
	}

	inline void check_attributes(const Tensor& input, const TensorList& params, const TensorList& hiddens, bool check_dtype=false) {
	auto input_device = input.device();
	auto input_dtype = input.scalar_type();

	auto check_tensors = [&](const std::string& name, const Tensor& t) {
	if (!t.defined()) return;
	auto t_device = t.device();
	TORCH_CHECK(input_device == t_device,
	"Input and ", name, " tensors are not at the same device, found input tensor at ",
	input_device, " and ", name, " tensor at ", t_device);
	if (check_dtype) {
	auto t_dtype = t.scalar_type();
	TORCH_CHECK(input_dtype == t_dtype,
	"Input and ", name, " tensors are not the same dtype, found input tensor with ",
	input_dtype, " and ", name, " tensor with ", t_dtype);
	}
	};

	for (auto h : hiddens) check_tensors("hidden", h);
	for (auto p : params) check_tensors("parameter", p);
	}

	TORCH_CHECK(
	self.device() == other.device(),
	"expected all tensors to be on the same device. Found: ",
	self.device(),
	", ",
	other.device());

	inline void copy_into_sparse(const SparseTensor& self, const Tensor& indices, const Tensor& values, bool non_blocking) {
	alias_into_sparse(
	self,
	indices.to(self._indices().options(), non_blocking, /copy=/true),
	values.to(self._values().options(), non_blocking, /copy=/true));
	}

	if (options.has_device()) {
	options = options.device(ensure_has_index(options.device()));
	}

[RFC] Unify device check and guard in kernel wrapper #55241

[RFC] Unify device check and guard in kernel wrapper #55241

Conversation

wenleix commented Apr 2, 2021 • edited Loading

Appendix

facebook-github-bot commented Apr 2, 2021 • edited Loading

💊 CI failures summary and remediations

🕵️ 5 new failures recognized by patterns

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 (1/5)

pytorch_windows_vs2019_py36_cuda10.1_test2 (2/5)

pytorch_linux_xenial_cuda10_2_cudnn7_py3_jit_legacy_test (3/5)

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 (4/5)

pytorch_windows_vs2019_py36_cuda10.1_test1 (5/5)

ci.pytorch.org: 1 failed

wenleix Apr 6, 2021

Choose a reason for hiding this comment

ezyang Apr 14, 2021

Choose a reason for hiding this comment

wenleix commented Apr 23, 2021

wenleix commented Apr 23, 2021 • edited Loading

wenleix commented Apr 23, 2021 • edited Loading

github-actions bot commented Apr 13, 2022

wenleix commented Apr 2, 2021 •

edited

Loading

facebook-github-bot commented Apr 2, 2021 •

edited

Loading

wenleix commented Apr 23, 2021 •

edited

Loading

wenleix commented Apr 23, 2021 •

edited

Loading