Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Unify device check and guard in kernel wrapper #55241

Closed
wants to merge 34 commits into from

Conversation

wenleix
Copy link
Contributor

@wenleix wenleix commented Apr 2, 2021

Stack from ghstack:

For kernels run on CUDA backend, the following preparation steps are required for most kernels:

  1. Make sure all the tensors are on the same device (except for
    0-dimension tensor, which is allowed to stay in CPU). We will denote
    this device as common_device.

  2. Acquire the device guard on common_device.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. RegisterCUDA.cpp).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

  1. Kernel may not implement the device check, or implement in an inconsistent way.

  2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
    the wrong device guard. We have seen issues before
    (db2b273) thus kernel might need to re-acquire
    device guard in such situation, which is fragile.

Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. TensorIteratorBase::compute_types in ATen/TensorIterator.cpp. Used in kernels rely on TensorIterator to perform element-wise operations:

// Checks all tensors are on the same device, if requested
if (config.check_all_same_device_) {
// Handles CPU scalars on CUDA kernels that support them
if (!common_device.is_cpu() &&
config.allow_cpu_scalars_ && !op.is_output && op.tensor.dim() == 0 &&
op.tensor.is_cpu()) {
TORCH_CHECK(current_cpu_scalars_on_non_cpu < max_cpu_scalars_on_non_cpu,
"Trying to pass too many CPU scalars to non-CPU kernel!");
++current_cpu_scalars_on_non_cpu;
} else if (op.device != common_device) {
TORCH_CHECK(false,
"Expected all tensors to be on the same device, but "
"found at least two devices, ", common_device, " and ", op.device, "!");
}
}

b. checkSameGPU and checkAllSameGPU in Aten/TensorUtils.cpp:. Used in a lot of CUDA kernels.

void checkSameGPU(CheckedFrom c, const TensorArg& t1, const TensorArg& t2) {
if (! (t1->is_cuda()) || ! (t2->is_cuda())) {
std::ostringstream oss;
if (! t1->is_cuda()) {
oss << "Tensor for " << t1 << " is on CPU, ";
}
if (! t2->is_cuda()) {
oss << "Tensor for " << t2 << " is on CPU, ";
}
oss << "but expected " << ((!(t1->is_cuda() || t2->is_cuda())) ? "them" : "it")
<< " to be on GPU (while checking arguments for " << c << ")";
AT_ERROR(oss.str());
}
TORCH_CHECK(
t1->get_device() == t2->get_device(),
"Expected tensor for ", t1, " to have the same device as tensor for ", t2,
"; but device ", t1->get_device(), " does not equal ", t2->get_device(),
" (while checking arguments for ", c, ")");
}
void checkAllSameGPU(CheckedFrom c, ArrayRef<TensorArg> tensors) {
checkAllSame(c, tensors, checkSameGPU);
}

c. check_attributes in ATen/native/RNN.h. Used in a few RNN kernels:

inline void check_attributes(const Tensor& input, const TensorList& params, const TensorList& hiddens, bool check_dtype=false) {
auto input_device = input.device();
auto input_dtype = input.scalar_type();
auto check_tensors = [&](const std::string& name, const Tensor& t) {
if (!t.defined()) return;
auto t_device = t.device();
TORCH_CHECK(input_device == t_device,
"Input and ", name, " tensors are not at the same device, found input tensor at ",
input_device, " and ", name, " tensor at ", t_device);
if (check_dtype) {
auto t_dtype = t.scalar_type();
TORCH_CHECK(input_dtype == t_dtype,
"Input and ", name, " tensors are not the same dtype, found input tensor with ",
input_dtype, " and ", name, " tensor with ", t_dtype);
}
};
for (auto h : hiddens) check_tensors("hidden", h);
for (auto p : params) check_tensors("parameter", p);
}

d. dot_check in ATen/native/cuda/LinearAlgebra.cu.

TORCH_CHECK(
self.device() == other.device(),
"expected all tensors to be on the same device. Found: ",
self.device(),
", ",
other.device());

These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in TensorList.

Differential Revision: D27540071

For kernels run on CUDA backend, the following preparations are required:

(1) Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

(2) Acquire the device guard on `common_device`.

Today, these two preparations are done seperately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device gurad. There are at least two issues with
the current appraoch:

1. Kernel may not impelemnt the device check, or implement in inconsistent way.
In fact, there are a few implementations in codebase to do the checks.

2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile..

Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 2, 2021

💊 CI failures summary and remediations

As of commit 636d485 (more details on the Dr. CI page):


  • 7/7 failures possibly* introduced in this PR
    • 2/7 non-scanned failure(s)

🕵️ 5 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 (1/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Apr 24 00:19:11 AssertionError: False is not tr...difference of 0 (121 vs. 121) occuring at index 0.
Apr 24 00:19:11 ======================================================================
Apr 24 00:19:11 FAIL [1.666s]: test_to (__main__.PackedSequenceTest)
Apr 24 00:19:11 ----------------------------------------------------------------------
Apr 24 00:19:11 Traceback (most recent call last):
Apr 24 00:19:11   File "test_nn.py", line 190, in test_to
Apr 24 00:19:11     self.assertEqual(b, a.to(cuda))
Apr 24 00:19:11   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1360, in assertEqual
Apr 24 00:19:11     exact_dtype=exact_dtype, exact_device=exact_device)
Apr 24 00:19:11   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1326, in assertEqual
Apr 24 00:19:11     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Apr 24 00:19:11 AssertionError: False is not true : Tensors failed to compare as equal!Found 0 different element(s) (out of 12), with the greatest difference of 0 (121 vs. 121) occuring at index 0.
Apr 24 00:19:11 
Apr 24 00:19:11 ----------------------------------------------------------------------
Apr 24 00:19:11 Ran 1 test in 1.671s
Apr 24 00:19:11 
Apr 24 00:19:11 FAILED (failures=1)
Apr 24 00:19:11 
Apr 24 00:19:11 Generating XML reports...
Apr 24 00:19:11 Generated XML report: test-reports/python-unittest/test_nn/TEST-PackedSequenceTest-20210424001909.xml
Apr 24 00:19:12 Traceback (most recent call last):
Apr 24 00:19:12   File "test/run_test.py", line 1156, in <module>

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_test2 (2/5)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

AssertionError: "Expected all tensors to be on ...evice, but inputs are on cpu and out is on cuda:0"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 292, in instantiated_test
    result = test_fn(self, *args)
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 723, in only_fn
    return fn(slf, device, *args, **kwargs)
  File "test_tensor_creation_ops.py", line 721, in test_cat_stack_cross_devices
    torch.cat((cpu, cpu), out=out_cuda)
AssertionError: "Expected all tensors to be on the same device" does not match "torch.cat(): all input tensors and out must be on the same device, but inputs are on cpu and out is on cuda:0"

----------------------------------------------------------------------
Ran 561 tests in 131.261s

FAILED (failures=1, skipped=78)

Generating XML reports...
Generated XML report: test-reports\python-unittest\test_tensor_creation_ops\TEST-TestLikeTensorCreationCPU-20210423224708.xml
Generated XML report: test-reports\python-unittest\test_tensor_creation_ops\TEST-TestLikeTensorCreationCUDA-20210423224708.xml
Generated XML report: test-reports\python-unittest\test_tensor_creation_ops\TEST-TestRandomTensorCreationCPU-20210423224708.xml

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_jit_legacy_test (3/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Apr 24 00:04:24 RuntimeError: CUDA error: an illegal memory access was encountered
Apr 24 00:04:24 ----------------------------------------------------------------------
Apr 24 00:04:24 Traceback (most recent call last):
Apr 24 00:04:24   File "/var/lib/jenkins/workspace/test/jit/test_data_parallel.py", line 88, in test_python_submodule_script
Apr 24 00:04:24     self.check_replicas(module, replicas)
Apr 24 00:04:24   File "/var/lib/jenkins/workspace/test/jit/test_data_parallel.py", line 82, in check_replicas
Apr 24 00:04:24     self.assertEqual(replica(replica_input).data, expected_output)
Apr 24 00:04:24   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1321, in assertEqual
Apr 24 00:04:24     exact_device=exact_device)
Apr 24 00:04:24   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1168, in _compareTensors
Apr 24 00:04:24     b = b.cpu()
Apr 24 00:04:24 RuntimeError: CUDA error: an illegal memory access was encountered
Apr 24 00:04:24 
Apr 24 00:04:24 ----------------------------------------------------------------------
Apr 24 00:04:24 Ran 117 tests in 36.702s
Apr 24 00:04:24 
Apr 24 00:04:24 FAILED (errors=1, skipped=1)
Apr 24 00:04:24 
Apr 24 00:04:24 Generating XML reports...
Apr 24 00:04:24 Generated XML report: test-reports/python-unittest/test_jit_legacy/TEST-jit.test_async.TestAsync-20210424000347.xml
Apr 24 00:04:24 Generated XML report: test-reports/python-unittest/test_jit_legacy/TEST-jit.test_autodiff_subgraph_slicing.TestAutodiffSubgraphSlicing-20210424000347.xml
Apr 24 00:04:24 Generated XML report: test-reports/python-unittest/test_jit_legacy/TEST-jit.test_backends.TestBackends-20210424000347.xml

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 (4/5)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Apr 24 00:08:44 AssertionError: False is not tr...6164904832839966), which occurred at index (0, 0).
Apr 24 00:08:44   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1030, in wrapper
Apr 24 00:08:44     method(*args, **kwargs)
Apr 24 00:08:44   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 292, in instantiated_test
Apr 24 00:08:44     result = test_fn(self, *args)
Apr 24 00:08:44   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 747, in multi_fn
Apr 24 00:08:44     return fn(slf, devices, *args, **kwargs)
Apr 24 00:08:44   File "test_torch.py", line 7336, in test_multidevice_serialization
Apr 24 00:08:44     self.assertEqual(cp, original)
Apr 24 00:08:44   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1326, in assertEqual
Apr 24 00:08:44     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Apr 24 00:08:44 AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 4532020583610935537 element(s) (out of 16) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 5.3299023951003925e-43 (-1.6164904832839966 vs. -1.6164904832839966), which occurred at index (0, 0).
Apr 24 00:08:44 
Apr 24 00:08:44 ----------------------------------------------------------------------
Apr 24 00:08:44 Ran 14 tests in 2.541s
Apr 24 00:08:44 
Apr 24 00:08:44 FAILED (failures=1)
Apr 24 00:08:44 
Apr 24 00:08:44 Generating XML reports...
Apr 24 00:08:44 Generated XML report: test-reports/python-unittest/test_torch/TEST-TestDevicePrecisionCUDA-20210424000841.xml
Apr 24 00:08:44 Traceback (most recent call last):
Apr 24 00:08:44   File "test/run_test.py", line 1156, in <module>

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_test1 (5/5)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

AssertionError: "Expected all tensors to be on ... arugment for argument mat2 in method wrapper_mm)"
======================================================================
FAIL [0.004s]: test_matmul_device_mismatch (__main__.TestCudaComm)
----------------------------------------------------------------------
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument mat2 in method wrapper_mm)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_cuda.py", line 3836, in test_matmul_device_mismatch
    cpu @ cuda
AssertionError: "Expected all tensors to be on the same common device" does not match "Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument mat2 in method wrapper_mm)"

----------------------------------------------------------------------
Ran 159 tests in 82.063s

FAILED (failures=1, skipped=67)

Generating XML reports...
Generated XML report: test-reports\python-unittest\test_cuda\TEST-TestCuda-20210423231007.xml
Generated XML report: test-reports\python-unittest\test_cuda\TEST-TestCudaComm-20210423231007.xml
Traceback (most recent call last):

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

For kernels run on CUDA backend, the following preparations are required:

(1) Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

(2) Acquire the device guard on `common_device`.

Today, these two preparations are done seperately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device gurad. There are at least two issues with
the current appraoch:

1. Kernel may not impelemnt the device check, or implement in inconsistent way.
In fact, there are a few implementations in codebase to do the checks.

2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile..

Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
…l wrapper"


For kernels run on CUDA backend, the following preparations are required:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two preparations are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.


As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
wenleix pushed a commit that referenced this pull request Apr 3, 2021
Pull Request resolved: #55241

For kernels run on CUDA backend, the following preparations are required:

(1) Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

(2) Acquire the device guard on `common_device`.

Today, these two preparations are done seperately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device gurad. There are at least two issues with
the current appraoch:

1. Kernel may not impelemnt the device check, or implement in inconsistent way.
In fact, there are a few implementations in codebase to do the checks.

2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile..
ghstack-source-id: 125666279

Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)
For kernels run on CUDA backend, the following preparations are required:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two preparations are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.


As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
For kernels run on CUDA backend, the following preparations are required:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two preparations are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.


As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
…rd in kernel wrapper"


For kernels run on CUDA backend, the following preparations are required:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two preparations are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.


As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
wenleix pushed a commit that referenced this pull request Apr 4, 2021
Pull Request resolved: #55241

For kernels run on CUDA backend, the following preparations are required:

(1) Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

(2) Acquire the device guard on `common_device`.

Today, these two preparations are done seperately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device gurad. There are at least two issues with
the current appraoch:

1. Kernel may not impelemnt the device check, or implement in inconsistent way.
In fact, there are a few implementations in codebase to do the checks.

2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile..
ghstack-source-id: 125690005

Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)
…pper"


For kernels run on CUDA backend, the following preparations are required:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two preparations are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.


As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
wenleix pushed a commit that referenced this pull request Apr 4, 2021
Pull Request resolved: #55241

For kernels run on CUDA backend, the following preparations are required:

(1) Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

(2) Acquire the device guard on `common_device`.

Today, these two preparations are done seperately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device gurad. There are at least two issues with
the current appraoch:

1. Kernel may not impelemnt the device check, or implement in inconsistent way.
In fact, there are a few implementations in codebase to do the checks.

2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile..
ghstack-source-id: 125691571

Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)
…pper"


For kernels run on CUDA backend, the following preparations are required:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two preparations are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.


As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
wenleix pushed a commit that referenced this pull request Apr 4, 2021
Pull Request resolved: #55241

For kernels run on CUDA backend, the following preparations are required:

(1) Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

(2) Acquire the device guard on `common_device`.

Today, these two preparations are done seperately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device gurad. There are at least two issues with
the current appraoch:

1. Kernel may not impelemnt the device check, or implement in inconsistent way.
In fact, there are a few implementations in codebase to do the checks.

2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile..
ghstack-source-id: 125692886

Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)
For kernels run on CUDA backend, the following preparations are required:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two preparations are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.


As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
wenleix pushed a commit that referenced this pull request Apr 5, 2021
Pull Request resolved: #55241

For kernels run on CUDA backend, the following preparations are required:

(1) Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

(2) Acquire the device guard on `common_device`.

Today, these two preparations are done seperately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device gurad. There are at least two issues with
the current appraoch:

1. Kernel may not impelemnt the device check, or implement in inconsistent way.
In fact, there are a few implementations in codebase to do the checks.

2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile..
ghstack-source-id: 125736005

Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)
For kernels run on CUDA backend, the following preparations are required:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two preparations are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.


As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
wenleix pushed a commit that referenced this pull request Apr 5, 2021
Pull Request resolved: #55241

For kernels run on CUDA backend, the following preparations are required:

(1) Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

(2) Acquire the device guard on `common_device`.

Today, these two preparations are done seperately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device gurad. There are at least two issues with
the current appraoch:

1. Kernel may not impelemnt the device check, or implement in inconsistent way.
In fact, there are a few implementations in codebase to do the checks.

2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile..
ghstack-source-id: 125769869

Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)
@@ -4347,6 +4350,7 @@
SparseCUDA: hspmm_sparse_cuda

- func: copy_sparse_to_sparse_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)
device_guard: False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calls into Tensor::to :

inline void copy_into_sparse(const SparseTensor& self, const Tensor& indices, const Tensor& values, bool non_blocking) {
alias_into_sparse(
self,
indices.to(self._indices().options(), non_blocking, /*copy=*/true),
values.to(self._values().options(), non_blocking, /*copy=*/true));
}

which handles device guard:

if (options.has_device()) {
options = options.device(ensure_has_index(options.device()));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should have been a composite haha

For kernels run on CUDA backend, the following preparations are required:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two preparations are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.


As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
…apper"


For kernels run on CUDA backend, the following preparations are required:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two preparations are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.


As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.

# Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.

# Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.

# Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
wenleix pushed a commit that referenced this pull request Apr 20, 2021
Pull Request resolved: #55241

For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.

# Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`.






ghstack-source-id: 126921579

Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!
…pper"


For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.

# Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
wenleix pushed a commit that referenced this pull request Apr 20, 2021
Pull Request resolved: #55241

For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.

# Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`.






ghstack-source-id: 126924122

Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!
For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.

# Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.

# Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
wenleix pushed a commit that referenced this pull request Apr 20, 2021
Pull Request resolved: #55241

For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.

# Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`.






ghstack-source-id: 126967104

Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!
For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.

# Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
wenleix pushed a commit that referenced this pull request Apr 21, 2021
Pull Request resolved: #55241

For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.

# Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`.






ghstack-source-id: 127011444

Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!
For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.




Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
wenleix pushed a commit that referenced this pull request Apr 22, 2021
Pull Request resolved: #55241

For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.

# Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`.






ghstack-source-id: 127145739

Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!
@wenleix
Copy link
Contributor Author

wenleix commented Apr 23, 2021

One issue that cause CUDA memory issue is because of normal_ -- while it uses TensorItearator, it only use the metadata part of TensorIterator (but not the execution part) . So device guard is not fetched for it.

…ard in kernel wrapper"


For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.


# Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 





Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
@wenleix
Copy link
Contributor Author

wenleix commented Apr 23, 2021

We found make TI to fetch device guard is a bit tricky:

  • We can fetch device guard at execution time. However, TI doesn't have more than one execution entry. Usually it's for_each(), but it can also be parallel_reduce() and serial_for_each(). To make things more complicated, for_each might delegate to serial_for_each, but user might also directly call serial_for_each

  • Some kernels only use TI for metadata management, but not the execution part (e.g. normal_). There are also some kernels use TI, but may or may not use the TI execution (e.g. copy_).

Given this and in general complication of this PR with device check + TI device guard + ... We will fall back with a more incremental approach by first just adding more device checks. As mentioned in #56570 (comment) . cc @ezyang

For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.


# Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. 





Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

[ghstack-poisoned]
wenleix pushed a commit that referenced this pull request Apr 23, 2021
Pull Request resolved: #55241

For kernels run on CUDA backend, the following preparation steps are required for most kernels:

1. Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as `common_device`.

2. Acquire the device guard on `common_device`.

Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`).

The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:

1. Kernel may not implement the device check, or implement in an inconsistent way.

2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.

# Appendix

As a concrete example, here are a few implementations in codebase to do the device checks:

a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340

b.  `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152

c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50

d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`.  https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`.






ghstack-source-id: 127311961

Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!
@wenleix
Copy link
Contributor Author

wenleix commented Apr 23, 2021

Command to generate change in native_functions.yaml (a few is missing): https://gist.github.com/wenleix/a7d1c7eb9c504738c3c863c9a4c69839

@github-actions
Copy link

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Apr 13, 2022
@github-actions github-actions bot closed this May 13, 2022
@facebook-github-bot facebook-github-bot deleted the gh/wenleix/18/head branch June 12, 2022 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants