-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Unify device check and guard in kernel wrapper #55241
Conversation
For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 636d485 (more details on the Dr. CI page):
🕵️ 5 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages: pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 (1/5)Step: "Run tests" (full log | diagnosis details | 🔁 rerun)
|
For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
…l wrapper" For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. ghstack-source-id: 125666279 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)
For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
…rd in kernel wrapper" For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. ghstack-source-id: 125690005 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)
…pper" For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. ghstack-source-id: 125691571 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)
…pper" For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. ghstack-source-id: 125692886 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)
For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. ghstack-source-id: 125736005 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)
For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparations are required: (1) Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. (2) Acquire the device guard on `common_device`. Today, these two preparations are done seperately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device gurad. There are at least two issues with the current appraoch: 1. Kernel may not impelemnt the device check, or implement in inconsistent way. In fact, there are a few implementations in codebase to do the checks. 2. If the first tensor happnened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile.. ghstack-source-id: 125769869 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/)
@@ -4347,6 +4350,7 @@ | |||
SparseCUDA: hspmm_sparse_cuda | |||
|
|||
- func: copy_sparse_to_sparse_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!) | |||
device_guard: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calls into Tensor::to
:
pytorch/aten/src/ATen/SparseTensorUtils.h
Lines 34 to 39 in bf37bf7
inline void copy_into_sparse(const SparseTensor& self, const Tensor& indices, const Tensor& values, bool non_blocking) { | |
alias_into_sparse( | |
self, | |
indices.to(self._indices().options(), non_blocking, /*copy=*/true), | |
values.to(self._values().options(), non_blocking, /*copy=*/true)); | |
} |
which handles device guard:
pytorch/aten/src/ATen/native/TensorConversions.cpp
Lines 83 to 85 in bf37bf7
if (options.has_device()) { | |
options = options.device(ensure_has_index(options.device())); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should have been a composite haha
For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
…apper" For kernels run on CUDA backend, the following preparations are required: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two preparations are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. ghstack-source-id: 126921579 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!
…pper" For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. ghstack-source-id: 126924122 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!
For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. ghstack-source-id: 126967104 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!
For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. ghstack-source-id: 127011444 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!
For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. ghstack-source-id: 127145739 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!
One issue that cause CUDA memory issue is because of |
…ard in kernel wrapper" For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
We found make TI to fetch device guard is a bit tricky:
Given this and in general complication of this PR with device check + TI device guard + ... We will fall back with a more incremental approach by first just adding more device checks. As mentioned in #56570 (comment) . cc @ezyang |
For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) [ghstack-poisoned]
Pull Request resolved: #55241 For kernels run on CUDA backend, the following preparation steps are required for most kernels: 1. Make sure all the tensors are on the same device (except for 0-dimension tensor, which is allowed to stay in CPU). We will denote this device as `common_device`. 2. Acquire the device guard on `common_device`. Today, these two steps are done separately. (1) is done in kernels in an adhoc way, while (2) is done in the kernel wrapper (e.g. `RegisterCUDA.cpp`). The kernel wrapper assumes (1) will be done in kernel, so it just pick up the first tensor to acquire device guard. There are at least two issues with the current approach: 1. Kernel may not implement the device check, or implement in an inconsistent way. 2. If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire the wrong device guard. We have seen issues before (db2b273) thus kernel might need to re-acquire device guard in such situation, which is fragile. # Appendix As a concrete example, here are a few implementations in codebase to do the device checks: a. `TensorIteratorBase::compute_types` in `ATen/TensorIterator.cpp`. Used in kernels rely on `TensorIterator` to perform element-wise operations: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorIterator.cpp#L326-L340 b. `checkSameGPU` and `checkAllSameGPU` in `Aten/TensorUtils.cpp`:. Used in a lot of CUDA kernels. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/TensorUtils.cpp#L130-L152 c. `check_attributes` in `ATen/native/RNN.h`. Used in a few RNN kernels: https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/RNN.h#L30-L50 d. `dot_check` in `ATen/native/cuda/LinearAlgebra.cu`. https://github.com/pytorch/pytorch/blob/6d030c14cf4bd2e59b6e4899a9ee4645d2199e68/aten/src/ATen/native/cuda/LinearAlgebra.cu#L378-L383 These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in `TensorList`. ghstack-source-id: 127311961 Differential Revision: [D27540071](https://our.internmc.facebook.com/intern/diff/D27540071/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D27540071/)!
Command to generate change in |
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Stack from ghstack:
For kernels run on CUDA backend, the following preparation steps are required for most kernels:
Make sure all the tensors are on the same device (except for
0-dimension tensor, which is allowed to stay in CPU). We will denote
this device as
common_device
.Acquire the device guard on
common_device
.Today, these two steps are done separately. (1) is done in kernels
in an adhoc way, while (2) is done in the kernel wrapper (e.g.
RegisterCUDA.cpp
).The kernel wrapper assumes (1) will be done in kernel, so it just pick up
the first tensor to acquire device guard. There are at least two issues with
the current approach:
Kernel may not implement the device check, or implement in an inconsistent way.
If the first tensor happened to be a 0-dimension CPU tensor, (2) will acquire
the wrong device guard. We have seen issues before
(db2b273) thus kernel might need to re-acquire
device guard in such situation, which is fragile.
Appendix
As a concrete example, here are a few implementations in codebase to do the device checks:
a.
TensorIteratorBase::compute_types
inATen/TensorIterator.cpp
. Used in kernels rely onTensorIterator
to perform element-wise operations:pytorch/aten/src/ATen/TensorIterator.cpp
Lines 326 to 340 in 6d030c1
b.
checkSameGPU
andcheckAllSameGPU
inAten/TensorUtils.cpp
:. Used in a lot of CUDA kernels.pytorch/aten/src/ATen/TensorUtils.cpp
Lines 130 to 152 in 6d030c1
c.
check_attributes
inATen/native/RNN.h
. Used in a few RNN kernels:pytorch/aten/src/ATen/native/RNN.h
Lines 30 to 50 in 6d030c1
d.
dot_check
inATen/native/cuda/LinearAlgebra.cu
.pytorch/aten/src/ATen/native/cuda/LinearAlgebra.cu
Lines 378 to 383 in 6d030c1
These device checks have slightly different assumption. For example, (a) allows 0-dimension CPU tensors, (b) assumes everything is on GPU device, and (c) also checks all tensors in
TensorList
.Differential Revision: D27540071