-
Notifications
You must be signed in to change notification settings - Fork 24.5k
Migrate glu from the THC to ATen (CUDA) #61153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586, closes gh-38697 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 3ab233d (more details on the Dr. CI page and at hud.pytorch.org/pr/61153):
❄️ 1 failure tentatively classified as flakybut reruns have not yet been triggered to confirm:
|
Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586, closes gh-38697 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | ghstack-source-id: 197f4f7 Pull Request resolved: #61153
// glu backward | ||
// ----------------------------------- | ||
template <typename scalar_t, typename OffsetCalc, typename StrideType> | ||
__global__ void glu_backward_kernel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need a one-off kernel for this instead of TensorIterator gpu_kernel? The reason it was done this way in THC was that the number of inputs was limited to 3, so these tricks were needed, but with TI there can be arbitrary many inputs, and thus backward can be a regular TI kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried with gpu_kernel_multiple_outputs
and it's ~1.5x slower in cuda time and also has greater CPU overhead since we need to dispatch 4 times to narrow the inputs/outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment about code structure
return grad_input; | ||
} | ||
|
||
Tensor glu_backward_cuda(const Tensor& grad_output, const Tensor& input, int64_t dim) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that this, and glu_backward_cuda_out, is a cuda-only function, you can put them directly in Activation.cu, and not REGISTER/DECLARE/DEFINE glu_backward_cuda_stub.
Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | [ghstack-poisoned]
Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | ghstack-source-id: 7c5b8b2 Pull Request resolved: #61153
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
// We explicitly iterate over the first half of the input tensor, and | ||
// gI_byte_offset and I_byte_offset are the offsets to access the | ||
// corresponding index in the second half of the tensor. | ||
CUDA_KERNEL_LOOP(i, numel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you don't need CUDA_KERNEL_LOOP here as you have num_threads >= num_elements
const auto N = iter.numel(); | ||
auto offset_calculator = make_element_offset_calculator<3>(iter); | ||
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(N > 0 && N <= std::numeric_limits<int32_t>::max()); | ||
int64_t grid = (N + NUM_THREADS - 1) / NUM_THREADS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NUM_THREADS and num_threads is 64, which is fine for vectorized TI kernels, but here I verified using num_threads=256 speeds things up for smaller sizes by ~10-15%.
Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | Differential Revision: [D29538093](https://our.internmc.facebook.com/intern/diff/D29538093) [ghstack-poisoned]
Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | ghstack-source-id: 90a6726 Pull Request resolved: #61153
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(N > 0 && N <= std::numeric_limits<int32_t>::max()); | ||
const auto offset_calculator = make_element_offset_calculator<3>(iter); | ||
constexpr int64_t block_size = 256; | ||
const int64_t grid = (N - block_size - 1) / block_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo here (N + block_size - 1
)
Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | Differential Revision: [D29538093](https://our.internmc.facebook.com/intern/diff/D29538093) [ghstack-poisoned]
Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | ghstack-source-id: d25ede7 Pull Request resolved: #61153
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Stack from ghstack:
Fixes gh-24571, fixes gh-24572
Closes gh-39586, closes gh-39586
Benchmarks
The benchmarks were run with nvprof calling the operator in a loop. It shows
reliable improvements for large tensors, but the TH implementation seems to fair
better for smaller tensors. For sufficiently large tensors, the ATen
implementation does win though.
Differential Revision: D29538093