Migrate glu from the THC to ATen (CUDA) #61153

peterbell10 · 2021-07-02T00:01:17Z

Stack from ghstack:

Migrate glu from the THC to ATen (CUDA) #61153 Migrate glu from the THC to ATen (CUDA)

Fixes gh-24571, fixes gh-24572
Closes gh-39586, closes gh-39586

Benchmarks

The benchmarks were run with nvprof calling the operator in a loop. It shows
reliable improvements for large tensors, but the TH implementation seems to fair
better for smaller tensors. For sufficiently large tensors, the ATen
implementation does win though.

Shape	Dim	Master Forward (us)	This PR Forward (us)	Master Backward (us)	This PR Backward (us)
128, 1000	0	2.4770	2.0820	3.0440	3.4680
	1	2.7060	4.4850	3.3380	3.6250
128, 10000	0	26.531	21.366	38.083	34.623
	1	27.680	30.465	38.943	35.204
128, 100000	0	292.09	219.56	355.57	324.49
	1	260.43	243.08	332.25	323.37
128, 1000000	0	2475.7	1874.6	3810.1	3215.7
	1	2586.3	2380.9	3349.9	3207.8

Differential Revision: D29538093

Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586, closes gh-38697 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | [ghstack-poisoned]

facebook-github-bot · 2021-07-02T00:01:23Z

💊 CI failures summary and remediations

As of commit 3ab233d (more details on the Dr. CI page and at hud.pytorch.org/pr/61153):

1/2 failures possibly* introduced in this PR
- 1/1 non-scanned failure(s)
1/2 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Jul 03 03:45:30 unknown file: Failure

Jul 03 03:45:30 frame #7: build/bin/test_api() [0xc0b8c5]
Jul 03 03:45:30 frame #8: build/bin/test_api() [0xc0bb65]
Jul 03 03:45:30 frame #9: testing::internal::UnitTestImpl::RunAllTests() + 0xbf9 (0xc0cba9 in build/bin/test_api)
Jul 03 03:45:30 frame #10: testing::UnitTest::Run() + 0x8f (0xc0ce4f in build/bin/test_api)
Jul 03 03:45:30 frame #11: main + 0xc8 (0x5833a8 in build/bin/test_api)
Jul 03 03:45:30 frame #12: __libc_start_main + 0xf0 (0x7f6f462f1840 in /lib/x86_64-linux-gnu/libc.so.6)
Jul 03 03:45:30 frame #13: _start + 0x29 (0x5b9a19 in build/bin/test_api)
Jul 03 03:45:30 " thrown in the test body.
Jul 03 03:45:30 [  FAILED  ] IntegrationTest.MNIST_CUDA (4 ms)
Jul 03 03:45:30 [ RUN      ] IntegrationTest.MNISTBatchNorm_CUDA
Jul 03 03:45:30 unknown file: Failure
Jul 03 03:45:30 C++ exception with description "Error opening images file at test/cpp/api/mnist/train-images-idx3-ubyte
Jul 03 03:45:30 Exception raised from read_images at /var/lib/jenkins/workspace/torch/csrc/api/src/data/datasets/mnist.cpp:67 (most recent call first):
Jul 03 03:45:30 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f6f5fcf4c6b in /var/lib/jenkins/workspace/build/lib/libc10.so)
Jul 03 03:45:30 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xce (0x7f6f5fcf047e in /var/lib/jenkins/workspace/build/lib/libc10.so)
Jul 03 03:45:30 frame #2: <unknown function> + 0x4201cf2 (0x7f6f64341cf2 in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jul 03 03:45:30 frame #3: torch::data::datasets::MNIST::MNIST(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::data::datasets::MNIST::Mode) + 0x46 (0x7f6f64342d96 in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jul 03 03:45:30 frame #4: IntegrationTest_MNISTBatchNorm_CUDA_Test::TestBody() + 0x9d6 (0x7843b6 in build/bin/test_api)
Jul 03 03:45:30 frame #5: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x4a (0xc1489a in build/bin/test_api)
Jul 03 03:45:30 frame #6: build/bin/test_api() [0xc0b2d6]
Jul 03 03:45:30 frame #7: build/bin/test_api() [0xc0b8c5]

Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586, closes gh-38697 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | ghstack-source-id: 197f4f7 Pull Request resolved: #61153

ngimel · 2021-07-02T02:04:29Z

aten/src/ATen/native/cuda/Activation.cu

+// glu backward
+// -----------------------------------
+template <typename scalar_t, typename OffsetCalc, typename StrideType>
+__global__ void glu_backward_kernel(


why do you need a one-off kernel for this instead of TensorIterator gpu_kernel? The reason it was done this way in THC was that the number of inputs was limited to 3, so these tricks were needed, but with TI there can be arbitrary many inputs, and thus backward can be a regular TI kernel.

I tried with gpu_kernel_multiple_outputs and it's ~1.5x slower in cuda time and also has greater CPU overhead since we need to dispatch 4 times to narrow the inputs/outputs.

ngimel

Minor comment about code structure

ngimel · 2021-07-02T04:59:58Z

aten/src/ATen/native/GatedLinearUnit.cpp

+  return grad_input;
+}
+
+Tensor glu_backward_cuda(const Tensor& grad_output, const Tensor& input, int64_t dim) {


given that this, and glu_backward_cuda_out, is a cuda-only function, you can put them directly in Activation.cu, and not REGISTER/DECLARE/DEFINE glu_backward_cuda_stub.

Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | [ghstack-poisoned]

Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | ghstack-source-id: 7c5b8b2 Pull Request resolved: #61153

ngimel · 2021-07-02T19:22:09Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2021-07-02T20:30:06Z

aten/src/ATen/native/cuda/Activation.cu

+  // We explicitly iterate over the first half of the input tensor, and
+  // gI_byte_offset and I_byte_offset are the offsets to access the
+  // corresponding index in the second half of the tensor.
+  CUDA_KERNEL_LOOP(i, numel) {


nit: you don't need CUDA_KERNEL_LOOP here as you have num_threads >= num_elements

ngimel · 2021-07-02T20:31:36Z

aten/src/ATen/native/cuda/Activation.cu

+  const auto N = iter.numel();
+  auto offset_calculator = make_element_offset_calculator<3>(iter);
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(N > 0 && N <= std::numeric_limits<int32_t>::max());
+  int64_t grid = (N + NUM_THREADS - 1) / NUM_THREADS;


NUM_THREADS and num_threads is 64, which is fine for vectorized TI kernels, but here I verified using num_threads=256 speeds things up for smaller sizes by ~10-15%.

Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | Differential Revision: [D29538093](https://our.internmc.facebook.com/intern/diff/D29538093) [ghstack-poisoned]

Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | ghstack-source-id: 90a6726 Pull Request resolved: #61153

ngimel · 2021-07-02T22:08:13Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2021-07-03T00:48:55Z

aten/src/ATen/native/cuda/Activation.cu

+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(N > 0 && N <= std::numeric_limits<int32_t>::max());
+  const auto offset_calculator = make_element_offset_calculator<3>(iter);
+  constexpr int64_t block_size = 256;
+  const int64_t grid = (N - block_size - 1) / block_size;


typo here (N + block_size - 1)

Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | Differential Revision: [D29538093](https://our.internmc.facebook.com/intern/diff/D29538093) [ghstack-poisoned]

Fixes gh-24571, fixes gh-24572 Closes gh-39586, closes gh-39586 Benchmarks ---------- The benchmarks were run with nvprof calling the operator in a loop. It shows reliable improvements for large tensors, but the TH implementation seems to fair better for smaller tensors. For sufficiently large tensors, the ATen implementation does win though. | Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) | |-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:| | 128, 1000 | 0 | 2.4770 | 2.0820 | 3.0440 | 3.4680 | | | 1 | 2.7060 | 4.4850 | 3.3380 | 3.6250 | | 128, 10000 | 0 | 26.531 | 21.366 | 38.083 | 34.623 | | | 1 | 27.680 | 30.465 | 38.943 | 35.204 | | 128, 100000 | 0 | 292.09 | 219.56 | 355.57 | 324.49 | | | 1 | 260.43 | 243.08 | 332.25 | 323.37 | | 128, 1000000 | 0 | 2475.7 | 1874.6 | 3810.1 | 3215.7 | | | 1 | 2586.3 | 2380.9 | 3349.9 | 3207.8 | ghstack-source-id: d25ede7 Pull Request resolved: #61153

ngimel · 2021-07-03T04:29:46Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-07-07T02:08:24Z

@ngimel merged this pull request in 0dc4047.

peterbell10 requested a review from ezyang as a code owner July 2, 2021 00:01

facebook-github-bot added the cla signed label Jul 2, 2021

pytorchbot added the open source label Jul 2, 2021

peterbell10 added the module: porting Issues related to porting TH/THNN legacy to ATen native label Jul 2, 2021

peterbell10 requested a review from ngimel July 2, 2021 00:07

ngimel reviewed Jul 2, 2021

View reviewed changes

ngimel approved these changes Jul 2, 2021

View reviewed changes

ngimel reviewed Jul 2, 2021

View reviewed changes

ngimel reviewed Jul 3, 2021

View reviewed changes

facebook-github-bot closed this in 0dc4047 Jul 7, 2021

facebook-github-bot added the Merged label Jul 7, 2021

facebook-github-bot deleted the gh/peterbell10/92/head branch July 10, 2021 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migrate glu from the THC to ATen (CUDA) #61153

Migrate glu from the THC to ATen (CUDA) #61153

Uh oh!

peterbell10 commented Jul 2, 2021 •

edited by ngimel

Loading

Uh oh!

facebook-github-bot commented Jul 2, 2021 •

edited

Loading

Uh oh!

ngimel Jul 2, 2021 •

edited

Loading

Uh oh!

peterbell10 Jul 2, 2021 •

edited

Loading

Uh oh!

ngimel left a comment

Uh oh!

ngimel Jul 2, 2021

Uh oh!

ngimel commented Jul 2, 2021

Uh oh!

ngimel Jul 2, 2021

Uh oh!

ngimel Jul 2, 2021

Uh oh!

ngimel commented Jul 2, 2021

Uh oh!

ngimel Jul 3, 2021

Uh oh!

ngimel commented Jul 3, 2021

Uh oh!

facebook-github-bot commented Jul 7, 2021

Uh oh!

Uh oh!

Migrate glu from the THC to ATen (CUDA) #61153

Migrate glu from the THC to ATen (CUDA) #61153

Uh oh!

Conversation

peterbell10 commented Jul 2, 2021 • edited by ngimel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Uh oh!

facebook-github-bot commented Jul 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

❄️ 1 failure tentatively classified as flaky

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test2 (1/1)

Uh oh!

ngimel Jul 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterbell10 Jul 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel Jul 2, 2021

Choose a reason for hiding this comment

Uh oh!

ngimel commented Jul 2, 2021

Uh oh!

ngimel Jul 2, 2021

Choose a reason for hiding this comment

Uh oh!

ngimel Jul 2, 2021

Choose a reason for hiding this comment

Uh oh!

ngimel commented Jul 2, 2021

Uh oh!

ngimel Jul 3, 2021

Choose a reason for hiding this comment

Uh oh!

ngimel commented Jul 3, 2021

Uh oh!

facebook-github-bot commented Jul 7, 2021

Uh oh!

Uh oh!

peterbell10 commented Jul 2, 2021 •

edited by ngimel

Loading

facebook-github-bot commented Jul 2, 2021 •

edited

Loading

ngimel Jul 2, 2021 •

edited

Loading

peterbell10 Jul 2, 2021 •

edited

Loading