Skip to content

Migrate glu from the THC to ATen (CUDA) #61153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

peterbell10
Copy link
Collaborator

@peterbell10 peterbell10 commented Jul 2, 2021

Stack from ghstack:

Fixes gh-24571, fixes gh-24572
Closes gh-39586, closes gh-39586

Benchmarks

The benchmarks were run with nvprof calling the operator in a loop. It shows
reliable improvements for large tensors, but the TH implementation seems to fair
better for smaller tensors. For sufficiently large tensors, the ATen
implementation does win though.

Shape Dim Master Forward (us) This PR Forward (us) Master Backward (us) This PR Backward (us)
128, 1000 0 2.4770 2.0820 3.0440 3.4680
1 2.7060 4.4850 3.3380 3.6250
128, 10000 0 26.531 21.366 38.083 34.623
1 27.680 30.465 38.943 35.204
128, 100000 0 292.09 219.56 355.57 324.49
1 260.43 243.08 332.25 323.37
128, 1000000 0 2475.7 1874.6 3810.1 3215.7
1 2586.3 2380.9 3349.9 3207.8

Differential Revision: D29538093

Fixes gh-24571, fixes gh-24572
Closes gh-39586, closes gh-39586, closes gh-38697

Benchmarks
----------

The benchmarks were run with nvprof calling the operator in a loop. It shows
reliable improvements for large tensors, but the TH implementation seems to fair
better for smaller tensors. For sufficiently large tensors, the ATen
implementation does win though.

|        Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) |
|-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:|
|    128, 1000 | 0   |        2.4770       |        2.0820        |        3.0440        |         3.4680        |
|              | 1   |        2.7060       |        4.4850        |        3.3380        |         3.6250        |
|   128, 10000 | 0   |        26.531       |        21.366        |        38.083        |         34.623        |
|              | 1   |        27.680       |        30.465        |        38.943        |         35.204        |
|  128, 100000 | 0   |        292.09       |        219.56        |        355.57        |         324.49        |
|              | 1   |        260.43       |        243.08        |        332.25        |         323.37        |
| 128, 1000000 | 0   |        2475.7       |        1874.6        |        3810.1        |         3215.7        |
|              | 1   |        2586.3       |        2380.9        |        3349.9        |         3207.8        |

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 2, 2021

💊 CI failures summary and remediations

As of commit 3ab233d (more details on the Dr. CI page and at hud.pytorch.org/pr/61153):



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Jul 03 03:45:30 unknown file: Failure
Jul 03 03:45:30 frame #7: build/bin/test_api() [0xc0b8c5]
Jul 03 03:45:30 frame #8: build/bin/test_api() [0xc0bb65]
Jul 03 03:45:30 frame #9: testing::internal::UnitTestImpl::RunAllTests() + 0xbf9 (0xc0cba9 in build/bin/test_api)
Jul 03 03:45:30 frame #10: testing::UnitTest::Run() + 0x8f (0xc0ce4f in build/bin/test_api)
Jul 03 03:45:30 frame #11: main + 0xc8 (0x5833a8 in build/bin/test_api)
Jul 03 03:45:30 frame #12: __libc_start_main + 0xf0 (0x7f6f462f1840 in /lib/x86_64-linux-gnu/libc.so.6)
Jul 03 03:45:30 frame #13: _start + 0x29 (0x5b9a19 in build/bin/test_api)
Jul 03 03:45:30 " thrown in the test body.
Jul 03 03:45:30 [  FAILED  ] IntegrationTest.MNIST_CUDA (4 ms)
Jul 03 03:45:30 [ RUN      ] IntegrationTest.MNISTBatchNorm_CUDA
Jul 03 03:45:30 unknown file: Failure
Jul 03 03:45:30 C++ exception with description "Error opening images file at test/cpp/api/mnist/train-images-idx3-ubyte
Jul 03 03:45:30 Exception raised from read_images at /var/lib/jenkins/workspace/torch/csrc/api/src/data/datasets/mnist.cpp:67 (most recent call first):
Jul 03 03:45:30 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f6f5fcf4c6b in /var/lib/jenkins/workspace/build/lib/libc10.so)
Jul 03 03:45:30 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xce (0x7f6f5fcf047e in /var/lib/jenkins/workspace/build/lib/libc10.so)
Jul 03 03:45:30 frame #2: <unknown function> + 0x4201cf2 (0x7f6f64341cf2 in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jul 03 03:45:30 frame #3: torch::data::datasets::MNIST::MNIST(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::data::datasets::MNIST::Mode) + 0x46 (0x7f6f64342d96 in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jul 03 03:45:30 frame #4: IntegrationTest_MNISTBatchNorm_CUDA_Test::TestBody() + 0x9d6 (0x7843b6 in build/bin/test_api)
Jul 03 03:45:30 frame #5: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x4a (0xc1489a in build/bin/test_api)
Jul 03 03:45:30 frame #6: build/bin/test_api() [0xc0b2d6]
Jul 03 03:45:30 frame #7: build/bin/test_api() [0xc0b8c5]

Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

peterbell10 added a commit that referenced this pull request Jul 2, 2021
Fixes gh-24571, fixes gh-24572
Closes gh-39586, closes gh-39586, closes gh-38697

Benchmarks
----------

The benchmarks were run with nvprof calling the operator in a loop. It shows
reliable improvements for large tensors, but the TH implementation seems to fair
better for smaller tensors. For sufficiently large tensors, the ATen
implementation does win though.

|        Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) |
|-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:|
|    128, 1000 | 0   |        2.4770       |        2.0820        |        3.0440        |         3.4680        |
|              | 1   |        2.7060       |        4.4850        |        3.3380        |         3.6250        |
|   128, 10000 | 0   |        26.531       |        21.366        |        38.083        |         34.623        |
|              | 1   |        27.680       |        30.465        |        38.943        |         35.204        |
|  128, 100000 | 0   |        292.09       |        219.56        |        355.57        |         324.49        |
|              | 1   |        260.43       |        243.08        |        332.25        |         323.37        |
| 128, 1000000 | 0   |        2475.7       |        1874.6        |        3810.1        |         3215.7        |
|              | 1   |        2586.3       |        2380.9        |        3349.9        |         3207.8        |

ghstack-source-id: 197f4f7
Pull Request resolved: #61153
@peterbell10 peterbell10 added the module: porting Issues related to porting TH/THNN legacy to ATen native label Jul 2, 2021
@peterbell10 peterbell10 requested a review from ngimel July 2, 2021 00:07
// glu backward
// -----------------------------------
template <typename scalar_t, typename OffsetCalc, typename StrideType>
__global__ void glu_backward_kernel(
Copy link
Collaborator

@ngimel ngimel Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need a one-off kernel for this instead of TensorIterator gpu_kernel? The reason it was done this way in THC was that the number of inputs was limited to 3, so these tricks were needed, but with TI there can be arbitrary many inputs, and thus backward can be a regular TI kernel.

Copy link
Collaborator Author

@peterbell10 peterbell10 Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with gpu_kernel_multiple_outputs and it's ~1.5x slower in cuda time and also has greater CPU overhead since we need to dispatch 4 times to narrow the inputs/outputs.

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment about code structure

return grad_input;
}

Tensor glu_backward_cuda(const Tensor& grad_output, const Tensor& input, int64_t dim) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that this, and glu_backward_cuda_out, is a cuda-only function, you can put them directly in Activation.cu, and not REGISTER/DECLARE/DEFINE glu_backward_cuda_stub.

Fixes gh-24571, fixes gh-24572
Closes gh-39586, closes gh-39586

Benchmarks
----------

The benchmarks were run with nvprof calling the operator in a loop. It shows
reliable improvements for large tensors, but the TH implementation seems to fair
better for smaller tensors. For sufficiently large tensors, the ATen
implementation does win though.

|        Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) |
|-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:|
|    128, 1000 | 0   |        2.4770       |        2.0820        |        3.0440        |         3.4680        |
|              | 1   |        2.7060       |        4.4850        |        3.3380        |         3.6250        |
|   128, 10000 | 0   |        26.531       |        21.366        |        38.083        |         34.623        |
|              | 1   |        27.680       |        30.465        |        38.943        |         35.204        |
|  128, 100000 | 0   |        292.09       |        219.56        |        355.57        |         324.49        |
|              | 1   |        260.43       |        243.08        |        332.25        |         323.37        |
| 128, 1000000 | 0   |        2475.7       |        1874.6        |        3810.1        |         3215.7        |
|              | 1   |        2586.3       |        2380.9        |        3349.9        |         3207.8        |

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jul 2, 2021
Fixes gh-24571, fixes gh-24572
Closes gh-39586, closes gh-39586

Benchmarks
----------

The benchmarks were run with nvprof calling the operator in a loop. It shows
reliable improvements for large tensors, but the TH implementation seems to fair
better for smaller tensors. For sufficiently large tensors, the ATen
implementation does win though.

|        Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) |
|-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:|
|    128, 1000 | 0   |        2.4770       |        2.0820        |        3.0440        |         3.4680        |
|              | 1   |        2.7060       |        4.4850        |        3.3380        |         3.6250        |
|   128, 10000 | 0   |        26.531       |        21.366        |        38.083        |         34.623        |
|              | 1   |        27.680       |        30.465        |        38.943        |         35.204        |
|  128, 100000 | 0   |        292.09       |        219.56        |        355.57        |         324.49        |
|              | 1   |        260.43       |        243.08        |        332.25        |         323.37        |
| 128, 1000000 | 0   |        2475.7       |        1874.6        |        3810.1        |         3215.7        |
|              | 1   |        2586.3       |        2380.9        |        3349.9        |         3207.8        |

ghstack-source-id: 7c5b8b2
Pull Request resolved: #61153
@ngimel
Copy link
Collaborator

ngimel commented Jul 2, 2021

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

// We explicitly iterate over the first half of the input tensor, and
// gI_byte_offset and I_byte_offset are the offsets to access the
// corresponding index in the second half of the tensor.
CUDA_KERNEL_LOOP(i, numel) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you don't need CUDA_KERNEL_LOOP here as you have num_threads >= num_elements

const auto N = iter.numel();
auto offset_calculator = make_element_offset_calculator<3>(iter);
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(N > 0 && N <= std::numeric_limits<int32_t>::max());
int64_t grid = (N + NUM_THREADS - 1) / NUM_THREADS;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NUM_THREADS and num_threads is 64, which is fine for vectorized TI kernels, but here I verified using num_threads=256 speeds things up for smaller sizes by ~10-15%.

Fixes gh-24571, fixes gh-24572
Closes gh-39586, closes gh-39586

Benchmarks
----------

The benchmarks were run with nvprof calling the operator in a loop. It shows
reliable improvements for large tensors, but the TH implementation seems to fair
better for smaller tensors. For sufficiently large tensors, the ATen
implementation does win though.

|        Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) |
|-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:|
|    128, 1000 | 0   |        2.4770       |        2.0820        |        3.0440        |         3.4680        |
|              | 1   |        2.7060       |        4.4850        |        3.3380        |         3.6250        |
|   128, 10000 | 0   |        26.531       |        21.366        |        38.083        |         34.623        |
|              | 1   |        27.680       |        30.465        |        38.943        |         35.204        |
|  128, 100000 | 0   |        292.09       |        219.56        |        355.57        |         324.49        |
|              | 1   |        260.43       |        243.08        |        332.25        |         323.37        |
| 128, 1000000 | 0   |        2475.7       |        1874.6        |        3810.1        |         3215.7        |
|              | 1   |        2586.3       |        2380.9        |        3349.9        |         3207.8        |

Differential Revision: [D29538093](https://our.internmc.facebook.com/intern/diff/D29538093)

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jul 2, 2021
Fixes gh-24571, fixes gh-24572
Closes gh-39586, closes gh-39586

Benchmarks
----------

The benchmarks were run with nvprof calling the operator in a loop. It shows
reliable improvements for large tensors, but the TH implementation seems to fair
better for smaller tensors. For sufficiently large tensors, the ATen
implementation does win though.

|        Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) |
|-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:|
|    128, 1000 | 0   |        2.4770       |        2.0820        |        3.0440        |         3.4680        |
|              | 1   |        2.7060       |        4.4850        |        3.3380        |         3.6250        |
|   128, 10000 | 0   |        26.531       |        21.366        |        38.083        |         34.623        |
|              | 1   |        27.680       |        30.465        |        38.943        |         35.204        |
|  128, 100000 | 0   |        292.09       |        219.56        |        355.57        |         324.49        |
|              | 1   |        260.43       |        243.08        |        332.25        |         323.37        |
| 128, 1000000 | 0   |        2475.7       |        1874.6        |        3810.1        |         3215.7        |
|              | 1   |        2586.3       |        2380.9        |        3349.9        |         3207.8        |

ghstack-source-id: 90a6726
Pull Request resolved: #61153
@ngimel
Copy link
Collaborator

ngimel commented Jul 2, 2021

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

TORCH_INTERNAL_ASSERT_DEBUG_ONLY(N > 0 && N <= std::numeric_limits<int32_t>::max());
const auto offset_calculator = make_element_offset_calculator<3>(iter);
constexpr int64_t block_size = 256;
const int64_t grid = (N - block_size - 1) / block_size;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here (N + block_size - 1)

Fixes gh-24571, fixes gh-24572
Closes gh-39586, closes gh-39586

Benchmarks
----------

The benchmarks were run with nvprof calling the operator in a loop. It shows
reliable improvements for large tensors, but the TH implementation seems to fair
better for smaller tensors. For sufficiently large tensors, the ATen
implementation does win though.

|        Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) |
|-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:|
|    128, 1000 | 0   |        2.4770       |        2.0820        |        3.0440        |         3.4680        |
|              | 1   |        2.7060       |        4.4850        |        3.3380        |         3.6250        |
|   128, 10000 | 0   |        26.531       |        21.366        |        38.083        |         34.623        |
|              | 1   |        27.680       |        30.465        |        38.943        |         35.204        |
|  128, 100000 | 0   |        292.09       |        219.56        |        355.57        |         324.49        |
|              | 1   |        260.43       |        243.08        |        332.25        |         323.37        |
| 128, 1000000 | 0   |        2475.7       |        1874.6        |        3810.1        |         3215.7        |
|              | 1   |        2586.3       |        2380.9        |        3349.9        |         3207.8        |

Differential Revision: [D29538093](https://our.internmc.facebook.com/intern/diff/D29538093)

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jul 3, 2021
Fixes gh-24571, fixes gh-24572
Closes gh-39586, closes gh-39586

Benchmarks
----------

The benchmarks were run with nvprof calling the operator in a loop. It shows
reliable improvements for large tensors, but the TH implementation seems to fair
better for smaller tensors. For sufficiently large tensors, the ATen
implementation does win though.

|        Shape | Dim | Master Forward (us) | This PR Forward (us) | Master Backward (us) | This PR Backward (us) |
|-------------:|-----|:-------------------:|:--------------------:|:--------------------:|:---------------------:|
|    128, 1000 | 0   |        2.4770       |        2.0820        |        3.0440        |         3.4680        |
|              | 1   |        2.7060       |        4.4850        |        3.3380        |         3.6250        |
|   128, 10000 | 0   |        26.531       |        21.366        |        38.083        |         34.623        |
|              | 1   |        27.680       |        30.465        |        38.943        |         35.204        |
|  128, 100000 | 0   |        292.09       |        219.56        |        355.57        |         324.49        |
|              | 1   |        260.43       |        243.08        |        332.25        |         323.37        |
| 128, 1000000 | 0   |        2475.7       |        1874.6        |        3810.1        |         3215.7        |
|              | 1   |        2586.3       |        2380.9        |        3349.9        |         3207.8        |

ghstack-source-id: d25ede7
Pull Request resolved: #61153
@ngimel
Copy link
Collaborator

ngimel commented Jul 3, 2021

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 0dc4047.

@facebook-github-bot facebook-github-bot deleted the gh/peterbell10/92/head branch July 10, 2021 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged module: porting Issues related to porting TH/THNN legacy to ATen native open source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants