Fix for possible RNG offset calculation bug in cuda vectorized dropout with VEC=2 #50110

mcarilli · 2021-01-05T22:08:13Z

The offset calculation (which gives an estimated ceiling on the most 32-bit values in the philox sequence any thread in the launch will use) uses the hardcoded UNROLL value of 4, and assumes the hungriest threads can use every value (.x, .y, .z, and .w) their curand_uniform4 calls provide. However, the way fused_dropout_kernel_vec is currently written, that assumption isn't true in the VEC=2 case: Each iteration of the grid x VEC stride loop, each thread calls curand_uniform4 once, uses rand.x and rand.y, and discards rand.z and rand.w. This means (I think) for a given totalElements, curand_uniform4 may be called twice as many times per thread in the VEC=2 case as for the VEC=4 case or the fully unrolled code path, which means the offset calculation (which is a good estimate for the latter two cases) is probably wrong for the fused_dropout_kernel_vec<..., /*VEC=*/2> code path.

The present PR inserts some value-reuse in fused_dropout_kernel_vec to align the number of times curand_uniform4 is called for launches with the same totalElements in the VEC=2 and VEC=4 cases. The diff should

make the offset calculation valid for all code paths
provide a very small perf boost by reducing the number of curand_uniform4 calls in the VEC=2 path
~~make results bitwise accurate for all code paths~~ nvm, tensor elements are assigned to threads differently in the unrolled, VEC 2 and VEC 4 cases, so we're screwed here no matter what.

@ngimel what do you think?

facebook-github-bot · 2021-01-05T22:08:22Z

💊 CI failures summary and remediations

As of commit 9a03677 (more details on the Dr. CI page):

3/9 failures possibly* introduced in this PR
- 3/3 non-CircleCI failure(s)
6/9 broken upstream at merge base 093aca0 on Jan 05 from 11:00am to 6:28pm

🚧 6 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_linux_xenial_py3_6_gcc5_4_test on Jan 05 from 11:00am to 5:24pm (04e86be - 9529ae3)
- 🔁 rerun
pytorch_windows_vs2019_py36_cuda10.1_test1 on Jan 05 from 1:31pm to 2:57pm (e442ac1 - e3c56dd)
- 🔁 rerun
pytorch_linux_bionic_py3_6_clang9_test on Jan 05 from 11:00am to 5:24pm (04e86be - 4a6c178)
- 🔁 rerun
pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 on Jan 05 from 11:03am to 6:28pm (bbae677 - 5e1c8f2)
- 🔁 rerun
pytorch_linux_xenial_py3_clang5_asan_test1 on Jan 05 from 11:27am to 6:07pm (c115957 - 9529ae3)
- 🔁 rerun
pytorch_linux_bionic_py3_8_gcc9_coverage_test1 on Jan 05 from 11:27am to 6:28pm (c115957 - 9529ae3)
- 🔁 rerun

ci.pytorch.org: 2 failed

Failed: pr/caffe2-pytorch-linux-bionic-rocm3.10-py3.6-test
Failed: pr/pytorch-linux-bionic-rocm3.10-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 19 times.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-01-06T06:41:32Z

@ngimel merged this pull request in 57d489e.

…t with VEC=2 (pytorch#50110) Summary: The [offset calculation](https://github.com/pytorch/pytorch/blob/e3c56ddde67ca1a49159ffa886d889b6e65c7033/aten/src/ATen/native/cuda/Dropout.cu#L328) (which gives an estimated ceiling on the most 32-bit values in the philox sequence any thread in the launch will use) uses the hardcoded UNROLL value of 4, and assumes the hungriest threads can use every value (.x, .y, .z, and .w) their curand_uniform4 calls provide. However, the way fused_dropout_kernel_vec is currently written, that assumption isn't true in the VEC=2 case: Each iteration of the `grid x VEC` stride loop, each thread calls curand_uniform4 once, uses rand.x and rand.y, and discards rand.z and rand.w. This means (I _think_) curand_uniform4 may be called twice as many times per thread in the VEC=2 case as for the VEC=4 case or the fully unrolled code path, which means the offset calculation (which is a good estimate for the latter two cases) is probably wrong for the `fused_dropout_kernel_vec<..., /*VEC=*/2>` code path. The present PR inserts some value-reuse in fused_dropout_kernel_vec to align the number of times curand_uniform4 is called for launches with the same totalElements in the VEC=2 and VEC=4 cases. The diff should - make the offset calculation valid for all code paths - provide a very small perf boost by reducing the number of curand_uniform4 calls in the VEC=2 path - ~~make results bitwise accurate for all code paths~~ nvm, tensor elements are assigned to threads differently in the unrolled, VEC 2 and VEC 4 cases, so we're screwed here no matter what. ngimel what do you think? Pull Request resolved: pytorch#50110 Reviewed By: smessmer Differential Revision: D25790121 Pulled By: ngimel fbshipit-source-id: f8f533ad997268c6f323cf4d225de547144247a8

compiles

180e173

facebook-github-bot added the cla signed label Jan 5, 2021

comment

9a03677

mcarilli changed the title ~~Possible fix for possible bug with cuda vectorized dropout in the VEC=2 case~~ Fix for possible RNG offset calculation bug in cuda vectorized dropout with VEC=2 Jan 5, 2021

pytorchbot added the open source label Jan 5, 2021

ngimel approved these changes Jan 5, 2021

View reviewed changes

facebook-github-bot reviewed Jan 5, 2021

View reviewed changes

facebook-github-bot closed this in 57d489e Jan 6, 2021

facebook-github-bot added the Merged label Jan 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for possible RNG offset calculation bug in cuda vectorized dropout with VEC=2 #50110

Fix for possible RNG offset calculation bug in cuda vectorized dropout with VEC=2 #50110

mcarilli commented Jan 5, 2021 •

edited

facebook-github-bot commented Jan 5, 2021 •

edited

facebook-github-bot left a comment

facebook-github-bot commented Jan 6, 2021

Fix for possible RNG offset calculation bug in cuda vectorized dropout with VEC=2 #50110

Fix for possible RNG offset calculation bug in cuda vectorized dropout with VEC=2 #50110

Conversation

mcarilli commented Jan 5, 2021 • edited

facebook-github-bot commented Jan 5, 2021 • edited

💊 CI failures summary and remediations

🚧 6 fixed upstream failures:

ci.pytorch.org: 2 failed

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 6, 2021

mcarilli commented Jan 5, 2021 •

edited

facebook-github-bot commented Jan 5, 2021 •

edited