Fixed CUDA randint generation for large ranges. #126066

tringwald · 2024-05-13T13:38:55Z

For large ranges, calls to CUDA randint use a different unroll_factor to generate random ints. This unroll_factor was not considered correctly in the calculation of the Philox offsets. Thus, some of the random states were reused, resulting in lower entropy (see #125224).

pytorch-bot · 2024-05-13T13:38:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126066

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 8 Pending, 1 Unrelated Failure

As of commit 1b281aa with merge base d578039 ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-cuda12.1-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu) (gh)
Analysis of target '//:ATen_CPU_AVX2' failed; build aborted: Analysis failed
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
test_linalg.py::TestLinalgCUDA::test_inverse_errors_large_cuda_complex128
pull / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu) (gh)

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-jammy-py3.8-gcc11 / test (distributed, 1, 2, linux.2xlarge) (gh) (similar failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

test/test_cuda.py

aten/src/ATen/native/cuda/DistributionTemplates.h

tringwald · 2024-05-13T14:26:12Z

@r-barnes Thanks for reviewing, I added some type annotations and changed the C++ parameters to const.

test/test_cuda.py

tringwald · 2024-05-14T07:35:03Z

@pytorchbot rebase

pytorchmergebot · 2024-05-14T07:36:35Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-05-14T07:36:40Z

Successfully rebased cuda-randint-randomness-for-large-range onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cuda-randint-randomness-for-large-range && git pull --rebase)

eqy · 2024-05-19T19:17:02Z

CC @drisspg who might know more about the SDPA tests

tringwald · 2024-05-19T20:54:10Z

Thanks @eqy. Those tests in test_transformers.py use torch._fill_mem_eff_dropout_mask_, which in turn calls a custom CUDA kernel to populate the dropout mask with uniform values before thresholding. I'm not sure why we don't use torch.rand there, but it seems like replacing the custom impl with torch.rand yields some weird test failures.
I've rolled back the test changes for now, so I can more easily debug the other failures, but we should probably reconsider if we need a custom rand impl for those tests.

…ts with torch.rand.

…on with ceil.

…relied on overlapping random states, which should now be fixed.

tringwald requested a review from eqy as a code owner May 13, 2024 13:38

pytorchbot added the open source label May 13, 2024

r-barnes reviewed May 13, 2024

View reviewed changes

test/test_cuda.py Outdated Show resolved Hide resolved

test/test_cuda.py Outdated Show resolved Hide resolved

aten/src/ATen/native/cuda/DistributionTemplates.h Outdated Show resolved Hide resolved

tringwald mentioned this pull request May 13, 2024

Strange behavior of randint using device=cuda #125224

Open

tringwald force-pushed the cuda-randint-randomness-for-large-range branch from 3ea6988 to 849bf9e Compare May 13, 2024 21:26

eqy reviewed May 14, 2024

View reviewed changes

test/test_cuda.py Outdated Show resolved Hide resolved

eqy approved these changes May 14, 2024

View reviewed changes

pytorchmergebot force-pushed the cuda-randint-randomness-for-large-range branch from b09c3f1 to cb7925c Compare May 14, 2024 07:36

tringwald force-pushed the cuda-randint-randomness-for-large-range branch 4 times, most recently from 303b76e to 0a7226b Compare May 18, 2024 20:05

tringwald force-pushed the cuda-randint-randomness-for-large-range branch from ada1975 to 993afca Compare June 8, 2024 13:28

tringwald requested review from lezcano, nikitaved and IvanYashchuk as code owners June 8, 2024 21:36

pytorch-bot bot added the release notes: linalg_frontend release notes category label Jun 8, 2024

tringwald added 5 commits June 13, 2024 20:38

Fixed CUDA randint generation for large ranges.

2212d61

Fixed offset calculation.

77a2627

Replaced call to torch._fill_mem_eff_dropout_mask_ in transformer tes…

1697b08

…ts with torch.rand.

Rolled back transformer test changes. Using original offset calculati…

bc08650

…on with ceil.

Increased the niter count for test_svd_lowrank. Apparently this test …

e080abb

…relied on overlapping random states, which should now be fixed.

tringwald force-pushed the cuda-randint-randomness-for-large-range branch from cafe610 to e080abb Compare June 13, 2024 19:24

tringwald added 2 commits June 13, 2024 23:15

Reverted niter change, use total_elements upper bound for the offset.

4b21d52

Hardcoded the offset increments for every dist_func.

1b281aa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed CUDA randint generation for large ranges. #126066

Fixed CUDA randint generation for large ranges. #126066

tringwald commented May 13, 2024

pytorch-bot bot commented May 13, 2024 •

edited

tringwald commented May 13, 2024

tringwald commented May 14, 2024

pytorchmergebot commented May 14, 2024

pytorchmergebot commented May 14, 2024

eqy commented May 19, 2024

tringwald commented May 19, 2024 •

edited

Fixed CUDA randint generation for large ranges. #126066

Are you sure you want to change the base?

Fixed CUDA randint generation for large ranges. #126066

Conversation

tringwald commented May 13, 2024

pytorch-bot bot commented May 13, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126066

❌ 3 New Failures, 8 Pending, 1 Unrelated Failure

tringwald commented May 13, 2024

tringwald commented May 14, 2024

pytorchmergebot commented May 14, 2024

pytorchmergebot commented May 14, 2024

eqy commented May 19, 2024

tringwald commented May 19, 2024 • edited

pytorch-bot bot commented May 13, 2024 •

edited

tringwald commented May 19, 2024 •

edited