-
Notifications
You must be signed in to change notification settings - Fork 24.9k
improve handling of precision issue in torch.multinomial (solves #4858) #5774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the PR, @t-vi! Could you add a test case for this in |
I can, but it won't actually catch the error reliably. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the problem that the CUDA RNG is returning a number in the range (0.0, 1] instead of [0, 1)?
In addition, it's fine if the test doesn't always catch the error, as long as it doesn't error out on correct code. I don't know how random number generation works on CUDA, but if you could fix a random seed that makes the random number 1, using that seed could be a test.
Minor comment below if we do go with this solution.
aten/src/THC/THCTensorRandom.cuh
Outdated
// the code below will move it to the last non-zero probability | ||
// this actually can happen when the random number is 1 | ||
// (github pytorch issue #4858). | ||
if (size > 0) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Ah sorry, just read your comment @t-vi. I think a specific random seed that fails would be good. For the second case, if the test doesn't take a long time to run that would be good (but yeah it might take too long) |
So I have a random seed that does cause this, but I'm not sure how strongly it depends on specifics of the current setup of |
@pytorchbot test this please |
@@ -140,8 +140,16 @@ __device__ int binarySearchForMultinomial(T* dist, | |||
|
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@pytorchbot test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes [#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [x] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: #139422 Approved by: https://github.com/huydhn
Fixes [#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [x] [**THIS PR**] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: #139407 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>
) Fixes [#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Manual Test - Test run Inductor.yml: https://github.com/pytorch/pytorch/actions/runs/11603287758/job/32309968542?pr=139337 - Test run inductor-unittest.yml ([3cbd83d](3cbd83d)) https://github.com/pytorch/pytorch/actions/runs/11605399925/job/32315737205?pr=139337 # Steps to fix the issue - [x] [**THIS PR**] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: #139337 Approved by: https://github.com/huydhn
…39422) Fixes [pytorch#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [x] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: pytorch#139422 Approved by: https://github.com/huydhn
…ch#139407) Fixes [pytorch#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Steps to fix the issue - [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [x] [**THIS PR**] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: pytorch#139407 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>
…rch#139337) Fixes [pytorch#5774](pytorch/test-infra#5774) # Overview Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest. See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0) # Manual Test - Test run Inductor.yml: https://github.com/pytorch/pytorch/actions/runs/11603287758/job/32309968542?pr=139337 - Test run inductor-unittest.yml ([3cbd83d](pytorch@3cbd83d)) https://github.com/pytorch/pytorch/actions/runs/11605399925/job/32315737205?pr=139337 # Steps to fix the issue - [x] [**THIS PR**] Create inductor-unittest.yml to handle unit test and daily rerun for inductor - [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124 - [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test Pull Request resolved: pytorch#139337 Approved by: https://github.com/huydhn
Hi,
I think I figured out #4858:
If you do
torch.cumsum(freqs, 0)
, you see that the cumulated sum isn't one when the first of the final 0.0 probabilities happen, so there is a precision issue at the upper end of the range (making the breakage more subtle).What seems to happen in
aten/src/THC/THCTensorRandom.cuh
is that the(0.0, 1.0]
random valueval
generated bycuda_uniform
is passed fromsampleMultinomialWithReplacement
tobinarySearchForMultinomial
in some cases (1.0 ?), thelt(midVal, val)
seems to evaluate totrue
always and we end withstart = size
(start being the result of the binary search).Then the
if (start == size)
catches this precision issue but does the wrong thing in settingstart = 0
. A more correct thing is to set start = size - 1 (but care for size = 0 unless the compiler figures it cannot happen) and let the code that follows find a non-zero-probability bin.With the attached small fix @coventry's test case passes 1 million iterations when it used to run into an error reliably in < 10 000 iterations (and I checked that it was indeed reaching the
start = 0
in this case).I hope this helps.
Best regards
Thomas