Skip to content

improve handling of precision issue in torch.multinomial (solves #4858) #5774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 17, 2018

Conversation

t-vi
Copy link
Collaborator

@t-vi t-vi commented Mar 14, 2018

Hi,

I think I figured out #4858:
If you do torch.cumsum(freqs, 0), you see that the cumulated sum isn't one when the first of the final 0.0 probabilities happen, so there is a precision issue at the upper end of the range (making the breakage more subtle).
What seems to happen in aten/src/THC/THCTensorRandom.cuh is that the (0.0, 1.0] random value val generated by cuda_uniform is passed from sampleMultinomialWithReplacement to binarySearchForMultinomial in some cases (1.0 ?), the lt(midVal, val) seems to evaluate to true always and we end with start = size (start being the result of the binary search).
Then the if (start == size) catches this precision issue but does the wrong thing in setting start = 0. A more correct thing is to set start = size - 1 (but care for size = 0 unless the compiler figures it cannot happen) and let the code that follows find a non-zero-probability bin.

With the attached small fix @coventry's test case passes 1 million iterations when it used to run into an error reliably in < 10 000 iterations (and I checked that it was indeed reaching the start = 0 in this case).

I hope this helps.

Best regards

Thomas

@zou3519
Copy link
Contributor

zou3519 commented Mar 14, 2018

Thanks for the PR, @t-vi! Could you add a test case for this in test/test_cuda.py please?

@t-vi
Copy link
Collaborator Author

t-vi commented Mar 14, 2018

I can, but it won't actually catch the error reliably.
Edit: @zou3519 Would you want a test that tries a specific random seed that fails (which I would venture produces a "1.0" random number) or do you prefer to try 100k or 1M seeds?

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the problem that the CUDA RNG is returning a number in the range (0.0, 1] instead of [0, 1)?

In addition, it's fine if the test doesn't always catch the error, as long as it doesn't error out on correct code. I don't know how random number generation works on CUDA, but if you could fix a random seed that makes the random number 1, using that seed could be a test.

Minor comment below if we do go with this solution.

// the code below will move it to the last non-zero probability
// this actually can happen when the random number is 1
// (github pytorch issue #4858).
if (size > 0) {

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@zou3519
Copy link
Contributor

zou3519 commented Mar 14, 2018

Ah sorry, just read your comment @t-vi. I think a specific random seed that fails would be good. For the second case, if the test doesn't take a long time to run that would be good (but yeah it might take too long)

@t-vi
Copy link
Collaborator Author

t-vi commented Mar 14, 2018

So I have a random seed that does cause this, but I'm not sure how strongly it depends on specifics of the current setup of sampleMultinomialWithReplacement.
Trying 100k seeds takes ~70 seconds on my GTX1080, so it would seem to be a bit over the top.
However, the first error happens at 11042, so 10k seeds is not enough...

@ezyang
Copy link
Contributor

ezyang commented Mar 14, 2018

@pytorchbot test this please

@@ -140,8 +140,16 @@ __device__ int binarySearchForMultinomial(T* dist,

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@ezyang
Copy link
Contributor

ezyang commented Mar 15, 2018

@pytorchbot test this please

Copy link
Collaborator

@ssnl ssnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@soumith soumith merged commit 7cbe63d into pytorch:master Mar 17, 2018
pytorchmergebot pushed a commit that referenced this pull request Nov 1, 2024
Fixes [#5774](pytorch/test-infra#5774)
# Overview
Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest.
See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0)

# Steps to fix the issue
- [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor
- [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124
- [x] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test

Pull Request resolved: #139422
Approved by: https://github.com/huydhn
pytorchmergebot pushed a commit that referenced this pull request Nov 1, 2024
Fixes [#5774](pytorch/test-infra#5774)
# Overview
Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest.
See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0)

# Steps to fix the issue
- [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor
- [x] [**THIS PR**] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124
- [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test

Pull Request resolved: #139407
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
pytorchmergebot pushed a commit that referenced this pull request Nov 1, 2024
)

Fixes [#5774](pytorch/test-infra#5774)
# Overview
Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest.
See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0)

# Manual Test
- Test run Inductor.yml:
https://github.com/pytorch/pytorch/actions/runs/11603287758/job/32309968542?pr=139337
- Test run inductor-unittest.yml ([3cbd83d](3cbd83d))
https://github.com/pytorch/pytorch/actions/runs/11605399925/job/32315737205?pr=139337

# Steps to fix the issue

- [x]  [**THIS PR**] Create inductor-unittest.yml to handle unit test and daily rerun for inductor
- [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124
- [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test

Pull Request resolved: #139337
Approved by: https://github.com/huydhn
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
…39422)

Fixes [pytorch#5774](pytorch/test-infra#5774)
# Overview
Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest.
See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0)

# Steps to fix the issue
- [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor
- [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124
- [x] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test

Pull Request resolved: pytorch#139422
Approved by: https://github.com/huydhn
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
…ch#139407)

Fixes [pytorch#5774](pytorch/test-infra#5774)
# Overview
Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest.
See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0)

# Steps to fix the issue
- [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor
- [x] [**THIS PR**] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124
- [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test

Pull Request resolved: pytorch#139407
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
…rch#139337)

Fixes [pytorch#5774](pytorch/test-infra#5774)
# Overview
Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest.
See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0)

# Manual Test
- Test run Inductor.yml:
https://github.com/pytorch/pytorch/actions/runs/11603287758/job/32309968542?pr=139337
- Test run inductor-unittest.yml ([3cbd83d](pytorch@3cbd83d))
https://github.com/pytorch/pytorch/actions/runs/11605399925/job/32315737205?pr=139337

# Steps to fix the issue

- [x]  [**THIS PR**] Create inductor-unittest.yml to handle unit test and daily rerun for inductor
- [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124
- [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test

Pull Request resolved: pytorch#139337
Approved by: https://github.com/huydhn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants