torch.multinomial chooses elements with zero weight #13867

jcjohnson · 2018-11-12T22:29:29Z

🐛 Bug

torch.multinomial occasionally samples elements with zero weight. This should never happen.

To Reproduce

I've been unable to reproduce this issue with randomly generated weights, so I've included a particular value of weights from my application that triggers this behavior:

 wget https://cs.stanford.edu/people/jcjohns/weights.pt

These weights are all nonnegative (but contain a lot of zeros), have a nonzero sum, and contain no NaNs or Infs.

import torch

torch.manual_seed(1)
weights = torch.load('weights.pt')
N, S = weights.shape[0], 4096
num_trials = 100
for trial in range(1, num_trials + 1):
  print('Starting trial %d / %d' % (trial, num_trials))
  weights[weights < 0] = 0.0
  samples = weights.multinomial(S, replacement=True)
  sampled_weights = weights[samples]
  assert sampled_weights.min() > 0

I fail the assertion on trial 6.

Environment

PyTorch version: 1.0.0.dev20181112
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100

Nvidia driver version: 396.51
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] Could not collect
[conda] pytorch 0.4.1 py37_py36_py35_py27__9.0.176_7.1.2_2 pytorch
[conda] pytorch-nightly 1.0.0.dev20181112 py3.7_cuda9.0.176_cudnn7.1.2_0 pytorch
[conda] torchvision 0.2.1
[conda] torchvision 0.2.1 py37_1 pytorch

The text was updated successfully, but these errors were encountered:

zou3519 · 2018-11-12T22:32:57Z

@jcjohnson can you confirm that you are running the latest pytorch when running this script? print(torch.__version__). I think we fixed an identical bug a while ago, but it looks like that fix wasn't enough.

jcjohnson · 2018-11-12T22:34:09Z

@zou3519 I just reinstalled from the nightly build, version 1.0.0.dev20181112. Can you point me to the earlier bugfix?

zou3519 · 2018-11-12T22:38:11Z

My bad, it looks like we fixed this for CUDA but we did not test on CPU: #4858. We'll look into it and get it fixed, thank you for the report :)

jcjohnson · 2018-11-12T22:41:04Z

That's weird -- I'm seeing this issue only on CUDA, and it works properly when I cast weights to CPU.

zou3519 · 2018-11-12T22:43:47Z

Got it, I didn't realize your weights were on CUDA. I can reproduce the assertion using your weights, so something is indeed wrong with the multinomial implementation

zou3519 · 2018-11-12T22:50:58Z

I'm wondering if floating point error could be to blame. One interesting thing to note that weights < 0 returns False for element 0:

(Pdb) weights
tensor([1.6399e-05, 1.1493e-05, 1.0797e-05,  ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00], device='cuda:0')
(Pdb) weights < 0
tensor([0, 0, 0,  ..., 0, 0, 0], device='cuda:0', dtype=torch.uint8)
(Pdb) weights[weights < 0] = 0
(Pdb) weights[0]
tensor(1.6399e-05, device='cuda:0')

jcjohnson · 2018-11-12T22:54:04Z

Isn't that correct? 1.6399e-05 is small but positive.

However many of the weights are quite small (and will become even smaller if multinomial internally renormalizes to sum to one) so I wouldn't be surprised if some floating point error were to blame.

zou3519 · 2018-11-12T22:57:40Z

Of course -- my apologies, I was reading that too quickly.

jcjohnson · 2018-11-12T22:58:05Z

No worries, I'm grateful for the fast response =)

syed-ahmed · 2018-11-13T00:47:42Z

@jcjohnson @zou3519 I think the problem is more with how we are seeding a Mersenne Twister engine. I recently learned that the 19937 states of a Mersenne Twister engine is very prone to getting into a bad state when one seeds the engine with a number with many 0 bits ("all zeros causes it to not work at all, whereas lots of zero bits are merely bad" - http://www.pcg-random.org/posts/cpp-seeding-surprises.html). I ran your script with seed = 10, and it breaks the assertion at trial 17.

Your script passes in my current PR #13070, (the PR is almost done and is waiting on some builds). I have changed the CUDA generator engine for multinomial to philox engine and I suppose the script passes because the philox engine doesn't have as many states as a Mersenne twister engine and we are seeding it properly with a 64 bit number.

D-X-Y · 2018-12-11T04:23:11Z

In https://pytorch.org/docs/stable/torch.html?highlight=multinomial#torch.multinomial

>>> weights = torch.tensor([0, 10, 3, 0], dtype=torch.float) # create a tensor of weights
>>> torch.multinomial(weights, 4)
tensor([ 1,  2,  0,  0])

Why torch.multinomial outputs [1,2,0,0] ? Since replacement=False, it can not generate same indexes.
Is this a bug and can anyone help to explain this?

jcjohnson · 2019-01-15T00:21:36Z

Is there any update on this? It has been two months.

syed-ahmed · 2019-01-15T18:38:34Z

Hi @jcjohnson . Apologies for the super long delay! My PR referred above became huge for review, so I'm currently breaking that up into two parts. I promise to push the two parts by end of this week.

jcjohnson · 2019-01-15T18:42:29Z

Thanks! Your PR looks pretty nontrivial indeed, so I'm not surprised it has taken a while to get sorted out. I'm looking forward to it!

t-vi · 2019-01-16T11:27:11Z

So in terms of a minimal fix:
The cumsum result (I'm not quite able to see this by calling cumsum manually, unfortunately, but used cuda-gdb) seems to include 0.99997884, 0.999978781 (in that order, i.e. it is not monotonically non-decreasing) in the critical positions.
Our logic to avoid zero probability items essentially checks for cumdist[n-1] == cumdist[n], but that doesn't work here.

I think the main options for a minimal fix ("1.0.1") are

write a cumsum replacement that returns tensors with non-decreasing entries for non-negative inputs,
pass the non-cumulated distribution to the sampling/bisection and check that for 0 in the above line.

I would expect the second to be the least risky fix because it seems to add the least logic.

t-vi · 2019-01-16T13:04:09Z

I seem to have a simpler repro:

        # test corner case from Issue #13867
        torch.cuda.manual_seed(33)
        probs = torch.randn(1_000_000, device='cuda').clamp(min=0)*3e-5
        samples = probs.multinomial(1_000_000, replacement=True)
        assert probs[samples].min().item() > 0

I'll have the PR in a few moments.

@jcjohnson

The cumsum over the probabilities can be not monotonically non-decreasing. Thus it is hard to detect zero probability classes using just the cumsum. This changes the binary search postprocessing to use the (non-cumulated) distribution instead. Thank you, @jcjohnson, for the bug report with reproducing case. Fixes: pytorch#13867

Summary: The cumsum over the probabilities can be not monotonically non-decreasing. Thus it is hard to detect zero probability classes using just the cumsum. This changes the binary search postprocessing to use the (non-cumulated) distribution instead. Thank you, jcjohnson, for the bug report with reproducing case. Fixes: #13867 Pull Request resolved: #16075 Differential Revision: D13695565 Pulled By: soumith fbshipit-source-id: 02c4d6f868f0050c1ae7d333f4317c5610e49cd9

zou3519 added the high priority label Nov 12, 2018

syed-ahmed mentioned this issue Nov 13, 2018

Refactor Random Number Generators in ATen #13070

Closed

jcjohnson mentioned this issue Dec 20, 2018

Bernoulli with zero probability can yield 1s #15356

Closed

t-vi mentioned this issue Jan 16, 2019

multinomial: fix detection of zero probability #16075

Closed

facebook-github-bot closed this as completed in d33e7d1 Jan 16, 2019

bderrett mentioned this issue Dec 4, 2020

torch.multinomial selects elements with zero weight #48841

Closed

ravihammond mentioned this issue Feb 1, 2022

Illegal Move Error [Solved] facebookresearch/off-belief-learning#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.multinomial chooses elements with zero weight #13867

torch.multinomial chooses elements with zero weight #13867

jcjohnson commented Nov 12, 2018 •

edited

zou3519 commented Nov 12, 2018 •

edited

jcjohnson commented Nov 12, 2018

zou3519 commented Nov 12, 2018

jcjohnson commented Nov 12, 2018

zou3519 commented Nov 12, 2018

zou3519 commented Nov 12, 2018

jcjohnson commented Nov 12, 2018

zou3519 commented Nov 12, 2018

jcjohnson commented Nov 12, 2018

syed-ahmed commented Nov 13, 2018 •

edited

D-X-Y commented Dec 11, 2018

jcjohnson commented Jan 15, 2019

syed-ahmed commented Jan 15, 2019

jcjohnson commented Jan 15, 2019

t-vi commented Jan 16, 2019

t-vi commented Jan 16, 2019

torch.multinomial chooses elements with zero weight #13867

torch.multinomial chooses elements with zero weight #13867

Comments

jcjohnson commented Nov 12, 2018 • edited

🐛 Bug

To Reproduce

Environment

zou3519 commented Nov 12, 2018 • edited

jcjohnson commented Nov 12, 2018

zou3519 commented Nov 12, 2018

jcjohnson commented Nov 12, 2018

zou3519 commented Nov 12, 2018

zou3519 commented Nov 12, 2018

jcjohnson commented Nov 12, 2018

zou3519 commented Nov 12, 2018

jcjohnson commented Nov 12, 2018

syed-ahmed commented Nov 13, 2018 • edited

D-X-Y commented Dec 11, 2018

jcjohnson commented Jan 15, 2019

syed-ahmed commented Jan 15, 2019

jcjohnson commented Jan 15, 2019

t-vi commented Jan 16, 2019

t-vi commented Jan 16, 2019

jcjohnson commented Nov 12, 2018 •

edited

zou3519 commented Nov 12, 2018 •

edited

syed-ahmed commented Nov 13, 2018 •

edited