Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats broadcasting in rvs (Trac #1544) #2069

Closed
scipy-gitbot opened this issue Apr 25, 2013 · 22 comments
Closed

stats broadcasting in rvs (Trac #1544) #2069

scipy-gitbot opened this issue Apr 25, 2013 · 22 comments
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected Migrated from Trac scipy.stats
Milestone

Comments

@scipy-gitbot
Copy link

Original ticket http://projects.scipy.org/scipy/ticket/1544 on 2011-10-30 by @josef-pkt, assigned to unknown.

I thought that rvs would create a broadcasted random array for different loc

import numpy as np
from scipy import stats


bins = stats.norm.ppf(np.linspace(0,1,11))

loc = np.zeros(1000)
scale = 1

print bins
print

np.random.seed(7654321)
print np.histogram(stats.norm.rvs(size=loc.shape), bins)[0]
np.random.seed(7654321)
print np.histogram(loc + scale * stats.norm.rvs(size=loc.shape), bins)[0]
np.random.seed(7654321)
print np.histogram(stats.norm.rvs(loc=loc, scale=scale), bins)[0]
np.random.seed(7654321)
print np.histogram(stats.norm(loc=loc, scale=scale).rvs(), bins)[0]
np.random.seed(7654321)

however it just draws one rvs and broadcasts that, so we get all identical rvs

[       -inf -1.28155157 -0.84162123 -0.52440051 -0.2533471   0.          0.2533471
  0.52440051  0.84162123  1.28155157         inf]

[ 99 104  92 100 103  95  96  95 108 108]
[ 99 104  92 100 103  95  96  95 108 108]
[   0    0 1000    0    0    0    0    0    0    0]
[   0    0 1000    0    0    0    0    0    0    0]
>>> rvs = stats.norm(loc=loc, scale=scale).rvs()
>>> rvs[:10]
array([-0.70256379, -0.70256379, -0.70256379, -0.70256379, -0.70256379,
       -0.70256379, -0.70256379, -0.70256379, -0.70256379, -0.70256379])
@scipy-gitbot
Copy link
Author

@WarrenWeckesser wrote on 2011-10-31

It does broadcast; you just have to give arguments with compatible shapes to get the broadcasting that you want:

In [35]: stats.norm(loc=array([0.0, 10.0, 100.0]), scale=10).rvs()
Out[35]: array([-11.81964492,  -1.81964492,  88.18035508])

In [36]: stats.norm(loc=array([0.0, 10.0, 100.0]), scale=array([[1.0], [10.0]])).rvs()
Out[36]: 
array([[ -0.59910884,   9.40089116,  99.40089116],
       [ -5.99108836,   4.00891164,  94.00891164]])

In [37]: stats.norm(loc=array([0.0, 10.0, 100.0]), scale=array([[1.0], [10.0]])).rvs(size=(2,3))
Out[37]: 
array([[  0.52229266,  10.732356  ,  99.6656165 ],
       [  8.80268499,   1.22946138,  92.26971629]])

In [38]: stats.norm(loc=array([0.0, 10.0, 100.0]), scale=array([[1.0], [10.0]])).rvs(size=(3,2,3))
Out[38]: 
array([[[  -0.30050128,   11.66806108,   98.84132578],
        [ -14.6134854 ,    7.83944809,  117.22379634]],

       [[   1.36336413,    9.35687459,   99.07673582],
        [ -10.38386458,   11.80348136,  114.20432812]],

       [[  -0.45469292,    8.51569085,   99.4730159 ],
        [  -8.69438476,    6.80314046,   86.96980011]]])

In input [36], loc has shape (3,) and scale has shape (2,1), which broadast to (2,3).
In input [38], the desired size is given as (3,2,3). This shape is compatible for broadcasting with loc and scale, so the output has shape (3,2,3).

@scipy-gitbot
Copy link
Author

@WarrenWeckesser wrote on 2011-10-31

Sorry, my previous post missed the key issue, which is the following surprising behavior:

In [47]: stats.norm(loc=zeros(5)).rvs()
Out[47]: array([ 0.0450558,  0.0450558,  0.0450558,  0.0450558,  0.0450558])

In [48]: stats.norm.rvs(loc=np.zeros(5))
Out[48]: array([-0.51003533, -0.51003533, -0.51003533, -0.51003533, -0.51003533])

Moreover, the "random" output shown in Out [36] of my previous post looks decidedly nonrandom.

It appears that the given value of the 'size' keyword determines the number of random values generated; after generating this many samples, the result is broadcast with the other arguments. Smells like a bug to me. Is there a good use case for the current behavior?

A work-around is to give the size explicitly. For example:

In [49]: stats.norm.rvs(loc=np.zeros(5), size=5)
Out[49]: array([-0.18638924,  1.98262646,  1.72811717,  1.3536744 , -0.52807763])

In [50]: stats.norm.rvs(loc=np.zeros((5,1)), size=5)
Out[50]: 
array([[-1.40732211,  2.25772776,  0.65555884, -0.85373538, -0.29639723],
       [-1.40732211,  2.25772776,  0.65555884, -0.85373538, -0.29639723],
       [-1.40732211,  2.25772776,  0.65555884, -0.85373538, -0.29639723],
       [-1.40732211,  2.25772776,  0.65555884, -0.85373538, -0.29639723],
       [-1.40732211,  2.25772776,  0.65555884, -0.85373538, -0.29639723]])

In [51]: stats.norm.rvs(loc=np.zeros((5,1)), size=(5,5))
Out[51]: 
array([[-0.02202415, -0.38135011,  0.19462152, -0.75862294,  0.26596077],
       [-0.00354676,  0.02734615,  1.39238526, -0.13878292, -1.02193461],
       [ 0.62605714, -1.02270551, -0.33135211, -0.90668301, -1.64451121],
       [ 0.52518014, -0.84042038,  0.70364437, -0.32555321, -0.92514762],
       [-0.07142783,  0.19321172,  1.65455672,  0.67108554,  0.95630008]])

@scipy-gitbot
Copy link
Author

@WarrenWeckesser wrote on 2011-10-31

It gets worse: it seems broadcasting of additional shape parameters does not work. For example, the gamma distribution has one additional parameter. In the following, line [103] works as expected. I expect line [104] to produce an array with shape (2,2), but instead it raises an exception. And I would expect line [105] to complain about incompatible shapes.

In [103]: stats.gamma.rvs(2.0, size=(2,2))
Out[103]: 
array([[ 0.21833399,  0.28700829],
       [ 5.51574125,  1.68236874]])

In [104]: stats.gamma.rvs(array([2.0, 5.0]), size=(2,2))
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (102, 0))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/warren/training_notes/statistics/<ipython-input-104-856552d2e375> in <module>()
----> 1 stats.gamma.rvs(array([2.0, 5.0]), size=(2,2))

/Users/warren/local_scipy/lib/python2.7/site-packages/scipy/stats/distributions.pyc in rvs(self, *args, **kwds)
    694             return loc*ones(size, 'd')
    695 
--> 696         vals = self._rvs(*args)
    697         if self._size is not None:
    698             vals = reshape(vals, size)

/Users/warren/local_scipy/lib/python2.7/site-packages/scipy/stats/distributions.pyc in _rvs(self, a)
   3323     """
   3324     def _rvs(self, a):
-> 3325         return mtrand.standard_gamma(a, self._size)
   3326     def _pdf(self, x, a):
   3327         return exp(self._logpdf(x, a))

/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/numpy/random/mtrand.so

in mtrand.RandomState.standard_gamma (numpy/random/mtrand/mtrand.c:8075)()

/Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/numpy/random/mtrand.so

in mtrand.cont1_array (numpy/random/mtrand/mtrand.c:1691)()

ValueError: shape mismatch: objects cannot be broadcast to a single shape

In [105]: stats.gamma.rvs(array([2.0, 5.0, 10.0, 15.0]), size=(2,2))
Out[105]: 
array([[  1.46430036,   4.39812092],
       [ 10.84626667,  16.80892809]])

@rgommers
Copy link
Member

rgommers commented Jun 9, 2013

Similar broadcasting issue described at http://thread.gmane.org/gmane.comp.python.scientific.user/33953

Looks like there are no generic broadcasting tests. One more gap that needs to be closed.

@WarrenWeckesser WarrenWeckesser added this to the 0.17.0 milestone Nov 19, 2015
@WarrenWeckesser
Copy link
Member

This was recently reported in stackoverflow: http://stackoverflow.com/questions/33810477/unexpected-scipy-stats-uniform-behaviour

From the link: "This bug or feature has cost me a couple of days of confusion and debugging in my larger program." Ouch.

I'm upping the priority of this issue, and setting the milestone to 0.17. Generating nonrandom rvs is really bad.

@WarrenWeckesser
Copy link
Member

Ugh. Fixing this so the API is nice and numpythonic (i.e. fully broadcasting) will almost certainly break backwards compatibility for anyone who has noticed the problem and used what looks like a reasonable work-around.

The way I wish rvs worked is as follows:

  1. Broadcast all the requested parameters (i.e. arg1, arg2, ..., (or their named variants) along with loc and scale). E.g. p = np.broadcast_array(*(args + (loc,) + (scale,)) (not valid code, but I hope the intent is clear).
  2. For each parameter combination in p, generate size random samples.

The shape of the return value:

     size       return shape
-------------   ------------------
       1        p.shape
      >1        p.shape + (size,)
tuple of ints   p.shape + size

Some examples, with input parameters arg1 and arg2 (the shape () means a scalar):

arg1.shape = (5,)
arg2.shape = (5,)
size = 1
p = np.broadcast_arrays(arg1, arg2) has shape (5,).
-> return shape is (5,)

arg1.shape = (5,)
arg2.shape = (5,)
size = 3
p = np.broadcast_arrays(arg1, arg2) has shape (5,).
-> return shape is (5, 3)

arg1.shape = ()
arg2.shape = ()
size = 1
p = np.broadcast_arrays(arg1, arg2) has shape ().
-> return shape is ()

arg1.shape = ()
arg2.shape = ()
size = 3
p = np.broadcast_arrays(arg1, arg2) has shape ().
-> return shape is (3,)

arg1.shape = (3,)
arg2.shape = (1, 4)
size = 5
p = np.broadcast_arrays(arg1, arg2) has shape (3, 4).
-> return shape is (3, 4, 5)

arg1.shape = (3,)
arg2.shape = (1, 4)
size = (2, 2)
p = np.broadcast_arrays(arg1, arg2) has shape (3, 4).
-> return shape is (3, 4, 2, 2)

The problem is that, as was noted above (and I even suggested it as a work-around in the recent question on stackoverflow), currently a call such as

rvs(np.zeros(5), np.arange(1, 6), size=5)

generates an array with shape (5,). That is, for each combination of parameters, it generates one sample. With the "fully broadcasting" API, that call should create an array with shape (5, 5).

@WarrenWeckesser
Copy link
Member

Proposal 1:

  • Add code to the rvs method that raises an exception when it is given arguments that currently generate nonrandom variates. (To be honest, I haven't thought this through completely, but I think this can be done.)
  • Deprecate rvs.
  • Add a new method, say random_variates (or something more succinct), that implements the fully broadcasting API.

@josef-pkt
Copy link
Member

I would have suggested to add a new sizeb keyword and warn on the old keyword. (and possibly after a few versions make size an alias for sizeb)
sizeb with b for broadcasting or something like this.

I got pretty used to rvs, and a new keyword might work as well as a new method.

@WarrenWeckesser
Copy link
Member

Yes, I think a new keyword could work, too. But then you've got an ugly API, with rvs having both size and sizeb. And "after a few versions make size an alias for sizeb" is very unappealing. I'd rather get a clean, final API as soon as possible. If we use a new keyword, I think the old one, size, should just be deprecated and eventually go away.

@perimosocordiae
Copy link
Member

Peanut gallery here: I'm +1 for a new method. I think it'll much easier to explain to users, without them needing to fully understand the specifics of the bug.

@WarrenWeckesser
Copy link
Member

On the other hand, after looking at the code, I suspect the current rvs API is intended to work like numpy's random number generators, but the scipy implementation is broken. In the numpy API, the shapes of the parameters must be compatible (in the broadcasting sense) with the shape given by size. For example,

In [61]: np.random.uniform([0, 2, 4], [1, 3, 5])
Out[61]: array([ 0.93276429,  2.57968065,  4.17083731])

In [62]: np.random.uniform([0, 2, 4], [1, 3, 5], size=3)  # same as above
Out[62]: array([ 0.50793932,  2.87066072,  4.40605079])

In [63]: np.random.uniform([0, 2, 4], [1, 3, 5], size=4)  # Shapes not compatible
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-63-d40c36827673> in <module>()
----> 1 np.random.uniform([0, 2, 4], [1, 3, 5], size=4)

mtrand.pyx in mtrand.RandomState.uniform (numpy/random/mtrand/mtrand.c:11984)()

mtrand.pyx in mtrand.cont2_array (numpy/random/mtrand/mtrand.c:3180)()

ValueError: shape mismatch: objects cannot be broadcast to a single shape

Note that np.random.uniform([0, 2, 4], [1, 3, 5], size=1) also generates an error.

You can get a 2-d array, but the length of the second dimension must be 3:

In [64]: np.random.uniform([0, 2, 4], [1, 3, 5], size=(1,3))  # Result has shape (1,3)
Out[64]: array([[ 0.45860527,  2.67216864,  4.43484719]])

In [65]: np.random.uniform([0, 2, 4], [1, 3, 5], size=(4,3)) # Result has shape (4,3)
Out[65]: 
array([[ 0.39288997,  2.14821499,  4.12672776],
       [ 0.46386454,  2.2271585 ,  4.28880301],
       [ 0.61072223,  2.46905283,  4.84252237],
       [ 0.2323965 ,  2.98491287,  4.46665662]])

Less trivial broadcasting of the arguments:

In [68]: np.random.uniform([[0], [2], [4]], [4, 5, 6])
Out[68]: 
array([[ 2.71257007,  4.07918867,  0.6399625 ],
       [ 2.99374625,  3.56465233,  4.01187348],
       [ 4.        ,  4.58467304,  5.62378487]])

In [69]: np.random.uniform([[0], [2], [4]], [4, 5, 6], size=(3, 3))
Out[69]: 
array([[ 1.25742168,  1.91351582,  3.61490645],
       [ 3.02601869,  3.36855778,  3.38390496],
       [ 4.        ,  4.33135316,  5.30008562]])

The size argument doesn't allow one of its dimensions to be broadcast:

In [70]: np.random.uniform([[0], [2], [4]], [4, 5, 6], size=(3, 1, 3))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-70-f7eb9d26bd41> in <module>()
----> 1 np.random.uniform([[0], [2], [4]], [4, 5, 6], size=(3, 1, 3))

mtrand.pyx in mtrand.RandomState.uniform (numpy/random/mtrand/mtrand.c:11984)()

mtrand.pyx in mtrand.cont2_array (numpy/random/mtrand/mtrand.c:3209)()

ValueError: size is not compatible with inputs

But you can do

In [71]: np.random.uniform([[0], [2], [4]], [4, 5, 6], size=(2, 3, 3))
Out[71]: 
array([[[ 1.52874668,  1.83954109,  2.08695179],
        [ 2.53803585,  3.68740984,  3.59386316],
        [ 4.        ,  4.43851451,  4.07922234]],

       [[ 1.60360839,  3.61115076,  5.46961197],
        [ 3.42959651,  4.50676028,  2.31481641],
        [ 4.        ,  4.46076607,  4.04545661]]])

So, another approach to this issue is to fix scipy's broken implementation of numpy's API for random variate generation.

@WarrenWeckesser
Copy link
Member

Although I like my original proposal for the interpretation of size, I now think we should fix the existing rvs method to act the same as the numpy random variate generators. That appears to have been the original intent, and it will be less disruptive than deprecating either the function or the size keyword.

@rgommers
Copy link
Member

The rvs methods are quite widely used, and mostly without loc/scale broadcasting I'd expect. Deprecating them will indeed be quite painful. I haven't thought about this very hard yet, but my first impression is that just changing the broadcasting behavior and apologizing to the few people who's noticed and worked around this issue is the best way to go.

@rgommers
Copy link
Member

This looks like a nontrivial change to make though; may be a bit late for 0.17.0.

@WarrenWeckesser
Copy link
Member

I'll do what I can in the next day or so, but in my experience, changes to the distributions code always ramify far more than I expect. If it doesn't make it into 0.17, so be it.

@ev-br
Copy link
Member

ev-br commented Nov 22, 2015

+1 for interpreting the inconsistency with numpy.random as a bug.

@josef-pkt
Copy link
Member

If the fix is not consistent with the current workaround of giving full size, then it still might break quite a bit of user code, and I would not do without regular deprecation policy.

I don't remember if statsmodels has any old code for this, we use almost exclusively numpy because it has almost all the distributions that we support. Some of the Bayesian packages might use this more, at least based on some questions that were in the past. pymc has or had a wrapper around scipy.stats.distributions (I haven't checked in years).

@WarrenWeckesser
Copy link
Member

Proof-of-concept, work-in-progress pull request: #5526

@WarrenWeckesser WarrenWeckesser removed this from the 0.17.0 milestone Nov 30, 2015
WarrenWeckesser added a commit to WarrenWeckesser/scipy that referenced this issue Jun 17, 2016
The class rv_generic was updated to handle broadcasting of the
arguments to the rvs() method.

The following distributions required additional custom changes to
support broadcasting in rvs():
    levy_stable, pearson3, rice, truncnorm,
    hypergeom, planck, randint.

The function `numpy.broadcast_to` is used in several places.  This
function was added to numpy in version 1.10, so a copy was added to
_lib/_numpy_compat.py to support older version of numpy.

Closes scipygh-2069.
WarrenWeckesser added a commit to WarrenWeckesser/scipy that referenced this issue Jun 18, 2016
The class rv_generic was updated to handle broadcasting of the
arguments to the rvs() method.

The following distributions required additional custom changes to
support broadcasting in rvs():
    levy_stable, pearson3, rice, truncnorm,
    hypergeom, planck, randint.

The function `numpy.broadcast_to` is used in several places.  This
function was added to numpy in version 1.10, so a copy was added to
_lib/_numpy_compat.py to support older version of numpy.

Closes scipygh-2069.
@WarrenWeckesser
Copy link
Member

A proposed fix is in #5526. It makes rvs() follow the broadcasting and size convention of numpy's random variate generators.

@ev-br ev-br closed this as completed in 3d5024b Jun 19, 2016
@ev-br ev-br added this to the 0.18.0 milestone Jun 19, 2016
@stsievert
Copy link
Contributor

stsievert commented Feb 3, 2018

I think I ran into the same bug as #2069 (comment). I'd appreciate some help on this.

Here's my traceback:

Traceback (most recent call last):
  File "/home/ec2-user/WideResNet-pytorch/train.py", line 496, in <module>
    main()
  File "/home/ec2-user/WideResNet-pytorch/train.py", line 272, in main
    train_d = train(train_loader, model, criterion, optimizer, epoch)
  File "/home/ec2-user/WideResNet-pytorch/train.py", line 372, in train
    r = optimizer.step()
  File "/home/ec2-user/WideResNet-pytorch/pytorch_ps_mpi/ps.py", line 132, in step
    msgs = list(msgs)
  File "/home/ec2-user/WideResNet-pytorch/pytorch_ps_mpi/ps.py", line 129, in <genexpr>
    msgs = (future.result() for future in self.futures)
  File "/home/ec2-user/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 398, in result
    return self.__get_result()
  File "/home/ec2-user/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 357, in __get_result
    raise self._exception
  File "/home/ec2-user/anaconda3/lib/python3.6/concurrent/futures/thread.py", line 55, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ec2-user/WideResNet-pytorch/pytorch_ps_mpi/ps.py", line 94, in format_for_send
    code = encode(grad.data, **kwargs)
  File "/home/ec2-user/WideResNet-pytorch/codings/qsgd.py", line 39, in encode
    mask = stats.bernoulli.rvs(probs).astype('bool')
  File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py", line 2810, in rvs
    return super(rv_discrete, self).rvs(*args, **kwargs)
  File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py", line 954, in rvs
    vals = self._rvs(*args)
  File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/scipy/stats/_discrete_distns.py", line 109, in _rvs
    return binom_gen._rvs(self, 1, p)
  File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/scipy/stats/_discrete_distns.py", line 40, in _rvs
    return self._random_state.binomial(n, p, self._size)
  File "mtrand.pyx", line 3788, in mtrand.RandomState.binomial
  File "mtrand.pyx", line 386, in mtrand.discnp_array
ValueError: shape mismatch: objects cannot be broadcast to a single shape

This is a confusing bug that happens seemingly randomly with low probability in a larger program, and I'm having difficulty getting a minimum working example.

I have verified that type(probs) == np.ndarray and probs.ndim = 1. I'm running under scipy.__version__ == 1.0.0.

@WarrenWeckesser
Copy link
Member

From the traceback you can see that threads are being used in your code. Do you know if multiple threads are calling bernoulli.rvs()? If so, then my guess is that the scipy code is not thread-safe. scipy.stats.bernoulli is a global object. The rvs() method of bernoulli sets the _size attribute, and then accesses that attribute in the _rvs() method, and passes it to the numpy binomial random number generator. It is the numpy line that is raising the exception:

  File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/scipy/stats/_discrete_distns.py", line 40, in _rvs
    return self._random_state.binomial(n, p, self._size)
  File "mtrand.pyx", line 3788, in mtrand.RandomState.binomial
  File "mtrand.pyx", line 386, in mtrand.discnp_array
ValueError: shape mismatch: objects cannot be broadcast to a single shape

That error can be generated by calling np.random.binomial with a size argument that is not compatible with the shape of p:

In [2]: np.random.binomial(1, [0.25, 0.5, 0.75], size=2)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-8e55ac8604bf> in <module>()
----> 1 np.random.binomial(1, [0.25, 0.5, 0.75], size=2)

mtrand.pyx in mtrand.RandomState.binomial (numpy/random/mtrand/mtrand.c:32482)()

mtrand.pyx in mtrand.discnp_array (numpy/random/mtrand/mtrand.c:9934)()

ValueError: shape mismatch: objects cannot be broadcast to a single shape

Two threads calling bernoulli.rvs(p) with arguments that have different lengths could also result in that error. It is possible that the first thread could set the _size attribute for its array but before it calls the numpy function, the second thread sets the _size attribute with a different size. Then when first thread calls the numpy function, the _size attribute doesn't match the size of its array.

@stsievert
Copy link
Contributor

Thanks for the quick response @WarrenWeckesser. I think you've resolved my problem, thank you.

Do you know if multiple threads are calling bernoulli.rvs()?

Yup, my code uses concurrent.futures.ThreadPoolExecutor in ps.py. In the function that was being executed on multiple threads, I was using scipy.stats.bernoulli. I changed my usage from scipy.stats.bernoulli(probs) to np.random.rand(len(probs)) < probs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected Migrated from Trac scipy.stats
Projects
None yet
Development

No branches or pull requests

7 participants