Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wishart fails with valid arguments #538

Closed
fonnesbeck opened this issue May 2, 2014 · 23 comments · Fixed by #701
Closed

Wishart fails with valid arguments #538

fonnesbeck opened this issue May 2, 2014 · 23 comments · Fixed by #701
Labels

Comments

@fonnesbeck
Copy link
Member

Models that use the Wishart as a prior currently fail with valid parameters. For example,

s = pm.Wishart('s', 4, 3, np.eye(3))

Fails with the following:

TypeError: Wrong number of dimensions: expected 0, got 2 with shape (3, 3)

I've tried a variety of 3x3 matrices, but all yield similar errors.

PS - I have a pull request that removed the redundant second parameter.

@jsalvatier
Copy link
Member

I think the issue is that you didn't specify 'shape'. That's definitely
something we should be able to infer thought.

On Fri, May 2, 2014 at 12:27 PM, Chris Fonnesbeck
notifications@github.comwrote:

Models that use the Wishart as a prior currently fail with valid
parameters. For example,

s = pm.Wishart('s', 4, 3, np.eye(3))

Fails with the following:

TypeError: Wrong number of dimensions: expected 0, got 2 with shape (3, 3)

I've tried a variety of 3x3 matrices, but all yield similar errors.


Reply to this email directly or view it on GitHubhttps://github.com//issues/538
.

@fonnesbeck
Copy link
Member Author

Yeah, I tried that, and got:

---------------------------------------------------------------------------
AsTensorError                             Traceback (most recent call last)
<ipython-input-73-6f02cca3962c> in <module>()
      1 with pm.Model():
----> 2     s = pm.Wishart('s', 4, 3, np.eye(3), shape=(3,3))

/Users/fonnescj/Code/pymc/pymc/distributions/distribution.pyc in __new__(cls, name, *args, **kwargs)
     17             data = kwargs.pop('observed', None)
     18             dist = cls.dist(*args, **kwargs)
---> 19             return model.Var(name, dist, data)
     20         elif name is None:
     21             return object.__new__(cls) #for pickle

/Users/fonnescj/Code/pymc/pymc/model.pyc in Var(self, name, dist, data)
    142         """
    143         if data is None:
--> 144             var = FreeRV(name=name, distribution=dist, model=self)
    145             self.free_RVs.append(var)
    146         else:

/Users/fonnescj/Code/pymc/pymc/model.pyc in __init__(self, type, owner, index, name, distribution, model)
    321             self.tag.test_value = np.ones(
    322                 distribution.shape, distribution.dtype) * distribution.default()
--> 323             self.logp_elemwiset = distribution.logp(self)
    324             self.model = model
    325 

/Users/fonnescj/Code/pymc/pymc/distributions/multivariate.pyc in logp(self, X)
    179              2) - n * log(IVI) - 2 * multigammaln(p, n / 2)) / 2,
    180 
--> 181             all(n > p - 1))

/Library/Python/2.7/site-packages/theano/tensor/basic.pyc in all(x, axis, keepdims)
   5078 
   5079 def all(x, axis=None, keepdims=False):
-> 5080     out = elemwise.All(axis)(x)
   5081 
   5082     if keepdims:

/Library/Python/2.7/site-packages/theano/gof/op.pyc in __call__(self, *inputs, **kwargs)
    409         """
    410         return_list = kwargs.pop('return_list', False)
--> 411         node = self.make_node(*inputs, **kwargs)
    412         if self.add_stack_trace_on_call:
    413             self.add_tag_trace(node)

/Library/Python/2.7/site-packages/theano/tensor/elemwise.pyc in make_node(self, input)
   1639 
   1640     def make_node(self, input):
-> 1641         input = as_tensor_variable(input)
   1642         if input.dtype not in ["int8", "uint8"]:
   1643             input = theano.tensor.neq(input, 0)

/Library/Python/2.7/site-packages/theano/tensor/basic.pyc in as_tensor_variable(x, name, ndim)
    181     if isinstance(x, bool):
    182         raise AsTensorError(
--> 183             "Cannot cast True or False as a tensor variable. Please use 1 or "
    184             "0. This error might be caused by using the == operator on "
    185             "Variables. v == w does not do what you think it does, "

AsTensorError: Cannot cast True or False as a tensor variable. Please use 1 or 0. This error might be caused by using the == operator on Variables. v == w does not do what you think it does, use theano.tensor.eq(v, w) instead.

It does not seem to like the boundary check.

@jsalvatier
Copy link
Member

Hmm that seems bad. Do we have tests for it?
On May 2, 2014 6:19 PM, "Chris Fonnesbeck" notifications@github.com wrote:

Yeah, I tried that, and got:


AsTensorError Traceback (most recent call last)
in ()
1 with pm.Model():
----> 2 s = pm.Wishart('s', 4, 3, np.eye(3), shape=(3,3))

/Users/fonnescj/Code/pymc/pymc/distributions/distribution.pyc in new(cls, name, _args, *_kwargs)
17 data = kwargs.pop('observed', None)
18 dist = cls.dist(_args, *_kwargs)
---> 19 return model.Var(name, dist, data)
20 elif name is None:
21 return object.new(cls) #for pickle

/Users/fonnescj/Code/pymc/pymc/model.pyc in Var(self, name, dist, data)
142 """
143 if data is None:
--> 144 var = FreeRV(name=name, distribution=dist, model=self)
145 self.free_RVs.append(var)
146 else:

/Users/fonnescj/Code/pymc/pymc/model.pyc in init(self, type, owner, index, name, distribution, model)
321 self.tag.test_value = np.ones(
322 distribution.shape, distribution.dtype) * distribution.default()
--> 323 self.logp_elemwiset = distribution.logp(self)
324 self.model = model
325

/Users/fonnescj/Code/pymc/pymc/distributions/multivariate.pyc in logp(self, X)
179 2) - n * log(IVI) - 2 * multigammaln(p, n / 2)) / 2,
180
--> 181 all(n > p - 1))

/Library/Python/2.7/site-packages/theano/tensor/basic.pyc in all(x, axis, keepdims)
5078
5079 def all(x, axis=None, keepdims=False):
-> 5080 out = elemwise.All(axis)(x)
5081
5082 if keepdims:

/Library/Python/2.7/site-packages/theano/gof/op.pyc in call(self, _inputs, *_kwargs)
409 """
410 return_list = kwargs.pop('return_list', False)
--> 411 node = self.make_node(_inputs, *_kwargs)
412 if self.add_stack_trace_on_call:
413 self.add_tag_trace(node)

/Library/Python/2.7/site-packages/theano/tensor/elemwise.pyc in make_node(self, input)
1639
1640 def make_node(self, input):
-> 1641 input = as_tensor_variable(input)
1642 if input.dtype not in ["int8", "uint8"]:
1643 input = theano.tensor.neq(input, 0)

/Library/Python/2.7/site-packages/theano/tensor/basic.pyc in as_tensor_variable(x, name, ndim)
181 if isinstance(x, bool):
182 raise AsTensorError(
--> 183 "Cannot cast True or False as a tensor variable. Please use 1 or "
184 "0. This error might be caused by using the == operator on "
185 "Variables. v == w does not do what you think it does, "

AsTensorError: Cannot cast True or False as a tensor variable. Please use 1 or 0. This error might be caused by using the == operator on Variables. v == w does not do what you think it does, use theano.tensor.eq(v, w) instead.


Reply to this email directly or view it on GitHubhttps://github.com//issues/538#issuecomment-42092620
.

@fonnesbeck
Copy link
Member Author

We have one, which it passes. Does my example fail for you?

@jsalvatier
Copy link
Member

Yup, it fails for me too.

On Fri, May 2, 2014 at 7:56 PM, Chris Fonnesbeck
notifications@github.comwrote:

We have one, which it passes. Does my example fail for you?


Reply to this email directly or view it on GitHubhttps://github.com//issues/538#issuecomment-42094468
.

@twiecki
Copy link
Member

twiecki commented Aug 17, 2014

#547 has a bunch of changes to Wishart. Might that fix it? @kadeng?

@kadeng
Copy link

kadeng commented Aug 18, 2014

@twiecki: You should definitely take a look at the changes I made there. I think that the current Wishart Implementation does not really work. I found discrepancies between the likelihood formula in the code and the formulas I found in literature (for example, in Gelman's Bayesian Data Analysis and in Murphy's Machine Learning, a probabilistic perspective). What's problematic there is that different authors seem to use different parametrizations and seem to name the parameters inconsistently, so in the end I was pretty confused.

But it's not like I have spent a lot of time to validate that my fix in turn is correct and works, so please don't rely on it. Use it as a suggestion :) The following paper is also an interesting read when it comes to actually using the Wishart or inverse Wishart as prior for the precision or covariance matrix for MVNs: http://ba.stat.cmu.edu/journal/2013/vol08/issue02/huang.pdf

@wgreenr
Copy link

wgreenr commented Mar 29, 2015

Recently have tried to sample from Wishart prior using:

import pymc3 as pm
import numpy as np

prec_prior = np.array([[ 25.3968254,   -1.58730159],
                       [ -1.58730159,   6.34920635]])
with pm.Model() as model:
    prec = pm.Wishart('prec', 100.0, prec_prior / 100.0, shape=(2, 2))
    step = pm.Metropolis()
    trace = pm.sample(10000, step)

And instead I obtained clearly diverging sampling process, with some entries of sampled matrices moving towards negative infinity, and not in any way resembling the prior precision matrix:

wishart

Then I decided to check whether my notions about Wishart distribution is fundamentally flawed, and using the code from http://www.mit.edu/~mattjj/released-code/hsmm/stats_util.py,
I obtained samples which look as one might expect, given a quite narrow prior centered at precision matrix prec_prior:

    sample_wishart(prec_prior / 100.0, 100.0)
    array([[ 26.37928925,  -3.88792305],
           [ -3.88792305,   7.6372514 ]])

The question is: does this behavior of MCMC sampler is something expected, or is this a bug, and how to properly sample from Wishart?

@kiudee
Copy link
Contributor

kiudee commented Mar 29, 2015

Yes, this is a bug and the same behavior I observed with the Wishart distribution (see issue #672).
If you can find the source of the error in the current code, we would greatly appreciate it.

I recommend taking a look at the LKJ distribution, which is a prior distribution for the correlation matrix instead of the covariance/precision matrix.
This has the advantage that you can specify separate priors for the standard deviations and the correlations.

@wgreenr
Copy link

wgreenr commented Mar 29, 2015

Will run some tests, Wishart logp seems to look good

@twiecki
Copy link
Member

twiecki commented Mar 29, 2015

Seems like we're not enforcing the tau > 0 constraint.

@kiudee
Copy link
Contributor

kiudee commented Mar 29, 2015

Indeed, this is the case. (See below)

I added the constraint and tried it with the given example:

import numpy as np
import pymc3 as pm
import scipy.optimize as opt
prec_prior = np.array([[ 25.3968254,   -1.58730159],
                       [ -1.58730159,   6.34920635]])
with pm.Model() as model:
    prec = pm.Wishart('prec', 100.0, prec_prior / 100.0, shape=(2, 2))

    start = pm.find_MAP(fmin=opt.fmin_powell)
    step = pm.Metropolis()
    trace = pm.sample(10000, step, start)

Here is the resulting trace (without any thinning):
Precision trace

PS: I’m sorry, it is too late here. I just realized that this is not true. The matrices need to be positive definite.

@twiecki
Copy link
Member

twiecki commented Mar 30, 2015

Oh, right. I experimented a bit with all(gt(eig(X), 0)) as a condition but didn't get it to work yet.

@twiecki
Copy link
Member

twiecki commented Apr 18, 2015

It should actually read all(gt(eigh(X)[0], 0)) which at least does not raise an exception. However, I'm still seeing the divergent sampling... maybe @karlnapf can advise on sampling from the Wishart distribution.

@twiecki
Copy link
Member

twiecki commented Apr 18, 2015

The other thing we have to enforce is the symmetry:

        return bound(
            ((n - p - 1) * log(IXI) - trace(matrix_inverse(V).dot(X)) -
                n * p * log(2) - n * log(IVI) - 2 * multigammaln(n / 2., p)) / 2,
            gt(n, (p - 1)),
            all(gt(eigh(X)[0], 0)),
            eq(X, X.T)
        )

Unfortunately, it now becomes quite obvious that trying to sample this matrix naively with these constraints is impossible, so we should probably try to sample only the upper triangular and use the cholesky decomposition.

@kadeng
Copy link

kadeng commented Apr 18, 2015

Sampling from the Wishart and Inverse Wishart seems to be something that
has been the attention of some rather sophisticated analysis to date. Make
sure you take at least a quick look at these ..

http://www.math.wustl.edu/~sawyer/hmhandouts/Wishart.pdf
http://www.gwu.edu/~forcpgm/YuChengKu-030510final-WishartYu-ChengKu.pdf

best,

Kai

On 18 April 2015 at 11:56, Thomas Wiecki notifications@github.com wrote:

The other thing we have to enforce is the symmetry:

    return bound(
        ((n - p - 1) * log(IXI) - trace(matrix_inverse(V).dot(X)) -
            n * p * log(2) - n * log(IVI) - 2 * multigammaln(n / 2., p)) / 2,
        gt(n, (p - 1)),
        all(gt(eigh(X)[0], 0)),
        eq(X, X.T)
    )

Unfortunately, it now becomes quite obvious that trying to sample this
matrix naively with these constraints is impossible, so we should probably
try to sample only the upper triangular and use the cholesky decomposition.


Reply to this email directly or view it on GitHub
#538 (comment).

@twiecki
Copy link
Member

twiecki commented Apr 18, 2015

Thanks @kadeng. I'm starting to wonder if we shouldn't just ditch the Wishart. @betanalpha suggests here: https://www.youtube.com/watch?v=xWQpEAyI5s8 that it's a terrible choice as a prior to begin with and suggests using the LKJ which we already have working.

@betanalpha
Copy link

Referencing a youtube video -- is that a first? There are two problems going on with the Wishart.

Firstly, because of the positive-definite and symmetric constraints the probability of generating a valid sample by varying every element is zero (technically nonzero on floating point, but small enough to make it unfeasible). Really the only way to sample from these objects is to transform the target distribution to the N * (N + 1) / 2 dimensional unconstrained space, sample there, and then map the samples back to the constrained space. For details see the "Transformations of Constrained Variables" chapter in the Stan manual.

Secondly, the Wishart distribution is very heavy-tailed which causes all kinds of problems for naive samplers. Formally heavy tails are obstructions to geometric ergodicity which is essentially the minimal condition needed to trust MCMC expectations in practice. More intuitively, when the tails of the target distribution are heavy the sampler drifts around in the tails and can't find its way back into the bulk of the distribution. NUTS will technically work okay, but only if you let the depth of the tree grow very, very large and incur the resulting computational cost. LKJ has much nicer tails and consequently pairs much better with most MCMC samplers. We'd also argue that the shape of LKJ priors are more inline with modeler's intuitions than Wishart (in particular the really nice interpretation is terms of the Cholesky factor), hence why we recommend them as defaults.

twiecki added a commit that referenced this issue Apr 18, 2015
The Wishart has never worked correctly and produced
divergent samples. One of the reasons was that we did
not enforce the matrix to be symmetric nor positive
semi-definite.

Now, however, it is impossible to randomly sample a
valid matrix so we raise a warning that the Wishart
is basically unusable currently and to instead
use the LKJ prior which has many desriable properties.

For more discussions, see #538.
@twiecki
Copy link
Member

twiecki commented Apr 18, 2015

Many thanks for chiming in @betanalpha, I actually meant to connect with you to get some feedback on pymc3 if you don't mind. One thing that stan does very well is the automatic transformations. I think it wouldn't be too hard to implement something similar here.

I now added a warning not to use the Wishart unless we fix it using the transform method.

Here's an example of how one can use the LKJ prior in pymc3:
https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/LKJ_correlation.py

@wgreenr
Copy link

wgreenr commented Apr 18, 2015

Thanks everyone, especially @betanalpha and @twiecki, for insightful comments! It seems that LKG is the way to go.

@twiecki
Copy link
Member

twiecki commented Apr 21, 2015

I tried the transform approach as suggested by @betanalpha and tried to follow the Stan docs. This is the model I came up with. @betanalpha if you have a sec it'd be great to get your input on whether you think this is correctly set up, the results make sense but this is the first time for me to work with multivariate transformations.

import pymc3 as pm
import numpy as np
import theano
import theano.tensor as T
import scipy.stats
import matplotlib.pyplot as plt

covariance = np.matrix([[2, .5],
                        [.5, 1]])

prec = np.linalg.inv(covariance)

mean = [.5, 1]
data = scipy.stats.multivariate_normal(mean, covariance).rvs(5000)

plt.scatter(data[:, 0], data[:, 1])

S = np.matrix([[1, .5],
               [.5, 1]])
L = scipy.linalg.cholesky(S)
nu = 5
with pm.Model() as model:
    mu = pm.Normal('mu', mu=0, sd=1, shape=2)
    c = T.sqrt(pm.ChiSquared('c', nu - np.arange(2, 4), shape=2))
    z = pm.Normal('z', 0, 1)
    A = T.stacklists([[c[0], 0], 
                      [z, c[1]]])

    # L * A * A.T * L.T ~ Wishart(L*L.T, nu)
    wishart = pm.Deterministic('wishart', T.dot(T.dot(T.dot(L, A), A.T), L.T))
    cov = pm.Deterministic('cov', T.nlinalg.matrix_inverse(wishart))
    lp = pm.MvNormal('likelihood', mu=mu, tau=wishart, observed=data)

    start = pm.find_MAP()
    step = pm.NUTS(scaling=start)
    trace = pm.sample(500, step)

pm.traceplot(trace[100:]);

print np.mean(trace['wishart'], axis=0)
print prec

print np.mean(trace['cov'], axis=0)
print covariance

Output:

[[ 0.57725294 -0.29173294]
 [-0.29173294  1.16447192]]
[[ 0.57142857 -0.28571429]
 [-0.28571429  1.14285714]]

[[ 1.9955087   0.50381087]
 [ 0.50381087  0.9889754 ]]
[[ 2.   0.5]
 [ 0.5  1. ]]

index

@twiecki twiecki mentioned this issue Apr 22, 2015
@twiecki
Copy link
Member

twiecki commented Apr 26, 2015

I refactored this now into a function with the same syntax as a distribution. However, it's a bit odd as it adds RVs itself to the model. I think it's pretty convenient though. @jsalvatier @fonnesbeck thoughts on adding this?

import pymc3 as pm
import numpy as np
import theano
import theano.tensor as T
import scipy.stats
import matplotlib.pyplot as plt

covariance = np.matrix([[2, .5, -.5],
                        [.5, 1.,  0.],
                        [-.5, 0., 0.5]])

prec = np.linalg.inv(covariance)

mean = [.5, 1, .2]
data = scipy.stats.multivariate_normal(mean, covariance).rvs(5000)

plt.scatter(data[:, 0], data[:, 1])

S = np.eye(3)
L = scipy.linalg.cholesky(S)
nu = 5

def WishartBartlett(name, S, nu, is_cholesky=False, return_cholesky=False):
    """
    Bartlett decomposition of the Wishart distribution. As the Wishart
    distribution requires the matrix to be symmetric positive semi-definite
    it is impossible for MCMC to ever propose acceptable matrices.

    Instead, we can use the Barlett decomposition which samples a lower
    diagonal matrix. Specifically:

    If L ~ [[sqrt(c_1), 0, ...],
             [z_21, sqrt(c_1), 0, ...],
             [z_31, z32, sqrt(c3), ...]]
    with c_i ~ Chi²(n-i+1) and n_ij ~ N(0, 1), then
    L * A * A.T * L.T ~ Wishart(L * L.T, nu)

    See http://en.wikipedia.org/wiki/Wishart_distribution#Bartlett_decomposition
    for more information.

    :Parameters:
      S : ndarray
        p x p positive definite matrix
      nu : int
        Degrees of freedom, > dim(S).
      is_cholesky : bool (default=False)
        Input matrix S is already Cholesky decomposed as S.T * S
      return_cholesky : bool (default=False)
        Only return the Cholesky decomposed matrix.

    :Note:
      This is not a standard Distribution class but follows a similar
      interface. Besides, the Wishart distribution, it will add RVs 
      c and z to your model which make up the matrix.
    """

    L = S if is_cholesky else scipy.linalg.cholesky(S)

    diag_idx = np.diag_indices_from(S)
    tril_idx = np.tril_indices_from(S, k=-1)
    n_diag = len(diag_idx[0])
    n_tril = len(tril_idx[0])
    c = T.sqrt(pm.ChiSquared('c', nu - np.arange(2, 2+n_diag), shape=n_diag))
    z = pm.Normal('z', 0, 1, shape=n_tril)
    # Construct A matrix
    A = T.zeros(S.shape, dtype=np.float32)
    A = T.set_subtensor(A[diag_idx], c)
    A = T.set_subtensor(A[tril_idx], z)

    # L * A * A.T * L.T ~ Wishart(L*L.T, nu)
    if return_cholesky:
        return pm.Deterministic(name, T.dot(L, A))
    else:
        return pm.Deterministic(name, T.dot(T.dot(T.dot(L, A), A.T), L.T))

with pm.Model() as model:
    mu = pm.Normal('mu', mu=0, sd=1, shape=3)

    prec = WishartBartlett('prec', S, nu)
    cov = pm.Deterministic('cov', T.nlinalg.matrix_inverse(prec))

    lp = pm.MvNormal('likelihood', mu=mu, tau=prec, observed=data)

    start = pm.find_MAP()
    step = pm.NUTS(scaling=start)
    trace = pm.sample(500, step)

pm.traceplot(trace[100:]);

@betanalpha
Copy link

Sorry for the late reply -- seems reasonable although I can't comment on the integration of the transformation into your sampler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants