-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wishart fails with valid arguments #538
Comments
I think the issue is that you didn't specify 'shape'. That's definitely On Fri, May 2, 2014 at 12:27 PM, Chris Fonnesbeck
|
Yeah, I tried that, and got:
It does not seem to like the boundary check. |
Hmm that seems bad. Do we have tests for it?
|
We have one, which it passes. Does my example fail for you? |
Yup, it fails for me too. On Fri, May 2, 2014 at 7:56 PM, Chris Fonnesbeck
|
@twiecki: You should definitely take a look at the changes I made there. I think that the current Wishart Implementation does not really work. I found discrepancies between the likelihood formula in the code and the formulas I found in literature (for example, in Gelman's Bayesian Data Analysis and in Murphy's Machine Learning, a probabilistic perspective). What's problematic there is that different authors seem to use different parametrizations and seem to name the parameters inconsistently, so in the end I was pretty confused. But it's not like I have spent a lot of time to validate that my fix in turn is correct and works, so please don't rely on it. Use it as a suggestion :) The following paper is also an interesting read when it comes to actually using the Wishart or inverse Wishart as prior for the precision or covariance matrix for MVNs: http://ba.stat.cmu.edu/journal/2013/vol08/issue02/huang.pdf |
Recently have tried to sample from Wishart prior using: import pymc3 as pm
import numpy as np
prec_prior = np.array([[ 25.3968254, -1.58730159],
[ -1.58730159, 6.34920635]])
with pm.Model() as model:
prec = pm.Wishart('prec', 100.0, prec_prior / 100.0, shape=(2, 2))
step = pm.Metropolis()
trace = pm.sample(10000, step) And instead I obtained clearly diverging sampling process, with some entries of sampled matrices moving towards negative infinity, and not in any way resembling the prior precision matrix: Then I decided to check whether my notions about Wishart distribution is fundamentally flawed, and using the code from http://www.mit.edu/~mattjj/released-code/hsmm/stats_util.py, sample_wishart(prec_prior / 100.0, 100.0)
array([[ 26.37928925, -3.88792305],
[ -3.88792305, 7.6372514 ]]) The question is: does this behavior of MCMC sampler is something expected, or is this a bug, and how to properly sample from Wishart? |
Yes, this is a bug and the same behavior I observed with the Wishart distribution (see issue #672). I recommend taking a look at the LKJ distribution, which is a prior distribution for the correlation matrix instead of the covariance/precision matrix. |
Will run some tests, Wishart logp seems to look good |
Seems like we're not enforcing the tau > 0 constraint. |
Oh, right. I experimented a bit with |
It should actually read |
The other thing we have to enforce is the symmetry: return bound(
((n - p - 1) * log(IXI) - trace(matrix_inverse(V).dot(X)) -
n * p * log(2) - n * log(IVI) - 2 * multigammaln(n / 2., p)) / 2,
gt(n, (p - 1)),
all(gt(eigh(X)[0], 0)),
eq(X, X.T)
) Unfortunately, it now becomes quite obvious that trying to sample this matrix naively with these constraints is impossible, so we should probably try to sample only the upper triangular and use the cholesky decomposition. |
Sampling from the Wishart and Inverse Wishart seems to be something that http://www.math.wustl.edu/~sawyer/hmhandouts/Wishart.pdf best, Kai On 18 April 2015 at 11:56, Thomas Wiecki notifications@github.com wrote:
|
Thanks @kadeng. I'm starting to wonder if we shouldn't just ditch the Wishart. @betanalpha suggests here: https://www.youtube.com/watch?v=xWQpEAyI5s8 that it's a terrible choice as a prior to begin with and suggests using the LKJ which we already have working. |
Referencing a youtube video -- is that a first? There are two problems going on with the Wishart. Firstly, because of the positive-definite and symmetric constraints the probability of generating a valid sample by varying every element is zero (technically nonzero on floating point, but small enough to make it unfeasible). Really the only way to sample from these objects is to transform the target distribution to the N * (N + 1) / 2 dimensional unconstrained space, sample there, and then map the samples back to the constrained space. For details see the "Transformations of Constrained Variables" chapter in the Stan manual. Secondly, the Wishart distribution is very heavy-tailed which causes all kinds of problems for naive samplers. Formally heavy tails are obstructions to geometric ergodicity which is essentially the minimal condition needed to trust MCMC expectations in practice. More intuitively, when the tails of the target distribution are heavy the sampler drifts around in the tails and can't find its way back into the bulk of the distribution. NUTS will technically work okay, but only if you let the depth of the tree grow very, very large and incur the resulting computational cost. LKJ has much nicer tails and consequently pairs much better with most MCMC samplers. We'd also argue that the shape of LKJ priors are more inline with modeler's intuitions than Wishart (in particular the really nice interpretation is terms of the Cholesky factor), hence why we recommend them as defaults. |
The Wishart has never worked correctly and produced divergent samples. One of the reasons was that we did not enforce the matrix to be symmetric nor positive semi-definite. Now, however, it is impossible to randomly sample a valid matrix so we raise a warning that the Wishart is basically unusable currently and to instead use the LKJ prior which has many desriable properties. For more discussions, see #538.
Many thanks for chiming in @betanalpha, I actually meant to connect with you to get some feedback on I now added a warning not to use the Wishart unless we fix it using the transform method. Here's an example of how one can use the LKJ prior in |
Thanks everyone, especially @betanalpha and @twiecki, for insightful comments! It seems that LKG is the way to go. |
I tried the transform approach as suggested by @betanalpha and tried to follow the Stan docs. This is the model I came up with. @betanalpha if you have a sec it'd be great to get your input on whether you think this is correctly set up, the results make sense but this is the first time for me to work with multivariate transformations. import pymc3 as pm
import numpy as np
import theano
import theano.tensor as T
import scipy.stats
import matplotlib.pyplot as plt
covariance = np.matrix([[2, .5],
[.5, 1]])
prec = np.linalg.inv(covariance)
mean = [.5, 1]
data = scipy.stats.multivariate_normal(mean, covariance).rvs(5000)
plt.scatter(data[:, 0], data[:, 1])
S = np.matrix([[1, .5],
[.5, 1]])
L = scipy.linalg.cholesky(S)
nu = 5
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sd=1, shape=2)
c = T.sqrt(pm.ChiSquared('c', nu - np.arange(2, 4), shape=2))
z = pm.Normal('z', 0, 1)
A = T.stacklists([[c[0], 0],
[z, c[1]]])
# L * A * A.T * L.T ~ Wishart(L*L.T, nu)
wishart = pm.Deterministic('wishart', T.dot(T.dot(T.dot(L, A), A.T), L.T))
cov = pm.Deterministic('cov', T.nlinalg.matrix_inverse(wishart))
lp = pm.MvNormal('likelihood', mu=mu, tau=wishart, observed=data)
start = pm.find_MAP()
step = pm.NUTS(scaling=start)
trace = pm.sample(500, step)
pm.traceplot(trace[100:]);
print np.mean(trace['wishart'], axis=0)
print prec
print np.mean(trace['cov'], axis=0)
print covariance Output:
|
I refactored this now into a function with the same syntax as a distribution. However, it's a bit odd as it adds RVs itself to the model. I think it's pretty convenient though. @jsalvatier @fonnesbeck thoughts on adding this? import pymc3 as pm
import numpy as np
import theano
import theano.tensor as T
import scipy.stats
import matplotlib.pyplot as plt
covariance = np.matrix([[2, .5, -.5],
[.5, 1., 0.],
[-.5, 0., 0.5]])
prec = np.linalg.inv(covariance)
mean = [.5, 1, .2]
data = scipy.stats.multivariate_normal(mean, covariance).rvs(5000)
plt.scatter(data[:, 0], data[:, 1])
S = np.eye(3)
L = scipy.linalg.cholesky(S)
nu = 5
def WishartBartlett(name, S, nu, is_cholesky=False, return_cholesky=False):
"""
Bartlett decomposition of the Wishart distribution. As the Wishart
distribution requires the matrix to be symmetric positive semi-definite
it is impossible for MCMC to ever propose acceptable matrices.
Instead, we can use the Barlett decomposition which samples a lower
diagonal matrix. Specifically:
If L ~ [[sqrt(c_1), 0, ...],
[z_21, sqrt(c_1), 0, ...],
[z_31, z32, sqrt(c3), ...]]
with c_i ~ Chi²(n-i+1) and n_ij ~ N(0, 1), then
L * A * A.T * L.T ~ Wishart(L * L.T, nu)
See http://en.wikipedia.org/wiki/Wishart_distribution#Bartlett_decomposition
for more information.
:Parameters:
S : ndarray
p x p positive definite matrix
nu : int
Degrees of freedom, > dim(S).
is_cholesky : bool (default=False)
Input matrix S is already Cholesky decomposed as S.T * S
return_cholesky : bool (default=False)
Only return the Cholesky decomposed matrix.
:Note:
This is not a standard Distribution class but follows a similar
interface. Besides, the Wishart distribution, it will add RVs
c and z to your model which make up the matrix.
"""
L = S if is_cholesky else scipy.linalg.cholesky(S)
diag_idx = np.diag_indices_from(S)
tril_idx = np.tril_indices_from(S, k=-1)
n_diag = len(diag_idx[0])
n_tril = len(tril_idx[0])
c = T.sqrt(pm.ChiSquared('c', nu - np.arange(2, 2+n_diag), shape=n_diag))
z = pm.Normal('z', 0, 1, shape=n_tril)
# Construct A matrix
A = T.zeros(S.shape, dtype=np.float32)
A = T.set_subtensor(A[diag_idx], c)
A = T.set_subtensor(A[tril_idx], z)
# L * A * A.T * L.T ~ Wishart(L*L.T, nu)
if return_cholesky:
return pm.Deterministic(name, T.dot(L, A))
else:
return pm.Deterministic(name, T.dot(T.dot(T.dot(L, A), A.T), L.T))
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sd=1, shape=3)
prec = WishartBartlett('prec', S, nu)
cov = pm.Deterministic('cov', T.nlinalg.matrix_inverse(prec))
lp = pm.MvNormal('likelihood', mu=mu, tau=prec, observed=data)
start = pm.find_MAP()
step = pm.NUTS(scaling=start)
trace = pm.sample(500, step)
pm.traceplot(trace[100:]); |
Sorry for the late reply -- seems reasonable although I can't comment on the integration of the transformation into your sampler. |
Models that use the Wishart as a prior currently fail with valid parameters. For example,
Fails with the following:
I've tried a variety of 3x3 matrices, but all yield similar errors.
PS - I have a pull request that removed the redundant second parameter.
The text was updated successfully, but these errors were encountered: