-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADVI, NUTS and Metropolis produce significantly different results #1163
Comments
This is entirely possible. Metropolis can mix really poorly sometimes, and ADVI can give really bad approximations. You would have a better idea if something is wrong or not if you looked at some convergence diagnostics. If the ELBO in ADVI has not converged to a stationary value and the MCMC samplers have not converged, then it is premature to compare them. Also, in general, you should not compare NUTS and Metropolis based on sampling speed. NUTS is vastly more efficient in terms of effective sample size in the resulting trace than Metropolis is. BTW, you appear to be overwriting your data ( |
The ELBO in ADVI stabilizes very quickly:
Metropolis seems to suffer from bad initialization. h2 stays at zero and does not move at all:
Is there anyway to use the mean of VB to initialize the metropolis? |
What do You may also get better mileage out of specifying |
Not using start didn't improve the results for Metropolis. Meanwhile I realized that I had intercept in the This is a new code (I followed your advice ):
NUTs is on the right path (the value matches ML estimation) but judging based on the traceplot, it seems that it is not quite converged yet. Perhaps more iteration helps but it is quite slow (5000 iteration takes more an hour):
Metropolis still does not move:
ADVI is pretty off too:
|
@kayhan-batmanghelich |
@junpenglao I noticed that if I run NUTS for about 20,000 iterations, the posterior value is close the maximum likelihood estimation but that takes very very long. I couldn't get ADVI to produce a reasonable result. Metropolis also does not produces a good results; I tried initializing it with ADVI values, increasing iteration, no success. Overall it is very surprising. This is a basic mixed effect model perhaps a bit big but really nothing special about it. My final goal is not to solve this specific mixed effect model but rather using it as stepping stone to train more complicated model. I am very interested to get PyMC3 working since it helps me to focus more on the model rather than inference algorithm. |
Actually, it makes a difference at least in the way And for me, only Metropolis works but not NUTS and ADVI...
I am trying to do the same thing - if you want we can share the codes and try to solve this together. |
@junpenglao I found STAN very slow for this problem. This is a basic model, there are not much to change in the model. This is my model: Other than numerical issue, I don't have any other explanation why it doesn't work. |
How did STAN's result compare? |
@datnamer For STAN, I tried both NUTS and ADVI. NUTS stopped at the warmup iteration (iteration 1) for 8 hours, so I stopped it. Full-rank ADVI is relatively fast (couple of hours) but does not converge to a correct answer. |
@kayhan-batmanghelich
I think you model might be ill-condition for your data, you should try to reparameterize the model or perform some preprocessing. |
Glad we are atleast keeping up with stan. |
I would be interested how atmcmc performs with your model. As it is the perfect case with the multiply peaked distribution. Would you be interested to try it? import pymc3 as pm
import numpy as np
from pymc3.step_methods import ATMCMC as atmcmc
import theano.tensor as tt
from matplotlib import pylab as plt
test_folder = ('ATMIP_TEST')
n_chains = 500
n_steps = 100
tune_interval = 25
njobs = 1
n = 4
mu1 = np.ones(n) * (1. / 2)
mu2 = -mu1
stdev = 0.1
sigma = np.power(stdev, 2) * np.eye(n)
isigma = np.linalg.inv(sigma)
dsigma = np.linalg.det(sigma)
w1 = stdev
w2 = (1 - stdev)
def two_gaussians(x):
log_like1 = - 0.5 * n * tt.log(2 * np.pi) \
- 0.5 * tt.log(dsigma) \
- 0.5 * (x - mu1).tt.dot(isigma).dot(x - mu1)
log_like2 = - 0.5 * n * tt.log(2 * np.pi) \
- 0.5 * tt.log(dsigma) \
- 0.5 * (x - mu2).T.dot(isigma).dot(x - mu2)
return tt.log(w1 * tt.exp(log_like1) + w2 * tt.exp(log_like2))
with pm.Model() as ATMIP_test:
X = pm.Uniform('X',
shape=n,
lower=-2. * np.ones_like(mu1),
upper=2. * np.ones_like(mu1),
testval=-1. * np.ones_like(mu1),
transform=None)
like = pm.Deterministic('like', two_gaussians(X))
llk = pm.Potential('like', like)
with ATMIP_test:
step = atmcmc.ATMCMC(n_chains=n_chains, tune_interval=tune_interval,
likelihood_name=ATMIP_test.deterministics[0].name)
trcs = atmcmc.ATMIP_sample(
n_steps=n_steps,
step=step,
njobs=njobs,
progressbar=True,
trace=test_folder,
model=ATMIP_test)
pm.summary(trcs)
Pltr = pm.traceplot(trcs, combined=True)
plt.show(Pltr[0][0]) |
@hvasbath Sure I will give it a go. However I think there is some mistakes in the ATMIP_2gaussians:
As |
That must have accidentally got in there when they changed automatically the way to import theano.tensor from T to tt. So originally it was only a capital T. |
@hvasbath I see. But I "...would need to add a likelihood variable to your model that contains the model likelihood and pass it to the input." Is the likelihood variable being pass to the following need to be the one with observedRV?
or can I also do |
It needs to be a deterministic variable. A potential is not being traced in the traces. Thats why I have the deterministic variable up there. For your model it would be fine if I understand it correctly, to put 'y' in the likelihood_name. But as you have Normal priors you cannot ignore them. So creating a deterministic variable like this: with pm.Model() as mixedEffect_model:
### hyperpriors
h2 = pm.Uniform('h2', transform=None)
sigma2 = pm.HalfCauchy('eps', 5, transform=None)
#beta_0 = pm.Uniform('beta_0', lower=-1000, upper=1000) # a replacement for improper prior
w = pm.Normal('w', mu = 0, sd = 100, shape=M)
z = pm.Normal('z', mu = 0, sd= (h2*sigma2)**0.5 , shape=N)
g = T.dot(L,z)
y = pm.Normal('y', mu = g + T.dot(X,w),
sd= ((1-h2)*sigma2)**0.5 , observed=Pheno )
like = pm.Deterministic('like', h2.logpt + sigma2.logpt + w.logpt + z.logpt + y.logpt) Transforms have to be switched off, otherwise it wont work, but anyways for Metropolis sampling I couldnt see a major improvement using them so far. Then you can call it like this: with mixedEffect_model:
step=ATMCMC.ATMCMC(n_chains=500, tune_interval=20, likelihood_name=mixedEffect_model.deterministics[0].name)
trace = ATMCMC.ATMIP_sample(n_steps=200, step=step, njobs=1, progressbar=True, model=mixedEffect_model, trace='Test') It will save all the traces in the current directory 'Test' subdirectory. |
Let me know what you think! |
Thanks a lot @hvasbath !!! I will try right away.
For me this was the hard part to understand - if you could write it down somewhere as documentation that would be great! |
Actually there is a description in the class definition of the ATMCMC object. But I will add an example to the description. Also a flag for deleting the intermediate stages will be in the updated version of the code. These can fill the harddisk quite fast- but I usually keep all of the stages in order to see how it samples. |
@hvasbath very cool, thanks for chiming in. this might be a good real-world example of ATMCMC usage. |
Could you provide some more information on how to choose the optimal parameters for the sampler? If I run it as it is, atmcmc return very similar estimation to MCMC and NUTS except |
@hvasbath , CC: @junpenglao , @twiecki Thanks @hvasbath for your suggestion. I am trying ATMCMC on my model. My first try failed and I got this error:
It seems that it is numpy issue. Not sure I can install that version of numpy for python 2.7 so I switch to 3.5. @twiecki is numpy requirement enforced during installation? Thanks, |
You probably can just update numpy to 1.11 and stick with 2.7. However, in general I recommend updating to 3.5. |
Apparently we don't enforce the right version, good point. |
@kayhan-batmanghelich I am using python2.7 with the latest numpy version that is no problem. And yes you need a newer numpy version! |
@twiecki you are right if we get this to run properly and the results are like expected maybe we can use this example for the docs? |
@hvasbath definitely. |
I installed
|
I reduced the number of cores and it still produces the same error. This seems like a numerical issue. It can also be an initialization issue because out of three trails, ones go through and ran for few hours (didn't finish because I got disconnected from our server). So far NUTS (after lots of 20,000 iteration) produce good results. |
The initial start points for your traces are being created by calling the .random() method for each variable. |
@hvasbath yes you are right; it has nothing to do with cores. Sorry for misunderstanding. I meant I am going to replicate the experiment.... |
OK, this time I got "lucky" and it didn't take 10 hours to fail, it fail within 30. This is the code:
This is the error:
|
But exactly that is the problem. The point where it crashes is when it reads the initial trace population and calculates an proposal covariance from this initial population. It is enough to have one NaN or inf in one trace there and it will be all over the whole matrix. What theano version do you have? There was a major fix there between 8.2 and 9.0 which fixed the alloc function to return NaNs, wich resulted in this problem. I always recommend the developers version for theano. Lets hope thats it ;) . |
@hvasbath OK, but at least in the log I couldn't find any inf/nan. Perhaps there is a nan/inf right before writing to the stage files but there is no way to check (other than debugging line by line).
I can request up to 5 days on a node but it is hard to predict when it crashes or how long it takes. Based on your previous reply, it may take up to 30 stages. If one stage takes about 8 hours (using 12 cores), 30 stages takes about 360hrs=15 days. Is there anyway to start from where I left off in case I got kicked off by our admin :) |
The operation where it crashes comes after sampling the stage is finished. So when it crashes, there has to be going sth wrong in the transitional stage. Could you please send me such a trace file next time it crashes? Such a way to start at a given stage is under construction. But simply I had no time to work on it further. But I guess next week I could find the time to work on that. It should not be too difficult. To implement such an exception obviously would be good to have- would need to be done as well. You might want to consider to downscale your problem first, until you are sure that the model is stable and not crashing. As was visible from your earlier post, your matrix is in the order of 4000 by 4000. Which is also why it takes so long to run one forward model. Do you have a specifically compiled ALTAR/BLAS version for your cluster? If not that would be important to do. Then try using only one core for sampling and enable the theano internal parallelisation on the matrix operations- how to do that is in the theano docs. This parallelisation there can be much more effective and could speed up your model significantly. |
Any update on this? @kayhan-batmanghelich |
Hi @twiecki, No Update on that. So far this is my conclusion: Metropolis and ADVI didn't produce correct results. NUTS produces correct results after 20,000 iteration but that took about a day. ATMCMC was also slow for this scale. It took three days to reach to stage 3 and I guess I needed 10-30 more stages which renders ATMCMC not practical for this problem. I know the model is correct; as the I mentioned NUTS results seems right. The goal was to achieve a faster inference since I would like to make the model more complicated and slow inference makes that almost impossible. Anyway, pymc3 is great and I am sure I will use for different problem in my research but for this specific one, I probably need to write a custom inference algorithm. At the end these are some suggestions. Of course I understand people are busy so feel free to ignore it :) -- it would be great if ATMCMC can resume in case job crashes, etc. It also seems a bit unstable, more informative message would be helpful. Thanks everyone for inputs, |
@kayhan-batmanghelich I just recently found out why it is so slow. That is because of how the backends work. See this issue: #1264 I am also working on continueing the sampling from a later stage... In which way it is instable? Please let me know, otherwise I cant improve it ;) . The sampler is rather young and it is basically my fault if something doesnt work well. ;) |
@hvasbath Thanks for your reply. Please let me know whenever you merge the changes to the main repository and I will try it again. Regarding instability, it produces the nan/inf values at different stages of the sampling (please see above). It is hard to reproduce it. For example, last time I couldn't reproduce it because my job got killed by scheduler after three days and it was still running. It is not your fault my friend, thank you for sharing your code :) |
With the example data and the new auto-init #1523 this converges immediately. |
@twiecki Would you please post the code as well. We have already talked about different ways of doing it here. Also, would please let me know which example data you used? |
I used this: import numpy as np, pymc3 as pm, theano.tensor as T, matplotlib.pyplot as plt
M = 6 # number of columns in X - fixed effect
N = 10 # number of columns in L - random effect
nobs = 10
# generate design matrix using patsy
from patsy import dmatrices
import pandas as pd
predictors = []
for s1 in range(N):
for c1 in range(2):
for c2 in range(3):
for i in range(nobs):
predictors.append(np.asarray([c1+1,c2+1,s1+1]))
tbltest = pd.DataFrame(predictors, columns=['Condi1', 'Condi2', 'subj'])
tbltest['Condi1'] = tbltest['Condi1'].astype('category')
tbltest['Condi2'] = tbltest['Condi2'].astype('category')
tbltest['subj'] = tbltest['subj'].astype('category')
tbltest['tempresp'] = np.random.normal(size=(nobs*M*N,1))
Y, X = dmatrices("tempresp ~ Condi1*Condi2", data=tbltest, return_type='matrix')
Terms = X.design_info.column_names
_, L = dmatrices('tempresp ~ -1+subj', data=tbltest, return_type='matrix')
X = np.asarray(X) # fixed effect
L = np.asarray(L) # mixed effect
Y = np.asarray(Y)
# generate data
w0 = [5,1,2,3,1,1]
z0 = np.random.normal(size=(N,))
Pheno = np.dot(X,w0) + np.dot(L,z0) + Y.flatten()
#%%
with pm.Model() as mixedEffect_model:
### hyperpriors
h2 = pm.Uniform('h2')
sigma2 = pm.HalfCauchy('eps', 5)
#beta_0 = pm.Uniform('beta_0', lower=-1000, upper=1000) # a replacement for improper prior
w = pm.Normal('w', mu = 0, sd = 100, shape=M)
z = pm.Normal('z', mu = 0, sd= (h2*sigma2)**0.5 , shape=N)
g = T.dot(L,z)
y = pm.Normal('y', mu = g + T.dot(X,w),
sd= ((1-h2)*sigma2)**0.5 , observed=Pheno )
trace = pm.sample(5000) |
@twiecki Thanks, so the Metropolis converges quickly. I will try it on my data and will report back. I presume I need to update the pymc since there has been significant changes. Thanks. |
@kayhan-batmanghelich Great, that would be helpful. Make sure to not submit a step-method object and update to master. |
Dear PyMC developers,
I am trying to develop simple mixed effect model using pymc (see the code below). I have tried NUTS, ADVI, Metropolis for inference. Aside from variations in speed for this model (Time : AVDI ~= Metrolopis << NUTS ), the results are significantly different. To be more specific, I am interested to estimate$h^2$ which is basically proportion of variance explained by the random effect.
This is the code:
ADVI:
Metropolis:
NUTS (slow):
Given that$h^2$ with standard error of around 0.07. The results is very different. It is very different from ML and also very different across different inference engines. NUTS is closer to ML but given that the results are drastically different, I don't know how to explain it.
w
andbeta_0
are treated as parameter with (approximately) improper prior, I expect the results to be very close to maximum likelihood estimation. Using maximum likelihood estimation (more specifically Restricted maximum Likelihood estimation), I expect values around 0.5 forAny idea?
The text was updated successfully, but these errors were encountered: