New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variational inference with AD? #708
Comments
We don't currently have this implemented, but I suspect it would fit well On Wed, May 13, 2015 at 4:27 PM, datnamer notifications@github.com wrote:
|
Neat! This would be fantastic and I think |
It would be :) Just to be clear though, this was more of a "I'll just leave this right here" kinda deal ...I would have no idea where to even start. |
Understood :). Looking at the paper it actually doesn't look that bad. |
I've been looking over that paper with a student of mine. Looks promising. |
Relatedly, there is this. Nice to see a few worthy alternatives to MCMC showing up. |
Wow. Flexible inference on larger datasets would be a killer app and also ameliorate python's dearth of statistical modeling packages. |
They are using |
It's the stochastic estimate of the gradient. If you're using @akucukelbir, thoughts? |
@syclik thanks for alerting me to this thread. Very exciting. The notation is really poor in that workshop paper; I threw it together too quickly. Expect a much more readable (and thorough) arxiv preprint soon. I'm not too familiar with Do we have transformations in @twiecki we only use rmsprop in the context of "forgetting" about past gradients. I think it would be more accuracy to say we use a windowed version of adaGrad with stochastic gradient ascent. @datnamer the arxiv preprint will present results of using a data subsampling strategy with this algorithm. The speed improvements are dramatic. |
@akucukelbir @syclik Thanks for chiming in here! Regarding the question of transformations, they do exist, but are not automatic (although that would probably not be too hard). Here is an example: Your help in getting this implemented in |
Great! Could we chat about this after June 5? I'd love to talk in more detail. (I'll put it in my calendar, but just to be safe could one of you also ping me?) |
Sounds good! |
I'd love to chat with you guys as well. On Wed, May 27, 2015 at 8:10 AM Thomas Wiecki notifications@github.com
|
@akucukelbir Want to try and find a time to chat about that? |
absolutely. the arXiv paper should be up in a day or two. i'll let you guys know on this thread, and we can schedule a time. (having the paper in front of us would help guide the discussion, i think.) sound good? |
👍 |
hi folks. ok so i'm having some problems with arXiv. (should be resolved soon.) in the meantime, i'll just host the preprint on my website. here it is: http://www.proditus.com/papers/KRGB_preprint.pdf are we all in similar time zones? (i'm in new york.) if so, we can maybe start proposing some time-slots to chat? looking forward! |
Sounds great. I'm in CET, @jsalvatier is in PT I think. So something like 11am EST might work well for everyone. |
I'm in Central, so anything works for me. |
if we're shooting for 11am eastern, then the earliest i could do is friday (jun 12). if later in day is a possibility, then 3pm eastern today (jun 10) is also an option. |
Friday would be best for me.
|
11am EST works fine for me. I'm jsalvatier on both skype and gmail. On Wed, Jun 10, 2015 at 7:39 AM, Thomas Wiecki notifications@github.com
|
I can make it too.
|
Why don't we use appear.in. Works in the browser without any account requirements or software. Let's say: Sound good? |
perfect. friday at 11am eastern on appear.in. looking forward! |
wow. i didn't realize you guys had HMC. i'll be sure to cite the 2010 paper for the camera-ready version of our paper. that's a big omission on my part. apologies. talk to you in 2 hours! |
Actually, the 2010 paper is pre-HMC. We are working on something now, but it is not yet published or in press. |
@aflaxman thanks Abie, do you know if there's a good way to figure out how to use the internal optimizer? |
|
Thanks thomas, exactly what I was looking for. On Fri, Jun 19, 2015 at 5:01 AM, Thomas Wiecki notifications@github.com
|
Looks like you found something even more suited than sklearn. If it seems like the sklearn route would still be helpful, let me know more what you are after, and I can take a look. |
Basically, I'm looking for any stochastic optimizers with a general On Mon, Jun 22, 2015 at 5:29 AM Abraham Flaxman notifications@github.com
|
@datnamer were you ever able to dig up the dataset? |
Hi all First thing that occurred to me on reading this paper was "this would be great to add into PyMC". Glad to see it's already being worked on! @jsalvatier Re AdaGrad implementations, there are a fair few theano-based "deep learning" packages around which should implement the common variants on SGD. The one I'm using is Blocks, which has implemented AdaGrad, AdaDelta etc here in a modular way which can be compiled as updates to theano shared variables: https://blocks.readthedocs.org/en/latest/api/algorithms.html Not sure quite how coupled those are to the rest of the library but might be a useful starting point, and aren't actually that much code anyway. You probably do want to implement this in a way that uses theano compiled functions to do the updates though, rather than a completely generic numpy-based optimiser routine. With theano the updates can run entirely on the GPU without having to transfer parameters between GPU and CPU (something that can kill the performance advantage). If there's any way this could support streaming training batches (e.g. from a python iterator) rather than requiring the whole dataset to be loaded into memory, that would also be great. @datnamer numpy mmapped arrays might be one way around this but might restrict the way you use the library a bit. I imagine most people working with seriously large datasets are happy to roll a bit of their own code if necessary to deliver minibatches to an algorithm here, or even to implement their own training loop provided you expose the right APIs. To my mind that might even be preferable than an "all or nothing" black box approach where you don't have any control over how the dataset is iterated over. @akucukelbir I have a question actually about the data subsampling strategy mentioned in the paper. It describes the updates as being O(BMK) per minibatch, where K is the number of parameters, implying you update all the parameters for every batch. How do you handle the case where you have IID data with latent variable(s) for each datapoint -- do you only update the variational parameters for the datapoints in the current batch, or do you need to maintain parameters for every point in the dataset and update them all on every batch? The latter seems impractical in a big data situation. I can think of ways you might be able to implement the former approach with AVDI but it might require analysing the model's dependency structure a bit to find which latent variables it can treat this way, and then doing a few inner iterations of inference for the per-batch variational parameters for a batch before updating the shared variational parameters. IIRC Hoffman's online inference for LDA works this way for example. |
@mjwillson at the moment, we only support models with "global" latent variables. e.g. we marginalize out the "local" latent variables in a mixture model, for instance. what you describe is indeed what SVI does. it's not immediately clear, however, how to implement that in ADVI. |
@akucukelbir that makes sense, thanks for the response :) I can imagine it would be fiddly to get that working in a general setting. |
@mjwillson indeed. nevertheless, it's a problem well worth tackling. let me know if at any point you become interested in working on it. |
@mjwillson @akucukelbir @jsalvatier @twiecki Have you seen this BBVI in numpy?! https://github.com/HIPS/autograd/blob/b305f211a0db3f73ee2bed2cd6fb5ff16fe7c8df/examples/black_box_svi.py |
That's really nice. I wonder how it compares to Theano performance-wise? It would be great to be able to someday free PyMC from the constraints that Theano currently places on it. |
Are those feature constraints, coding constraints or both? My dream is to have AD for a combination of numba and the new numpy successor dynamic array library Dynd with missing data, user defined types etc Numba is already working well with Dynd, and more support is planned iirc I think if we give it a big push in pydata, this will be big for the ecosystem. I opened an issue here: HIPS/autograd#51 Any help to get this some momentum would be awesome. |
Sort of both, but I was thinking primarily of semantic constraints, such as he inability to write loops. Theano is an additional layer that PyMC3 users have to deal with in order to build models. |
I think cgt is looking at allowing loops. |
While these are interesting propositions I think that would be more for pymc4 and we should focus on getting pymc3 out the door. |
@datnamer Wow I wasn't aware of autograd. That's like dark magic :) I guess it's probably not as fast as Theano (doesn't seem to do any symbolic graph-rewriting optimisations, won't compile new kernels for you, ...) but maybe it's worth it for the simplicity, and it does seem to have support for gpuarray. I wasn't aware of CGT (http://rll.berkeley.edu/cgt/) either, this looks very neat as a better Theano. Anyone using it / have a feel on how mature it is? (Sorry getting slightly off-topic) |
@twiecki if cgt is supposed to be almost a drop in theano replacement. ..maybe it could go in pymc 3? |
The problem I see with autograd is that it will be very slow as it's just using numpy. I hope they explore numba to speed things. The problem I see with cgt is that it's still a young project. Last time I checked it still lacked features that would not make it a drop-in replacement and I expect the cost of changing the backend to be quite high (even if it doesn't appear to be too bad on first sight). And it doesn't really solve any problems -- users would just have to learn cgt instead of theano. At this point, we are really close to have something quite powerful and usable built on Theano. Putting on the finishing touches will be a much better ROI. |
a bit late to the discussion, but i think all of this is very exciting stuff. |
@twiecki that makes sense. There is also this package that compiles numpy to theano: https://github.com/LowinData/pyautodiff But i don't see that it can handle loops (I'll ask). @akucukelbir I'm happy that you are involved and following! |
+1 to @akucukelbir @datnamer, it looks like pyautodiff is a misnomer. One of the reasons it's going to have trouble with loops is that theano is symbolic differentiation, not automatic differentiation. (with enough restrictions, loops are fine, but in general, it's going to be difficult to symbolically differentiate) |
Makes sense. I wonder if there is a way to compile loops to theano's scan. @syclik seems like a decent drawback. Are there any benefits to symbolic diff vs auto diff? |
@datnamer From what I can tell the benefit of autodiff is that the expression graph is constructed on-the-fly, meaning you can use loops and control flow without having to express them symbolically. That could make data-dependent control flow a lot more natural, but could still have pitfalls if you make control flow decisions based on parameters (it can't magically backpropagate the error past non-differentiable control flow operations -- and in practise it would only know about the one code path that the forwards evaluation went down) Autodiff is probably more limited in terms of graph optimisations too, since you don't have the expression graph upfront, although some clever just-in-time stuff might be possible. |
@mjwillson, that's exactly right. Since the expression graph is constructed for each evaluation, autodiffing an algorithm that has different branching behavior from run to run is possible. Regarding differentiating past non-differentiable operations, that just doesn't work (for math... the rest follows). Regarding graph optimizations: that's correct. With symbolic differentiation operating on a static expression graph, you can do some really neat optimization. This limits what you can express in symbolic differentiation, but I'd buy the argument that maybe you can restructure what you need into a static expression. With automatic differentiation, you're not guaranteed that the expression graph that's generated for a particular execution is static from run to run. Of course, in most circumstances, it is, so someone really clever could do something just-in-time. If you wanted to restrict the expressiveness of autodiff to guarantee a static expression graph, then you should just use symbolic differentiation. |
This is implemented now. |
Can the theano infrastructure handle this?
stan-dev/stan#1421
http://andrewgelman.com/2015/02/18/vb-stan-black-box-black-box-variational-bayes/
The text was updated successfully, but these errors were encountered: