Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variational inference with AD? #708

Closed
datnamer opened this issue May 13, 2015 · 64 comments
Closed

Variational inference with AD? #708

datnamer opened this issue May 13, 2015 · 64 comments

Comments

@datnamer
Copy link

Can the theano infrastructure handle this?

stan-dev/stan#1421
http://andrewgelman.com/2015/02/18/vb-stan-black-box-black-box-variational-bayes/

@jsalvatier
Copy link
Member

We don't currently have this implemented, but I suspect it would fit well
into pymc3.

On Wed, May 13, 2015 at 4:27 PM, datnamer notifications@github.com wrote:

Can the theano infrastructure handle this?

stan-dev/stan#1421 stan-dev/stan#1421

http://andrewgelman.com/2015/02/18/vb-stan-black-box-black-box-variational-bayes/


Reply to this email directly or view it on GitHub
#708.

@twiecki
Copy link
Member

twiecki commented May 14, 2015

Neat! This would be fantastic and I think theano has some features that would make this work without too much hassle.

@datnamer
Copy link
Author

It would be :) Just to be clear though, this was more of a "I'll just leave this right here" kinda deal ...I would have no idea where to even start.

@twiecki
Copy link
Member

twiecki commented May 14, 2015

Understood :). Looking at the paper it actually doesn't look that bad.

@fonnesbeck
Copy link
Member

I've been looking over that paper with a student of mine. Looks promising.

@fonnesbeck
Copy link
Member

Relatedly, there is this. Nice to see a few worthy alternatives to MCMC showing up.

@datnamer
Copy link
Author

Wow. Flexible inference on larger datasets would be a killer app and also ameliorate python's dearth of statistical modeling packages.

@twiecki
Copy link
Member

twiecki commented May 15, 2015

They are using rmprop to maximize. There are two python implementations:
One that works with theano from the theanets library: http://theanets.readthedocs.org/en/latest/generated/theanets.trainer.RmsProp.html
One from climin:
http://climin.readthedocs.org/en/latest/rmsprop.html

@twiecki
Copy link
Member

twiecki commented May 15, 2015

image
is what we need to figure out how to compute.

@syclik
Copy link

syclik commented May 27, 2015

It's the stochastic estimate of the gradient. If you're using theano, this should be cheaper than in Stan because it's already done symbolically, I think.

@akucukelbir, thoughts?

@akucukelbir
Copy link

@syclik thanks for alerting me to this thread. Very exciting.

The notation is really poor in that workshop paper; I threw it together too quickly. Expect a much more readable (and thorough) arxiv preprint soon.

I'm not too familiar with pymc or theano, but the primary reason our algorithm works so well in Stan are the automatic transformations. Without that, you'd have to specify the variational approximation.

Do we have transformations in pymc? If so, given that we've worked out a lot of the kinks in Stan, it shouldn't be hard to implement in pymc too.

@twiecki we only use rmsprop in the context of "forgetting" about past gradients. I think it would be more accuracy to say we use a windowed version of adaGrad with stochastic gradient ascent.

@datnamer the arxiv preprint will present results of using a data subsampling strategy with this algorithm. The speed improvements are dramatic.

@twiecki
Copy link
Member

twiecki commented May 27, 2015

@akucukelbir @syclik Thanks for chiming in here!

Regarding the question of transformations, they do exist, but are not automatic (although that would probably not be too hard). Here is an example:
https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/stochastic_volatility.py#L56

Your help in getting this implemented in pymc3 would be much appreciated!

@akucukelbir
Copy link

Great! Could we chat about this after June 5? I'd love to talk in more detail. (I'll put it in my calendar, but just to be safe could one of you also ping me?)

@twiecki
Copy link
Member

twiecki commented May 27, 2015

Sounds good!

@jsalvatier
Copy link
Member

I'd love to chat with you guys as well.

On Wed, May 27, 2015 at 8:10 AM Thomas Wiecki notifications@github.com
wrote:

Sounds good!


Reply to this email directly or view it on GitHub
#708 (comment).

@twiecki
Copy link
Member

twiecki commented Jun 9, 2015

@akucukelbir Want to try and find a time to chat about that?

@akucukelbir
Copy link

absolutely. the arXiv paper should be up in a day or two. i'll let you guys know on this thread, and we can schedule a time. (having the paper in front of us would help guide the discussion, i think.) sound good?

@twiecki
Copy link
Member

twiecki commented Jun 9, 2015

👍

@akucukelbir
Copy link

hi folks. ok so i'm having some problems with arXiv. (should be resolved soon.) in the meantime, i'll just host the preprint on my website. here it is:

http://www.proditus.com/papers/KRGB_preprint.pdf

are we all in similar time zones? (i'm in new york.) if so, we can maybe start proposing some time-slots to chat?

looking forward!

@twiecki
Copy link
Member

twiecki commented Jun 10, 2015

Sounds great. I'm in CET, @jsalvatier is in PT I think. So something like 11am EST might work well for everyone.

@fonnesbeck
Copy link
Member

I'm in Central, so anything works for me.

@akucukelbir
Copy link

if we're shooting for 11am eastern, then the earliest i could do is friday (jun 12).

if later in day is a possibility, then 3pm eastern today (jun 10) is also an option.

@twiecki
Copy link
Member

twiecki commented Jun 10, 2015

Friday would be best for me.
On Jun 10, 2015 3:48 PM, "Alp Kucukelbir" notifications@github.com wrote:

if we're shooting for 11am eastern, then the earliest i could do is friday
(jun 12).

if later in day is a possibility, then 3pm eastern today (jun 10) is also
an option.


Reply to this email directly or view it on GitHub
#708 (comment).

@jsalvatier
Copy link
Member

11am EST works fine for me. I'm jsalvatier on both skype and gmail.

On Wed, Jun 10, 2015 at 7:39 AM, Thomas Wiecki notifications@github.com
wrote:

Friday would be best for me.
On Jun 10, 2015 3:48 PM, "Alp Kucukelbir" notifications@github.com
wrote:

if we're shooting for 11am eastern, then the earliest i could do is
friday
(jun 12).

if later in day is a possibility, then 3pm eastern today (jun 10) is also
an option.


Reply to this email directly or view it on GitHub
#708 (comment).


Reply to this email directly or view it on GitHub
#708 (comment).

@syclik
Copy link

syclik commented Jun 10, 2015

I can make it too.
On Jun 10, 2015 1:46 PM, "John Salvatier" notifications@github.com wrote:

11am EST works fine for me. I'm jsalvatier on both skype and gmail.

On Wed, Jun 10, 2015 at 7:39 AM, Thomas Wiecki notifications@github.com
wrote:

Friday would be best for me.
On Jun 10, 2015 3:48 PM, "Alp Kucukelbir" notifications@github.com
wrote:

if we're shooting for 11am eastern, then the earliest i could do is
friday
(jun 12).

if later in day is a possibility, then 3pm eastern today (jun 10) is
also
an option.


Reply to this email directly or view it on GitHub
<#708 (comment)
.


Reply to this email directly or view it on GitHub
#708 (comment).


Reply to this email directly or view it on GitHub
#708 (comment).

@fonnesbeck
Copy link
Member

Why don't we use appear.in. Works in the browser without any account requirements or software. Let's say:

https://appear.in/pymc3

Sound good?

@akucukelbir
Copy link

perfect. friday at 11am eastern on appear.in. looking forward!

@akucukelbir
Copy link

wow. i didn't realize you guys had HMC. i'll be sure to cite the 2010 paper for the camera-ready version of our paper. that's a big omission on my part. apologies.

talk to you in 2 hours!

@fonnesbeck
Copy link
Member

Actually, the 2010 paper is pre-HMC. We are working on something now, but it is not yet published or in press.

@jsalvatier
Copy link
Member

@aflaxman thanks Abie, do you know if there's a good way to figure out how to use the internal optimizer?

@twiecki
Copy link
Member

twiecki commented Jun 19, 2015

scikit-learn only has SGD, but the paper is using adagrad I think. lasagne seems to offer all of those for theano: https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py

@jsalvatier
Copy link
Member

Thanks thomas, exactly what I was looking for.

On Fri, Jun 19, 2015 at 5:01 AM, Thomas Wiecki notifications@github.com
wrote:

scikit-learn only has SGD, but the paper is using adagrad I think. lasagne
seems to offer all of those for theano:
https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py


Reply to this email directly or view it on GitHub
#708 (comment).

@aflaxman
Copy link
Contributor

Looks like you found something even more suited than sklearn. If it seems like the sklearn route would still be helpful, let me know more what you are after, and I can take a look.

@jsalvatier
Copy link
Member

Basically, I'm looking for any stochastic optimizers with a general
interface in a similar way to how scipy.optimize has a general interface.

On Mon, Jun 22, 2015 at 5:29 AM Abraham Flaxman notifications@github.com
wrote:

Looks like you found something even more suited than sklearn. If it seems
like the sklearn route would still be helpful, let me know more what you
are after, and I can take a look.


Reply to this email directly or view it on GitHub
#708 (comment).

@jsalvatier
Copy link
Member

@datnamer were you ever able to dig up the dataset?

@mjwillson
Copy link

Hi all

First thing that occurred to me on reading this paper was "this would be great to add into PyMC". Glad to see it's already being worked on!

@jsalvatier Re AdaGrad implementations, there are a fair few theano-based "deep learning" packages around which should implement the common variants on SGD. The one I'm using is Blocks, which has implemented AdaGrad, AdaDelta etc here in a modular way which can be compiled as updates to theano shared variables:

https://blocks.readthedocs.org/en/latest/api/algorithms.html
https://github.com/mila-udem/blocks/blob/master/blocks/algorithms/__init__.py#L722

Not sure quite how coupled those are to the rest of the library but might be a useful starting point, and aren't actually that much code anyway. You probably do want to implement this in a way that uses theano compiled functions to do the updates though, rather than a completely generic numpy-based optimiser routine. With theano the updates can run entirely on the GPU without having to transfer parameters between GPU and CPU (something that can kill the performance advantage).

If there's any way this could support streaming training batches (e.g. from a python iterator) rather than requiring the whole dataset to be loaded into memory, that would also be great. @datnamer numpy mmapped arrays might be one way around this but might restrict the way you use the library a bit.

I imagine most people working with seriously large datasets are happy to roll a bit of their own code if necessary to deliver minibatches to an algorithm here, or even to implement their own training loop provided you expose the right APIs. To my mind that might even be preferable than an "all or nothing" black box approach where you don't have any control over how the dataset is iterated over.

@akucukelbir I have a question actually about the data subsampling strategy mentioned in the paper. It describes the updates as being O(BMK) per minibatch, where K is the number of parameters, implying you update all the parameters for every batch.

How do you handle the case where you have IID data with latent variable(s) for each datapoint -- do you only update the variational parameters for the datapoints in the current batch, or do you need to maintain parameters for every point in the dataset and update them all on every batch? The latter seems impractical in a big data situation.

I can think of ways you might be able to implement the former approach with AVDI but it might require analysing the model's dependency structure a bit to find which latent variables it can treat this way, and then doing a few inner iterations of inference for the per-batch variational parameters for a batch before updating the shared variational parameters. IIRC Hoffman's online inference for LDA works this way for example.

@akucukelbir
Copy link

@mjwillson at the moment, we only support models with "global" latent variables. e.g. we marginalize out the "local" latent variables in a mixture model, for instance.

what you describe is indeed what SVI does. it's not immediately clear, however, how to implement that in ADVI.

@mjwillson
Copy link

@akucukelbir that makes sense, thanks for the response :) I can imagine it would be fiddly to get that working in a general setting.

@akucukelbir
Copy link

@mjwillson indeed. nevertheless, it's a problem well worth tackling. let me know if at any point you become interested in working on it.

@datnamer
Copy link
Author

@fonnesbeck
Copy link
Member

That's really nice. I wonder how it compares to Theano performance-wise? It would be great to be able to someday free PyMC from the constraints that Theano currently places on it.

@datnamer
Copy link
Author

Are those feature constraints, coding constraints or both? My dream is to have AD for a combination of numba and the new numpy successor dynamic array library Dynd with missing data, user defined types etc

Numba is already working well with Dynd, and more support is planned iirc

I think if we give it a big push in pydata, this will be big for the ecosystem.

I opened an issue here: HIPS/autograd#51

Any help to get this some momentum would be awesome.

@fonnesbeck
Copy link
Member

Sort of both, but I was thinking primarily of semantic constraints, such as he inability to write loops. Theano is an additional layer that PyMC3 users have to deal with in order to build models.

@datnamer
Copy link
Author

I think cgt is looking at allowing loops.

@twiecki
Copy link
Member

twiecki commented Oct 15, 2015

While these are interesting propositions I think that would be more for pymc4 and we should focus on getting pymc3 out the door.

@mjwillson
Copy link

@datnamer Wow I wasn't aware of autograd. That's like dark magic :)

I guess it's probably not as fast as Theano (doesn't seem to do any symbolic graph-rewriting optimisations, won't compile new kernels for you, ...) but maybe it's worth it for the simplicity, and it does seem to have support for gpuarray.

I wasn't aware of CGT (http://rll.berkeley.edu/cgt/) either, this looks very neat as a better Theano. Anyone using it / have a feel on how mature it is?

(Sorry getting slightly off-topic)

@datnamer
Copy link
Author

@twiecki if cgt is supposed to be almost a drop in theano replacement. ..maybe it could go in pymc 3?

@twiecki
Copy link
Member

twiecki commented Oct 16, 2015

The problem I see with autograd is that it will be very slow as it's just using numpy. I hope they explore numba to speed things.

The problem I see with cgt is that it's still a young project. Last time I checked it still lacked features that would not make it a drop-in replacement and I expect the cost of changing the backend to be quite high (even if it doesn't appear to be too bad on first sight). And it doesn't really solve any problems -- users would just have to learn cgt instead of theano.

At this point, we are really close to have something quite powerful and usable built on Theano. Putting on the finishing touches will be a much better ROI.

@akucukelbir
Copy link

a bit late to the discussion, but i think all of this is very exciting stuff.

@datnamer
Copy link
Author

@twiecki that makes sense. There is also this package that compiles numpy to theano: https://github.com/LowinData/pyautodiff

But i don't see that it can handle loops (I'll ask).

@akucukelbir I'm happy that you are involved and following!

@syclik
Copy link

syclik commented Oct 16, 2015

+1 to @akucukelbir

@datnamer, it looks like pyautodiff is a misnomer. One of the reasons it's going to have trouble with loops is that theano is symbolic differentiation, not automatic differentiation. (with enough restrictions, loops are fine, but in general, it's going to be difficult to symbolically differentiate)

@datnamer
Copy link
Author

Makes sense. I wonder if there is a way to compile loops to theano's scan.

@syclik seems like a decent drawback. Are there any benefits to symbolic diff vs auto diff?

@mjwillson
Copy link

@datnamer From what I can tell the benefit of autodiff is that the expression graph is constructed on-the-fly, meaning you can use loops and control flow without having to express them symbolically.

That could make data-dependent control flow a lot more natural, but could still have pitfalls if you make control flow decisions based on parameters (it can't magically backpropagate the error past non-differentiable control flow operations -- and in practise it would only know about the one code path that the forwards evaluation went down)

Autodiff is probably more limited in terms of graph optimisations too, since you don't have the expression graph upfront, although some clever just-in-time stuff might be possible.

@syclik
Copy link

syclik commented Oct 19, 2015

@mjwillson, that's exactly right. Since the expression graph is constructed for each evaluation, autodiffing an algorithm that has different branching behavior from run to run is possible.

Regarding differentiating past non-differentiable operations, that just doesn't work (for math... the rest follows).

Regarding graph optimizations: that's correct. With symbolic differentiation operating on a static expression graph, you can do some really neat optimization. This limits what you can express in symbolic differentiation, but I'd buy the argument that maybe you can restructure what you need into a static expression. With automatic differentiation, you're not guaranteed that the expression graph that's generated for a particular execution is static from run to run. Of course, in most circumstances, it is, so someone really clever could do something just-in-time. If you wanted to restrict the expressiveness of autodiff to guarantee a static expression graph, then you should just use symbolic differentiation.

@twiecki
Copy link
Member

twiecki commented May 18, 2016

This is implemented now.

@twiecki twiecki closed this as completed May 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants