Variational inference with AD? #708

datnamer · 2015-05-13T23:27:07Z

Can the theano infrastructure handle this?

stan-dev/stan#1421
http://andrewgelman.com/2015/02/18/vb-stan-black-box-black-box-variational-bayes/

jsalvatier · 2015-05-14T03:13:47Z

We don't currently have this implemented, but I suspect it would fit well
into pymc3.

On Wed, May 13, 2015 at 4:27 PM, datnamer notifications@github.com wrote:

Can the theano infrastructure handle this?

stan-dev/stan#1421 stan-dev/stan#1421

http://andrewgelman.com/2015/02/18/vb-stan-black-box-black-box-variational-bayes/

—
Reply to this email directly or view it on GitHub
#708.

twiecki · 2015-05-14T09:38:10Z

Neat! This would be fantastic and I think theano has some features that would make this work without too much hassle.

datnamer · 2015-05-14T16:03:15Z

It would be :) Just to be clear though, this was more of a "I'll just leave this right here" kinda deal ...I would have no idea where to even start.

twiecki · 2015-05-14T16:07:13Z

Understood :). Looking at the paper it actually doesn't look that bad.

fonnesbeck · 2015-05-14T16:10:17Z

I've been looking over that paper with a student of mine. Looks promising.

fonnesbeck · 2015-05-14T16:13:09Z

Relatedly, there is this. Nice to see a few worthy alternatives to MCMC showing up.

datnamer · 2015-05-14T16:16:02Z

Wow. Flexible inference on larger datasets would be a killer app and also ameliorate python's dearth of statistical modeling packages.

twiecki · 2015-05-15T10:50:04Z

They are using rmprop to maximize. There are two python implementations:
One that works with theano from the theanets library: http://theanets.readthedocs.org/en/latest/generated/theanets.trainer.RmsProp.html
One from climin:
http://climin.readthedocs.org/en/latest/rmsprop.html

twiecki · 2015-05-15T11:00:04Z

is what we need to figure out how to compute.

syclik · 2015-05-27T14:11:44Z

It's the stochastic estimate of the gradient. If you're using theano, this should be cheaper than in Stan because it's already done symbolically, I think.

@akucukelbir, thoughts?

akucukelbir · 2015-05-27T14:44:42Z

@syclik thanks for alerting me to this thread. Very exciting.

The notation is really poor in that workshop paper; I threw it together too quickly. Expect a much more readable (and thorough) arxiv preprint soon.

I'm not too familiar with pymc or theano, but the primary reason our algorithm works so well in Stan are the automatic transformations. Without that, you'd have to specify the variational approximation.

Do we have transformations in pymc? If so, given that we've worked out a lot of the kinks in Stan, it shouldn't be hard to implement in pymc too.

@twiecki we only use rmsprop in the context of "forgetting" about past gradients. I think it would be more accuracy to say we use a windowed version of adaGrad with stochastic gradient ascent.

@datnamer the arxiv preprint will present results of using a data subsampling strategy with this algorithm. The speed improvements are dramatic.

twiecki · 2015-05-27T14:49:15Z

@akucukelbir @syclik Thanks for chiming in here!

Regarding the question of transformations, they do exist, but are not automatic (although that would probably not be too hard). Here is an example:
https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/stochastic_volatility.py#L56

Your help in getting this implemented in pymc3 would be much appreciated!

akucukelbir · 2015-05-27T14:53:28Z

Great! Could we chat about this after June 5? I'd love to talk in more detail. (I'll put it in my calendar, but just to be safe could one of you also ping me?)

twiecki · 2015-05-27T15:10:00Z

Sounds good!

jsalvatier · 2015-05-27T18:02:48Z

I'd love to chat with you guys as well.

On Wed, May 27, 2015 at 8:10 AM Thomas Wiecki notifications@github.com
wrote:

Sounds good!

—
Reply to this email directly or view it on GitHub
#708 (comment).

twiecki · 2015-06-09T15:22:11Z

@akucukelbir Want to try and find a time to chat about that?

akucukelbir · 2015-06-09T16:58:31Z

absolutely. the arXiv paper should be up in a day or two. i'll let you guys know on this thread, and we can schedule a time. (having the paper in front of us would help guide the discussion, i think.) sound good?

twiecki · 2015-06-09T17:59:08Z

👍

akucukelbir · 2015-06-10T13:32:07Z

hi folks. ok so i'm having some problems with arXiv. (should be resolved soon.) in the meantime, i'll just host the preprint on my website. here it is:

http://www.proditus.com/papers/KRGB_preprint.pdf

are we all in similar time zones? (i'm in new york.) if so, we can maybe start proposing some time-slots to chat?

looking forward!

twiecki · 2015-06-10T13:35:05Z

Sounds great. I'm in CET, @jsalvatier is in PT I think. So something like 11am EST might work well for everyone.

fonnesbeck · 2015-06-10T13:38:21Z

I'm in Central, so anything works for me.

akucukelbir · 2015-06-10T13:48:38Z

if we're shooting for 11am eastern, then the earliest i could do is friday (jun 12).

if later in day is a possibility, then 3pm eastern today (jun 10) is also an option.

twiecki · 2015-06-10T14:39:42Z

Friday would be best for me.
On Jun 10, 2015 3:48 PM, "Alp Kucukelbir" notifications@github.com wrote:

if we're shooting for 11am eastern, then the earliest i could do is friday
(jun 12).

if later in day is a possibility, then 3pm eastern today (jun 10) is also
an option.

—
Reply to this email directly or view it on GitHub
#708 (comment).

jsalvatier · 2015-06-10T17:45:53Z

11am EST works fine for me. I'm jsalvatier on both skype and gmail.

On Wed, Jun 10, 2015 at 7:39 AM, Thomas Wiecki notifications@github.com
wrote:

Friday would be best for me.
On Jun 10, 2015 3:48 PM, "Alp Kucukelbir" notifications@github.com
wrote:

if we're shooting for 11am eastern, then the earliest i could do is
friday
(jun 12).

if later in day is a possibility, then 3pm eastern today (jun 10) is also
an option.

—
Reply to this email directly or view it on GitHub
#708 (comment).

—
Reply to this email directly or view it on GitHub
#708 (comment).

syclik · 2015-06-10T17:52:36Z

I can make it too.
On Jun 10, 2015 1:46 PM, "John Salvatier" notifications@github.com wrote:

11am EST works fine for me. I'm jsalvatier on both skype and gmail.

On Wed, Jun 10, 2015 at 7:39 AM, Thomas Wiecki notifications@github.com
wrote:

Friday would be best for me.
On Jun 10, 2015 3:48 PM, "Alp Kucukelbir" notifications@github.com
wrote:

if we're shooting for 11am eastern, then the earliest i could do is
friday
(jun 12).

if later in day is a possibility, then 3pm eastern today (jun 10) is
also
an option.

—
Reply to this email directly or view it on GitHub
<#708 (comment)
.

—
Reply to this email directly or view it on GitHub
#708 (comment).

—
Reply to this email directly or view it on GitHub
#708 (comment).

fonnesbeck · 2015-06-10T17:59:15Z

Why don't we use appear.in. Works in the browser without any account requirements or software. Let's say:

https://appear.in/pymc3

Sound good?

akucukelbir · 2015-06-10T18:12:10Z

perfect. friday at 11am eastern on appear.in. looking forward!

akucukelbir · 2015-06-12T13:04:12Z

wow. i didn't realize you guys had HMC. i'll be sure to cite the 2010 paper for the camera-ready version of our paper. that's a big omission on my part. apologies.

talk to you in 2 hours!

fonnesbeck · 2015-06-12T13:44:37Z

Actually, the 2010 paper is pre-HMC. We are working on something now, but it is not yet published or in press.

jsalvatier · 2015-06-19T05:52:46Z

@aflaxman thanks Abie, do you know if there's a good way to figure out how to use the internal optimizer?

twiecki · 2015-06-19T12:01:51Z

scikit-learn only has SGD, but the paper is using adagrad I think. lasagne seems to offer all of those for theano: https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py

jsalvatier · 2015-06-19T17:07:04Z

Thanks thomas, exactly what I was looking for.

On Fri, Jun 19, 2015 at 5:01 AM, Thomas Wiecki notifications@github.com
wrote:

scikit-learn only has SGD, but the paper is using adagrad I think. lasagne
seems to offer all of those for theano:
https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py

—
Reply to this email directly or view it on GitHub
#708 (comment).

aflaxman · 2015-06-22T12:29:52Z

Looks like you found something even more suited than sklearn. If it seems like the sklearn route would still be helpful, let me know more what you are after, and I can take a look.

jsalvatier · 2015-06-22T16:51:48Z

Basically, I'm looking for any stochastic optimizers with a general
interface in a similar way to how scipy.optimize has a general interface.

On Mon, Jun 22, 2015 at 5:29 AM Abraham Flaxman notifications@github.com
wrote:

Looks like you found something even more suited than sklearn. If it seems
like the sklearn route would still be helpful, let me know more what you
are after, and I can take a look.

—
Reply to this email directly or view it on GitHub
#708 (comment).

jsalvatier · 2015-07-21T07:50:27Z

@datnamer were you ever able to dig up the dataset?

mjwillson · 2015-09-21T15:52:44Z

Hi all

First thing that occurred to me on reading this paper was "this would be great to add into PyMC". Glad to see it's already being worked on!

@jsalvatier Re AdaGrad implementations, there are a fair few theano-based "deep learning" packages around which should implement the common variants on SGD. The one I'm using is Blocks, which has implemented AdaGrad, AdaDelta etc here in a modular way which can be compiled as updates to theano shared variables:

https://blocks.readthedocs.org/en/latest/api/algorithms.html
https://github.com/mila-udem/blocks/blob/master/blocks/algorithms/__init__.py#L722

Not sure quite how coupled those are to the rest of the library but might be a useful starting point, and aren't actually that much code anyway. You probably do want to implement this in a way that uses theano compiled functions to do the updates though, rather than a completely generic numpy-based optimiser routine. With theano the updates can run entirely on the GPU without having to transfer parameters between GPU and CPU (something that can kill the performance advantage).

If there's any way this could support streaming training batches (e.g. from a python iterator) rather than requiring the whole dataset to be loaded into memory, that would also be great. @datnamer numpy mmapped arrays might be one way around this but might restrict the way you use the library a bit.

I imagine most people working with seriously large datasets are happy to roll a bit of their own code if necessary to deliver minibatches to an algorithm here, or even to implement their own training loop provided you expose the right APIs. To my mind that might even be preferable than an "all or nothing" black box approach where you don't have any control over how the dataset is iterated over.

@akucukelbir I have a question actually about the data subsampling strategy mentioned in the paper. It describes the updates as being O(BMK) per minibatch, where K is the number of parameters, implying you update all the parameters for every batch.

How do you handle the case where you have IID data with latent variable(s) for each datapoint -- do you only update the variational parameters for the datapoints in the current batch, or do you need to maintain parameters for every point in the dataset and update them all on every batch? The latter seems impractical in a big data situation.

I can think of ways you might be able to implement the former approach with AVDI but it might require analysing the model's dependency structure a bit to find which latent variables it can treat this way, and then doing a few inner iterations of inference for the per-batch variational parameters for a batch before updating the shared variational parameters. IIRC Hoffman's online inference for LDA works this way for example.

akucukelbir · 2015-09-21T18:17:41Z

@mjwillson at the moment, we only support models with "global" latent variables. e.g. we marginalize out the "local" latent variables in a mixture model, for instance.

what you describe is indeed what SVI does. it's not immediately clear, however, how to implement that in ADVI.

mjwillson · 2015-09-22T13:03:07Z

@akucukelbir that makes sense, thanks for the response :) I can imagine it would be fiddly to get that working in a general setting.

akucukelbir · 2015-09-22T13:22:19Z

@mjwillson indeed. nevertheless, it's a problem well worth tackling. let me know if at any point you become interested in working on it.

datnamer · 2015-10-15T01:34:40Z

@mjwillson @akucukelbir @jsalvatier @twiecki Have you seen this BBVI in numpy?! https://github.com/HIPS/autograd/blob/b305f211a0db3f73ee2bed2cd6fb5ff16fe7c8df/examples/black_box_svi.py

fonnesbeck · 2015-10-15T02:05:40Z

That's really nice. I wonder how it compares to Theano performance-wise? It would be great to be able to someday free PyMC from the constraints that Theano currently places on it.

datnamer · 2015-10-15T02:12:14Z

Are those feature constraints, coding constraints or both? My dream is to have AD for a combination of numba and the new numpy successor dynamic array library Dynd with missing data, user defined types etc

Numba is already working well with Dynd, and more support is planned iirc

I think if we give it a big push in pydata, this will be big for the ecosystem.

I opened an issue here: HIPS/autograd#51

Any help to get this some momentum would be awesome.

fonnesbeck · 2015-10-15T02:20:48Z

Sort of both, but I was thinking primarily of semantic constraints, such as he inability to write loops. Theano is an additional layer that PyMC3 users have to deal with in order to build models.

datnamer · 2015-10-15T02:23:17Z

I think cgt is looking at allowing loops.

twiecki · 2015-10-15T10:01:28Z

While these are interesting propositions I think that would be more for pymc4 and we should focus on getting pymc3 out the door.

mjwillson · 2015-10-15T10:34:36Z

@datnamer Wow I wasn't aware of autograd. That's like dark magic :)

I guess it's probably not as fast as Theano (doesn't seem to do any symbolic graph-rewriting optimisations, won't compile new kernels for you, ...) but maybe it's worth it for the simplicity, and it does seem to have support for gpuarray.

I wasn't aware of CGT (http://rll.berkeley.edu/cgt/) either, this looks very neat as a better Theano. Anyone using it / have a feel on how mature it is?

(Sorry getting slightly off-topic)

datnamer · 2015-10-15T10:45:56Z

@twiecki if cgt is supposed to be almost a drop in theano replacement. ..maybe it could go in pymc 3?

twiecki · 2015-10-16T11:40:11Z

The problem I see with autograd is that it will be very slow as it's just using numpy. I hope they explore numba to speed things.

The problem I see with cgt is that it's still a young project. Last time I checked it still lacked features that would not make it a drop-in replacement and I expect the cost of changing the backend to be quite high (even if it doesn't appear to be too bad on first sight). And it doesn't really solve any problems -- users would just have to learn cgt instead of theano.

At this point, we are really close to have something quite powerful and usable built on Theano. Putting on the finishing touches will be a much better ROI.

akucukelbir · 2015-10-16T17:57:13Z

a bit late to the discussion, but i think all of this is very exciting stuff.

datnamer · 2015-10-16T17:59:46Z

@twiecki that makes sense. There is also this package that compiles numpy to theano: https://github.com/LowinData/pyautodiff

But i don't see that it can handle loops (I'll ask).

@akucukelbir I'm happy that you are involved and following!

syclik · 2015-10-16T18:04:13Z

+1 to @akucukelbir

@datnamer, it looks like pyautodiff is a misnomer. One of the reasons it's going to have trouble with loops is that theano is symbolic differentiation, not automatic differentiation. (with enough restrictions, loops are fine, but in general, it's going to be difficult to symbolically differentiate)

datnamer · 2015-10-16T18:10:24Z

Makes sense. I wonder if there is a way to compile loops to theano's scan.

@syclik seems like a decent drawback. Are there any benefits to symbolic diff vs auto diff?

mjwillson · 2015-10-19T14:49:18Z

@datnamer From what I can tell the benefit of autodiff is that the expression graph is constructed on-the-fly, meaning you can use loops and control flow without having to express them symbolically.

That could make data-dependent control flow a lot more natural, but could still have pitfalls if you make control flow decisions based on parameters (it can't magically backpropagate the error past non-differentiable control flow operations -- and in practise it would only know about the one code path that the forwards evaluation went down)

Autodiff is probably more limited in terms of graph optimisations too, since you don't have the expression graph upfront, although some clever just-in-time stuff might be possible.

syclik · 2015-10-19T15:03:40Z

@mjwillson, that's exactly right. Since the expression graph is constructed for each evaluation, autodiffing an algorithm that has different branching behavior from run to run is possible.

Regarding differentiating past non-differentiable operations, that just doesn't work (for math... the rest follows).

Regarding graph optimizations: that's correct. With symbolic differentiation operating on a static expression graph, you can do some really neat optimization. This limits what you can express in symbolic differentiation, but I'd buy the argument that maybe you can restructure what you need into a static expression. With automatic differentiation, you're not guaranteed that the expression graph that's generated for a particular execution is static from run to run. Of course, in most circumstances, it is, so someone really clever could do something just-in-time. If you wanted to restrict the expressiveness of autodiff to guarantee a static expression graph, then you should just use symbolic differentiation.

twiecki · 2016-05-18T06:54:46Z

This is implemented now.

datnamer mentioned this issue Jun 5, 2015

Variational inference with AD, for larger datasets JuliaStats/Klara.jl#80

Open

jsalvatier mentioned this issue Jun 20, 2015

[draft] Variational Inference #775

Merged

datnamer mentioned this issue Oct 22, 2015

Dynd support HIPS/autograd#51

Closed

datnamer mentioned this issue Mar 9, 2016

Building design matrices - Best practice statsmodels/statsmodels#2841

Open

twiecki closed this as completed May 18, 2016

Variational inference with AD? #708

Variational inference with AD? #708

Comments

datnamer commented May 13, 2015

jsalvatier commented May 14, 2015

twiecki commented May 14, 2015

datnamer commented May 14, 2015

twiecki commented May 14, 2015

fonnesbeck commented May 14, 2015

fonnesbeck commented May 14, 2015

datnamer commented May 14, 2015

twiecki commented May 15, 2015

twiecki commented May 15, 2015

syclik commented May 27, 2015

akucukelbir commented May 27, 2015

twiecki commented May 27, 2015

akucukelbir commented May 27, 2015

twiecki commented May 27, 2015

jsalvatier commented May 27, 2015

twiecki commented Jun 9, 2015

akucukelbir commented Jun 9, 2015

twiecki commented Jun 9, 2015

akucukelbir commented Jun 10, 2015

twiecki commented Jun 10, 2015

fonnesbeck commented Jun 10, 2015

akucukelbir commented Jun 10, 2015

twiecki commented Jun 10, 2015

jsalvatier commented Jun 10, 2015

syclik commented Jun 10, 2015

fonnesbeck commented Jun 10, 2015

akucukelbir commented Jun 10, 2015

akucukelbir commented Jun 12, 2015

fonnesbeck commented Jun 12, 2015

jsalvatier commented Jun 19, 2015

twiecki commented Jun 19, 2015

jsalvatier commented Jun 19, 2015

aflaxman commented Jun 22, 2015

jsalvatier commented Jun 22, 2015

jsalvatier commented Jul 21, 2015

mjwillson commented Sep 21, 2015

akucukelbir commented Sep 21, 2015

mjwillson commented Sep 22, 2015

akucukelbir commented Sep 22, 2015

datnamer commented Oct 15, 2015

fonnesbeck commented Oct 15, 2015

datnamer commented Oct 15, 2015

fonnesbeck commented Oct 15, 2015

datnamer commented Oct 15, 2015

twiecki commented Oct 15, 2015

mjwillson commented Oct 15, 2015

datnamer commented Oct 15, 2015

twiecki commented Oct 16, 2015

akucukelbir commented Oct 16, 2015

datnamer commented Oct 16, 2015

syclik commented Oct 16, 2015

datnamer commented Oct 16, 2015

mjwillson commented Oct 19, 2015

syclik commented Oct 19, 2015

twiecki commented May 18, 2016