Implicit reparameterization gradients #51

dustinvtran · 2018-05-23T17:01:08Z

The idea is simple to implement and well-scoped as part of TF Distributions. I personally like the idea of even having it be the default for Gamma, Beta, and others. The gradient implementations may be tricky. Maybe @mfigurnov has an implementation we can build on? (if open source or google-internal)

axch · 2018-05-24T13:52:54Z

Wow, that's a really nice trick -- I wish I had thought of it! Re: gradient implementations: The paper discusses deriving them from the CDF using forward-mode AD. That (or reverse-mode) could be a good place to start on CPU, especially if TF is willing to incorporate a C++ AD system (the Stan math library, perhaps?) as a dependency. Is CUDA close enough to C++ that the same AD system could just work?

The bad news is that any AD approach will likely have to accept some error, because differentiating a numeric approximation is worse than numerically approximating a (correct) derivative. The good news is that for basic distributions like Gamma, unit-testing the derivative code is pretty easy: just test against an expensive, high-precision numerical method (like Richardson extrapolation in 64-bit or 128-bit floats).

Hm. I wonder whether it would be possible to construct a better (and still fast) approximation of the true derivative by using the expensive-but-accurate computation as a reference. For instance, can an optimal set of Chebyshev polynomials up to, say, 5th order be found empirically? One could even use the AD as a starting point, and find Chebyshev polynomials to cancel off the difference between the AD results and the accurate results.

davmre · 2018-05-24T17:50:58Z

I imagine that getting a full-fledged C++ AD system into TF would be quite a heavy lift, but since this approach isn't AD'ing user code, just a small number of well-understood computations (e.g., the gamma CDF), couldn't one just do the AD manually? That is, insofar as forward-mode AD is 'just' running your computation on dual numbers x+eps instead of x, for any given algorithm you should be able to just explicitly do this, grounding out the real and infinitesimal parts in your code.

axch · 2018-05-24T18:15:43Z

Yes. The question in my mind is whether the computations in question really are few in number and well-understood; but one could certainly start down that path and see how much use one covered before getting tired.

srvasude · 2018-05-24T18:24:00Z

Right, I think we can override the gradient computations with this explicitly.

The idea is that (talking with @jvdillon) is that distribution can have an override gradient method for samples. That way a user can switch with different ways to override the gradient (and for non reparameterizable distributions, we would error out explicitly.

Re: How would you use a Chebyshev approx here? Are you saying for univariate distributions, use the Inverse CDF transform (support is [0, 1]) and then do some Chebyshev approximation to that (rather than doing implicit differentiation with the CDF)? At some point you'd have to compute the Inverse CDF of your random draw right (which can be expensive)?

axch · 2018-05-24T18:33:29Z

No, I was just musing about how to get better precision for the derivative of the CDF than you would get by ADing the implementation. Namely, if you have a slow but accurate way to compute some analytically obnoxious function (in this case, f(x) = d/dx CDF_foo(x)|_x), and you have a fast but somewhat inaccurate way to compute it (in this case, applying AD to CDF_foo), is it possible to empirically select coefficients to construct a Chebyshev approximation of the difference, so as to ship

def f(x):
  return AD(CDF_foo)(x) + <some polynomial>(x)

The idea would be to pick the polynomial to minimize the maximum (empirically determined) error between this (which is presumably still fast) and the slow but accurate thing. It's not clear that this idea is workable, and it certainly isn't necessary to get started, but it seemed cute.

To clarify: the slow and accurate thing I have in mind is numerically computing d/dx CDF_foo(x)|_x by Richardson extrapolation in high precision arithmetic (either 64-bit, or, if that's not good enough, busting out a suitable very high precision library).

mfigurnov · 2018-05-24T18:33:57Z

Thanks for the interest in the work! Hopefully, this will be incorporated into TFP soon :) Let me comment on some of the points. Tl;dr take a look at the appendix, it answers some of your questions.

Regarding an auto-diff system: it's an overkill to bring one just for this feature. The CDF for Gamma is computed using basic arithmetics and Gamma function (see appendix B). So manual forward-mode differentiation is somewhat annoying, but very much doable. This is what we did, basically (except for small tweaks described in appendix B).

Regarding computing the derivative directly. There is a closed-form expression in appendix C for that. However, it involves a hypergeometric function that is tricky to compute. We use it to obtain the ground-truth values.

Regarding the error of derivative: the error of the derivative is close to the error of CDF, so there doesn't seem to be an error accumulation going on.

srvasude · 2018-05-24T18:42:10Z

@axch: Understood, thanks! Yeah that's a fairly interesting approach (although as you said, it is not clear to me that this completely workable).

mfigurnov · 2018-05-24T18:54:15Z

Also, take a look at the numbers in table 2. For Gamma, the average derivative dz/dalpha is one (see appendix A), so you cannot really expect the error to go below 1e-7 for float32 and 1e-16 for float64. So the correction term is an interesting idea, but might be unnecessary for practical purposes.

yorkerlin · 2018-05-28T22:59:47Z

@mfigurnov Could you comment on the difference between the method in Eq. 27 at page 18 of https://arxiv.org/pdf/1206.6679.pdf and your method?
Any comments on sparse Gamma distributions when \alpha<1e-6? (see https://arxiv.org/pdf/1509.01631.pdf)

yorkerlin · 2018-05-28T23:49:11Z

@mfigurnov Any comments on the Generalized inverse Gaussian (GIG) distribution? It seems it is non-trivial to compute the CDF for GIG.

mfigurnov · 2018-05-29T13:00:42Z

@yorkerlin

Could you comment on the difference between the method in Eq. 27 at page 18 of https://arxiv.org/pdf/1206.6679.pdf and your method?

Thanks for the interest in our work. The difference compared to Salimans&Knowles is that we (i) propose using automatic differentiation to compute the derivative of the CDF and experimentally evaluate the effectiveness of this approach; (ii) present a more general form of this implicit gradient (using standardization functions instead of just the CDF, and also provide an expression for the multivariate case); (iii) highlight the connection between the implicit derivative and the standard reparameterization trick. We also recently became aware that a related approach was developed by Hoffman&Blei "Structured Stochastic Variational Inference". Additionally, we found a 1988 paper by Suri&Zazanis on perturbation analysis https://www.jstor.org/stable/2632185 that proposes both the explicit reparameterization and the eqn. (27) of Salimans&Knowles in Section 4, so the idea has been known for a lot longer than we thought :) We are currently updating the related work section to include and discuss these works.

Any comments on sparse Gamma distributions when \alpha<1e-6? (see https://arxiv.org/pdf/1509.01631.pdf)

The error of our method is actually better for extremely low \alpha than for larger ones. So we do not use any additional approximations.

Any comments on the Generalized inverse Gaussian (GIG) distribution? It seems it is non-trivial to compute the CDF for GIG.

There is a method for approximation of the CDF of this distribution: http://epub.wu.ac.at/1548/ . Since it uses a polynomial approximation, it should be straightforward to differentiate. I am not sure how practical this is, though, since the method seems to have a slow set-up. (Non-generalized) inverse Gaussian Distribution has tractable CDF and tractable derivative of the CDF, so you could definitely use implicit gradients for that distribution.

yorkerlin · 2018-05-29T18:06:58Z

@mfigurnov
Thanks for your message. Nice work :)

About the multivariate case
It seems that it is non-trivial to compute the gradient w.r.t. the covariance matrix for a finite mixture of multivariate Gaussians with full correlation if the multivariate CDF transform is used. It is relatively easy to compute the gradient w.r.t. mixing weights in this case. Do you have any experimental result on this?

fritzo · 2018-06-07T01:20:35Z

In PyTorch we followed an approach similar to what @axch suggests by constructing a minimax numerical approximation to the gradient directly (rather than combining multiple approximations via AD). Here's our write-up https://arxiv.org/abs/1806.01851 and derivations for Gamma and Beta/Dirichlet, in case it helps. IIRC our Gamma doesn't work well for alpha<1e-3 due to loss of precision in the sampler. cc @martinjankowiak

yorkerlin · 2018-06-07T19:48:42Z

@fritzo nice work!
For multivariate Gaussian, it will be great if you can compare your method against the following method?
Let q(x)=N(x|\mu,V), where C is the (lower triangular) Choleksy factor of the co-variance matrix V. (C*C^T=V)
The gradient w.r.t. C can be computed as below
dC = tril(dV*(2*C)), where dV=\nabla_{V} E_{q(x)} {f(x)} can be computed using Eq. 19 at Opper&Archambeau http://www0.cs.ucl.ac.uk/staff/c.archambeau/publ/neco_mo09_web.pdf

For Gaussian model, we use the mentioned method in our paper (see Figure 2 of https://arxiv.org/pdf/1511.00146.pdf)
For Gamma model, we use the method proposed by Knowles in another paper. (see Figure 1 of https://arxiv.org/pdf/1703.04265.pdf) I think your proposed method could improve the performance.

yorkerlin · 2018-06-07T20:31:27Z

@fritzo The mentioned method uses the second order information of the target function f(x). It may not be fair to compare against your method. Note: the REINFORCE method uses the zero order information of f(x) while the re-parametrization method uses the first order information of f(x).

mfigurnov · 2018-06-08T11:40:43Z

@yorkerlin We don't have results for mixtures of full-covariance multivariate Normals, unfortunately. One way to obtain gradients w.r.t. the covariance matrix could be ancestral sampling: you sample the point from one of the mixture components, and backpropagate to just this component's mean/covariance. I am fairly sure that this estimator is going to be unbiased, but obviously higher variance than if you properly accounted for the posterior mixture probabilities.

martinjankowiak · 2018-06-15T17:48:59Z

@yorkerlin no, we didn't compare to that (second-order) approach. for one thing because computing hessians in high dimensions can be prohibitively expensive. of course, you can approximate the hessian, in which a natural thing to do is follow the approach taken in https://arxiv.org/abs/1705.07880. cc @fritzo

yorkerlin · 2018-06-16T14:35:11Z

@martinjankowiak
For high dimensional problems, diagonal Gaussian distribution could be used. It is relatively easy to compute the diagonal elements of the Hessian matrix.

mfigurnov · 2018-06-28T13:40:11Z

Gamma, Beta, Dirichlet, Student's t and inverse Gamma distributions are now fully reparameterized, see the following commits: TensorFlow (tensorflow/tensorflow@b894f68), TFP (633e4f2)

The numerical method for the derivative of Gamma samples is incorporated into Eigen: https://bitbucket.org/eigen/eigen/pull-requests/403/derivative-of-the-incomplete-gamma/

Thanks to the TF team for their help! :)

dustinvtran · 2018-07-22T19:18:31Z

Thanks @mfigurnov! Fantastic work. Closing this issue.

jvdillon · 2018-07-23T16:25:16Z

What about truncated_normal?

dustinvtran · 2018-07-23T20:59:02Z

There are many that can be interpreted as being part of this broad issue. It's probably worth having the many other extensions in separate issues to measure progress. Thoughts?

srvasude · 2018-09-12T20:15:40Z

I'm agreed with Dustin on this. I read this bug as using the trick for some fairly common distributions (Gamma, StudentT, Beta, etc., i.e. what the paper addresses). Closing this bug, and will open new bugs on a case-to-case basis.

yorkerlin · 2019-06-17T20:05:18Z

You may look at our poser for the ICML workshop on Stein's method, where we weaken the smoothness assumption of the Implicit reparameterization gradients. In our poster, we only focus on the exponential family. However, the idea can be readily extended to general continuous univariate distribution. For multivariate case, you can use Theorem 4 in our poster.

TL;DR: The target function can be locally Lipchitz instead of continuously differentiable. For example, Relu is not continuously differentiable.

yorkerlin · 2019-06-17T20:15:29Z

@jvdillon For truncated_normal, if the truncated interval is fixed, you can still use that trick but you have to ensure that the boundary condition holds (see Lemma 3 in our poster).

brianwa84 · 2020-08-12T19:39:41Z

Move the sample inside the tape? Brian Patton | Software Engineer | bjp@google.com

…

On Wed, Aug 12, 2020 at 6:20 AM Mossi8 ***@***.***> wrote: Hi , Backpropagation is not working for gamma distribution. Has there been any changes? concentration = tf.constant(3.0) rate = tf.constant(2.0) dist = tfp.distributions.Gamma(concentration, rate) samples = dist.sample(5) # Shape [5] with tf.GradientTape() as tape: loss = tf.reduce_mean(tf.square(samples)) # Arbitrary loss function Unbiased stochastic gradients of the loss function grads = tape.gradient(loss, [concentration, rate]) print(grads) [None, None] Thanks!!! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#51 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFJFSI5MEXF26QSJMF6XNZLSAJUG3ANCNFSM4FBLLERQ> .

Mossi8 · 2020-08-13T10:04:25Z

Move the sample inside the tape? Brian Patton | Software Engineer | bjp@google.com
…
On Wed, Aug 12, 2020 at 6:20 AM Mossi8 @.***> wrote: Hi , Backpropagation is not working for gamma distribution. Has there been any changes? concentration = tf.constant(3.0) rate = tf.constant(2.0) dist = tfp.distributions.Gamma(concentration, rate) samples = dist.sample(5) # Shape [5] with tf.GradientTape() as tape: loss = tf.reduce_mean(tf.square(samples)) # Arbitrary loss function Unbiased stochastic gradients of the loss function grads = tape.gradient(loss, [concentration, rate]) print(grads) [None, None] Thanks!!! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#51 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJFSI5MEXF26QSJMF6XNZLSAJUG3ANCNFSM4FBLLERQ .

Hi, It still give None.

This comes from a model I built which samples from a gamma distribution:
tfp.distributions.Gamma(concentration=mean,rate=logvar,name='Gamma').sample()

But, I get this error

LookupError: No gradient defined for operation 'gradients/Gamma_1/sample/IdentityN_grad/RandomGammaGrad' (op type: RandomGammaGrad)

However if I sample from a Normal it works just fine.

Thank you!!!

brianwa84 · 2020-08-13T13:57:28Z

It looks like you are taking a second-order gradient there. That's not supported. First order appears to work fine though, maybe you're missing tape.watch? https://colab.research.google.com/gist/brianwa84/46e9c880245d2fe1905cc782a68be79c/gamma.ipynb

jvdillon · 2020-08-17T20:01:50Z

Could we fire an assertion if gradient is called through custom grad, eg, by adding a customgrad to the customgrad which throws an exception?

…

On Thu, Aug 13, 2020 at 6:57 AM Brian Patton ***@***.***> wrote: It looks like you are taking a second-order gradient there. That's not supported. First order appears to work fine though, maybe you're missing tape.watch? https://colab.research.google.com/gist/brianwa84/46e9c880245d2fe1905cc782a68be79c/gamma.ipynb — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#51 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIVTNWP65DBBSL3QEKUMIDSAPWNTANCNFSM4FBLLERQ> .

brianwa84 · 2020-08-17T20:11:47Z

The assertion we would fire would look almost identical to the one here, i.e. see the behavior in MixtureSameFamily to disable second order grads. It raises a LookupError. Sure, we could give a better explanation, but I don't think the explanation here is too bad. Brian Patton | Software Engineer | bjp@google.com On Mon, Aug 17, 2020 at 4:02 PM Joshua V. Dillon <notifications@github.com> wrote:

…

Could we fire an assertion if gradient is called through custom grad, eg, by adding a customgrad to the customgrad which throws an exception? On Thu, Aug 13, 2020 at 6:57 AM Brian Patton ***@***.***> wrote: > It looks like you are taking a second-order gradient there. That's not > supported. > First order appears to work fine though, maybe you're missing tape.watch? > > https://colab.research.google.com/gist/brianwa84/46e9c880245d2fe1905cc782a68be79c/gamma.ipynb > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #51 (comment) >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAIVTNWP65DBBSL3QEKUMIDSAPWNTANCNFSM4FBLLERQ > > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#51 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFJFSI6R4Z7UZCAB27UVPZDSBGED5ANCNFSM4FBLLERQ> .

Mossi8 · 2020-08-17T20:31:51Z

https://colab.research.google.com/gist/brianwa84/46e9c880245d2fe1905cc782a68be79c/gamma.ipynb

Hi, it is solved now. You were right, I had a persistent=True somewhere, and it was taking the second second order derivatives with gamma reparameterization. And, I was not familiar with the LookupError.

Thank you so much!

danleonte · 2021-02-17T18:42:08Z

Hi

Is there a way to use the current implementation of the reparametrization trick in

tfp.distributions.Gamma(concentration=mean,rate=logvar,name='Gamma').sample()

for a truncated gamma distribution? Would throwing away samples that are too big work (even if is wasteful)?

brianwa84 · 2021-02-17T20:37:03Z

Gamma recently got a differentiable quantile function, so you could probably use the approach outlined in #1135.

…

On Wed, Feb 17, 2021 at 1:42 PM Dan Leonte ***@***.***> wrote: Hi Is there a way to use the current implementation of the reparametrization trick in tfp.distributions.Gamma(concentration=mean,rate=logvar,name='Gamma').sample() for a truncated gamma distribution? I guess throwing away samples that are too big is one way, but it is rather wasteful. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#51 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFJFSIZHENDUOYIOM2OCIZTS7QEZHANCNFSM4FBLLERQ> .

srvasude closed this as completed May 24, 2018

srvasude reopened this May 24, 2018

dustinvtran closed this as completed Jul 22, 2018

jvdillon reopened this Jul 23, 2018

This was referenced Jul 26, 2018

How are the samples "reparameterized" w.r.t. degrees of freedom? #109

Closed

some issues abort example latent_dirichlet_allocation.py #110

Closed

srvasude closed this as completed Sep 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implicit reparameterization gradients #51

Implicit reparameterization gradients #51

dustinvtran commented May 23, 2018 •

edited

Loading

axch commented May 24, 2018

davmre commented May 24, 2018 •

edited

Loading

axch commented May 24, 2018

srvasude commented May 24, 2018

axch commented May 24, 2018

mfigurnov commented May 24, 2018 •

edited

Loading

srvasude commented May 24, 2018

mfigurnov commented May 24, 2018 •

edited

Loading

yorkerlin commented May 28, 2018 •

edited

Loading

yorkerlin commented May 28, 2018 •

edited

Loading

mfigurnov commented May 29, 2018

yorkerlin commented May 29, 2018 •

edited

Loading

fritzo commented Jun 7, 2018

yorkerlin commented Jun 7, 2018 •

edited

Loading

yorkerlin commented Jun 7, 2018 •

edited

Loading

mfigurnov commented Jun 8, 2018

martinjankowiak commented Jun 15, 2018

yorkerlin commented Jun 16, 2018 •

edited

Loading

mfigurnov commented Jun 28, 2018

dustinvtran commented Jul 22, 2018

jvdillon commented Jul 23, 2018

dustinvtran commented Jul 23, 2018

srvasude commented Sep 12, 2018

yorkerlin commented Jun 17, 2019 •

edited

Loading

yorkerlin commented Jun 17, 2019 •

edited

Loading

brianwa84 commented Aug 12, 2020 via email

Mossi8 commented Aug 13, 2020 •

edited

Loading

brianwa84 commented Aug 13, 2020 via email

jvdillon commented Aug 17, 2020 via email

brianwa84 commented Aug 17, 2020 via email

Mossi8 commented Aug 17, 2020

danleonte commented Feb 17, 2021 •

edited

Loading

brianwa84 commented Feb 17, 2021 via email

Implicit reparameterization gradients #51

Implicit reparameterization gradients #51

Comments

dustinvtran commented May 23, 2018 • edited Loading

axch commented May 24, 2018

davmre commented May 24, 2018 • edited Loading

axch commented May 24, 2018

srvasude commented May 24, 2018

axch commented May 24, 2018

mfigurnov commented May 24, 2018 • edited Loading

srvasude commented May 24, 2018

mfigurnov commented May 24, 2018 • edited Loading

yorkerlin commented May 28, 2018 • edited Loading

yorkerlin commented May 28, 2018 • edited Loading

mfigurnov commented May 29, 2018

yorkerlin commented May 29, 2018 • edited Loading

fritzo commented Jun 7, 2018

yorkerlin commented Jun 7, 2018 • edited Loading

yorkerlin commented Jun 7, 2018 • edited Loading

mfigurnov commented Jun 8, 2018

martinjankowiak commented Jun 15, 2018

yorkerlin commented Jun 16, 2018 • edited Loading

mfigurnov commented Jun 28, 2018

dustinvtran commented Jul 22, 2018

jvdillon commented Jul 23, 2018

dustinvtran commented Jul 23, 2018

srvasude commented Sep 12, 2018

yorkerlin commented Jun 17, 2019 • edited Loading

yorkerlin commented Jun 17, 2019 • edited Loading

brianwa84 commented Aug 12, 2020 via email

Mossi8 commented Aug 13, 2020 • edited Loading

brianwa84 commented Aug 13, 2020 via email

jvdillon commented Aug 17, 2020 via email

brianwa84 commented Aug 17, 2020 via email

Mossi8 commented Aug 17, 2020

danleonte commented Feb 17, 2021 • edited Loading

brianwa84 commented Feb 17, 2021 via email

dustinvtran commented May 23, 2018 •

edited

Loading

davmre commented May 24, 2018 •

edited

Loading

mfigurnov commented May 24, 2018 •

edited

Loading

mfigurnov commented May 24, 2018 •

edited

Loading

yorkerlin commented May 28, 2018 •

edited

Loading

yorkerlin commented May 28, 2018 •

edited

Loading

yorkerlin commented May 29, 2018 •

edited

Loading

yorkerlin commented Jun 7, 2018 •

edited

Loading

yorkerlin commented Jun 7, 2018 •

edited

Loading

yorkerlin commented Jun 16, 2018 •

edited

Loading

yorkerlin commented Jun 17, 2019 •

edited

Loading

yorkerlin commented Jun 17, 2019 •

edited

Loading

Mossi8 commented Aug 13, 2020 •

edited

Loading

danleonte commented Feb 17, 2021 •

edited

Loading