Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implicit reparameterization gradients #51

Closed
dustinvtran opened this issue May 23, 2018 · 33 comments
Closed

Implicit reparameterization gradients #51

dustinvtran opened this issue May 23, 2018 · 33 comments

Comments

@dustinvtran
Copy link
Member

dustinvtran commented May 23, 2018

The idea is simple to implement and well-scoped as part of TF Distributions. I personally like the idea of even having it be the default for Gamma, Beta, and others. The gradient implementations may be tricky. Maybe @mfigurnov has an implementation we can build on? (if open source or google-internal)

@axch
Copy link
Contributor

axch commented May 24, 2018

Wow, that's a really nice trick -- I wish I had thought of it! Re: gradient implementations: The paper discusses deriving them from the CDF using forward-mode AD. That (or reverse-mode) could be a good place to start on CPU, especially if TF is willing to incorporate a C++ AD system (the Stan math library, perhaps?) as a dependency. Is CUDA close enough to C++ that the same AD system could just work?

The bad news is that any AD approach will likely have to accept some error, because differentiating a numeric approximation is worse than numerically approximating a (correct) derivative. The good news is that for basic distributions like Gamma, unit-testing the derivative code is pretty easy: just test against an expensive, high-precision numerical method (like Richardson extrapolation in 64-bit or 128-bit floats).

Hm. I wonder whether it would be possible to construct a better (and still fast) approximation of the true derivative by using the expensive-but-accurate computation as a reference. For instance, can an optimal set of Chebyshev polynomials up to, say, 5th order be found empirically? One could even use the AD as a starting point, and find Chebyshev polynomials to cancel off the difference between the AD results and the accurate results.

@davmre
Copy link
Contributor

davmre commented May 24, 2018

I imagine that getting a full-fledged C++ AD system into TF would be quite a heavy lift, but since this approach isn't AD'ing user code, just a small number of well-understood computations (e.g., the gamma CDF), couldn't one just do the AD manually? That is, insofar as forward-mode AD is 'just' running your computation on dual numbers x+eps instead of x, for any given algorithm you should be able to just explicitly do this, grounding out the real and infinitesimal parts in your code.

@axch
Copy link
Contributor

axch commented May 24, 2018

Yes. The question in my mind is whether the computations in question really are few in number and well-understood; but one could certainly start down that path and see how much use one covered before getting tired.

@srvasude srvasude reopened this May 24, 2018
@srvasude
Copy link
Contributor

Right, I think we can override the gradient computations with this explicitly.

The idea is that (talking with @jvdillon) is that distribution can have an override gradient method for samples. That way a user can switch with different ways to override the gradient (and for non reparameterizable distributions, we would error out explicitly.

Re: How would you use a Chebyshev approx here? Are you saying for univariate distributions, use the Inverse CDF transform (support is [0, 1]) and then do some Chebyshev approximation to that (rather than doing implicit differentiation with the CDF)? At some point you'd have to compute the Inverse CDF of your random draw right (which can be expensive)?

@axch
Copy link
Contributor

axch commented May 24, 2018

No, I was just musing about how to get better precision for the derivative of the CDF than you would get by ADing the implementation. Namely, if you have a slow but accurate way to compute some analytically obnoxious function (in this case, f(x) = d/dx CDF_foo(x)|_x), and you have a fast but somewhat inaccurate way to compute it (in this case, applying AD to CDF_foo), is it possible to empirically select coefficients to construct a Chebyshev approximation of the difference, so as to ship

def f(x):
  return AD(CDF_foo)(x) + <some polynomial>(x)

The idea would be to pick the polynomial to minimize the maximum (empirically determined) error between this (which is presumably still fast) and the slow but accurate thing. It's not clear that this idea is workable, and it certainly isn't necessary to get started, but it seemed cute.

To clarify: the slow and accurate thing I have in mind is numerically computing d/dx CDF_foo(x)|_x by Richardson extrapolation in high precision arithmetic (either 64-bit, or, if that's not good enough, busting out a suitable very high precision library).

@mfigurnov
Copy link

mfigurnov commented May 24, 2018

Thanks for the interest in the work! Hopefully, this will be incorporated into TFP soon :) Let me comment on some of the points. Tl;dr take a look at the appendix, it answers some of your questions.

Regarding an auto-diff system: it's an overkill to bring one just for this feature. The CDF for Gamma is computed using basic arithmetics and Gamma function (see appendix B). So manual forward-mode differentiation is somewhat annoying, but very much doable. This is what we did, basically (except for small tweaks described in appendix B).

Regarding computing the derivative directly. There is a closed-form expression in appendix C for that. However, it involves a hypergeometric function that is tricky to compute. We use it to obtain the ground-truth values.

Regarding the error of derivative: the error of the derivative is close to the error of CDF, so there doesn't seem to be an error accumulation going on.

@srvasude
Copy link
Contributor

@axch: Understood, thanks! Yeah that's a fairly interesting approach (although as you said, it is not clear to me that this completely workable).

@mfigurnov
Copy link

mfigurnov commented May 24, 2018

Also, take a look at the numbers in table 2. For Gamma, the average derivative dz/dalpha is one (see appendix A), so you cannot really expect the error to go below 1e-7 for float32 and 1e-16 for float64. So the correction term is an interesting idea, but might be unnecessary for practical purposes.

@yorkerlin
Copy link

yorkerlin commented May 28, 2018

@mfigurnov Could you comment on the difference between the method in Eq. 27 at page 18 of https://arxiv.org/pdf/1206.6679.pdf and your method?
Any comments on sparse Gamma distributions when \alpha<1e-6? (see https://arxiv.org/pdf/1509.01631.pdf)

@yorkerlin
Copy link

yorkerlin commented May 28, 2018

@mfigurnov Any comments on the Generalized inverse Gaussian (GIG) distribution? It seems it is non-trivial to compute the CDF for GIG.

@mfigurnov
Copy link

@yorkerlin

Could you comment on the difference between the method in Eq. 27 at page 18 of https://arxiv.org/pdf/1206.6679.pdf and your method?

Thanks for the interest in our work. The difference compared to Salimans&Knowles is that we (i) propose using automatic differentiation to compute the derivative of the CDF and experimentally evaluate the effectiveness of this approach; (ii) present a more general form of this implicit gradient (using standardization functions instead of just the CDF, and also provide an expression for the multivariate case); (iii) highlight the connection between the implicit derivative and the standard reparameterization trick. We also recently became aware that a related approach was developed by Hoffman&Blei "Structured Stochastic Variational Inference". Additionally, we found a 1988 paper by Suri&Zazanis on perturbation analysis https://www.jstor.org/stable/2632185 that proposes both the explicit reparameterization and the eqn. (27) of Salimans&Knowles in Section 4, so the idea has been known for a lot longer than we thought :) We are currently updating the related work section to include and discuss these works.

Any comments on sparse Gamma distributions when \alpha<1e-6? (see https://arxiv.org/pdf/1509.01631.pdf)

The error of our method is actually better for extremely low \alpha than for larger ones. So we do not use any additional approximations.

Any comments on the Generalized inverse Gaussian (GIG) distribution? It seems it is non-trivial to compute the CDF for GIG.

There is a method for approximation of the CDF of this distribution: http://epub.wu.ac.at/1548/ . Since it uses a polynomial approximation, it should be straightforward to differentiate. I am not sure how practical this is, though, since the method seems to have a slow set-up. (Non-generalized) inverse Gaussian Distribution has tractable CDF and tractable derivative of the CDF, so you could definitely use implicit gradients for that distribution.

@yorkerlin
Copy link

yorkerlin commented May 29, 2018

@mfigurnov
Thanks for your message. Nice work :)

About the multivariate case
It seems that it is non-trivial to compute the gradient w.r.t. the covariance matrix for a finite mixture of multivariate Gaussians with full correlation if the multivariate CDF transform is used. It is relatively easy to compute the gradient w.r.t. mixing weights in this case. Do you have any experimental result on this?

@fritzo
Copy link

fritzo commented Jun 7, 2018

In PyTorch we followed an approach similar to what @axch suggests by constructing a minimax numerical approximation to the gradient directly (rather than combining multiple approximations via AD). Here's our write-up https://arxiv.org/abs/1806.01851 and derivations for Gamma and Beta/Dirichlet, in case it helps. IIRC our Gamma doesn't work well for alpha<1e-3 due to loss of precision in the sampler. cc @martinjankowiak

@yorkerlin
Copy link

yorkerlin commented Jun 7, 2018

@fritzo nice work!
For multivariate Gaussian, it will be great if you can compare your method against the following method?
Let q(x)=N(x|\mu,V), where C is the (lower triangular) Choleksy factor of the co-variance matrix V. (C*C^T=V)
The gradient w.r.t. C can be computed as below
dC = tril(dV*(2*C)), where dV=\nabla_{V} E_{q(x)} {f(x)} can be computed using Eq. 19 at Opper&Archambeau http://www0.cs.ucl.ac.uk/staff/c.archambeau/publ/neco_mo09_web.pdf

For Gaussian model, we use the mentioned method in our paper (see Figure 2 of https://arxiv.org/pdf/1511.00146.pdf)
For Gamma model, we use the method proposed by Knowles in another paper. (see Figure 1 of https://arxiv.org/pdf/1703.04265.pdf) I think your proposed method could improve the performance.

@yorkerlin
Copy link

yorkerlin commented Jun 7, 2018

@fritzo The mentioned method uses the second order information of the target function f(x). It may not be fair to compare against your method. Note: the REINFORCE method uses the zero order information of f(x) while the re-parametrization method uses the first order information of f(x).

@mfigurnov
Copy link

@yorkerlin We don't have results for mixtures of full-covariance multivariate Normals, unfortunately. One way to obtain gradients w.r.t. the covariance matrix could be ancestral sampling: you sample the point from one of the mixture components, and backpropagate to just this component's mean/covariance. I am fairly sure that this estimator is going to be unbiased, but obviously higher variance than if you properly accounted for the posterior mixture probabilities.

@martinjankowiak
Copy link

@yorkerlin no, we didn't compare to that (second-order) approach. for one thing because computing hessians in high dimensions can be prohibitively expensive. of course, you can approximate the hessian, in which a natural thing to do is follow the approach taken in https://arxiv.org/abs/1705.07880. cc @fritzo

@yorkerlin
Copy link

yorkerlin commented Jun 16, 2018

@martinjankowiak
For high dimensional problems, diagonal Gaussian distribution could be used. It is relatively easy to compute the diagonal elements of the Hessian matrix.

@mfigurnov
Copy link

Gamma, Beta, Dirichlet, Student's t and inverse Gamma distributions are now fully reparameterized, see the following commits: TensorFlow (tensorflow/tensorflow@b894f68), TFP (633e4f2)

The numerical method for the derivative of Gamma samples is incorporated into Eigen: https://bitbucket.org/eigen/eigen/pull-requests/403/derivative-of-the-incomplete-gamma/

Thanks to the TF team for their help! :)

@dustinvtran
Copy link
Member Author

Thanks @mfigurnov! Fantastic work. Closing this issue.

@jvdillon
Copy link
Contributor

What about truncated_normal?

@jvdillon jvdillon reopened this Jul 23, 2018
@dustinvtran
Copy link
Member Author

There are many that can be interpreted as being part of this broad issue. It's probably worth having the many other extensions in separate issues to measure progress. Thoughts?

@srvasude
Copy link
Contributor

I'm agreed with Dustin on this. I read this bug as using the trick for some fairly common distributions (Gamma, StudentT, Beta, etc., i.e. what the paper addresses). Closing this bug, and will open new bugs on a case-to-case basis.

@yorkerlin
Copy link

yorkerlin commented Jun 17, 2019

You may look at our poser for the ICML workshop on Stein's method, where we weaken the smoothness assumption of the Implicit reparameterization gradients. In our poster, we only focus on the exponential family. However, the idea can be readily extended to general continuous univariate distribution. For multivariate case, you can use Theorem 4 in our poster.

TL;DR: The target function can be locally Lipchitz instead of continuously differentiable. For example, Relu is not continuously differentiable.

@yorkerlin
Copy link

yorkerlin commented Jun 17, 2019

@jvdillon For truncated_normal, if the truncated interval is fixed, you can still use that trick but you have to ensure that the boundary condition holds (see Lemma 3 in our poster).

@brianwa84
Copy link
Contributor

brianwa84 commented Aug 12, 2020 via email

@Mossi8
Copy link

Mossi8 commented Aug 13, 2020

Move the sample inside the tape? Brian Patton | Software Engineer | bjp@google.com

On Wed, Aug 12, 2020 at 6:20 AM Mossi8 @.***> wrote: Hi , Backpropagation is not working for gamma distribution. Has there been any changes? concentration = tf.constant(3.0) rate = tf.constant(2.0) dist = tfp.distributions.Gamma(concentration, rate) samples = dist.sample(5) # Shape [5] with tf.GradientTape() as tape: loss = tf.reduce_mean(tf.square(samples)) # Arbitrary loss function Unbiased stochastic gradients of the loss function grads = tape.gradient(loss, [concentration, rate]) print(grads) [None, None] Thanks!!! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#51 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJFSI5MEXF26QSJMF6XNZLSAJUG3ANCNFSM4FBLLERQ .

Hi, It still give None.

This comes from a model I built which samples from a gamma distribution:
tfp.distributions.Gamma(concentration=mean,rate=logvar,name='Gamma').sample()

But, I get this error

LookupError: No gradient defined for operation 'gradients/Gamma_1/sample/IdentityN_grad/RandomGammaGrad' (op type: RandomGammaGrad)

However if I sample from a Normal it works just fine.

Thank you!!!

@brianwa84
Copy link
Contributor

brianwa84 commented Aug 13, 2020 via email

@jvdillon
Copy link
Contributor

jvdillon commented Aug 17, 2020 via email

@brianwa84
Copy link
Contributor

brianwa84 commented Aug 17, 2020 via email

@Mossi8
Copy link

Mossi8 commented Aug 17, 2020

https://colab.research.google.com/gist/brianwa84/46e9c880245d2fe1905cc782a68be79c/gamma.ipynb

Hi, it is solved now. You were right, I had a persistent=True somewhere, and it was taking the second second order derivatives with gamma reparameterization. And, I was not familiar with the LookupError.

Thank you so much!

@danleonte
Copy link

danleonte commented Feb 17, 2021

Hi

Is there a way to use the current implementation of the reparametrization trick in

tfp.distributions.Gamma(concentration=mean,rate=logvar,name='Gamma').sample()

for a truncated gamma distribution? Would throwing away samples that are too big work (even if is wasteful)?

@brianwa84
Copy link
Contributor

brianwa84 commented Feb 17, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests