-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradients of non-scalars (higher rank Jacobians) #675
Comments
zackchase@, you are right about the current |
It'd be pretty hard to support gradients of non-scalars with our current setup, since it would require every gradient function to handle extra rank input. The one possibility I could see would be if we add some sort of map facility to register how to add extra ranks to ops, then compute gradients with respect to extra rank by computing lower rank and calling the registered map transformations. Someone asked for map a while back, so if anyone wanted to tackle this task that might be the way to go. Handling it at the gradient function level is probably bad, since it would add required complexity to an existing feature. Warning: This is a pretty large change, so a good deal of discussion would be in order before starting. |
Hi Geoffrey, thanks for taking an interest in this issue. I was initially confused by the use of "rank" to describe the number of dimensions of the array. Should we avoid this name in the thread title and documentation to preempt confusion via overloading the linear algebra notion of rank? |
Tensor rank is very standard terminology: http://mathworld.wolfram.com/TensorRank.html |
Cool. The terminology gets funny when we talk about rank-R decompositions of tensors, meaning the tensor can be represented as a sum of R outer products of rank-1 tensors, but probably not a problem for us to solve here. One thing I thought of is that I would like to compute the frobenius norm of the Jacobian of the log probabilities for use as a smoothness penalty much like the smoothness penalty used in a contractive autoencoder. In this case, as we only seek a scalar at the end, is there a more efficient method than separately calculating the derivative of each output with respect to the inputs? |
Are you saying your network has a bunch of outputs, and then you combine them into a single scalar that you are trying to optimize? In that case, you should differentiate with respect to that single scalar. |
Not exactly. I'm saying if one wants to penalize the norm of the Jacobian of the mapping function. cost(y,yhat, X) = loss(y,yhat) + norm(Jacobian(log(yhat), X)) |
Ah, sorry for not reading carefully. You're correct that (as far as I know) there's no easy way to do that in current tensorflow. According to someone more knowledgeable than I, people generally do such contractive autoencoders by writing out the first derivative manually. Also, they generally restrict to single layer at a time networks for speed issues, since doing the full Jacobian for a multilayer network is quite expensive. |
Regardless, it would be good to have a way to call derivatives of vectors and receive gradients of the expected shape. |
Differentiating with respect to one variable is similar to how it works in Theano. I agree it may be confusing when TensorFlow automatically turns many variables into one by taking the sum. An alternative would be to fail if there's more than 1 output variable specified, or have a wrapper that automatically calls existing gradient function on each output variable The reason for "one output variable at a time" in TensorFlow (and Theano) is because we do reverse mode AD by default. In reverse AD you have a single target scalar quantity and you propagate sensitivities with respect to that quantity. In contrast, if you we did forward AD instead, we would naturally support multiple output variables, but only compute derivative with respect to one scalar variable at a time. Supporting mixed mode propagation to cover "multiple inputs/multiple outputs" case in the most efficient way could be a lot of extra plumbing. If you have a small number of output variables but large number of input variables, standard thing to do is to apply reverse AD with respect to each variable in a loop. This is what Theano recommends to do for compute Hessian for instance: http://deeplearning.net/software/theano/tutorial/gradients.html#computing-the-hessian. If you have a small number of input variables but large number of output variables, then the most efficient thing to do would be to run forward-mode AD for all the input variables in a loop. Forward mode AD is not implemented and would require adding an equivalent of Theano's "Rop" operator to differentiable ops and some plumbing to call them instead of existing op "gradient" function (existing gradient function is an equivalent of Lop operation, or "left multiply sensitivity vector by op's jacobian" operation) |
I was hoping to implement higher order derivatives using the map function but am getting an error message I can't quite get my head around. My implementation is (in pseudo code)
When I fetch the hessian, I get the error message
I assumed that tensorflow has an issue because it doesn't know about Looping in python is of course fine but I'd like to avoid introducing an extra node to the graph for every parameter. Any suggestions? |
@yuanbyu: Do you understand this issue with |
Note for anyone who comes across this thread: |
We don't support higher-order gradients for while_loop/map_fn/scan/fold. You should see an informative error message if you try to do that. |
@yaroslavvb Any plans on adding forward mode AD? I filed an issue on it a couple weeks ago but haven't heard back. |
@vladfi1 I'm no longer at Brain, so I wouldn't know. I would say it is unlikely to ever be part of core TensorFlow. There are >450 ops in TF, so Brain team would have to implement forward AD grad method for all of 450 ops and maintain them forever, or alternatively have to explain why someone's favorite op doesn't have forward AD support. It seems more realistic that someone would create a separately maintained library that does forward-AD, and utilizes TensorFlow as backend. Kind of like autograd but using TensorFlow instead of numpy as the backend. |
Is |
There are no built-in jacobians in TensorFlow, instead anything called
'grad' or 'gradient' computes Jacobian-vector product (also called LOp in
theano), see
https://en.wikipedia.org/wiki/Automatic_differentiation#Reverse_accumulation
…On Mon, Feb 20, 2017 at 4:04 AM, MYaooo ***@***.***> wrote:
Is tf.test.compute_gradient is some kind of funtion that we can get the
jacobian matrix (not as a tensor but as a numpy.ndarray) of a vector tensor
y w.r.t. a vector tensor x?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#675 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABaHO7dLBLvhJH5KoSiV8WcySnfR5cnks5reYFMgaJpZM4G9_8_>
.
|
@yaroslavvb Thanks for your reply. I see why it's expensive to do that. |
@marcociccone in TF 1.4 seems to be working fine, if this is an alternative for you. |
This is indeed a bug in TF. It's caused by taking the gradient(y, x) inside a while loop wrt such that the computation of y from x goes through a different while loop, something like:
So in @mholzel's script, it's from passing the outcome of one jacobian call to the other jacobian call. (BTW, thanks very much for the easy repro.) Unfortunately this is quite tricky to fix. I'll try to take another look at it tomorrow and see if I can come up with something. |
This is actually extremely tricky to fix. I'm not sure how this was working in 1.4, was it definitely giving the right answer? The fundamental problem is that the second jacobian call (i.e. the hessian) is calling tf.gradients() inside a while loop, and that backprop calculation must go through the while loop from the first jacobian call. TF computes while loop gradients using stacks to store intermediate loop values, so if you're doing that calculation multiple times via another loop, we'd have to somehow re-use the stack values on each iteration. This is conceptually possible but would be a pretty big change. I can at least try to improve the error message though. |
I have never seen the nested call work, but I did not try 1.4. |
@mholzel for the double jacobian, suppose you just map to tf.hessian instead of tf.gradient. Doesn't solve to arbitrary order but it does get the hessian of a tensor wrt variables. |
There is now an experimental new approach to doing Jacobians here: |
Did anyone compare the performance of the new functionality and the previous solution posted by @mholzel here? I am finding the new |
gradients_test.py in the same file has some benchmarks that show that pfor based jacobian generally works much faster than a while loop based one. Can you share your benchmark ? |
I threw together this function, it uses tf.map_fn and assumes a batch style input, but at least it works for batches. It's pretty high level but seems to adress some of the ideas if the conversation above:
|
@zackchase Is this resolved? Please close If it was resolved already. Thanks! |
tf.GradientTape.jacobian and tf.GradientTape.batch_jacobian APIs have been added for computing Jacobians. These are based on experimental pfor logic which can also be disabled to fallback to loop based implementation. |
…m-pr-620 Cherry-pick PR 620
Currently if you call gradients(ys, xs), it will return the sum of dy/dx over all ys for each x in xs. I believe this doesn't accord with an a priori mathematical notion of the derivative of a vector. I'd like the way to take the derivative of ys wrt xs where both are vectors and have a Jacobian matrix returned. By extension, I'd like to take the derivative of a vector wrt a matrix and get back a 3-tensor. There doesn't seem to be a convenient tensorflow function to compute the Jacobian or higher order derivatives. Am I missing something or is this functionality that we could add?
The text was updated successfully, but these errors were encountered: