Gradients of non-scalars (higher rank Jacobians) #675

zackchase · 2016-01-04T11:16:43Z

Currently if you call gradients(ys, xs), it will return the sum of dy/dx over all ys for each x in xs. I believe this doesn't accord with an a priori mathematical notion of the derivative of a vector. I'd like the way to take the derivative of ys wrt xs where both are vectors and have a Jacobian matrix returned. By extension, I'd like to take the derivative of a vector wrt a matrix and get back a 3-tensor. There doesn't seem to be a convenient tensorflow function to compute the Jacobian or higher order derivatives. Am I missing something or is this functionality that we could add?

keveman · 2016-01-04T15:45:05Z

zackchase@, you are right about the current gradients function. Currently, you can compute the Jacobian of, say, a vector, by calling gradients multiple times, one for every scalar component (obtained by slicing) of the original vector, and reassembling the results. Contributions are welcome to make this nicer and efficient.

girving · 2016-01-04T21:30:03Z

It'd be pretty hard to support gradients of non-scalars with our current setup, since it would require every gradient function to handle extra rank input. The one possibility I could see would be if we add some sort of map facility to register how to add extra ranks to ops, then compute gradients with respect to extra rank by computing lower rank and calling the registered map transformations.

Someone asked for map a while back, so if anyone wanted to tackle this task that might be the way to go. Handling it at the gradient function level is probably bad, since it would add required complexity to an existing feature. Warning: This is a pretty large change, so a good deal of discussion would be in order before starting.

zackchase · 2016-01-04T21:55:50Z

Hi Geoffrey, thanks for taking an interest in this issue. I was initially confused by the use of "rank" to describe the number of dimensions of the array. Should we avoid this name in the thread title and documentation to preempt confusion via overloading the linear algebra notion of rank?

girving · 2016-01-04T21:58:33Z

Tensor rank is very standard terminology: http://mathworld.wolfram.com/TensorRank.html

zackchase · 2016-01-04T22:37:11Z

Cool. The terminology gets funny when we talk about rank-R decompositions of tensors, meaning the tensor can be represented as a sum of R outer products of rank-1 tensors, but probably not a problem for us to solve here.

One thing I thought of is that I would like to compute the frobenius norm of the Jacobian of the log probabilities for use as a smoothness penalty much like the smoothness penalty used in a contractive autoencoder. In this case, as we only seek a scalar at the end, is there a more efficient method than separately calculating the derivative of each output with respect to the inputs?

girving · 2016-01-04T22:52:43Z

Are you saying your network has a bunch of outputs, and then you combine them into a single scalar that you are trying to optimize? In that case, you should differentiate with respect to that single scalar.

zackchase · 2016-01-04T22:57:29Z

Not exactly. I'm saying if one wants to penalize the norm of the Jacobian of the mapping function.
So optimization objective would be (pseudocode):

cost(y,yhat, X) = loss(y,yhat) + norm(Jacobian(log(yhat), X))

girving · 2016-01-04T23:05:28Z

Ah, sorry for not reading carefully. You're correct that (as far as I know) there's no easy way to do that in current tensorflow. According to someone more knowledgeable than I, people generally do such contractive autoencoders by writing out the first derivative manually. Also, they generally restrict to single layer at a time networks for speed issues, since doing the full Jacobian for a multilayer network is quite expensive.

zackchase · 2016-01-09T09:40:35Z

Regardless, it would be good to have a way to call derivatives of vectors and receive gradients of the expected shape.

yaroslavvb · 2016-01-09T19:16:17Z

Differentiating with respect to one variable is similar to how it works in Theano. I agree it may be confusing when TensorFlow automatically turns many variables into one by taking the sum. An alternative would be to fail if there's more than 1 output variable specified, or have a wrapper that automatically calls existing gradient function on each output variable

The reason for "one output variable at a time" in TensorFlow (and Theano) is because we do reverse mode AD by default. In reverse AD you have a single target scalar quantity and you propagate sensitivities with respect to that quantity. In contrast, if you we did forward AD instead, we would naturally support multiple output variables, but only compute derivative with respect to one scalar variable at a time. Supporting mixed mode propagation to cover "multiple inputs/multiple outputs" case in the most efficient way could be a lot of extra plumbing.

If you have a small number of output variables but large number of input variables, standard thing to do is to apply reverse AD with respect to each variable in a loop. This is what Theano recommends to do for compute Hessian for instance: http://deeplearning.net/software/theano/tutorial/gradients.html#computing-the-hessian. If you have a small number of input variables but large number of output variables, then the most efficient thing to do would be to run forward-mode AD for all the input variables in a loop. Forward mode AD is not implemented and would require adding an equivalent of Theano's "Rop" operator to differentiable ops and some plumbing to call them instead of existing op "gradient" function (existing gradient function is an equivalent of Lop operation, or "left multiply sensitivity vector by op's jacobian" operation)

tillahoffmann · 2016-04-27T15:58:36Z

I was hoping to implement higher order derivatives using the map function but am getting an error message I can't quite get my head around. My implementation is (in pseudo code)

params = tf.Variable("some initial value")
loss = some_function(params)
grads = tf.gradients(loss, params)[0]
hess = tf.map_fn(lambda grad: tf.gradients(grad, X)[0], grads)

When I fetch the hessian, I get the error message

InvalidArgumentError: All inputs to node map/while/gradients/map/TensorArrayUnpack_grad/TensorArrayGrad/TensorArrayGrad must be from the same frame.

I assumed that tensorflow has an issue because it doesn't know about params in the loop (cf. non_sequences in theano scan), and extended map_fn to pass extra arguments to the loop. Unfortunately, the extra arguments get wrapped in an identity transformation and tf.gradients(params, tf.identity(params)) gives [None], which seems a bit unintuitive.

Looping in python is of course fine but I'd like to avoid introducing an extra node to the graph for every parameter. Any suggestions?

girving · 2016-04-28T17:34:44Z

@yuanbyu: Do you understand this issue with tf.map_fn?

girving · 2016-06-09T23:41:41Z

Note for anyone who comes across this thread: tf.map_fn is an unrelated thing involving control flow, not something related to mapping over extra rank tensors.

yuanbyu · 2016-08-24T05:29:29Z

We don't support higher-order gradients for while_loop/map_fn/scan/fold. You should see an informative error message if you try to do that.

vladfi1 · 2016-10-03T00:21:15Z

@yaroslavvb Any plans on adding forward mode AD? I filed an issue on it a couple weeks ago but haven't heard back.

yaroslavvb · 2016-10-03T00:42:49Z

@vladfi1 I'm no longer at Brain, so I wouldn't know. I would say it is unlikely to ever be part of core TensorFlow. There are >450 ops in TF, so Brain team would have to implement forward AD grad method for all of 450 ops and maintain them forever, or alternatively have to explain why someone's favorite op doesn't have forward AD support. It seems more realistic that someone would create a separately maintained library that does forward-AD, and utilizes TensorFlow as backend. Kind of like autograd but using TensorFlow instead of numpy as the backend.

myaooo · 2017-02-20T12:03:10Z

Is tf.test.compute_gradient is some kind of function that we can get the Jacobian matrix (not as a tensor but as a numpy.ndarray) of a vector tensor y w.r.t. a vector tensor x?

yaroslavvb · 2017-02-20T18:08:27Z

There are no built-in jacobians in TensorFlow, instead anything called 'grad' or 'gradient' computes Jacobian-vector product (also called LOp in theano), see https://en.wikipedia.org/wiki/Automatic_differentiation#Reverse_accumulation

…

On Mon, Feb 20, 2017 at 4:04 AM, MYaooo ***@***.***> wrote: Is tf.test.compute_gradient is some kind of funtion that we can get the jacobian matrix (not as a tensor but as a numpy.ndarray) of a vector tensor y w.r.t. a vector tensor x? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#675 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABaHO7dLBLvhJH5KoSiV8WcySnfR5cnks5reYFMgaJpZM4G9_8_> .

myaooo · 2017-02-21T03:33:37Z

@yaroslavvb Thanks for your reply. I see why it's expensive to do that.
Do you have any suggestions if I want to get the numerical results of the Jacobian with certain inputs x. The only not so expensive workaround I can think of is to apply perturbation on each dim of x to get the approximate results.

dancasas · 2018-04-27T21:07:32Z

@marcociccone in TF 1.4 seems to be working fine, if this is an alternative for you.

skye · 2018-05-04T01:48:16Z

This is indeed a bug in TF. It's caused by taking the gradient(y, x) inside a while loop wrt such that the computation of y from x goes through a different while loop, something like:

x = ...
y = while_loop(..., [x])
z = while_loop(..., tf.gradients(y, x), [y])

So in @mholzel's script, it's from passing the outcome of one jacobian call to the other jacobian call. (BTW, thanks very much for the easy repro.)

Unfortunately this is quite tricky to fix. I'll try to take another look at it tomorrow and see if I can come up with something.

skye · 2018-05-05T01:18:46Z

This is actually extremely tricky to fix. I'm not sure how this was working in 1.4, was it definitely giving the right answer?

The fundamental problem is that the second jacobian call (i.e. the hessian) is calling tf.gradients() inside a while loop, and that backprop calculation must go through the while loop from the first jacobian call. TF computes while loop gradients using stacks to store intermediate loop values, so if you're doing that calculation multiple times via another loop, we'd have to somehow re-use the stack values on each iteration. This is conceptually possible but would be a pretty big change. I can at least try to improve the error message though.

mholzel · 2018-05-05T04:22:55Z

I have never seen the nested call work, but I did not try 1.4.

Joshuaalbert · 2018-06-12T16:14:05Z

@mholzel for the double jacobian, suppose you just map to tf.hessian instead of tf.gradient. Doesn't solve to arbitrary order but it does get the hessian of a tensor wrt variables.

agarwal-ashish · 2018-07-12T22:03:08Z

There is now an experimental new approach to doing Jacobians here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/parallel_for/gradients.py#L28

dancasas · 2018-10-25T08:16:20Z

There is now an experimental new approach to doing Jacobians here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/parallel_for/gradients.py#L28

Did anyone compare the performance of the new functionality and the previous solution posted by @mholzel here? I am finding the new jacobian included in gradients.py way slower than @mholzel solution...

agarwal-ashish · 2018-10-26T02:17:28Z

There is now an experimental new approach to doing Jacobians here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/parallel_for/gradients.py#L28

Did anyone compare the performance of the new functionality and the previous solution posted by @mholzel here? I am finding the new jacobian included in gradients.py way slower than @mholzel solution...

gradients_test.py in the same file has some benchmarks that show that pfor based jacobian generally works much faster than a while loop based one. Can you share your benchmark ?

GJBoth · 2018-10-28T21:09:54Z

I threw together this function, it uses tf.map_fn and assumes a batch style input, but at least it works for batches. It's pretty high level but seems to adress some of the ideas if the conversation above:

def Jacobian(X, y):
    J = tf.map_fn(lambda m: tf.gradients(y[:,:,m:m+1], X)[0], tf.range(tf.shape(y)[-1]), tf.float32)
    J = tf.transpose(tf.squeeze(J), perm = [1,0,2])
    return J
```

jvishnuvardhan · 2019-03-22T20:44:04Z

@zackchase Is this resolved? Please close If it was resolved already. Thanks!

agarwal-ashish · 2019-06-17T22:22:15Z

tf.GradientTape.jacobian and tf.GradientTape.batch_jacobian APIs have been added for computing Jacobians. These are based on experimental pfor logic which can also be disabled to fallback to loop based implementation.

…m-pr-620 Cherry-pick PR 620

keveman added the stat:contribution welcome Status - Contributions welcome label Jan 4, 2016

girving changed the title ~~Problems Calculating the Jacobian~~ Gradients of non-scalars (higher rank Jacobians) Jan 4, 2016

girving added the enhancement label Jan 4, 2016

girving added the triaged label Jun 8, 2016

girving mentioned this issue Jun 9, 2016

Hessian (calling tf.gradients twice) of tf.scan fails #2598

Closed

aselle removed the triaged label Jul 28, 2016

yuanbyu closed this as completed Aug 24, 2016

tillahoffmann mentioned this issue Nov 1, 2016

Hessian with respect to one-dimensional tensors #5329

Merged

cxfneo mentioned this issue Dec 26, 2016

Issues: self.action_grads = tf.gradients(self.out, self.action) pemami4911/deep-rl#3

Closed

aselle added type:feature Feature requests and removed enhancement labels Feb 9, 2017

skye assigned skye and unassigned agarwal-ashish May 4, 2018

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label May 4, 2018

skye added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label May 16, 2018

fKunstner mentioned this issue May 23, 2018

[feature request] Simple and Efficient way to get gradients of each element of a sum pytorch/pytorch#7786

Open

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 13, 2018

Harshini-Gadige added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Sep 11, 2018

Harshini-Gadige unassigned skye Sep 11, 2018

tensorflowbutler assigned poxvoculi Sep 12, 2018

poxvoculi assigned agarwal-ashish and unassigned poxvoculi Sep 12, 2018

agarwal-ashish assigned saxenasaurabh and unassigned agarwal-ashish Oct 2, 2018

yifannieudem mentioned this issue Oct 17, 2018

Cannot calculate tf.gradients wrt embedding_matrix #23033

Closed

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Oct 26, 2018

agarwal-ashish closed this as completed Jun 17, 2019

ppwwyyxx mentioned this issue Jul 25, 2019

How to calculate gradients of a list of y with respect to a single x? tensorpack/tensorpack#1281

Closed

tsbertalan mentioned this issue Nov 13, 2019

[TF 2.0 API Docs] tf.custom_gradient #26270

Closed

ekuznetsov139 pushed a commit to ekuznetsov139/tensorflow that referenced this issue Mar 6, 2020

Merge pull request tensorflow#675 from ROCmSoftwarePlatform/r1.15-roc…

3470e32

…m-pr-620 Cherry-pick PR 620

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradients of non-scalars (higher rank Jacobians) #675

Gradients of non-scalars (higher rank Jacobians) #675

zackchase commented Jan 4, 2016

keveman commented Jan 4, 2016

girving commented Jan 4, 2016

zackchase commented Jan 4, 2016

girving commented Jan 4, 2016

zackchase commented Jan 4, 2016

girving commented Jan 4, 2016

zackchase commented Jan 4, 2016

girving commented Jan 4, 2016

zackchase commented Jan 9, 2016

yaroslavvb commented Jan 9, 2016

tillahoffmann commented Apr 27, 2016

girving commented Apr 28, 2016

girving commented Jun 9, 2016

yuanbyu commented Aug 24, 2016

vladfi1 commented Oct 3, 2016

yaroslavvb commented Oct 3, 2016 •

edited

Loading

myaooo commented Feb 20, 2017 •

edited

Loading

yaroslavvb commented Feb 20, 2017 via email

myaooo commented Feb 21, 2017 •

edited

Loading

dancasas commented Apr 27, 2018

skye commented May 4, 2018

skye commented May 5, 2018

mholzel commented May 5, 2018

Joshuaalbert commented Jun 12, 2018

agarwal-ashish commented Jul 12, 2018

dancasas commented Oct 25, 2018

agarwal-ashish commented Oct 26, 2018

GJBoth commented Oct 28, 2018

jvishnuvardhan commented Mar 22, 2019

agarwal-ashish commented Jun 17, 2019

Gradients of non-scalars (higher rank Jacobians) #675

Gradients of non-scalars (higher rank Jacobians) #675

Comments

zackchase commented Jan 4, 2016

keveman commented Jan 4, 2016

girving commented Jan 4, 2016

zackchase commented Jan 4, 2016

girving commented Jan 4, 2016

zackchase commented Jan 4, 2016

girving commented Jan 4, 2016

zackchase commented Jan 4, 2016

girving commented Jan 4, 2016

zackchase commented Jan 9, 2016

yaroslavvb commented Jan 9, 2016

tillahoffmann commented Apr 27, 2016

girving commented Apr 28, 2016

girving commented Jun 9, 2016

yuanbyu commented Aug 24, 2016

vladfi1 commented Oct 3, 2016

yaroslavvb commented Oct 3, 2016 • edited Loading

myaooo commented Feb 20, 2017 • edited Loading

yaroslavvb commented Feb 20, 2017 via email

myaooo commented Feb 21, 2017 • edited Loading

dancasas commented Apr 27, 2018

skye commented May 4, 2018

skye commented May 5, 2018

mholzel commented May 5, 2018

Joshuaalbert commented Jun 12, 2018

agarwal-ashish commented Jul 12, 2018

dancasas commented Oct 25, 2018

agarwal-ashish commented Oct 26, 2018

GJBoth commented Oct 28, 2018

jvishnuvardhan commented Mar 22, 2019

agarwal-ashish commented Jun 17, 2019

yaroslavvb commented Oct 3, 2016 •

edited

Loading

myaooo commented Feb 20, 2017 •

edited

Loading

myaooo commented Feb 21, 2017 •

edited

Loading