Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradients of non-scalars (higher rank Jacobians) #675

Closed
zackchase opened this issue Jan 4, 2016 · 68 comments
Closed

Gradients of non-scalars (higher rank Jacobians) #675

zackchase opened this issue Jan 4, 2016 · 68 comments
Assignees
Labels
type:feature Feature requests

Comments

@zackchase
Copy link
Contributor

Currently if you call gradients(ys, xs), it will return the sum of dy/dx over all ys for each x in xs. I believe this doesn't accord with an a priori mathematical notion of the derivative of a vector. I'd like the way to take the derivative of ys wrt xs where both are vectors and have a Jacobian matrix returned. By extension, I'd like to take the derivative of a vector wrt a matrix and get back a 3-tensor. There doesn't seem to be a convenient tensorflow function to compute the Jacobian or higher order derivatives. Am I missing something or is this functionality that we could add?

@keveman
Copy link
Contributor

keveman commented Jan 4, 2016

zackchase@, you are right about the current gradients function. Currently, you can compute the Jacobian of, say, a vector, by calling gradients multiple times, one for every scalar component (obtained by slicing) of the original vector, and reassembling the results. Contributions are welcome to make this nicer and efficient.

@keveman keveman added the stat:contribution welcome Status - Contributions welcome label Jan 4, 2016
@girving
Copy link
Contributor

girving commented Jan 4, 2016

It'd be pretty hard to support gradients of non-scalars with our current setup, since it would require every gradient function to handle extra rank input. The one possibility I could see would be if we add some sort of map facility to register how to add extra ranks to ops, then compute gradients with respect to extra rank by computing lower rank and calling the registered map transformations.

Someone asked for map a while back, so if anyone wanted to tackle this task that might be the way to go. Handling it at the gradient function level is probably bad, since it would add required complexity to an existing feature. Warning: This is a pretty large change, so a good deal of discussion would be in order before starting.

@girving girving changed the title Problems Calculating the Jacobian Gradients of non-scalars (higher rank Jacobians) Jan 4, 2016
@zackchase
Copy link
Contributor Author

Hi Geoffrey, thanks for taking an interest in this issue. I was initially confused by the use of "rank" to describe the number of dimensions of the array. Should we avoid this name in the thread title and documentation to preempt confusion via overloading the linear algebra notion of rank?

@girving
Copy link
Contributor

girving commented Jan 4, 2016

Tensor rank is very standard terminology: http://mathworld.wolfram.com/TensorRank.html

@zackchase
Copy link
Contributor Author

Cool. The terminology gets funny when we talk about rank-R decompositions of tensors, meaning the tensor can be represented as a sum of R outer products of rank-1 tensors, but probably not a problem for us to solve here.

One thing I thought of is that I would like to compute the frobenius norm of the Jacobian of the log probabilities for use as a smoothness penalty much like the smoothness penalty used in a contractive autoencoder. In this case, as we only seek a scalar at the end, is there a more efficient method than separately calculating the derivative of each output with respect to the inputs?

@girving
Copy link
Contributor

girving commented Jan 4, 2016

Are you saying your network has a bunch of outputs, and then you combine them into a single scalar that you are trying to optimize? In that case, you should differentiate with respect to that single scalar.

@zackchase
Copy link
Contributor Author

Not exactly. I'm saying if one wants to penalize the norm of the Jacobian of the mapping function.
So optimization objective would be (pseudocode):

cost(y,yhat, X) = loss(y,yhat) + norm(Jacobian(log(yhat), X))

@girving
Copy link
Contributor

girving commented Jan 4, 2016

Ah, sorry for not reading carefully. You're correct that (as far as I know) there's no easy way to do that in current tensorflow. According to someone more knowledgeable than I, people generally do such contractive autoencoders by writing out the first derivative manually. Also, they generally restrict to single layer at a time networks for speed issues, since doing the full Jacobian for a multilayer network is quite expensive.

@zackchase
Copy link
Contributor Author

Regardless, it would be good to have a way to call derivatives of vectors and receive gradients of the expected shape.

@yaroslavvb
Copy link
Contributor

Differentiating with respect to one variable is similar to how it works in Theano. I agree it may be confusing when TensorFlow automatically turns many variables into one by taking the sum. An alternative would be to fail if there's more than 1 output variable specified, or have a wrapper that automatically calls existing gradient function on each output variable

The reason for "one output variable at a time" in TensorFlow (and Theano) is because we do reverse mode AD by default. In reverse AD you have a single target scalar quantity and you propagate sensitivities with respect to that quantity. In contrast, if you we did forward AD instead, we would naturally support multiple output variables, but only compute derivative with respect to one scalar variable at a time. Supporting mixed mode propagation to cover "multiple inputs/multiple outputs" case in the most efficient way could be a lot of extra plumbing.

If you have a small number of output variables but large number of input variables, standard thing to do is to apply reverse AD with respect to each variable in a loop. This is what Theano recommends to do for compute Hessian for instance: http://deeplearning.net/software/theano/tutorial/gradients.html#computing-the-hessian. If you have a small number of input variables but large number of output variables, then the most efficient thing to do would be to run forward-mode AD for all the input variables in a loop. Forward mode AD is not implemented and would require adding an equivalent of Theano's "Rop" operator to differentiable ops and some plumbing to call them instead of existing op "gradient" function (existing gradient function is an equivalent of Lop operation, or "left multiply sensitivity vector by op's jacobian" operation)

@tillahoffmann
Copy link
Contributor

I was hoping to implement higher order derivatives using the map function but am getting an error message I can't quite get my head around. My implementation is (in pseudo code)

params = tf.Variable("some initial value")
loss = some_function(params)
grads = tf.gradients(loss, params)[0]
hess = tf.map_fn(lambda grad: tf.gradients(grad, X)[0], grads)

When I fetch the hessian, I get the error message

InvalidArgumentError: All inputs to node map/while/gradients/map/TensorArrayUnpack_grad/TensorArrayGrad/TensorArrayGrad must be from the same frame.

I assumed that tensorflow has an issue because it doesn't know about params in the loop (cf. non_sequences in theano scan), and extended map_fn to pass extra arguments to the loop. Unfortunately, the extra arguments get wrapped in an identity transformation and tf.gradients(params, tf.identity(params)) gives [None], which seems a bit unintuitive.

Looping in python is of course fine but I'd like to avoid introducing an extra node to the graph for every parameter. Any suggestions?

@girving
Copy link
Contributor

girving commented Apr 28, 2016

@yuanbyu: Do you understand this issue with tf.map_fn?

@girving girving added the triaged label Jun 8, 2016
@girving
Copy link
Contributor

girving commented Jun 9, 2016

Note for anyone who comes across this thread: tf.map_fn is an unrelated thing involving control flow, not something related to mapping over extra rank tensors.

@yuanbyu
Copy link
Contributor

yuanbyu commented Aug 24, 2016

We don't support higher-order gradients for while_loop/map_fn/scan/fold. You should see an informative error message if you try to do that.

@yuanbyu yuanbyu closed this as completed Aug 24, 2016
@vladfi1
Copy link
Contributor

vladfi1 commented Oct 3, 2016

@yaroslavvb Any plans on adding forward mode AD? I filed an issue on it a couple weeks ago but haven't heard back.

@yaroslavvb
Copy link
Contributor

yaroslavvb commented Oct 3, 2016

@vladfi1 I'm no longer at Brain, so I wouldn't know. I would say it is unlikely to ever be part of core TensorFlow. There are >450 ops in TF, so Brain team would have to implement forward AD grad method for all of 450 ops and maintain them forever, or alternatively have to explain why someone's favorite op doesn't have forward AD support. It seems more realistic that someone would create a separately maintained library that does forward-AD, and utilizes TensorFlow as backend. Kind of like autograd but using TensorFlow instead of numpy as the backend.

@myaooo
Copy link

myaooo commented Feb 20, 2017

Is tf.test.compute_gradient is some kind of function that we can get the Jacobian matrix (not as a tensor but as a numpy.ndarray) of a vector tensor y w.r.t. a vector tensor x?

@yaroslavvb
Copy link
Contributor

yaroslavvb commented Feb 20, 2017 via email

@myaooo
Copy link

myaooo commented Feb 21, 2017

@yaroslavvb Thanks for your reply. I see why it's expensive to do that.
Do you have any suggestions if I want to get the numerical results of the Jacobian with certain inputs x. The only not so expensive workaround I can think of is to apply perturbation on each dim of x to get the approximate results.

@dancasas
Copy link

@marcociccone in TF 1.4 seems to be working fine, if this is an alternative for you.

@skye
Copy link
Member

skye commented May 4, 2018

This is indeed a bug in TF. It's caused by taking the gradient(y, x) inside a while loop wrt such that the computation of y from x goes through a different while loop, something like:

x = ...
y = while_loop(..., [x])
z = while_loop(..., tf.gradients(y, x), [y])

So in @mholzel's script, it's from passing the outcome of one jacobian call to the other jacobian call. (BTW, thanks very much for the easy repro.)

Unfortunately this is quite tricky to fix. I'll try to take another look at it tomorrow and see if I can come up with something.

@skye skye assigned skye and unassigned agarwal-ashish May 4, 2018
@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label May 4, 2018
@skye
Copy link
Member

skye commented May 5, 2018

This is actually extremely tricky to fix. I'm not sure how this was working in 1.4, was it definitely giving the right answer?

The fundamental problem is that the second jacobian call (i.e. the hessian) is calling tf.gradients() inside a while loop, and that backprop calculation must go through the while loop from the first jacobian call. TF computes while loop gradients using stacks to store intermediate loop values, so if you're doing that calculation multiple times via another loop, we'd have to somehow re-use the stack values on each iteration. This is conceptually possible but would be a pretty big change. I can at least try to improve the error message though.

@mholzel
Copy link
Contributor

mholzel commented May 5, 2018

I have never seen the nested call work, but I did not try 1.4.

@Joshuaalbert
Copy link

@mholzel for the double jacobian, suppose you just map to tf.hessian instead of tf.gradient. Doesn't solve to arbitrary order but it does get the hessian of a tensor wrt variables.

@agarwal-ashish
Copy link

There is now an experimental new approach to doing Jacobians here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/parallel_for/gradients.py#L28

@dancasas
Copy link

There is now an experimental new approach to doing Jacobians here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/parallel_for/gradients.py#L28

Did anyone compare the performance of the new functionality and the previous solution posted by @mholzel here? I am finding the new jacobian included in gradients.py way slower than @mholzel solution...

@agarwal-ashish
Copy link

There is now an experimental new approach to doing Jacobians here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/parallel_for/gradients.py#L28

Did anyone compare the performance of the new functionality and the previous solution posted by @mholzel here? I am finding the new jacobian included in gradients.py way slower than @mholzel solution...

gradients_test.py in the same file has some benchmarks that show that pfor based jacobian generally works much faster than a while loop based one. Can you share your benchmark ?

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Oct 26, 2018
@GJBoth
Copy link

GJBoth commented Oct 28, 2018

I threw together this function, it uses tf.map_fn and assumes a batch style input, but at least it works for batches. It's pretty high level but seems to adress some of the ideas if the conversation above:

def Jacobian(X, y):
    J = tf.map_fn(lambda m: tf.gradients(y[:,:,m:m+1], X)[0], tf.range(tf.shape(y)[-1]), tf.float32)
    J = tf.transpose(tf.squeeze(J), perm = [1,0,2])
    return J
```

@jvishnuvardhan
Copy link
Contributor

@zackchase Is this resolved? Please close If it was resolved already. Thanks!

@agarwal-ashish
Copy link

tf.GradientTape.jacobian and tf.GradientTape.batch_jacobian APIs have been added for computing Jacobians. These are based on experimental pfor logic which can also be disabled to fallback to loop based implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature Feature requests
Projects
None yet
Development

No branches or pull requests