Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide unaggregated gradients tensors #4897

Closed
seerdecker opened this issue Oct 11, 2016 · 33 comments
Closed

Provide unaggregated gradients tensors #4897

seerdecker opened this issue Oct 11, 2016 · 33 comments

Comments

@seerdecker
Copy link

@seerdecker seerdecker commented Oct 11, 2016

As described here, TF is inflexible when it comes to access to the gradients:
http://stackoverflow.com/questions/35731506/unaggregated-gradients-gradients-per-example-in-tensorflow?rq=1

Please provide a method where the user can retrieve the raw gradients, not the averaged gradients. Requiring the user to compute their own gradients is impractical -- the framework should work for the user, not the other way around.

Use case: this is needed for reinforcement learning, where the gradients of one net needs to be backpropagated through another net (in separate steps).

@yaroslavvb
Copy link
Contributor

@yaroslavvb yaroslavvb commented Oct 11, 2016

TensorFlow is just a regular automatic differentiation system, it gives the gradient that you ask for -- so if you ask for a gradient of the mean loss over all examples, the gradient is aggregated, but if you ask for the gradient of loss over a single example, it gives you "unaggregated" gradient for that particular datapoint. Naively, if you have a batch of size k, you could have "k" tf.gradient calls, and that would give you gradient for each of the k examples. You can make this more efficient by reformulating your task, here's a post from @goodfeli For networks with conv layers it's more tricky

@seerdecker
Copy link
Author

@seerdecker seerdecker commented Oct 11, 2016

I saw the article. This isn't what I'm looking for since I'd have to do all the computations by hand. It's difficult and error prone. Doing "k" tf.gradients() call is slow.

Obtaining the batch of gradients is required for many algorithms. It really should be something that's easy to do.

@yaroslavvb
Copy link
Contributor

@yaroslavvb yaroslavvb commented Oct 11, 2016

[tf.gradients(cost....) for cost in example_costs] is the way to get gradients for multiple costs. If that's too slow, it might be useful to provide timelines to see where the slowness happens

@seerdecker
Copy link
Author

@seerdecker seerdecker commented Oct 11, 2016

If I follow what you're saying, example_costs would be defined at graph instantiation time, right? So that forces a fixed batch size.

@yaroslavvb
Copy link
Contributor

@yaroslavvb yaroslavvb commented Oct 11, 2016

example_costs in my example is a Python list of costs you want to differentiate with respect to, and you can change this list during runtime. For efficiency, you may want small number of per-example-cost gradient graphs and reuse them during runtime.

Generally though, "retrieve raw gradient" request is ill-specified -- there's no place in TensorFlow where "per-example gradients" are added together since there's no notion of "example" when differentiating. IE, you can encode your examples as rows of data matrix, or as columns, or in some other way, and that would give you the same gradient, but different sets of "per-example gradients". The AD system of TensorFlow gives you freedom to choose example encoding and does not need to know about your choice, which I think is what's needed from a general computation framework.

That said, it would be useful to see implementations of methods like Ian's or others that give ways of computing per-example gradients efficiently

@seerdecker
Copy link
Author

@seerdecker seerdecker commented Oct 11, 2016

Thanks for the answer. Yes, some examples of implementations would come handy.

I don't understand how TF is providing freedom here. The implicit assumption that the last matrix rank corresponds to the batch instances seems to be built-in the design. For instance, how could you pack your data in rows in a matrix and pass that through a convolutional layer? The convolutional layer imposes semantics on the content of your tensors.

The notion of a batch of gradients is well-defined in general (not speaking about TF specifically). You have a batch of inputs. The output of the network is a vector containing the loss value for each input. The batch of gradients is a tensor containing one entry per (input, loss) pair.

In TF, in terms of semantics, I think all you'd have to do is to allow the user to specify a vector loss instead of a scalar loss in tf.gradients(), and then retrieve the requested batch of gradients.

In practice TensorFlow is a deep learning framework, and the AD system is subservient to that goal. If it's hard to retrieve something so fundamental as a batch gradients, it seems to me that there's something missing in the API.

@goodfeli
Copy link

@goodfeli goodfeli commented Oct 11, 2016

On the contrary, it's because it's an AD framework rather than a DL framework that it is somewhat difficult to get the batch gradients. The gradient of an individual example is a deep learning concept. As a pure math engine, tensorflow just exposes the concept of the gradient with respect to a specific tensor. Because there's no tensor representing the parameters as used on a single example, the AD engine doesn't have a concept for the DL value you want.

This actually does have a reasonably efficient solution that doesn't require n tf.gradients calls. Do n calls to convolution, wrapping the same parameters in a different identity op each time. Then do one tf.gradients call getting the gradient with respect to each of the identity copies.

@seerdecker
Copy link
Author

@seerdecker seerdecker commented Oct 11, 2016

Alright, thanks for the help.

@yaroslavvb
Copy link
Contributor

@yaroslavvb yaroslavvb commented Oct 11, 2016

Baking in the NN assumptions too deeply is why DistBelief ended up so inflexible. In TensorFlow you can minimize the loss which is sum(XWX'), where X is a single matrix-valued example, so that's your per-example gradient. Allowing tf.gradients to differentiate with respect to vector valued vec seems like a good idea, that would make it simpler to compute Hessians as well as per-example gradients

@asimshankar
Copy link
Contributor

@asimshankar asimshankar commented Oct 21, 2016

Closing this due to inactivity (and it seems that there was a healthy discussion). Feel free to re-open if you think something needs to change here.

@jpiabrantes
Copy link

@jpiabrantes jpiabrantes commented Nov 1, 2016

When I ask tensorflow for the gradient of a function that returns a scalar for every example, tf just gives me one gradient instead of a gradient per example. My "loss" function is not a sum or average so I can't understand this behaviour. Simple example:

x = tf.placeholder(tf.float32, [None,3] , name="input")
W1 = tf.get_variable("W1", shape=[3, 1],
           initializer=tf.contrib.layers.xavier_initializer())
output = tf.nn.relu(tf.matmul(x,W1))
grads = tf.gradients(output,[W1])
sess = tf.InteractiveSession()
init = tf.initialize_all_variables()
sess.run(init)
print sess.run(grads,feed_dict={x:np.random.rand(10,3)})
>>[array([[ 0.83393538],
       [ 0.16146532],
       [ 0.43589821]], dtype=float32)]
@jmacglashan
Copy link

@jmacglashan jmacglashan commented Nov 22, 2016

I agree that it would be useful for various algorithms to be able to get each individual gradient per batch and it doesn't seem easy to do this. As jpiabrantes commented above, the issue is not limited to asking for the gradient of a tensor that does a a batch reduction. In fact, it's clear that the gradient on an aggregation operation should too be aggregated. The problem is tf.gradients also aggregates even when the output of the tensor by which you differentiating doesn't aggregate.

I understand why it is the way it is for mathematical generality, but the very fact that tf.gradients has a flag for how aggregation is done suggests that we should be able to specify a batch mode where it does not aggregate the gradients and instead returns an NxMx... tensor where n is the number gradients over which it would normally aggregate and Mx... is the shape of the variable.

@goodfeli
Copy link

@goodfeli goodfeli commented Apr 2, 2017

@goodfeli
Copy link

@goodfeli goodfeli commented Apr 2, 2017

@altosaar
Copy link

@altosaar altosaar commented Apr 25, 2017

Thank you @goodfeli !

It would be very useful to support this in tensorflow for calculating second-order gradients (e.g. fisher information, etc).

Any hope for official support or is this the recommended solution?

cc @datang1992

@goodfeli
Copy link

@goodfeli goodfeli commented Apr 25, 2017

@yaroslavvb
Copy link
Contributor

@yaroslavvb yaroslavvb commented Apr 25, 2017

@altosaar I recently had to compute per-example gradients, and ended up with implementation here . To make things fast I had to drop tf.gradients and compute things manually. There's two orders of magnitude speed-up if you use specialized structures, specifically, use "khatri-rao" product which gives you a batch of matmul gradients given a batch of activations + batch of backprops. It can be computed efficiently through einsum

@altosaar
Copy link

@altosaar altosaar commented Apr 25, 2017

Awesome, thank you @goodfeli. @yaroslavvb that implementation is extremely helpful - exactly what we're trying to do 👍 👍 👍

Also cc @ebrevdo based on past discussions!

@shoyer
Copy link
Member

@shoyer shoyer commented May 6, 2017

See #675 (comment) for a proposal for tf.jacobian.

I'm not sure this is exactly the issue discussed here, but I'll just note how I solved my use-case for "per example gradients" using the helper function in the linked comment. With scalar input/output jacobian(outputs, inputs) calculates the right result. For batched inputs/outputs, jacobian(tf.reduce_sum(outputs, 0), inputs) gives me the result I'm looking for, because each inputs example only effects the corresponding entry in outputs, so the other entries get dropped out when differentiating the sum.

@sitzikbs
Copy link

@sitzikbs sitzikbs commented Jun 25, 2017

@goodfeli , how would the code given above change if I wish to get the gradient between in a specific layer? (that is, between the output of the layer and the weights at the input of the layer)

examples = tf.split(batch)
weight_copies = [tf.identity(weights) for x in examples]
output = tf.stack(f(x, w) in zip(examples, weight_copies))
cost = cost_function(output)
per_example_gradients = tf.gradients(cost, weight_copies)

@goodfeli
Copy link

@goodfeli goodfeli commented Jun 28, 2017

The mathematical concept of a gradient is defined only when the output is a single scalar. (If you give TensorFlow an output that isn't a scalar, it will give you the gradient of the sum). You probably want a Jacobian. So you'd need to make a loop that computes the gradient once per entry in the output.

@sitzikbs
Copy link

@sitzikbs sitzikbs commented Jun 29, 2017

Thank you for the reply.
I think my followup question is more relevant in stack overflow so I posted there. If you get a chance I will really appreciate the help since no one answered it in a while. (The basic idea is that I wish to use the gradients of one layer as inputs to another layer - like Fisher Vectors but really can't)
Here is the link to my question there.

@heimanba89
Copy link

@heimanba89 heimanba89 commented Apr 12, 2018

Hi @goodfeli @yaroslavvb , I have a problem that is very similar to per sample gradient. The current tf.gradients() outputs the sum of gradients given the batch, is this possible to output the sum of element-wise power of gradients, i.e., sum over batch (dy/dx)^2, instead of sum over batch (dy/dx) currently

@joshuawchen
Copy link

@joshuawchen joshuawchen commented May 14, 2018

@goodfeli As far as I understand, one needs to split up the input variable into 'batch_size' number of tensors. However, one would need to know the batch_size (as an integer) a priori when constructing this using tf.split. Is there an easy way to use your pseudocode since we usually dont have the batch_size a priori?

@pkadambi
Copy link

@pkadambi pkadambi commented Oct 17, 2018

@sitzikbs Did you end up using something similar to the pseudocode that @goodfeli suggested? If so, was this fast, and does it work on batch level ops like batch norm?

@sitzikbs
Copy link

@sitzikbs sitzikbs commented Oct 18, 2018

@sitzikbs Did you end up using something similar to the pseudocode that @goodfeli suggested? If so, was this fast, and does it work on batch level ops like batch norm?

@pkadambi I ended up computing the gradients manually in an external function, its in the 3dmfv function in the tf_utils file at this repo: link

@Lockenlui
Copy link

@Lockenlui Lockenlui commented Nov 22, 2018

This is only pseudocode, but basic idea is: examples = tf.split(batch) weight_copies = [tf.identity(weights) for x in examples] output = tf.stack(f(x, w) in zip(examples, weight_copies)) cost = cost_function(output) per_example_gradients = tf.gradients(cost, weight_copies)

Can somebody please elaborate the pseudo code given by @goodfeli a little bit more. A small example for a tiny MLP would be very nice!

@mgbukov
Copy link

@mgbukov mgbukov commented May 24, 2019

is there an efficient tflow way to compute the following covariance matrix M_{xy} of derivatives:

M_{xy} = Cov(\partial_x f(x,y); \partial_y g(x,y)) =
= <\partial_x f(x,y) \partial_y g(x,y)> - <\partial_x f(x,y)><\partial_y g(x,y)>

where f(x,y), g(x,y) are differentiable functions of some vectors x and y, \partial_x is the gradient along x, and <> is the minibatch average.

The second part of the expression, <\partial_x f(x,y)>*<\partial_y g(x,y)>, can be readily computed using tf.gradients. However, the first part, <\partial_x f(x,y) \partial_y g(x,y)>, represents one of those instances discussed above where taking the partial derivatives does not commute with the minibatch average.

From the discussion above it sounds like this type of operations is what tensorflow should be able to do.

@yaroslavvb
Copy link
Contributor

@yaroslavvb yaroslavvb commented May 26, 2019

@mgbukov using tf.gradients for this in TensorFlow will be inefficient, more generally, autograd in all major neural network frameworks is specialized to "single output summed over examples" case. When I needed access to individual gradients I had to derive the formulas by hand in terms of einsum ops. Jax has a better automatic way of doing it through vmap.

@agarwal-ashish
Copy link

@agarwal-ashish agarwal-ashish commented Jun 17, 2019

tf.vectorized_map can be used to efficiently compute per-example gradients.

@ndrmnl
Copy link

@ndrmnl ndrmnl commented Dec 13, 2019

Just to add a remark: our team at Owkin has studied this and done some benchmarks, please see report on arXiv and sample code at this repo on Github. Our analysis is focused on CNNs; we propose a new approach which combines Goodfellow's trick to a a PyTorch peculiarity (the groups argument in the convolution), and show that it is sometimes faster than simply recreating pointers to the original model (the approach proposed by @goodfeli above, and what tf.vectorized_map is effectivelly doing). Unfortunately, it is unclear how much of it applies to Tensorflow, as there's no groups argument here! Please let us know if you have any feedback.

@agarwal-ashish
Copy link

@agarwal-ashish agarwal-ashish commented Dec 14, 2019

It is not obvious to me if "multi" in your paper is the same as what tf.vectorized_map based graph rewrite would translate to. So it will be worth benchmarking that as well. It has been tested on complex networks and provides order of magnitude speedups due to vectorization based rewrites.

The problem, as you hint at, is that we have not implemented an efficient vectorization of Conv2DBackpropFilter, which is what is required for your case to run fast. See
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/parallel_for/pfor.py#L1768
which falls back to a while loop.

Creating a new kernel for this, (or perhaps mapping this to an existing kernel), should do the trick, and might be akin to adding the group argument you mentioned.

@ndrmnl
Copy link

@ndrmnl ndrmnl commented Dec 16, 2019

Thanks for the comments @agarwal-ashish!

It is not obvious to me if "multi" in your paper is the same as what tf.vectorized_map based graph rewrite would translate to. So it will be worth benchmarking that as well.

That's a very good point, I imagine there are some optimizations in tf.vectorized_map since it's graph based. We did all our benchmarks in PyTorch, but it is definitely a good idea to do cross PyTorch/Tensorflow benchmarks at some point.

The problem, as you hint at, is that we have not implemented an efficient vectorization of Conv2DBackpropFilter, which is what is required for your case to run fast.

Indeed that is the main issue in porting it to Tensorflow. We might look into implementing a custom kernel for this, thanks for the tip :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.