New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide unaggregated gradients tensors #4897

Closed
seerdecker opened this Issue Oct 11, 2016 · 26 comments

Comments

Projects
None yet
@seerdecker

seerdecker commented Oct 11, 2016

As described here, TF is inflexible when it comes to access to the gradients:
http://stackoverflow.com/questions/35731506/unaggregated-gradients-gradients-per-example-in-tensorflow?rq=1

Please provide a method where the user can retrieve the raw gradients, not the averaged gradients. Requiring the user to compute their own gradients is impractical -- the framework should work for the user, not the other way around.

Use case: this is needed for reinforcement learning, where the gradients of one net needs to be backpropagated through another net (in separate steps).

@yaroslavvb

This comment has been minimized.

Contributor

yaroslavvb commented Oct 11, 2016

TensorFlow is just a regular automatic differentiation system, it gives the gradient that you ask for -- so if you ask for a gradient of the mean loss over all examples, the gradient is aggregated, but if you ask for the gradient of loss over a single example, it gives you "unaggregated" gradient for that particular datapoint. Naively, if you have a batch of size k, you could have "k" tf.gradient calls, and that would give you gradient for each of the k examples. You can make this more efficient by reformulating your task, here's a post from @goodfeli For networks with conv layers it's more tricky

@seerdecker

This comment has been minimized.

seerdecker commented Oct 11, 2016

I saw the article. This isn't what I'm looking for since I'd have to do all the computations by hand. It's difficult and error prone. Doing "k" tf.gradients() call is slow.

Obtaining the batch of gradients is required for many algorithms. It really should be something that's easy to do.

@yaroslavvb

This comment has been minimized.

Contributor

yaroslavvb commented Oct 11, 2016

[tf.gradients(cost....) for cost in example_costs] is the way to get gradients for multiple costs. If that's too slow, it might be useful to provide timelines to see where the slowness happens

@seerdecker

This comment has been minimized.

seerdecker commented Oct 11, 2016

If I follow what you're saying, example_costs would be defined at graph instantiation time, right? So that forces a fixed batch size.

@yaroslavvb

This comment has been minimized.

Contributor

yaroslavvb commented Oct 11, 2016

example_costs in my example is a Python list of costs you want to differentiate with respect to, and you can change this list during runtime. For efficiency, you may want small number of per-example-cost gradient graphs and reuse them during runtime.

Generally though, "retrieve raw gradient" request is ill-specified -- there's no place in TensorFlow where "per-example gradients" are added together since there's no notion of "example" when differentiating. IE, you can encode your examples as rows of data matrix, or as columns, or in some other way, and that would give you the same gradient, but different sets of "per-example gradients". The AD system of TensorFlow gives you freedom to choose example encoding and does not need to know about your choice, which I think is what's needed from a general computation framework.

That said, it would be useful to see implementations of methods like Ian's or others that give ways of computing per-example gradients efficiently

@seerdecker

This comment has been minimized.

seerdecker commented Oct 11, 2016

Thanks for the answer. Yes, some examples of implementations would come handy.

I don't understand how TF is providing freedom here. The implicit assumption that the last matrix rank corresponds to the batch instances seems to be built-in the design. For instance, how could you pack your data in rows in a matrix and pass that through a convolutional layer? The convolutional layer imposes semantics on the content of your tensors.

The notion of a batch of gradients is well-defined in general (not speaking about TF specifically). You have a batch of inputs. The output of the network is a vector containing the loss value for each input. The batch of gradients is a tensor containing one entry per (input, loss) pair.

In TF, in terms of semantics, I think all you'd have to do is to allow the user to specify a vector loss instead of a scalar loss in tf.gradients(), and then retrieve the requested batch of gradients.

In practice TensorFlow is a deep learning framework, and the AD system is subservient to that goal. If it's hard to retrieve something so fundamental as a batch gradients, it seems to me that there's something missing in the API.

@goodfeli

This comment has been minimized.

goodfeli commented Oct 11, 2016

On the contrary, it's because it's an AD framework rather than a DL framework that it is somewhat difficult to get the batch gradients. The gradient of an individual example is a deep learning concept. As a pure math engine, tensorflow just exposes the concept of the gradient with respect to a specific tensor. Because there's no tensor representing the parameters as used on a single example, the AD engine doesn't have a concept for the DL value you want.

This actually does have a reasonably efficient solution that doesn't require n tf.gradients calls. Do n calls to convolution, wrapping the same parameters in a different identity op each time. Then do one tf.gradients call getting the gradient with respect to each of the identity copies.

@seerdecker

This comment has been minimized.

seerdecker commented Oct 11, 2016

Alright, thanks for the help.

@yaroslavvb

This comment has been minimized.

Contributor

yaroslavvb commented Oct 11, 2016

Baking in the NN assumptions too deeply is why DistBelief ended up so inflexible. In TensorFlow you can minimize the loss which is sum(XWX'), where X is a single matrix-valued example, so that's your per-example gradient. Allowing tf.gradients to differentiate with respect to vector valued vec seems like a good idea, that would make it simpler to compute Hessians as well as per-example gradients

@asimshankar

This comment has been minimized.

Member

asimshankar commented Oct 21, 2016

Closing this due to inactivity (and it seems that there was a healthy discussion). Feel free to re-open if you think something needs to change here.

@jpiabrantes

This comment has been minimized.

jpiabrantes commented Nov 1, 2016

When I ask tensorflow for the gradient of a function that returns a scalar for every example, tf just gives me one gradient instead of a gradient per example. My "loss" function is not a sum or average so I can't understand this behaviour. Simple example:

x = tf.placeholder(tf.float32, [None,3] , name="input")
W1 = tf.get_variable("W1", shape=[3, 1],
           initializer=tf.contrib.layers.xavier_initializer())
output = tf.nn.relu(tf.matmul(x,W1))
grads = tf.gradients(output,[W1])
sess = tf.InteractiveSession()
init = tf.initialize_all_variables()
sess.run(init)
print sess.run(grads,feed_dict={x:np.random.rand(10,3)})
>>[array([[ 0.83393538],
       [ 0.16146532],
       [ 0.43589821]], dtype=float32)]
@jmacglashan

This comment has been minimized.

jmacglashan commented Nov 22, 2016

I agree that it would be useful for various algorithms to be able to get each individual gradient per batch and it doesn't seem easy to do this. As jpiabrantes commented above, the issue is not limited to asking for the gradient of a tensor that does a a batch reduction. In fact, it's clear that the gradient on an aggregation operation should too be aggregated. The problem is tf.gradients also aggregates even when the output of the tensor by which you differentiating doesn't aggregate.

I understand why it is the way it is for mathematical generality, but the very fact that tf.gradients has a flag for how aggregation is done suggests that we should be able to specify a batch mode where it does not aggregate the gradients and instead returns an NxMx... tensor where n is the number gradients over which it would normally aggregate and Mx... is the shape of the variable.

@goodfeli

This comment has been minimized.

goodfeli commented Apr 2, 2017

@goodfeli

This comment has been minimized.

goodfeli commented Apr 2, 2017

@altosaar

This comment has been minimized.

altosaar commented Apr 25, 2017

Thank you @goodfeli !

It would be very useful to support this in tensorflow for calculating second-order gradients (e.g. fisher information, etc).

Any hope for official support or is this the recommended solution?

cc @datang1992

@goodfeli

This comment has been minimized.

goodfeli commented Apr 25, 2017

@yaroslavvb

This comment has been minimized.

Contributor

yaroslavvb commented Apr 25, 2017

@altosaar I recently had to compute per-example gradients, and ended up with implementation here . To make things fast I had to drop tf.gradients and compute things manually. There's two orders of magnitude speed-up if you use specialized structures, specifically, use "khatri-rao" product which gives you a batch of matmul gradients given a batch of activations + batch of backprops. It can be computed efficiently through einsum

@altosaar

This comment has been minimized.

altosaar commented Apr 25, 2017

Awesome, thank you @goodfeli. @yaroslavvb that implementation is extremely helpful - exactly what we're trying to do 👍 👍 👍

Also cc @ebrevdo based on past discussions!

@shoyer

This comment has been minimized.

Member

shoyer commented May 6, 2017

See #675 (comment) for a proposal for tf.jacobian.

I'm not sure this is exactly the issue discussed here, but I'll just note how I solved my use-case for "per example gradients" using the helper function in the linked comment. With scalar input/output jacobian(outputs, inputs) calculates the right result. For batched inputs/outputs, jacobian(tf.reduce_sum(outputs, 0), inputs) gives me the result I'm looking for, because each inputs example only effects the corresponding entry in outputs, so the other entries get dropped out when differentiating the sum.

@sitzikbs

This comment has been minimized.

sitzikbs commented Jun 25, 2017

@goodfeli , how would the code given above change if I wish to get the gradient between in a specific layer? (that is, between the output of the layer and the weights at the input of the layer)

examples = tf.split(batch)
weight_copies = [tf.identity(weights) for x in examples]
output = tf.stack(f(x, w) in zip(examples, weight_copies))
cost = cost_function(output)
per_example_gradients = tf.gradients(cost, weight_copies)

@goodfeli

This comment has been minimized.

goodfeli commented Jun 28, 2017

The mathematical concept of a gradient is defined only when the output is a single scalar. (If you give TensorFlow an output that isn't a scalar, it will give you the gradient of the sum). You probably want a Jacobian. So you'd need to make a loop that computes the gradient once per entry in the output.

@sitzikbs

This comment has been minimized.

sitzikbs commented Jun 29, 2017

Thank you for the reply.
I think my followup question is more relevant in stack overflow so I posted there. If you get a chance I will really appreciate the help since no one answered it in a while. (The basic idea is that I wish to use the gradients of one layer as inputs to another layer - like Fisher Vectors but really can't)
Here is the link to my question there.

@heimanba89

This comment has been minimized.

heimanba89 commented Apr 12, 2018

Hi @goodfeli @yaroslavvb , I have a problem that is very similar to per sample gradient. The current tf.gradients() outputs the sum of gradients given the batch, is this possible to output the sum of element-wise power of gradients, i.e., sum over batch (dy/dx)^2, instead of sum over batch (dy/dx) currently

@joshuawchen

This comment has been minimized.

joshuawchen commented May 14, 2018

@goodfeli As far as I understand, one needs to split up the input variable into 'batch_size' number of tensors. However, one would need to know the batch_size (as an integer) a priori when constructing this using tf.split. Is there an easy way to use your pseudocode since we usually dont have the batch_size a priori?

@pkadambi

This comment has been minimized.

pkadambi commented Oct 17, 2018

@sitzikbs Did you end up using something similar to the pseudocode that @goodfeli suggested? If so, was this fast, and does it work on batch level ops like batch norm?

@sitzikbs

This comment has been minimized.

sitzikbs commented Oct 18, 2018

@sitzikbs Did you end up using something similar to the pseudocode that @goodfeli suggested? If so, was this fast, and does it work on batch level ops like batch norm?

@pkadambi I ended up computing the gradients manually in an external function, its in the 3dmfv function in the tf_utils file at this repo: link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment