New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide unaggregated gradients tensors #4897
Comments
TensorFlow is just a regular automatic differentiation system, it gives the gradient that you ask for -- so if you ask for a gradient of the mean loss over all examples, the gradient is aggregated, but if you ask for the gradient of loss over a single example, it gives you "unaggregated" gradient for that particular datapoint. Naively, if you have a batch of size k, you could have "k" |
I saw the article. This isn't what I'm looking for since I'd have to do all the computations by hand. It's difficult and error prone. Doing "k" tf.gradients() call is slow. Obtaining the batch of gradients is required for many algorithms. It really should be something that's easy to do. |
|
If I follow what you're saying, example_costs would be defined at graph instantiation time, right? So that forces a fixed batch size. |
Generally though, "retrieve raw gradient" request is ill-specified -- there's no place in TensorFlow where "per-example gradients" are added together since there's no notion of "example" when differentiating. IE, you can encode your examples as rows of data matrix, or as columns, or in some other way, and that would give you the same gradient, but different sets of "per-example gradients". The AD system of TensorFlow gives you freedom to choose example encoding and does not need to know about your choice, which I think is what's needed from a general computation framework. That said, it would be useful to see implementations of methods like Ian's or others that give ways of computing per-example gradients efficiently |
Thanks for the answer. Yes, some examples of implementations would come handy. I don't understand how TF is providing freedom here. The implicit assumption that the last matrix rank corresponds to the batch instances seems to be built-in the design. For instance, how could you pack your data in rows in a matrix and pass that through a convolutional layer? The convolutional layer imposes semantics on the content of your tensors. The notion of a batch of gradients is well-defined in general (not speaking about TF specifically). You have a batch of inputs. The output of the network is a vector containing the loss value for each input. The batch of gradients is a tensor containing one entry per (input, loss) pair. In TF, in terms of semantics, I think all you'd have to do is to allow the user to specify a vector loss instead of a scalar loss in tf.gradients(), and then retrieve the requested batch of gradients. In practice TensorFlow is a deep learning framework, and the AD system is subservient to that goal. If it's hard to retrieve something so fundamental as a batch gradients, it seems to me that there's something missing in the API. |
On the contrary, it's because it's an AD framework rather than a DL framework that it is somewhat difficult to get the batch gradients. The gradient of an individual example is a deep learning concept. As a pure math engine, tensorflow just exposes the concept of the gradient with respect to a specific tensor. Because there's no tensor representing the parameters as used on a single example, the AD engine doesn't have a concept for the DL value you want. This actually does have a reasonably efficient solution that doesn't require n tf.gradients calls. Do n calls to convolution, wrapping the same parameters in a different identity op each time. Then do one tf.gradients call getting the gradient with respect to each of the identity copies. |
Alright, thanks for the help. |
Baking in the NN assumptions too deeply is why DistBelief ended up so inflexible. In TensorFlow you can minimize the loss which is |
Closing this due to inactivity (and it seems that there was a healthy discussion). Feel free to re-open if you think something needs to change here. |
When I ask tensorflow for the gradient of a function that returns a scalar for every example, tf just gives me one gradient instead of a gradient per example. My "loss" function is not a sum or average so I can't understand this behaviour. Simple example: x = tf.placeholder(tf.float32, [None,3] , name="input")
W1 = tf.get_variable("W1", shape=[3, 1],
initializer=tf.contrib.layers.xavier_initializer())
output = tf.nn.relu(tf.matmul(x,W1))
grads = tf.gradients(output,[W1])
sess = tf.InteractiveSession()
init = tf.initialize_all_variables()
sess.run(init)
print sess.run(grads,feed_dict={x:np.random.rand(10,3)})
>>[array([[ 0.83393538],
[ 0.16146532],
[ 0.43589821]], dtype=float32)] |
I agree that it would be useful for various algorithms to be able to get each individual gradient per batch and it doesn't seem easy to do this. As jpiabrantes commented above, the issue is not limited to asking for the gradient of a tensor that does a a batch reduction. In fact, it's clear that the gradient on an aggregation operation should too be aggregated. The problem is tf.gradients also aggregates even when the output of the tensor by which you differentiating doesn't aggregate. I understand why it is the way it is for mathematical generality, but the very fact that tf.gradients has a flag for how aggregation is done suggests that we should be able to specify a batch mode where it does not aggregate the gradients and instead returns an NxMx... tensor where n is the number gradients over which it would normally aggregate and Mx... is the shape of the variable. |
This is only pseudocode, but basic idea is:
examples = tf.split(batch)
weight_copies = [tf.identity(weights) for x in examples]
output = tf.stack(f(x, w) in zip(examples, weight_copies))
cost = cost_function(output)
per_example_gradients = tf.gradients(cost, weight_copies)
…On Sat, Apr 1, 2017 at 1:47 PM, jjough ***@***.***> wrote:
Do n calls to convolution, wrapping the same parameters in a different
identity op each time. Then do one tf.gradients call getting the gradient
with respect to each of the identity copies.
@goodfeli <https://github.com/goodfeli> could you please explain this
further? Would love to see the trick, but not really sure what you mean.
I'm manually calculating a full Jacobian by making a bunch of calls to
tf.gradients, and it's very slow.
def tf_jacobian(tensor2, tensor1, feed_dict):
"""
Computes the tensor d(tensor2)/d(tensor1) recursively.
"""
shape = list(sess.run(tf.shape(tensor2), feed_dict))
if shape:
return tf.stack([tf_jacobian(tf.squeeze(M, squeeze_dims = 0), tensor1, feed_dict) for M in tf.split(split_dim = 0, num_split = shape[0], value = tensor2)])
else:
grad = tf.gradients(tensor2, tensor1)
if grad[0] != None:
return tf.squeeze(grad, squeeze_dims = [0])
else:
return tf.zeros_like(tensor1)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4897 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXrGqYwDyn590JO8VdyFECt-I8LZwiIks5rrrfQgaJpZM4KT_Op>
.
|
I've used it and it was fast
…On Sun, Apr 2, 2017 at 10:44 AM, jjough ***@***.***> wrote:
Thanks Ian - this makes sense, trying it out now. Is it obvious that this
would be faster than the naive method?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4897 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXrGkyMYBIVaI3qNIxpBvv5AUF8zMWcks5rr951gaJpZM4KT_Op>
.
|
Thank you @goodfeli ! It would be very useful to support this in tensorflow for calculating second-order gradients (e.g. fisher information, etc). Any hope for official support or is this the recommended solution? cc @datang1992 |
I think this is the recommended solution for now
…On Tue, Apr 25, 2017 at 5:47 PM, Jaan Altosaar ***@***.***> wrote:
Thank you @goodfeli <https://github.com/goodfeli> !
It would be very useful to support this in tensorflow for calculating
second-order gradients (e.g. fisher information, etc).
Any hope for official support or is this the recommended solution?
cc @datang1992 <https://github.com/datang1992>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4897 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXrGg-Q5lr6Ew5YWeIpmMV_mN700WOmks5rzhWJgaJpZM4KT_Op>
.
|
@altosaar I recently had to compute per-example gradients, and ended up with implementation here . To make things fast I had to drop |
Awesome, thank you @goodfeli. @yaroslavvb that implementation is extremely helpful - exactly what we're trying to do 👍 👍 👍 Also cc @ebrevdo based on past discussions! |
See #675 (comment) for a proposal for I'm not sure this is exactly the issue discussed here, but I'll just note how I solved my use-case for "per example gradients" using the helper function in the linked comment. With scalar input/output |
@goodfeli , how would the code given above change if I wish to get the gradient between in a specific layer? (that is, between the output of the layer and the weights at the input of the layer)
|
The mathematical concept of a gradient is defined only when the output is a single scalar. (If you give TensorFlow an output that isn't a scalar, it will give you the gradient of the sum). You probably want a Jacobian. So you'd need to make a loop that computes the gradient once per entry in the output. |
Thank you for the reply. |
Hi @goodfeli @yaroslavvb , I have a problem that is very similar to per sample gradient. The current tf.gradients() outputs the sum of gradients given the batch, is this possible to output the sum of element-wise power of gradients, i.e., sum over batch (dy/dx)^2, instead of sum over batch (dy/dx) currently |
@goodfeli As far as I understand, one needs to split up the input variable into 'batch_size' number of tensors. However, one would need to know the batch_size (as an integer) a priori when constructing this using tf.split. Is there an easy way to use your pseudocode since we usually dont have the batch_size a priori? |
@pkadambi I ended up computing the gradients manually in an external function, its in the 3dmfv function in the tf_utils file at this repo: link |
Can somebody please elaborate the pseudo code given by @goodfeli a little bit more. A small example for a tiny MLP would be very nice! |
is there an efficient tflow way to compute the following covariance matrix M_{xy} of derivatives: M_{xy} = Cov(\partial_x f(x,y); \partial_y g(x,y)) = where f(x,y), g(x,y) are differentiable functions of some vectors x and y, \partial_x is the gradient along x, and <> is the minibatch average. The second part of the expression, <\partial_x f(x,y)>*<\partial_y g(x,y)>, can be readily computed using From the discussion above it sounds like this type of operations is what tensorflow should be able to do. |
@mgbukov using tf.gradients for this in TensorFlow will be inefficient, more generally, autograd in all major neural network frameworks is specialized to "single output summed over examples" case. When I needed access to individual gradients I had to derive the formulas by hand in terms of einsum ops. Jax has a better automatic way of doing it through vmap. |
tf.vectorized_map can be used to efficiently compute per-example gradients. |
Just to add a remark: our team at Owkin has studied this and done some benchmarks, please see report on arXiv and sample code at this repo on Github. Our analysis is focused on CNNs; we propose a new approach which combines Goodfellow's trick to a a PyTorch peculiarity (the |
It is not obvious to me if "multi" in your paper is the same as what tf.vectorized_map based graph rewrite would translate to. So it will be worth benchmarking that as well. It has been tested on complex networks and provides order of magnitude speedups due to vectorization based rewrites. The problem, as you hint at, is that we have not implemented an efficient vectorization of Conv2DBackpropFilter, which is what is required for your case to run fast. See Creating a new kernel for this, (or perhaps mapping this to an existing kernel), should do the trick, and might be akin to adding the group argument you mentioned. |
Thanks for the comments @agarwal-ashish!
That's a very good point, I imagine there are some optimizations in
Indeed that is the main issue in porting it to Tensorflow. We might look into implementing a custom kernel for this, thanks for the tip :) |
As described here, TF is inflexible when it comes to access to the gradients:
http://stackoverflow.com/questions/35731506/unaggregated-gradients-gradients-per-example-in-tensorflow?rq=1
Please provide a method where the user can retrieve the raw gradients, not the averaged gradients. Requiring the user to compute their own gradients is impractical -- the framework should work for the user, not the other way around.
Use case: this is needed for reinforcement learning, where the gradients of one net needs to be backpropagated through another net (in separate steps).
The text was updated successfully, but these errors were encountered: