Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.gradients() gives the conjugate of what is expected #3348

Closed
whdc opened this issue Jul 17, 2016 · 16 comments
Closed

tf.gradients() gives the conjugate of what is expected #3348

whdc opened this issue Jul 17, 2016 · 16 comments

Comments

@whdc
Copy link

@whdc whdc commented Jul 17, 2016

tf.gradients(), when used on complex numbers, erroneously flips the sign of the imaginary part:

>>> x = tf.Variable(0. + 0.j)
>>> sess.run(tf.gradients(x*x, x), feed_dict={x:0.1j})
[-0.20000000000000001j]
>>> sess.run(tf.gradients(tf.exp(x), x), feed_dict={x:0.1j})
[(0.99500416527802571-0.099833416646828155j)]

I expect 0.2j and 0.99500416527802571+0.099833416646828155j.

I'm running version 0.9.0, CPU only, on OS X.

@jmchen-g

This comment has been minimized.

Copy link
Contributor

@jmchen-g jmchen-g commented Jul 18, 2016

@girving Could you take a look at this please?

@girving

This comment has been minimized.

Copy link
Contributor

@girving girving commented Jul 18, 2016

The gradient of a holomorphic function is the conjugate of its complex derivative.

@girving girving closed this Jul 18, 2016
@corcra

This comment has been minimized.

Copy link

@corcra corcra commented Aug 16, 2016

@girving can you explain this statement?

Using the complex analogue of the definition of the derivative, for f(z) = z*z, f'(z) = z as one would expect (for z complex).

For the derivative of f(z) in terms of its partial derivatives with respect to the real and imaginary components of z (referring to these partial derivatives as f_x and f_y), you get f'(z) = 0.5*(f_x - j*f_y), which sort of looks like a conjugate, but still would return 0.2j in the first example.

What am I missing here?

@girving

This comment has been minimized.

Copy link
Contributor

@girving girving commented Aug 16, 2016

I don't have time to show you the proof, but you can check yourself that if w = f(z), then dL/dz = conj(f'(z)) dL/dw for gradients w.r.t. a loss L and f'(z) the complex derivative.

@charmasaur

This comment has been minimized.

Copy link
Contributor

@charmasaur charmasaur commented Jul 17, 2019

I'll attempt to clarify for any future readers who also get confused by this.

tl;dr

I'm reasonably confident the "gradient" returned by tf.gradients is conj(df/dz + dconj(f)/dz) (which reduces to conj(df/dz) for holomorphic f).

More details

The "gradient" mentioned by girving as the conjugate of the derivative is (related to) the gradient of the corresponding real map, when we express that gradient as a complex value. To expand on that, by definition of the Wirtinger derivative we have df/dz = 0.5 * (df/dx - i df/dy) (where x and y are the real and imaginary parts of z). If f is real-valued (e.g. a loss), then we have conj(df/dz) = 0.5 * (df/dx + i df/dy), and df/dx + i df/dy in Cartesian coordinates is (df/dx, df/dy), which is the gradient of f when considered as a map between real spaces.

That's not the usual definition of "gradient" of a complex map though--in my experience "gradient" is synonymous with "derivative". The terms certainly seem to be used interchangeably in the tf.gradients docs (e.g "gradients() adds ops to the graph to output the derivatives of ys with respect to xs").

I imagine the reason for using that definition of "gradient" is because TF is generally concerned with gradients in order to find the direction of maximum increase, and the direction of maximum increase is (df/dx, df/dy). FWIW I would have thought it was more reasonable to define the "gradient" as df/dz + dconj(f)/dz, and then take the conjugate in the optimiser when deciding on the direction of maximum increase, but I expect that ship has sailed.

Even more details

To expand further, for anybody who's interested, another way to write the computed quantity is dR(f)/dx + i * dR(f)/dy, or in Cartesian coordinates, (dR(f)/dx, dR(f)/dy). That is, it's the gradient of the real part of the function, when viewed as a function of real variables. If you run an optimisation using a complex function, therefore, what you'll end up optimising is the real part of the function.

The operator D(f) := conj(df/dz + dconj(f)/dz turns out to obey pretty standard chain and product rules, so the auto-differentiation still goes through correctly once the gradient functions are modified appropriately. Specifically, defining D(C, f) := conj(conj(C) * df/dz + C * dconj(f)/dz), which represents the "gradient" of f(as defined before) with accumulation C (which is the form of the TF gradient methods, where C is grad and f is the function whose gradient is being defined), one can easily verify:

D(1, f) = D(f) [the initial state of the auto-differentiation]
D(C, z) = C [the end/base state]
D(C, f o g) = D(conj(conj(C) * df/dg + C * dconj(f)/dg), g) [chain rule]

Implementing one of the gradient functions therefore boils down to evaluating that expression in the chain rule for the function in question. For example, for z -> conj(z) we have

conj(conj(C) * dconj(z)/dz + C * dconj(conj(z))/dz) = conj(C),

consistent with the code.

For a holomorphic function (multiplication, exp, etc...) the expression simply reduces to C * conj(df/dg) (because dconj(f)/dg vanishes for holomorphic f), which is the reason for all the conjugation introduced here.

@NEGU93

This comment has been minimized.

Copy link

@NEGU93 NEGU93 commented Sep 24, 2019

I'll attempt to clarify for any future readers who also get confused by this.

tl;dr

I'm reasonably confident the "gradient" returned by tf.gradients is conj(df/dz + dconj(f)/dz) (which reduces to conj(df/dz) for holomorphic f).

More details

The "gradient" mentioned by girving as the conjugate of the derivative is (related to) the gradient of the corresponding real map, when we express that gradient as a complex value. To expand on that, by definition of the Wirtinger derivative we have df/dz = 0.5 * (df/dx - i df/dy) (where x and y are the real and imaginary parts of z). If f is real-valued (e.g. a loss), then we have conj(df/dz) = 0.5 * (df/dx + i df/dy), and df/dx + i df/dy in Cartesian coordinates is (df/dx, df/dy), which is the gradient of f when considered as a map between real spaces.

That's not the usual definition of "gradient" of a complex map though--in my experience "gradient" is synonymous with "derivative". The terms certainly seem to be used interchangeably in the tf.gradients docs (e.g "gradients() adds ops to the graph to output the derivatives of ys with respect to xs").

I imagine the reason for using that definition of "gradient" is because TF is generally concerned with gradients in order to find the direction of maximum increase, and the direction of maximum increase is (df/dx, df/dy). FWIW I would have thought it was more reasonable to define the "gradient" as df/dz + dconj(f)/dz, and then take the conjugate in the optimiser when deciding on the direction of maximum increase, but I expect that ship has sailed.

Even more details

To expand further, for anybody who's interested, another way to write the computed quantity is dR(f)/dx + i * dR(f)/dy, or in Cartesian coordinates, (dR(f)/dx, dR(f)/dy). That is, it's the gradient of the real part of the function, when viewed as a function of real variables. If you run an optimisation using a complex function, therefore, what you'll end up optimising is the real part of the function.

The operator D(f) := conj(df/dz + dconj(f)/dz turns out to obey pretty standard chain and product rules, so the auto-differentiation still goes through correctly once the gradient functions are modified appropriately. Specifically, defining D(C, f) := conj(conj(C) * df/dz + C * dconj(f)/dz), which represents the "gradient" of f(as defined before) with accumulation C (which is the form of the TF gradient methods, where C is grad and f is the function whose gradient is being defined), one can easily verify:

D(1, f) = D(f) [the initial state of the auto-differentiation]
D(C, z) = C [the end/base state]
D(C, f o g) = D(conj(conj(C) * df/dg + C * dconj(f)/dg), g) [chain rule]

Implementing one of the gradient functions therefore boils down to evaluating that expression in the chain rule for the function in question. For example, for z -> conj(z) we have

conj(conj(C) * dconj(z)/dz + C * dconj(conj(z))/dz) = conj(C),

consistent with the code.

For a holomorphic function (multiplication, exp, etc...) the expression simply reduces to C * conj(df/dg) (because dconj(f)/dg vanishes for holomorphic f), which is the reason for all the conjugation introduced here.

Thanks for that detailed explanation!
How did you find out this definition (conj(df/dz + dconj(f)/dz)) was the one used by tensorflow? Couldn't find it anywhere and I've been searching/asking around a lot.
Do you have any clue on WHY this definition of the gradient? Any paper which says for non-holomorphic functions this mathematical definition will give you what you want?

@girving

This comment has been minimized.

Copy link
Contributor

@girving girving commented Sep 24, 2019

If anyone is curious: @charmasaur suggests that it would have been "more correct" to define gradients the other way. This isn't correct: the easiest way to see this is to imagine that the inputs to a network are real, the outputs are real, and in the middle of the network is a holomorphic function. In this case, the optimizer has no idea that there's a holomorphic function involved in the middle of a computation: it sees a bunch of real stuff, and would have to do a full graph traversal to notice the issue.

A better option would be to define the gradient to be mathematically correct, and as @charmasaur has helpfully shown this is what TensorFlow does.

@girving

This comment has been minimized.

Copy link
Contributor

@girving girving commented Sep 24, 2019

@martinwicke I gave a talk once inside Google with a slide showing the proof that the gradient should be the conjugate. It's the TensorFlow documentation talk about gradients. Could you take a picture of that slide and attach it here?

@NEGU93

This comment has been minimized.

Copy link

@NEGU93 NEGU93 commented Sep 24, 2019

@martinwicke I gave a talk once inside Google with a slide showing the proof that the gradient should be the conjugate. It's the TensorFlow documentation talk about gradients. Could you take a picture of that slide and attach it here?

I didn't understood, you gave the talk or this @martinwicke ? In any case, this talk would be much appreciated if you could provide the link. Also references if any.

Thank you.

@martinwicke

This comment has been minimized.

Copy link
Member

@martinwicke martinwicke commented Sep 24, 2019

This one?
Screen Shot 2019-09-24 at 13 49 24

@charmasaur

This comment has been minimized.

Copy link
Contributor

@charmasaur charmasaur commented Sep 25, 2019

Thanks for the replies!

If anyone is curious: @charmasaur suggests that it would have been "more correct" to define gradients the other way. This isn't correct: the easiest way to see this is to imagine that the inputs to a network are real, the outputs are real, and in the middle of the network is a holomorphic function. In this case, the optimizer has no idea that there's a holomorphic function involved in the middle of a computation: it sees a bunch of real stuff, and would have to do a full graph traversal to notice the issue.

In that example (and any other situation involving real inputs and outputs), wouldn't the eventual gradient be real anyway, so taking the conjugate would be a no-op?

In any case, to me it just seems like a matter of convention/definition: without the conjugate, your "gradient" is the coefficient of the complex-linear function that approximates your function (well, approximates in real parts in the case of non-holomorphic functions); with the conjugate, your "gradient" is the direction of maximum increase. No matter which you choose, you can always get the other just by taking a conjugate.

How did you find out this definition (conj(df/dz + dconj(f)/dz)) was the one used by tensorflow? Couldn't find it anywhere and I've been searching/asking around a lot.

I found out that was the definition by evaluating a bunch of gradients and spotting a pattern, then reading the code to convince myself that what was being computed was consistent with that pattern.

Do you have any clue on WHY this definition of the gradient? Any paper which says for non-holomorphic functions this mathematical definition will give you what you want?

As discussed above, I think this is the definition of gradient because it's the gradient of the real part of your function when viewed as a map of real variables. That means you can plug it into a standard gradient descent optimizer and the result will be to optimize the real part of your function (which seems like a decent interpretation of "please optimize this complex-valued function").

As to your second question, to my knowledge there's not really any well-accepted "gradient of a general complex-valued function of complex variables". For holomorphic functions you have the Wirtinger derivative, which is the coefficient of the complex-linear approximation to your function, and is well-accepted. For non-holomorphic functions though, there is no complex-linear approximation (by definition), so it's not even clear what you want when you ask for a derivative or gradient. Your best bet is probably to view your function as a map between real spaces and look at a Jacobian.

In general I think it really boils down to a question of what you're actually doing with the gradient -- that will determine exactly which quantity you want to calculate. Once you know that quantity, which will almost certainly be some linear combination of df/dz and dconj(f)/dz, TF will be able to calculate it by choosing an appropriate grad_ys input to tf.gradients. Specifically, in the operator D(C, f) := conj(conj(C) * df/dz + C * dconj(f)/dz) I mentioned in my earlier reply, C corresponds to grad_ys.

For example, if you wanted to compute df/dz for some non-holomorphic f, you could use grad_ys=[1] to get conj(df/dz + dconj(f)/dz), then use grad_ys=[1j] to get conj(-i df/dz + i dconj(f)/dz). From those two quantities you can get df/dz.

Does that help?

@refraction-ray

This comment has been minimized.

Copy link
Contributor

@refraction-ray refraction-ray commented Sep 25, 2019

As a side note for the above discussion, I believe this technical report is a great material for derivatives and gradients definition of complex function in a more mathematical rigorous way. Specifically, in chapter 4, the author shows exactly why "gradient" of real valued complex variable functions is (two times) the complex conjugate of the corresponding partial derivatives (note no derivative can be well defined for non-holomorphic function, only partial derivatives can).

@girving

This comment has been minimized.

Copy link
Contributor

@girving girving commented Sep 25, 2019

@martinwicke Yep. Presumably that will resolve the confusion. :)

@NEGU93

This comment has been minimized.

Copy link

@NEGU93 NEGU93 commented Sep 25, 2019

I found out that was the definition by evaluating a bunch of gradients and spotting a pattern, then reading the code to convince myself that what was being computed was consistent with that pattern.

YOU ARE THE BOSS!!! I was trying to reverse engineer it myself. I had some theories and all where taken down at some point. I was expecting the definition to be 2*df/dconj(z) based on a paper of CVNN (Akira Hirose) which did work for f : C -> R but not for f : C -> C. (Always using Wirtinger calculus of course)

One last question maybe off topic. I tried reading the code myself but I'm more an electronic engineer and not software/informatic engineer. I got to a point where the Python code calls an API of a C/C++ code for the gradient but couldn't really find where to continue from there. You that have read the code can maybe point me in the right direction. Do you know where to start reading from the C code? I would like to read it too to further understand it.
For what I do I think I will have to understand it eventually anyway.

As a side note for the above discussion, I believe this technical report is a great material for derivatives and gradients definition of complex function in a more mathematical rigorous way. Specifically, in chapter 4, the author shows exactly why "gradient" of real valued complex variable functions is (two times) the complex conjugate of the corresponding partial derivatives (note no derivative can be well defined for non-holomorphic function, only partial derivatives can).

Good, I will give it a look. Thank you!

@charmasaur

This comment has been minimized.

Copy link
Contributor

@charmasaur charmasaur commented Sep 25, 2019

One last question maybe off topic. I tried reading the code myself but I'm more an electronic engineer and not software/informatic engineer. I got to a point where the Python code calls an API of a C/C++ code for the gradient but couldn't really find where to continue from there. You that have read the code can maybe point me in the right direction. Do you know where to start reading from the C code? I would like to read it too to further understand it.
For what I do I think I will have to understand it eventually anyway.

I won't be much help on this one, unfortunately. For my investigation I just dug around until I found the python code defining gradients (math_grad.py, linked earlier), and used that in isolation. I've never actually tried to trace through all the way from a gradients call to the C++ code.

Maybe one of the TF folks can point to a resource for learning more about that?

@NEGU93

This comment has been minimized.

Copy link

@NEGU93 NEGU93 commented Sep 27, 2019

Just some thought I have been given lately. Another (more compressed) way to write tensorflow's definition of the gradient is: 2*dreal(f) / dconj(z).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.