# tf.gradients() gives the conjugate of what is expected #3348

Closed
opened this issue Jul 17, 2016 · 16 comments
Closed

# tf.gradients() gives the conjugate of what is expected#3348

opened this issue Jul 17, 2016 · 16 comments

### whdc commented Jul 17, 2016 • edited

 `tf.gradients()`, when used on complex numbers, erroneously flips the sign of the imaginary part: ``````>>> x = tf.Variable(0. + 0.j) >>> sess.run(tf.gradients(x*x, x), feed_dict={x:0.1j}) [-0.20000000000000001j] >>> sess.run(tf.gradients(tf.exp(x), x), feed_dict={x:0.1j}) [(0.99500416527802571-0.099833416646828155j)] `````` I expect `0.2j` and `0.99500416527802571+0.099833416646828155j`. I'm running version 0.9.0, CPU only, on OS X.

### jmchen-g commented Jul 18, 2016

 @girving Could you take a look at this please?

### girving commented Jul 18, 2016 • edited

 The gradient of a holomorphic function is the conjugate of its complex derivative.

### corcra commented Aug 16, 2016

 @girving can you explain this statement? Using the complex analogue of the definition of the derivative, for `f(z) = z*z`, `f'(z) = z` as one would expect (for `z` complex). For the derivative of `f(z)` in terms of its partial derivatives with respect to the real and imaginary components of `z` (referring to these partial derivatives as `f_x` and `f_y`), you get `f'(z) = 0.5*(f_x - j*f_y)`, which sort of looks like a conjugate, but still would return `0.2j` in the first example. What am I missing here?

### girving commented Aug 16, 2016

 I don't have time to show you the proof, but you can check yourself that if `w = f(z)`, then `dL/dz = conj(f'(z)) dL/dw` for gradients w.r.t. a loss `L` and `f'(z)` the complex derivative.

### charmasaur commented Jul 17, 2019

I'll attempt to clarify for any future readers who also get confused by this.

### tl;dr

I'm reasonably confident the "gradient" returned by `tf.gradients` is `conj(df/dz + dconj(f)/dz)` (which reduces to `conj(df/dz)` for holomorphic `f`).

### More details

The "gradient" mentioned by girving as the conjugate of the derivative is (related to) the gradient of the corresponding real map, when we express that gradient as a complex value. To expand on that, by definition of the Wirtinger derivative we have `df/dz = 0.5 * (df/dx - i df/dy)` (where `x` and `y` are the real and imaginary parts of `z`). If `f` is real-valued (e.g. a loss), then we have `conj(df/dz) = 0.5 * (df/dx + i df/dy)`, and `df/dx + i df/dy` in Cartesian coordinates is `(df/dx, df/dy)`, which is the gradient of `f` when considered as a map between real spaces.

That's not the usual definition of "gradient" of a complex map though--in my experience "gradient" is synonymous with "derivative". The terms certainly seem to be used interchangeably in the `tf.gradients` docs (e.g "`gradients()` adds ops to the graph to output the derivatives of `ys` with respect to `xs`").

I imagine the reason for using that definition of "gradient" is because TF is generally concerned with gradients in order to find the direction of maximum increase, and the direction of maximum increase is `(df/dx, df/dy)`. FWIW I would have thought it was more reasonable to define the "gradient" as `df/dz + dconj(f)/dz`, and then take the conjugate in the optimiser when deciding on the direction of maximum increase, but I expect that ship has sailed.

### Even more details

To expand further, for anybody who's interested, another way to write the computed quantity is `dR(f)/dx + i * dR(f)/dy`, or in Cartesian coordinates, `(dR(f)/dx, dR(f)/dy)`. That is, it's the gradient of the real part of the function, when viewed as a function of real variables. If you run an optimisation using a complex function, therefore, what you'll end up optimising is the real part of the function.

The operator `D(f) := conj(df/dz + dconj(f)/dz` turns out to obey pretty standard chain and product rules, so the auto-differentiation still goes through correctly once the gradient functions are modified appropriately. Specifically, defining `D(C, f) := conj(conj(C) * df/dz + C * dconj(f)/dz)`, which represents the "gradient" of `f`(as defined before) with accumulation `C` (which is the form of the TF gradient methods, where `C` is `grad` and `f` is the function whose gradient is being defined), one can easily verify:

``````D(1, f) = D(f) [the initial state of the auto-differentiation]
D(C, z) = C [the end/base state]
D(C, f o g) = D(conj(conj(C) * df/dg + C * dconj(f)/dg), g) [chain rule]
``````

Implementing one of the gradient functions therefore boils down to evaluating that expression in the chain rule for the function in question. For example, for z -> conj(z) we have

``````conj(conj(C) * dconj(z)/dz + C * dconj(conj(z))/dz) = conj(C),
``````

consistent with the code.

For a holomorphic function (multiplication, exp, etc...) the expression simply reduces to `C * conj(df/dg)` (because `dconj(f)/dg` vanishes for holomorphic f), which is the reason for all the conjugation introduced here.

### NEGU93 commented Sep 24, 2019

I'll attempt to clarify for any future readers who also get confused by this.

### tl;dr

I'm reasonably confident the "gradient" returned by `tf.gradients` is `conj(df/dz + dconj(f)/dz)` (which reduces to `conj(df/dz)` for holomorphic `f`).

### More details

The "gradient" mentioned by girving as the conjugate of the derivative is (related to) the gradient of the corresponding real map, when we express that gradient as a complex value. To expand on that, by definition of the Wirtinger derivative we have `df/dz = 0.5 * (df/dx - i df/dy)` (where `x` and `y` are the real and imaginary parts of `z`). If `f` is real-valued (e.g. a loss), then we have `conj(df/dz) = 0.5 * (df/dx + i df/dy)`, and `df/dx + i df/dy` in Cartesian coordinates is `(df/dx, df/dy)`, which is the gradient of `f` when considered as a map between real spaces.

That's not the usual definition of "gradient" of a complex map though--in my experience "gradient" is synonymous with "derivative". The terms certainly seem to be used interchangeably in the `tf.gradients` docs (e.g "`gradients()` adds ops to the graph to output the derivatives of `ys` with respect to `xs`").

I imagine the reason for using that definition of "gradient" is because TF is generally concerned with gradients in order to find the direction of maximum increase, and the direction of maximum increase is `(df/dx, df/dy)`. FWIW I would have thought it was more reasonable to define the "gradient" as `df/dz + dconj(f)/dz`, and then take the conjugate in the optimiser when deciding on the direction of maximum increase, but I expect that ship has sailed.

### Even more details

To expand further, for anybody who's interested, another way to write the computed quantity is `dR(f)/dx + i * dR(f)/dy`, or in Cartesian coordinates, `(dR(f)/dx, dR(f)/dy)`. That is, it's the gradient of the real part of the function, when viewed as a function of real variables. If you run an optimisation using a complex function, therefore, what you'll end up optimising is the real part of the function.

The operator `D(f) := conj(df/dz + dconj(f)/dz` turns out to obey pretty standard chain and product rules, so the auto-differentiation still goes through correctly once the gradient functions are modified appropriately. Specifically, defining `D(C, f) := conj(conj(C) * df/dz + C * dconj(f)/dz)`, which represents the "gradient" of `f`(as defined before) with accumulation `C` (which is the form of the TF gradient methods, where `C` is `grad` and `f` is the function whose gradient is being defined), one can easily verify:

``````D(1, f) = D(f) [the initial state of the auto-differentiation]
D(C, z) = C [the end/base state]
D(C, f o g) = D(conj(conj(C) * df/dg + C * dconj(f)/dg), g) [chain rule]
``````

Implementing one of the gradient functions therefore boils down to evaluating that expression in the chain rule for the function in question. For example, for z -> conj(z) we have

``````conj(conj(C) * dconj(z)/dz + C * dconj(conj(z))/dz) = conj(C),
``````

consistent with the code.

For a holomorphic function (multiplication, exp, etc...) the expression simply reduces to `C * conj(df/dg)` (because `dconj(f)/dg` vanishes for holomorphic f), which is the reason for all the conjugation introduced here.

Thanks for that detailed explanation!
How did you find out this definition (`conj(df/dz + dconj(f)/dz)`) was the one used by tensorflow? Couldn't find it anywhere and I've been searching/asking around a lot.
Do you have any clue on WHY this definition of the gradient? Any paper which says for non-holomorphic functions this mathematical definition will give you what you want?

### girving commented Sep 24, 2019

 If anyone is curious: @charmasaur suggests that it would have been "more correct" to define gradients the other way. This isn't correct: the easiest way to see this is to imagine that the inputs to a network are real, the outputs are real, and in the middle of the network is a holomorphic function. In this case, the optimizer has no idea that there's a holomorphic function involved in the middle of a computation: it sees a bunch of real stuff, and would have to do a full graph traversal to notice the issue. A better option would be to define the gradient to be mathematically correct, and as @charmasaur has helpfully shown this is what TensorFlow does.

### girving commented Sep 24, 2019

 @martinwicke I gave a talk once inside Google with a slide showing the proof that the gradient should be the conjugate. It's the TensorFlow documentation talk about gradients. Could you take a picture of that slide and attach it here?

### NEGU93 commented Sep 24, 2019

 @martinwicke I gave a talk once inside Google with a slide showing the proof that the gradient should be the conjugate. It's the TensorFlow documentation talk about gradients. Could you take a picture of that slide and attach it here? I didn't understood, you gave the talk or this @martinwicke ? In any case, this talk would be much appreciated if you could provide the link. Also references if any. Thank you.

### charmasaur commented Sep 25, 2019

 Thanks for the replies! If anyone is curious: @charmasaur suggests that it would have been "more correct" to define gradients the other way. This isn't correct: the easiest way to see this is to imagine that the inputs to a network are real, the outputs are real, and in the middle of the network is a holomorphic function. In this case, the optimizer has no idea that there's a holomorphic function involved in the middle of a computation: it sees a bunch of real stuff, and would have to do a full graph traversal to notice the issue. In that example (and any other situation involving real inputs and outputs), wouldn't the eventual gradient be real anyway, so taking the conjugate would be a no-op? In any case, to me it just seems like a matter of convention/definition: without the conjugate, your "gradient" is the coefficient of the complex-linear function that approximates your function (well, approximates in real parts in the case of non-holomorphic functions); with the conjugate, your "gradient" is the direction of maximum increase. No matter which you choose, you can always get the other just by taking a conjugate. How did you find out this definition (`conj(df/dz + dconj(f)/dz)`) was the one used by tensorflow? Couldn't find it anywhere and I've been searching/asking around a lot. I found out that was the definition by evaluating a bunch of gradients and spotting a pattern, then reading the code to convince myself that what was being computed was consistent with that pattern. Do you have any clue on WHY this definition of the gradient? Any paper which says for non-holomorphic functions this mathematical definition will give you what you want? As discussed above, I think this is the definition of gradient because it's the gradient of the real part of your function when viewed as a map of real variables. That means you can plug it into a standard gradient descent optimizer and the result will be to optimize the real part of your function (which seems like a decent interpretation of "please optimize this complex-valued function"). As to your second question, to my knowledge there's not really any well-accepted "gradient of a general complex-valued function of complex variables". For holomorphic functions you have the Wirtinger derivative, which is the coefficient of the complex-linear approximation to your function, and is well-accepted. For non-holomorphic functions though, there is no complex-linear approximation (by definition), so it's not even clear what you want when you ask for a derivative or gradient. Your best bet is probably to view your function as a map between real spaces and look at a Jacobian. In general I think it really boils down to a question of what you're actually doing with the gradient -- that will determine exactly which quantity you want to calculate. Once you know that quantity, which will almost certainly be some linear combination of `df/dz` and `dconj(f)/dz`, TF will be able to calculate it by choosing an appropriate `grad_ys` input to `tf.gradients`. Specifically, in the operator `D(C, f) := conj(conj(C) * df/dz + C * dconj(f)/dz)` I mentioned in my earlier reply, `C` corresponds to `grad_ys`. For example, if you wanted to compute `df/dz` for some non-holomorphic `f`, you could use `grad_ys=` to get `conj(df/dz + dconj(f)/dz)`, then use `grad_ys=[1j]` to get `conj(-i df/dz + i dconj(f)/dz)`. From those two quantities you can get `df/dz`. Does that help?

### refraction-ray commented Sep 25, 2019

 As a side note for the above discussion, I believe this technical report is a great material for derivatives and gradients definition of complex function in a more mathematical rigorous way. Specifically, in chapter 4, the author shows exactly why "gradient" of real valued complex variable functions is (two times) the complex conjugate of the corresponding partial derivatives (note no derivative can be well defined for non-holomorphic function, only partial derivatives can).

### girving commented Sep 25, 2019

 @martinwicke Yep. Presumably that will resolve the confusion. :)

### NEGU93 commented Sep 25, 2019

 I found out that was the definition by evaluating a bunch of gradients and spotting a pattern, then reading the code to convince myself that what was being computed was consistent with that pattern. YOU ARE THE BOSS!!! I was trying to reverse engineer it myself. I had some theories and all where taken down at some point. I was expecting the definition to be `2*df/dconj(z)` based on a paper of CVNN (Akira Hirose) which did work for `f : C -> R` but not for `f : C -> C`. (Always using Wirtinger calculus of course) One last question maybe off topic. I tried reading the code myself but I'm more an electronic engineer and not software/informatic engineer. I got to a point where the Python code calls an API of a C/C++ code for the gradient but couldn't really find where to continue from there. You that have read the code can maybe point me in the right direction. Do you know where to start reading from the C code? I would like to read it too to further understand it. For what I do I think I will have to understand it eventually anyway. As a side note for the above discussion, I believe this technical report is a great material for derivatives and gradients definition of complex function in a more mathematical rigorous way. Specifically, in chapter 4, the author shows exactly why "gradient" of real valued complex variable functions is (two times) the complex conjugate of the corresponding partial derivatives (note no derivative can be well defined for non-holomorphic function, only partial derivatives can). Good, I will give it a look. Thank you!

### charmasaur commented Sep 25, 2019

 One last question maybe off topic. I tried reading the code myself but I'm more an electronic engineer and not software/informatic engineer. I got to a point where the Python code calls an API of a C/C++ code for the gradient but couldn't really find where to continue from there. You that have read the code can maybe point me in the right direction. Do you know where to start reading from the C code? I would like to read it too to further understand it. For what I do I think I will have to understand it eventually anyway. I won't be much help on this one, unfortunately. For my investigation I just dug around until I found the python code defining gradients (math_grad.py, linked earlier), and used that in isolation. I've never actually tried to trace through all the way from a `gradients` call to the C++ code. Maybe one of the TF folks can point to a resource for learning more about that?

### NEGU93 commented Sep 27, 2019

 Just some thought I have been given lately. Another (more compressed) way to write tensorflow's definition of the gradient is: `2*dreal(f) / dconj(z)`.