/ tensorflow Public

# tf.gradients() gives the conjugate of what is expected#3348

Closed
opened this issue Jul 17, 2016 · 36 comments
Closed

# tf.gradients() gives the conjugate of what is expected #3348

opened this issue Jul 17, 2016 · 36 comments
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower

 tf.gradients(), when used on complex numbers, erroneously flips the sign of the imaginary part: >>> x = tf.Variable(0. + 0.j) >>> sess.run(tf.gradients(x*x, x), feed_dict={x:0.1j}) [-0.20000000000000001j] >>> sess.run(tf.gradients(tf.exp(x), x), feed_dict={x:0.1j}) [(0.99500416527802571-0.099833416646828155j)]  I expect 0.2j and 0.99500416527802571+0.099833416646828155j. I'm running version 0.9.0, CPU only, on OS X. The text was updated successfully, but these errors were encountered:

### jmchen-g commented Jul 18, 2016

 @girving Could you take a look at this please?

added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 18, 2016

 The gradient of a holomorphic function is the conjugate of its complex derivative.

closed this as completed Jul 18, 2016

### corcra commented Aug 16, 2016

 @girving can you explain this statement? Using the complex analogue of the definition of the derivative, for f(z) = z*z, f'(z) = z as one would expect (for z complex). For the derivative of f(z) in terms of its partial derivatives with respect to the real and imaginary components of z (referring to these partial derivatives as f_x and f_y), you get f'(z) = 0.5*(f_x - j*f_y), which sort of looks like a conjugate, but still would return 0.2j in the first example. What am I missing here?

### girving commented Aug 16, 2016

 I don't have time to show you the proof, but you can check yourself that if w = f(z), then dL/dz = conj(f'(z)) dL/dw for gradients w.r.t. a loss L and f'(z) the complex derivative.

### charmasaur commented Jul 17, 2019

I'll attempt to clarify for any future readers who also get confused by this.

### tl;dr

I'm reasonably confident the "gradient" returned by tf.gradients is conj(df/dz + dconj(f)/dz) (which reduces to conj(df/dz) for holomorphic f).

### More details

The "gradient" mentioned by girving as the conjugate of the derivative is (related to) the gradient of the corresponding real map, when we express that gradient as a complex value. To expand on that, by definition of the Wirtinger derivative we have df/dz = 0.5 * (df/dx - i df/dy) (where x and y are the real and imaginary parts of z). If f is real-valued (e.g. a loss), then we have conj(df/dz) = 0.5 * (df/dx + i df/dy), and df/dx + i df/dy in Cartesian coordinates is (df/dx, df/dy), which is the gradient of f when considered as a map between real spaces.

That's not the usual definition of "gradient" of a complex map though--in my experience "gradient" is synonymous with "derivative". The terms certainly seem to be used interchangeably in the tf.gradients docs (e.g "gradients() adds ops to the graph to output the derivatives of ys with respect to xs").

I imagine the reason for using that definition of "gradient" is because TF is generally concerned with gradients in order to find the direction of maximum increase, and the direction of maximum increase is (df/dx, df/dy). FWIW I would have thought it was more reasonable to define the "gradient" as df/dz + dconj(f)/dz, and then take the conjugate in the optimiser when deciding on the direction of maximum increase, but I expect that ship has sailed.

### Even more details

To expand further, for anybody who's interested, another way to write the computed quantity is dR(f)/dx + i * dR(f)/dy, or in Cartesian coordinates, (dR(f)/dx, dR(f)/dy). That is, it's the gradient of the real part of the function, when viewed as a function of real variables. If you run an optimisation using a complex function, therefore, what you'll end up optimising is the real part of the function.

The operator D(f) := conj(df/dz + dconj(f)/dz turns out to obey pretty standard chain and product rules, so the auto-differentiation still goes through correctly once the gradient functions are modified appropriately. Specifically, defining D(C, f) := conj(conj(C) * df/dz + C * dconj(f)/dz), which represents the "gradient" of f(as defined before) with accumulation C (which is the form of the TF gradient methods, where C is grad and f is the function whose gradient is being defined), one can easily verify:

D(1, f) = D(f) [the initial state of the auto-differentiation]
D(C, z) = C [the end/base state]
D(C, f o g) = D(conj(conj(C) * df/dg + C * dconj(f)/dg), g) [chain rule]


Implementing one of the gradient functions therefore boils down to evaluating that expression in the chain rule for the function in question. For example, for z -> conj(z) we have

conj(conj(C) * dconj(z)/dz + C * dconj(conj(z))/dz) = conj(C),


consistent with the code.

For a holomorphic function (multiplication, exp, etc...) the expression simply reduces to C * conj(df/dg) (because dconj(f)/dg vanishes for holomorphic f), which is the reason for all the conjugation introduced here.

### NEGU93 commented Sep 24, 2019

I'll attempt to clarify for any future readers who also get confused by this.

### tl;dr

I'm reasonably confident the "gradient" returned by tf.gradients is conj(df/dz + dconj(f)/dz) (which reduces to conj(df/dz) for holomorphic f).

### More details

The "gradient" mentioned by girving as the conjugate of the derivative is (related to) the gradient of the corresponding real map, when we express that gradient as a complex value. To expand on that, by definition of the Wirtinger derivative we have df/dz = 0.5 * (df/dx - i df/dy) (where x and y are the real and imaginary parts of z). If f is real-valued (e.g. a loss), then we have conj(df/dz) = 0.5 * (df/dx + i df/dy), and df/dx + i df/dy in Cartesian coordinates is (df/dx, df/dy), which is the gradient of f when considered as a map between real spaces.

That's not the usual definition of "gradient" of a complex map though--in my experience "gradient" is synonymous with "derivative". The terms certainly seem to be used interchangeably in the tf.gradients docs (e.g "gradients() adds ops to the graph to output the derivatives of ys with respect to xs").

I imagine the reason for using that definition of "gradient" is because TF is generally concerned with gradients in order to find the direction of maximum increase, and the direction of maximum increase is (df/dx, df/dy). FWIW I would have thought it was more reasonable to define the "gradient" as df/dz + dconj(f)/dz, and then take the conjugate in the optimiser when deciding on the direction of maximum increase, but I expect that ship has sailed.

### Even more details

To expand further, for anybody who's interested, another way to write the computed quantity is dR(f)/dx + i * dR(f)/dy, or in Cartesian coordinates, (dR(f)/dx, dR(f)/dy). That is, it's the gradient of the real part of the function, when viewed as a function of real variables. If you run an optimisation using a complex function, therefore, what you'll end up optimising is the real part of the function.

The operator D(f) := conj(df/dz + dconj(f)/dz turns out to obey pretty standard chain and product rules, so the auto-differentiation still goes through correctly once the gradient functions are modified appropriately. Specifically, defining D(C, f) := conj(conj(C) * df/dz + C * dconj(f)/dz), which represents the "gradient" of f(as defined before) with accumulation C (which is the form of the TF gradient methods, where C is grad and f is the function whose gradient is being defined), one can easily verify:

D(1, f) = D(f) [the initial state of the auto-differentiation]
D(C, z) = C [the end/base state]
D(C, f o g) = D(conj(conj(C) * df/dg + C * dconj(f)/dg), g) [chain rule]


Implementing one of the gradient functions therefore boils down to evaluating that expression in the chain rule for the function in question. For example, for z -> conj(z) we have

conj(conj(C) * dconj(z)/dz + C * dconj(conj(z))/dz) = conj(C),


consistent with the code.

For a holomorphic function (multiplication, exp, etc...) the expression simply reduces to C * conj(df/dg) (because dconj(f)/dg vanishes for holomorphic f), which is the reason for all the conjugation introduced here.

Thanks for that detailed explanation!
How did you find out this definition (conj(df/dz + dconj(f)/dz)) was the one used by tensorflow? Couldn't find it anywhere and I've been searching/asking around a lot.
Do you have any clue on WHY this definition of the gradient? Any paper which says for non-holomorphic functions this mathematical definition will give you what you want?

### girving commented Sep 24, 2019

 If anyone is curious: @charmasaur suggests that it would have been "more correct" to define gradients the other way. This isn't correct: the easiest way to see this is to imagine that the inputs to a network are real, the outputs are real, and in the middle of the network is a holomorphic function. In this case, the optimizer has no idea that there's a holomorphic function involved in the middle of a computation: it sees a bunch of real stuff, and would have to do a full graph traversal to notice the issue. A better option would be to define the gradient to be mathematically correct, and as @charmasaur has helpfully shown this is what TensorFlow does.

### girving commented Sep 24, 2019

 @martinwicke I gave a talk once inside Google with a slide showing the proof that the gradient should be the conjugate. It's the TensorFlow documentation talk about gradients. Could you take a picture of that slide and attach it here?

### NEGU93 commented Sep 24, 2019

 @martinwicke I gave a talk once inside Google with a slide showing the proof that the gradient should be the conjugate. It's the TensorFlow documentation talk about gradients. Could you take a picture of that slide and attach it here? I didn't understood, you gave the talk or this @martinwicke ? In any case, this talk would be much appreciated if you could provide the link. Also references if any. Thank you.

### martinwicke commented Sep 24, 2019

 This one?

### charmasaur commented Sep 25, 2019

 Thanks for the replies! If anyone is curious: @charmasaur suggests that it would have been "more correct" to define gradients the other way. This isn't correct: the easiest way to see this is to imagine that the inputs to a network are real, the outputs are real, and in the middle of the network is a holomorphic function. In this case, the optimizer has no idea that there's a holomorphic function involved in the middle of a computation: it sees a bunch of real stuff, and would have to do a full graph traversal to notice the issue. In that example (and any other situation involving real inputs and outputs), wouldn't the eventual gradient be real anyway, so taking the conjugate would be a no-op? In any case, to me it just seems like a matter of convention/definition: without the conjugate, your "gradient" is the coefficient of the complex-linear function that approximates your function (well, approximates in real parts in the case of non-holomorphic functions); with the conjugate, your "gradient" is the direction of maximum increase. No matter which you choose, you can always get the other just by taking a conjugate. How did you find out this definition (conj(df/dz + dconj(f)/dz)) was the one used by tensorflow? Couldn't find it anywhere and I've been searching/asking around a lot. I found out that was the definition by evaluating a bunch of gradients and spotting a pattern, then reading the code to convince myself that what was being computed was consistent with that pattern. Do you have any clue on WHY this definition of the gradient? Any paper which says for non-holomorphic functions this mathematical definition will give you what you want? As discussed above, I think this is the definition of gradient because it's the gradient of the real part of your function when viewed as a map of real variables. That means you can plug it into a standard gradient descent optimizer and the result will be to optimize the real part of your function (which seems like a decent interpretation of "please optimize this complex-valued function"). As to your second question, to my knowledge there's not really any well-accepted "gradient of a general complex-valued function of complex variables". For holomorphic functions you have the Wirtinger derivative, which is the coefficient of the complex-linear approximation to your function, and is well-accepted. For non-holomorphic functions though, there is no complex-linear approximation (by definition), so it's not even clear what you want when you ask for a derivative or gradient. Your best bet is probably to view your function as a map between real spaces and look at a Jacobian. In general I think it really boils down to a question of what you're actually doing with the gradient -- that will determine exactly which quantity you want to calculate. Once you know that quantity, which will almost certainly be some linear combination of df/dz and dconj(f)/dz, TF will be able to calculate it by choosing an appropriate grad_ys input to tf.gradients. Specifically, in the operator D(C, f) := conj(conj(C) * df/dz + C * dconj(f)/dz) I mentioned in my earlier reply, C corresponds to grad_ys. For example, if you wanted to compute df/dz for some non-holomorphic f, you could use grad_ys=[1] to get conj(df/dz + dconj(f)/dz), then use grad_ys=[1j] to get conj(-i df/dz + i dconj(f)/dz). From those two quantities you can get df/dz. Does that help?

### refraction-ray commented Sep 25, 2019

 As a side note for the above discussion, I believe this technical report is a great material for derivatives and gradients definition of complex function in a more mathematical rigorous way. Specifically, in chapter 4, the author shows exactly why "gradient" of real valued complex variable functions is (two times) the complex conjugate of the corresponding partial derivatives (note no derivative can be well defined for non-holomorphic function, only partial derivatives can).

### girving commented Sep 25, 2019

 @martinwicke Yep. Presumably that will resolve the confusion. :)

### NEGU93 commented Sep 25, 2019

 I found out that was the definition by evaluating a bunch of gradients and spotting a pattern, then reading the code to convince myself that what was being computed was consistent with that pattern. YOU ARE THE BOSS!!! I was trying to reverse engineer it myself. I had some theories and all where taken down at some point. I was expecting the definition to be 2*df/dconj(z) based on a paper of CVNN (Akira Hirose) which did work for f : C -> R but not for f : C -> C. (Always using Wirtinger calculus of course) One last question maybe off topic. I tried reading the code myself but I'm more an electronic engineer and not software/informatic engineer. I got to a point where the Python code calls an API of a C/C++ code for the gradient but couldn't really find where to continue from there. You that have read the code can maybe point me in the right direction. Do you know where to start reading from the C code? I would like to read it too to further understand it. For what I do I think I will have to understand it eventually anyway. As a side note for the above discussion, I believe this technical report is a great material for derivatives and gradients definition of complex function in a more mathematical rigorous way. Specifically, in chapter 4, the author shows exactly why "gradient" of real valued complex variable functions is (two times) the complex conjugate of the corresponding partial derivatives (note no derivative can be well defined for non-holomorphic function, only partial derivatives can). Good, I will give it a look. Thank you!

### charmasaur commented Sep 25, 2019

 One last question maybe off topic. I tried reading the code myself but I'm more an electronic engineer and not software/informatic engineer. I got to a point where the Python code calls an API of a C/C++ code for the gradient but couldn't really find where to continue from there. You that have read the code can maybe point me in the right direction. Do you know where to start reading from the C code? I would like to read it too to further understand it. For what I do I think I will have to understand it eventually anyway. I won't be much help on this one, unfortunately. For my investigation I just dug around until I found the python code defining gradients (math_grad.py, linked earlier), and used that in isolation. I've never actually tried to trace through all the way from a gradients call to the C++ code. Maybe one of the TF folks can point to a resource for learning more about that?

### NEGU93 commented Sep 27, 2019

 Just some thought I have been given lately. Another (more compressed) way to write tensorflow's definition of the gradient is: 2*dreal(f) / dconj(z).

### ziofil commented Feb 27, 2020

 I need to implement a function f: C --> C with a tf.custom_gradient decorator, so I'm trying to understand TensorFlow's behaviour with non-holomorphic functions. The total differential of f is: df = (J_z)dz + (J_z*)dz* (eq. 3.7 in the technical report) where J_z = partial f/partial z and J_z* = partial f/partial z*. How does this reduce to the cases in use by TensorFlow? Also, I seem to have found that the seemingly correct way to implement the gradient when using tf.custom_gradient is def grad(dy): return tf.reduce_sum(dy*tf.math.conj(J_z) + tf.math.conj(dy)*J_zc) with J_z and J_zc defined above. However, there seem to be cases where this doesn't work. Is it wrong?

 I wrote the equation of Tensorflow in here but based again in @charmasaur findings (THANKS!). Basically the equation for the function will be: I am still not sure how to get the definition listed on your paper. PS: I am not 100% into the case of f: C --> C as I work with neural networks (cost function is real) but I am very interested in the topic anyway (cracking TF gradient), please keep me inform of your findings. PS 2: I have seen you work just next to me. I am currently doing my PhD in CentraleSupelec. We can get in touch to discuss the topic in person if you want. Contact me if you are interested, my info is in my GitHub profile.

### ziofil commented Feb 28, 2020

 I also work with neural networks with a real loss function, but I also have components in the middle that have complex inputs and complex outputs. The definition that you wrote, which apparently TF is using, matches (I think) the following: which works for f: C -> R. My C to C function is not a loss function, but just an intermediate component, that's why I only want to propagate the gradient correctly without losing any part.

### charmasaur commented Feb 28, 2020

 @ziofil at first glance that custom grad looks reasonable to me, at least in terms of the conjugates and derivatives being taken. Could you give an example of where it doesn't work?

### ziofil commented Feb 28, 2020

 Not without copypasting a ton of code+math... which makes me suspect that the way I'm computing J_z and J_zc might be buggy. At least, as you endorse the custom grad, I can concentrate on tracking down the bug in the math or in its implementation. Thank you both! (I'll get back here if I find out the custom grad was wrong)

 Maybe one of the TF folks can point to a resource for learning more about that? I actually need a formal verification that the equation gave by @charmasaur is indeed used for Tensorflow tf.gradient. Can someone within Tensorflow assert this is indeed the equation or at least give me a reference for this? Or shall I read the code and do the verification myself as @charmasaur did? This may also interest you @sylviemonet.

 I wrote the equation of Tensorflow in here but based again in @charmasaur findings (THANKS!). Basically the equation for the function will be: At first, I have a small question about this equation. What does mean in this equation? Then I want to know how to write down this in the form with df = XX dz + XX dz*? If I simply change to the form df = ( XXXX ) dz [or replace dz by dz*? I dont know.], it would be like: In the same tutorial as @ziofil mentioned before, it looks like: Here I want to know, how to explain these two expressions? (Note: I've also tried to calculate them by replacing partial derivative by the Wirtinger derivatives, they are not equal ) All the equation above is about the function f with complex value. Next, we consider the function f is a real-valued function. I agree with this result in the case that the definition of the gradient is . And here we use the same idea to change to the form df = ( XXXX ) dz, it would be like: . We can also quote the equation from the tutorial to compare two equations given the f is a real-valued function: . The REAL operator functions for all parts but not only the function f. Here I don't know either how to explain or am I wrong in some definitions or concepts?

 Unfortunately, I have no answers about that @sylviemonet. However, I don't agree on: . That would assume the gradient is df/dz which is not right? This thread seems to be having some interest. I asked the question on stackoverflow because google asks us to do that (https://www.tensorflow.org/community). Shall we move the discussion there? Or create a new issue in GitHub (Although I believe TensorFlow encourages to use StackOverflow for this kind of question). Or even a new question on StackOverflow?

### refraction-ray commented Mar 3, 2020

 To customize gradient for f: C-->C in tensorflow, follows the procedures below. Always imagine a final real output as L, then the forward pass looks like: y=f(x), L=g(y), where x and y can be complex vectors. The full differentiation would be: where \bar{x} is defined as \partial L/\partial x and the same for \bar{y}. We can express the differentiation of dy as dy(dx) from y=f(x), then we plug dy(dx) into the above formula. Since dx and dx^* are independent, and dx are on the both side of the equation, the coefficient before dx on the right hand side is just $\bar{x}$ which is expressed by $\bar{y}$, and $\bar{x}(\bar{y})$ is the derivative back propagate formula the user should customize in autograd. For tensorflow, since it back propagates gradients instead of derivatives, one should make conjugate of $\bar{x}(\bar{y})$ as customized gradient primitive. This gradient is consistent and as expected when final output is real. Maybe the above clarification resolve some confusions :)

### NEGU93 commented Mar 3, 2020

 To customize gradient for f: C-->C in tensorflow, follows the procedures below. Always imagine a final real output as L, then the forward pass looks like: y=f(x), L=g(y), where x and y can be complex vectors. The full differentiation would be: where \bar{x} is defined as \partial L/\partial x and the same for \bar{y}. We can express the differentiation of dy as dy(dx) from y=f(x), then we plug dy(dx) into the above formula. Since dx and dx^* are independent, and dx are on the both side of the equation, the coefficient before dx on the right hand side is just $\bar{x}$ which is expressed by $\bar{y}$, and $\bar{x}(\bar{y})$ is the derivative back propagate formula the user should customize in autograd. For tensorflow, since it back propagates gradients instead of derivatives, one should make conjugate of $\bar{x}(\bar{y})$ as customized gradient primitive. This gradient is consistent and as expected when final output is real. Maybe the above clarification resolve some confusions :) Can you cite the source of how can you assert that @refraction-ray?

### refraction-ray commented Mar 3, 2020

 @NEGU93 , for the general approach to get back propagation formula, any paper on auto differentiation of linear algebra can serve as a reference, eg. 3.5.2 in https://arxiv.org/pdf/1701.00392.pdf. For why such back propagation formula (actually its conjugate) is utilized in tf customize gradient, I have no specific references. I am sure it is true since I have contributed gradients for some operations for tensorflow. And you can double-check it by comparison with numerical derivatives.

### NEGU93 commented Mar 11, 2020

 @NEGU93 , for the general approach to get back propagation formula, any paper on auto differentiation of linear algebra can serve as a reference, eg. 3.5.2 in https://arxiv.org/pdf/1701.00392.pdf. Wow, that paper was indeed great! Helped a lot, thanks!

I hope the following can help those in need of an explanation.

## Parameter update rule

The gradient descent update of a complex parameter needs the gradient of the loss function with respect to the conjugate of the parameter (proof):

Clearly this update rule falls back to the usual update rule in case of a real parameter.

## Computation of the update

There are two rules to remember when computing the gradient for the update:

1. Treat variables and their complex conjugate as independent. This allows you to handle non-holomorphic functions. For example, suppose there is a non-holomorphic function f: C->C between z and the loss, then the chain rule is

1. Numerator and denominator of partial derivative expressions behave "independently" with respect to complex conjugation, so the following identities hold (in the first one, remember that L is real):

In Tensorflow when we write a new expression and we want to customize its gradient, we need to use the @tf.custom_gradient decorator and within the scope of the function we define a gradient function and return it after the output of the expression. This function is then used internally by TF, which calls it and passes the upstream gradient to it. So for example, if we want to customize the function f in the example above, we would write

@tf.custom_gradient
def f(z):
# code to compute output = f(z)
# see below
return dL
return output, grad

The role of grad(dy) is to propagate the gradient of the final value (i.e. the value of the loss function L) backwards, past the expression that we are customizing. So grad is given the upstream gradient dy = dL/df* (notice that TF diligently supplies the gradient with respect to the conjugate of our complex function) and grad has to return dL/dz*.

Why doesn't TF supply two gradients given all this fuss about treating variables and their conjugate independently? Because dL/dz* and dL/dz are not independent: they are the conjugate of each other (L is real), and TF always assumes that at the end of the line there is a real loss function to optimize.

So in the body of grad(dy) we should compute df1 = df/dz and df2 = df/dz* (this time independently because f is complex!) and combine them with dy and its conjugate as prescribed by the chain rule:

def grad(dy):
# code to compute df1=df/dz and df2=df/dz*
# recall that dy = dL/df*
dL = dy*tf.math.conj(df1) + tf.math.conj(dy)*df2
return dL

### Special cases

If f: R->C (i.e. if z is a real variable) then df1 and df2 are the conjugate of each other. So we can simplify the chain rule and write dL = 2*tf.math.real(dy*tf.math.conj(df1))

If f:C->C is holomorphic, then f doesn't depend on z*, which means that df2 = 0. So we can simplify the chain rule and write dL = dy*tf.math.conj(df1)

### Multiple variables

The shape of dy is the same as the output of f (i.e. if f outputs a tensor of shape (a,b,c...) then dy will also be a tensor of the same shape). On the other hand, the return values of grad, (being the gradients of the loss with respect to the various inputs of f) have to match the shape of each input.

For example, if f takes two complex vectors of shape (a,) and (b,) as input variables and it returns a matrix of shape (c,d), then dy has shape (c,d) and grad has to return two vectors of shape(a,) and (b,).

mentioned this issue May 1, 2020

### ezyang commented Jul 22, 2020

 I want to point out that Tensorflow disagrees with JAX on what the grad should be: from jax import grad def f(z): return z * z z = 0.1j print(grad(f, holomorphic=True)(z))  prints 0.2j 

mentioned this issue Jul 22, 2020

 I want to point out that Tensorflow disagrees with JAX on what the grad should be: from jax import grad def f(z): return z * z z = 0.1j print(grad(f, holomorphic=True)(z))  prints 0.2j  I don't know about jax but if that's the case, then jax is computing the derivative of f wrt z. When you compute a gradient to minimize a function you should compute the derivative wrt the conjugate of z. Therefore, gradient of f is NOT equal to derivate wrt z. Here there's a good explanation why this is the case. Conclusion: tensorflow is computing what should be expected.

### ezyang commented Jul 23, 2020

 This isn't correct: the easiest way to see this is to imagine that the inputs to a network are real, the outputs are real, and in the middle of the network is a holomorphic function. In this case, the optimizer has no idea that there's a holomorphic function involved in the middle of a computation: it sees a bunch of real stuff, and would have to do a full graph traversal to notice the issue. I also chime in that this reasoning isn't correct, @girving. The optimizer doesn't need to traverse the entire function. This is because when you use the d/dz Wirtinger derivative, you have to conjugate the adjoint when you view real-as-complex or complex-as-real (which is how you ostensibly embedded the holomorphic function inside a R-to-R function). (None of this changes the fact that which derivative definition you use is a matter of convention, and there are reasons to prefer one or the other.)

### ziofil commented Jul 23, 2020

 Jax computes complex derivatives in a different way than TF. Take a function f(z) : C -> C Jax breaks down the input into real and imaginary parts: z = x + 1jy and defines the function as f(z) = u(x,y) + 1jv(x,y) where u and v are real functions with real arguments. Then, Jax defines the Jacobian of f at z as a 2x2 matrix (two functions, two variables). Now, in your example f is holomorphic, so its jacobian has a special structure (it's a rescaled version of a rotation matrix). In turn this means that you can ignore the complex part v(x,y), and everything is still working as expected. The derivative that Jax computes is then given by the vector-jacobian product and it's 2z.

### anjali411 commented Aug 14, 2020

 >>> a=tf.Variable(1+1j) >>> @tf.function ... def example(a): ... b=tf.math.conj(a) ... return tf.gradients(b, a, grad_ys=tf.constant(1+1j)) ... >>> example(a) WARNING:tensorflow:AutoGraph could not transform and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of . Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert WARNING: AutoGraph could not transform and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: Unable to locate the source code of . Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert []  shouldn't the gradient returned be 1+1j according to the tensorflow gradient convention?

 Nope: You are passing the upstream gradient as dL/db* = 1+1j, therefore: dL/da* = dL/db* db*/da* + dL/db db/da* = (1+1j) x 0 + (1-1j) x 1 = 1-1j because b = a* (your function) and TF gives you the gradient for the update of a, which is dL/da* (i.e. the gradient with respect to the conjugate of a). Note that the upstram gradient is also interpreted as dL/db* and not dL/db. (Here L is a hypothetical real loss function which TF assumes to exist at the end of the computation).

mentioned this issue Feb 22, 2021