In [1]:
import tensorflow as tf
print(tf.__version__)

2.0.0-beta1


In [2]:
# Computing gradients using epsilon
def f(w1, w2):
    return 3 * w1 ** 2 + 2 * w1 * w2
w1, w2 = 5, 3
eps = 1e-6

print("partial derivative wrt w1, [(w1, w2) = (5, 3)]: ", (f(w1 + eps, w2) - f(w1, w2)) / eps)

partial derivative wrt w1, [(w1, w2) = (5, 3)]:  36.000003007075065


In [3]:
print("partial derivative wrt w2, [(w1, w2) = (5, 3)]: ", (f(w1, w2 + eps) - f(w1, w2)) / eps)

partial derivative wrt w2, [(w1, w2) = (5, 3)]:  10.000000003174137


In [15]:
# Using GradientTape

w1, w2 = tf.Variable(5.), tf.Variable(3.)
with tf.GradientTape() as tape:
    z = f(w1, w2)

gradients = tape.gradient(z, [w1, w2])

print("dw1 : {} \t dw2 : {} ".format(gradients[0].numpy(), gradients[1].numpy()))

dw1 : 36.0 	 dw2 : 10.0 


Not only is the result accurate (the precision is only limited by the floating
point errors), but the ```gradient()``` method only goes through the recorded computations 
once (in reverse order), no matter how many variables there are, so it is 
incredibly efficient. It’s like magic!

Only put the strict minimum inside the ```tf.GradientTape()``` block,
to save memory. Alternatively, you can pause recording by creating
a ```with tape.stop_recording()``` block inside the ```tf.GradientTape()``` block.

**The tape is automatically erased immediately after you call its gradient() method, so
you will get an exception if you try to call gradient() twice.**

In [19]:
with tf.GradientTape() as tape:
    z = f(w1, w2)
dz_dw1 = tape.gradient(z, w1) # => tensor 36.0
print("dz_dw1 :", dz_dw1)
dz_dw2 = tape.gradient(z, w2) # RuntimeError!
print("dz_dw2: ", dz_dw1)

dz_dw1 : tf.Tensor(36.0, shape=(), dtype=float32)


RuntimeError: GradientTape.gradient can only be called once on non-persistent tapes.

**If you need to call ```gradient()``` more than once, you must make the tape persistent,
and delete it when you are done with it to free resources:**

In [20]:
with tf.GradientTape(persistent=True) as tape:
    z = f(w1, w2)
dz_dw1 = tape.gradient(z, w1) # => tensor 36.0
print("dz_dw1 :", dz_dw1)
dz_dw2 = tape.gradient(z, w2) # RuntimeError!
print("dz_dw2: ", dz_dw1)
del tape

dz_dw1 : tf.Tensor(36.0, shape=(), dtype=float32)
dz_dw2:  tf.Tensor(36.0, shape=(), dtype=float32)


**By default, the tape will only track operations involving variables, so if you try to
compute the gradient of z with regards to anything else than a variable, the result will
be None.**

However, you can force the tape to watch any tensors you like, to record every operation 
that involves them. You can then compute gradients with regards to these tensors, 
as if they were variables.

In [22]:
c1, c2 = tf.constant(5.), tf.constant(3.)
with tf.GradientTape() as tape:
    z = f(c1, c2)
gradients = tape.gradient(z, [c1, c2]) # returns [None, None]
print(gradients)

[None, None]


In [25]:
with tf.GradientTape() as tape:
    tape.watch(c1)
    tape.watch(c2)
    z = f(c1, c2)
gradients = tape.gradient(z, [c1, c2])
print("dc1 : {} \t dc2 : {} ".format(gradients[0].numpy(), gradients[1].numpy()))

dc1 : 36.0 	 dc2 : 10.0 


This can be useful in some cases, for example if you want to implement a regulariza‐
tion loss that penalizes activations that vary a lot when the inputs vary little: the loss
will be based on the gradient of the activations with regards to the inputs. Since the
inputs are not variables, you would need to tell the tape to watch them.

**In some rare cases you may want to stop gradients from backpropagating through
some part of your neural network. To do this, you must use the ```tf.stop_gradient()``` 
function: it just returns its inputs during the forward pass (like ```tf.identity()```), 
but it does not let gradients through during backpropagation (it acts like a constant).**

For example:

In [26]:
def f(w1, w2):
    return 3 * w1 ** 2 + tf.stop_gradient(2 * w1 * w2)

with tf.GradientTape() as tape:
    z = f(w1, w2) # same result as without stop_gradient()
    
gradients = tape.gradient(z, [w1, w2]) # => returns [tensor 30., None]

print(gradients)

[<tf.Tensor: id=1083, shape=(), dtype=float32, numpy=30.0>, None]


you may occasionally run into some numerical issues when computing gradients. 
For example, if you compute the gradients of the my_softplus() function for
large inputs, the result will be NaN:

This is because computing the gradients of this function using autodiff leads to some
numerical difficulties: due to floating point precision errors, autodiff ends up computing infinity divided by infinity (which returns NaN).

Fortunately, we can analytically find that the derivative of the softplus 
function is just $ 1 / (1 + 1 / exp(x))$, which is numerically stable. 
Next, we can tell TensorFlow to use this stable function when
computing the gradients of the my_softplus() function, by decorating it with
``` @tf.custom_gradient ``` , and making it return both its normal output and the function
that computes the derivatives (note that it will receive as input the gradients that were
backpropagated so far, down to the softplus function, and according to the chain rule
we should multiply them with this function’s gradients)

In [27]:
@tf.custom_gradient
def my_better_softplus(z):
    exp = tf.exp(z)
    def my_softplus_gradients(grad):
        return grad / (1 + 1 / exp)
    return tf.math.log(exp + 1), my_softplus_gradients