# <p style="text-align:center;">Tensorflow - IV</p>
---
*<p style="text-align:right;">Reference: Tensorflow Official Docs</p>*


In [1]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# Introduction to Gradients and Automatic Differentiation

Automatic differentiation(AD) is useful for implementing machine learning algorithms such as backpropagation for training neural networks.

>AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program.

In this guide, you will explore ways to compute gradients with TensorFlow, especially in eager execution. Eager execution provides an imperative interface to TensorFlow. 

### Eager Execution
With eager execution enabled, TensorFlow functions execute operations immediately (as opposed to adding to a graph to be executed later in a `tf.compat.v1.Session`) and return concrete values (as opposed to symbolic references to a node in a computational graph). Eager execution is enabled by default and this API returns True in most of cases. However, this API might return False in some cases.

### Graph Execution
Since eager execution runs all operations one-by-one in Python, it cannot take advantage of potential acceleration opportunities. Graph execution extracts tensor computations from Python and builds an efficient graph before evaluation. Graphs, or `tf.Graph` objects, are special data structures with `tf.Operation` and `tf.Tensor` objects. While `tf.Operation` objects represent computational units,`tf.Tensor` objects represent data units. Graphs can be saved, run, and restored without original Python code, which provides extra flexibility for cross-platform applications. With a graph, you can take advantage of your model in mobile, embedded, and backend environment where Python is unavailable.

## 1. Computing Gradients

To differentiate automatically, TensorFlow needs to remember what operations happend in what order during *forwards pass*. Then, during the *backward pass*, TensorFlow traverses this list of operations in reverse order to compute gradients

## 2. Gradient Tapes

TensorFlow provides the `tf.GradientTape` API for automatic differentiation; that is, computing the gradient of a computation with respect to some inputs, usually `tf.Variables`. TensorFlow "records" relevant operations executed inside the context of a `tf.GradientTape` onto a "tape". TensorFlow then uses that tape to compute the gradients of a "recorded" computation using reverse mode differentiation.

Here is a simple example:

In [2]:
x = tf.Variable(3.0)

with tf.GradientTape() as tape:
    y = x**2

Once you've recorded some operations, use `GradientTape.gradient(target, sources)` to calculate the gradient of some target (often a loss) relative to some source (often the model's variables):

In [3]:
dy_dx = tape.gradient(y,x) #gradient(dy, dx) --> computes (dy/dx) at x
dy_dx.numpy()

6.0

The above example uses scalars, but `tf.GradientTape` works as easily on any tensor as shown below. By default, the resources held by a GradientTape are released as soon as `GradientTape.gradient()` method is called. To compute multiple gradients over the same computation, create a `persistent` gradient tape. This allows multiple calls to the `gradient()` method as resources are released when the tape object is garbage collected. 

In [4]:
w = tf.Variable(tf.random.normal((3,2)), name='w')
b = tf.Variable(tf.zeros(2, dtype = tf.float32), name = 'b')

x = [[1.,2.,3.]]

with tf.GradientTape(persistent=True) as tape:
    y = x@w + b
    loss = tf.reduce_mean(y**2)

To get the gradient of `loss` with respect to both variables, you can pass both as sources to the `gradient` method. The tape is flexible about how sources are passed and will accept any nested combination of lists or dictionaries and return the gradient structured the same way (see `tf.nest`). 

In [5]:
#derivative of loss w.r.t. 'w' and 'b'
[dl_dw,dl_db] = tape.gradient(loss,[w,b])

The gradient w.r.t each source has the shape of the source:

In [6]:
print(w.shape)
print(dl_dw.shape)

(3, 2)
(3, 2)


Here is the gradient calculation again, this time passing a dictionary of variables:

In [7]:
my_vars = {'w':w,'b':b}

grad = tape.gradient(loss, my_vars)
grad['b']

<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-6.2749133 ,  0.24908018], dtype=float32)>

## 3. Gradients w.r.t a model
It's common to collect `tf.Variables` into a `tf.Module` or one of its subclasses (`layers.Layer`, `keras.Model`) for checkpointing and exporting.

In most cases, you will want to calculate gradients with respect to a model's trainable variables. Since all subclasses of `tf.Module` aggregate their variables in the `Module.trainable_variables` property, you can calculate these gradients in a few lines of code:

In [8]:
layer = tf.keras.layers.Dense(2,activation = 'relu')
x = tf.constant([[1.,2.,3.]])

with tf.GradientTape() as tape:
    y = layer(x)
    loss = tf.reduce_mean(y**2)
    
grad = tape.gradient(loss, layer.trainable_variables)

In [9]:
for var,g in zip(layer.trainable_variables, grad):
    print(f'{var.name}, shape:{g.shape}')

dense/kernel:0, shape:(3, 2)
dense/bias:0, shape:(2,)


## 4. Controlling What the Tape Watches

The default behavior is to record all operations after accessing a trainable `tf.Variable`. The reasons for this are:

* The tape needs to know which operations to record in the forward pass to calculate the gradients in the backwards pass.

* The tape holds references to intermediate outputs, so you don't want to record unnecessary operations.

* The most common use case involves calculating the gradient of a loss with respect to all a model's trainable variables.

For example, the following fails to calculate a gradient because the `tf.Tensor` is not "watched" by default, and the `tf.Variable` is not trainable:

In [10]:
x0 = tf.Variable(3.0, name = 'x0') #a trainable var
x1 = tf.Variable(3.0, name = 'x1', trainable = False) #non-trainable
x2 = tf.Variable(2.0, name = 'x2') + 1.0 #var + tensor returns tensor and tensor is not watched by default
x3 = tf.constant(3.0, name = 'x3') #not a var

In [11]:
with tf.GradientTape() as tape:
    y = (x0**2) + (x1**2) + (x2**2)
    
grad = tape.gradient(y, [x0,x1,x2,x3])

In [12]:
for g in grad:
    print(g)

tf.Tensor(6.0, shape=(), dtype=float32)
None
None
None


You can list the variables being watched by the tape using the `GradientTape.watched_variables` method:

In [13]:
[var.name for var in tape.watched_variables()]

['x0:0']

`tf.GradientTape` provides hooks that give the user control over what is or is not watched.

To record gradients with respect to a `tf.Tensor`, you need to call `GradientTape.watch(x)`:

In [14]:
x = tf.constant(3.0)
with tf.GradientTape() as tape:
    tape.watch(x)
    y = x**2

dy_dx = tape.gradient(y,x)
print(dy_dx.numpy())

6.0


## 5. Intermediate Results

You can also request gradients of the output with respect to intermediate values computed inside the `tf.GradientTape` context.

In [15]:
x = tf.constant(3.0)

with tf.GradientTape() as tape:
    tape.watch(x)
    y = x*x
    z = y*y
# Use the tape to compute the gradient of z with respect to the
# intermediate value y.
# dz_dy = 2 * y and y = x ** 2 = 9
print(tape.gradient(z,y).numpy())

18.0


By default, the resources held by a `GradientTape` are released as soon as the `GradientTape.gradient` method is called. To compute multiple gradients over the same computation, create a gradient tape with `persistent=True`. This allows multiple calls to the gradient method as resources are released when the tape object is garbage collected. For example:

In [16]:
x = tf.constant([1,3.0])

with tf.GradientTape(persistent=True) as tape:
    tape.watch(x)
    y = x*x
    z = y*y

print(tape.gradient(z,x).numpy())
print(tape.gradient(y,x).numpy())

[  4. 108.]
[2. 6.]


In [17]:
del tape # drop the reference

## 6. Notes on Performance

* There is a tiny overhead associated with doing operations inside a gradient tape context. For most eager execution this will not be a noticeable cost, but you should still use tape context around the areas only where it is required.

* Gradient tapes use memory to store intermediate results, including inputs and outputs, for use during the backwards pass.

For efficiency, some ops (like ReLU) don't need to keep their intermediate results and they are pruned during the forward pass. However, if you use `persistent=True` on your tape, nothing is discarded and your peak memory usage will be higher.

## 7. Control Flow

Because a gradient tape records operations as they are executed, Python control flow is naturally handled (for example, `if` and `while` statements).

Here a different variable is used on each branch of an `if`. The gradient only connects to the variable that was used:

In [18]:
x = tf.constant(1.0)

v0 = tf.Variable(2.0)
v1 = tf.Variable(2.0)

with tf.GradientTape(persistent=True) as tape:
    tape.watch(x)
    if x>0.0:
        result = v0
    else:
        result = v1**2

dv0,dv1 = tape.gradient(result,[v0,v1])

print(dv0)
print(dv1)

tf.Tensor(1.0, shape=(), dtype=float32)
None


Just remember that the control statements themselves are not differentiable, so they are invisible to gradient-based optimizers.

Depending on the value of `x` in the above example, the tape either records `result = v0` or `result = v1**2`. The gradient with respect to `x` is always `None`.

## 8. Cases when gradient returns None

When a target is not connected to a source, `gradient` will return `None`

In [19]:
x = tf.Variable(2.)
y = tf.Variable(3.)

with tf.GradientTape() as tape:
    z = y * y
print(tape.gradient(z,x))

None


Here `z` is obviously not connected to `x`, but there are several less-obvious ways that a gradient can be disconnected.

### 8.1. Replaced a variable with a tensor

In the section on "controlling what the tape watches" you saw that the tape will automatically watch a `tf.Variable` but not a `tf.Tensor`.

One common error is to inadvertently replace a `tf.Variable` with a `tf.Tensor`, instead of using `Variable.assign` to update the `tf.Variable`. Here is an example:

In [20]:
x = tf.Variable(2.0)

for epoch in range(2):
    with tf.GradientTape() as tape:
        y = x+1
        
    print(type(x).__name__, ":", tape.gradient(y, x))
    x = x + 1   # This should be `x.assign_add(1)`

ResourceVariable : tf.Tensor(1.0, shape=(), dtype=float32)
EagerTensor : None


### 8.2. Did calculations out of Tensorflow

The tape can't record the gradient path if the calculation exits TensorFlow. For example:


In [21]:
x = tf.Variable([[1.0,2.0],[3.0,4.0]], dtype = tf.float32)

with tf.GradientTape() as tape:
    x2 = x**2
    y = np.mean(x2, axis = 0) #done in numpy
    #like most ops reduce_mean will cast the NumPy array to a constant tensor
    #using 'tf.conert_to_tensor'
    y = tf.reduce_mean(y, axis = 0)
    
print(tape.gradient(y,x))

None


### 8.3. Took gradients through an integer or string

Integers and strings are not differentiable. If a calculation path uses these data types there will be no gradient.

Nobody expects strings to be differentiable, but it's easy to accidentally create an `int` constant or variable if you don't specify the `dtype`.

In [22]:
x = tf.constant(10)

with tf.GradientTape() as g:
    g.watch(x)
    y = x*x
    
print(g.gradient(y,x))

None


TensorFlow doesn't automatically cast between types, so, in practice, you'll often get a type error instead of a missing gradient.

### 8.4. Took gradients through a stateful object

State stops gradients. When you read from a stateful object, the tape can only observe the current state, not the history that lead to it.

A `tf.Tensor` is immutable. You can't change a tensor once it's created. It has a value, but no state. All the operations discussed so far are also stateless: the output of a `tf.matmul` only depends on its inputs.

A `tf.Variable` has internal state—its value. When you use the variable, the state is read. It's normal to calculate a gradient with respect to a variable, but the variable's state blocks gradient calculations from going farther back. For example:

In [23]:
x0 = tf.Variable(3.0)
x1 = tf.Variable(0.0)

with tf.GradientTape() as tape:
    # Update x1 = x1 + x0.
    x1.assign_add(x0)
    # The tape starts recording from x1.
    y = x1**2

#This doesnt works
print(tape.gradient(y,x0))

None


Similarly, `tf.data.Dataset` iterators and `tf.queues` are stateful, and will stop all gradients on tensors that pass through them.

## 9. No gradient registered

Some `tf.Operations` are **registered as being non-differentiable** and will return None. Others have **no gradient registered**.

The `tf.raw_ops` page shows which low-level ops have gradients registered.

If you attempt to take a gradient through a float op that has no gradient registered the tape will throw an error instead of silently returning None. This way you know something has gone wrong.

For example, the `tf.image.adjust_contrast` function wraps `raw_ops.AdjustContrastv2`, which could have a gradient but the gradient is not implemented:

In [24]:
image = tf.Variable([[[0.5, 0.0, 0.0]]])
delta = tf.Variable(0.1)

with tf.GradientTape() as tape:
      new_image = tf.image.adjust_contrast(image, delta)

try:
    print(tape.gradient(new_image, [image, delta]))
    assert False   # This should not happen.
except LookupError as e:
    print(f'{type(e).__name__}: {e}')

LookupError: gradient registry has no entry for: AdjustContrastv2


If you need to differentiate through this op, you'll either need to implement the gradient and register it (using `tf.RegisterGradient`) or re-implement the function using other ops.

## 10. Zeros instead of `None`

In some cases it would be convenient to get 0 instead of `None` for unconnected gradients. You can decide what to return when you have unconnected gradients using the `unconnected_gradients` argument:

In [25]:
x = tf.Variable([2., 2.])
y = tf.Variable(3.)

with tf.GradientTape() as tape:
      z = y**2
print(tape.gradient(z, x, unconnected_gradients=tf.UnconnectedGradients.ZERO))

tf.Tensor([0. 0.], shape=(2,), dtype=float32)
