### **Automatic Differentiation**

**10. Autograd in PyTorch**

* As nested functions get complex, determination of their derivatives and programming them become difficult.
* Neural network is some kind of a nested function and performing backprop needs computation of immense derivatives. Computation of derivatives manually is almost impossible.
* Autograd is a core component of Pytorch that provides automatic differentiation for tensor operations. It enables gradient computation, which is essential for training machine learning models using optimization algorithm like gradient descent.

In [17]:
import torch as tr
import tensorflow as tf
import numpy as np

# import plotting libraries
import matplotlib.pyplot as plt
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

**10.01. Computing gradient for scalars**

In [18]:
x = tr.tensor(3.0, requires_grad=True) # requires gradient (default - False)

In [19]:
y = x**2
z = 2*x
k = 9/x
l = 8-x
m = 11+x
print(x)
print(y)
print(z)
print(k)
print(l)
print(m)

tensor(3., requires_grad=True)
tensor(9., grad_fn=<PowBackward0>)
tensor(6., grad_fn=<MulBackward0>)
tensor(3., grad_fn=<MulBackward0>)
tensor(5., grad_fn=<RsubBackward1>)
tensor(14., grad_fn=<AddBackward0>)


* when PyTorch is told that we need to compute gradient, it at backend creates a computation graph.<br>
  x --> (pow) --> y

In [20]:
y.backward() # needs to be run to compute gradient
x.grad

tensor(6.)

In [21]:
x = tr.tensor(3.0, requires_grad=True)
y = x**2
z = tr.sin(y)
print(x)
print(y)
print(z)

tensor(3., requires_grad=True)
tensor(9., grad_fn=<PowBackward0>)
tensor(0.4121, grad_fn=<SinBackward0>)


x-->(pow)-->y-->(sin)-->z

In [23]:
def dz_dx(x):
    return 2*x*tr.cos(x**2)

def dz_dy(y):
    return tr.cos(y)

print(dz_dx(x))
print(dz_dy(y))

tensor(-5.4668, grad_fn=<MulBackward0>)
tensor(-0.9111, grad_fn=<CosBackward0>)


In [24]:
z.backward()
print(y.grad)
print(x.grad)

None
tensor(-5.4668)


  print(y.grad)


* By default, PyTorch only stores .grad for leaf tensors. To store gradients for non-leaf tensors like y, you need to explicitly call .retain_grad() on them before backward().
* grad can be implicitly created only for scalar outputs

In [29]:
x = tr.tensor(3.0, requires_grad=True)
y = x**2
y.retain_grad() # retains gradient
z = tr.sin(y)
print(x)
print(y)
print(z)

tensor(3., requires_grad=True)
tensor(9., grad_fn=<PowBackward0>)
tensor(0.4121, grad_fn=<SinBackward0>)


In [30]:
z.backward()

print(y.grad)  # should now print the gradient of z w.r.t y
print(x.grad)  # should print the gradient of z w.r.t x

tensor(-0.9111)
tensor(-5.4668)


**10.02. Computing gradients for vectors**

In [None]:
import torch as tr
import numpy as np

x = tr.tensor(np.linspace(0, 3.0, 20), dtype=tr.float32, requires_grad=True)
y = x**2
y.retain_grad()
z = tr.sin(y)

print("x:", x)
print("y:", y)
print("z:", z)

# Backpropagate using a scalar-valued function of z
z.sum().backward() # some kind of multi-variable differentiation w.r.t each vector column (xi)

print("y.grad:", y.grad)  # ∂(sum sin(y)) / ∂y = cos(y)
print("x.grad:", x.grad)  # ∂(sum sin(x^2)) / ∂x = cos(x^2) * 2x


x: tensor([0.0000, 0.1579, 0.3158, 0.4737, 0.6316, 0.7895, 0.9474, 1.1053, 1.2632,
        1.4211, 1.5789, 1.7368, 1.8947, 2.0526, 2.2105, 2.3684, 2.5263, 2.6842,
        2.8421, 3.0000], requires_grad=True)
y: tensor([0.0000, 0.0249, 0.0997, 0.2244, 0.3989, 0.6233, 0.8975, 1.2216, 1.5956,
        2.0194, 2.4931, 3.0166, 3.5900, 4.2133, 4.8864, 5.6094, 6.3823, 7.2050,
        8.0776, 9.0000], grad_fn=<PowBackward0>)
z: tensor([ 0.0000,  0.0249,  0.0996,  0.2225,  0.3884,  0.5837,  0.7818,  0.9397,
         0.9997,  0.9011,  0.6040,  0.1246, -0.4336, -0.8780, -0.9849, -0.6239,
         0.0989,  0.7967,  0.9751,  0.4121], grad_fn=<SinBackward0>)
y.grad: tensor([ 1.0000,  0.9997,  0.9950,  0.9749,  0.9215,  0.8120,  0.6236,  0.3421,
        -0.0248, -0.4337, -0.7970, -0.9922, -0.9011, -0.4786,  0.1732,  0.7815,
         0.9951,  0.6044, -0.2217, -0.9111])
x.grad: tensor([ 0.0000,  0.3157,  0.6284,  0.9236,  1.1640,  1.2821,  1.1815,  0.7563,
        -0.0626, -1.2326, -2.5168, -3.4466, -3.

**10.03 Clearing gradients**

* If `backward()` function is called multiple times, the gradients get accumulated (Added), not cleared on its own. 
* Hence when multiple passes are to be run on data, we clear the gradient.
* 

In [None]:
x = tr.tensor(2.0, requires_grad=True)
y = x**2 # forward pass

y.backward()

In [57]:
x.grad

tensor(4.)

In [58]:
y = x**2

y.backward()
x.grad

tensor(8.)

In [59]:
x.grad.zero_() # inplace operation which assigns zero to gradient of x

tensor(0.)

In [60]:
y = x**2

y.backward()
x.grad

tensor(4.)

**10.04 Disable gradient tracking**

In [None]:
# 1. setting requires_grad_(False)
x = tr.tensor(2.0)
x.requires_grad_(False) # fun_ --> inplace changes
y = x**2 # forward pass
print(x,y)

tensor(2.) tensor(4.)


In [65]:
# 2. using detach() --> creates completely new tensor
x = tr.tensor(2.0,requires_grad=True)
y = x**2
z = x.detach() 
y1 = z**2
print(id(x)==id(z))
print(x)
print(y)
print(z)
print(y1)

False
tensor(2., requires_grad=True)
tensor(4., grad_fn=<PowBackward0>)
tensor(2.)
tensor(4.)


In [None]:
# 3. using no_grad() function
x = tr.tensor(2.0,requires_grad=True)

with tr.no_grad():
    y = x**2

print(x,y)

tensor(2., requires_grad=True) tensor(4.)


**11. GradientTape in Tensorflow**

* `tf.GradientTape` tracks operations and computes gradients via reverse-mode autodiff
* You must wrap all computations inside the `with tf.GradientTape(...)` block.
* To compute gradients with respect to a tensor, that tensor must be a tf.Variable (or at least explicitly marked as trainable).
* You cannot compute gradients w.r.t. a tf.constant, because constants are not considered trainable by default. ( there are still walkthrough solutions to this)
* In TensorFlow, a tensor is considered trainable if it is a tf.Variable with the trainable=True flag (which is the default). `x = tf.Variable(3.0, trainable=True)`
* `persistent` is set to True if:
  * You need multiple gradients from the same graph.
  * You're debugging or inspecting intermediate layers.
  * You're implementing custom training loops involving multiple partial derivatives.
* if gradient is called only once, it may not be needed.
* If you do use persistent=True, you should explicitly delete the tape when done
* This helps free memory because persistent tapes retain more information than normal tapes.
* TensorFlow allows computation of gradient of vectors without `aggregation` like sum, mean etc.

**11.01 Computing gradient for scalars**

In [None]:
# 1. for tf.Variable()
x = tf.Variable(3.0, trainable = True) # defualt, trainable is True


# start recording gradients
with tf.GradientTape(persistent=True) as tape:
    # tape.watch(x) # if trainable is False, we need to watch the Variable
    y = x**2
    z = tf.sin(y)

# computing gradients
dz_dy = tape.gradient(z,y)
dz_dx = tape.gradient(z,x)

print(dz_dy)
print(dz_dx)


tf.Tensor(-0.91113025, shape=(), dtype=float32)
tf.Tensor(-5.4667816, shape=(), dtype=float32)


In [74]:
# 2. for tf.constant()
x = tf.constant(3.0) 

# start recording gradients
with tf.GradientTape(persistent=True) as tape:
    tape.watch(x)  # manually tell tape to watch this constant
    y = x**2
    z = tf.sin(y)

# if not watched for constant it will print 'None'

# computing gradients
dz_dy = tape.gradient(z,y)
dz_dx = tape.gradient(z,x)

print(dz_dy)
print(dz_dx)


tf.Tensor(-0.91113025, shape=(), dtype=float32)
tf.Tensor(-5.4667816, shape=(), dtype=float32)


In [None]:
del tape

**11.02. Computing gradients for vectors**

In [None]:
x = tf.Variable(tf.linspace(0,3,10), trainable = True) # defualt, trainable is True


# start recording gradients
with tf.GradientTape(persistent=True) as tape:
    # tape.watch(x) # if trainable is False, we need to watch the Variable
    y = x**2
    z = tf.sin(y)

# computing gradients
dz_dy = tape.gradient(z,y)
dz_dx = tape.gradient(z,x)

print(dz_dy)
print(dz_dx)


tf.Tensor(
[ 1.          0.99383351  0.90284967  0.54030231 -0.20550672 -0.93454613
 -0.65364362  0.6683999   0.67640492 -0.91113026], shape=(10,), dtype=float64)
tf.Tensor(
[ 0.          0.66255567  1.20379956  1.08060461 -0.54801792 -3.11515378
 -2.61457448  3.11919955  3.60749292 -5.46678157], shape=(10,), dtype=float64)


**Notes:**
* TensorFlow computes ∂zᵢ/∂xⱼ for each zᵢ with respect to xⱼ, and sums it up internally unless you control the upstream gradient.
* PyTorch requires scalar output by default and computes ∂(scalar)/∂x. This is a design choice rooted in backpropagation theory, where you typically compute gradients of a scalar-valued loss function with respect to parameters. But PyTorch lets you override it by: <br>
`z.backward(gradient=some_vector) # Explicit vector-Jacobian product`
* PyTorch tracks gradients automatically if you opt in per tensor with `requires_grad=True`.
* TensorFlow only tracks gradients when you explicitly record them using `tf.GradientTape()`.
*  In TensorFlow: Gradients are freshly computed each time
   *  Gradients do not accumulate.
   *  When you call tape.gradient(...), you get a new set of gradients.
   *  There is no .grad attribute on variables.
   *  You apply gradients via an optimizer, not by modifying Variables directly.