# Automatic Differentiation with GradientTape

## Learning Objectives
By the end of this notebook, you will understand:
- What automatic differentiation is and why it's crucial for deep learning
- How to use TensorFlow's GradientTape for computing gradients
- The difference between gradients of scalars and tensors
- How to compute gradients of complex functions
- Best practices for working with GradientTape

## What is Automatic Differentiation?

**Automatic differentiation (AutoDiff)** is the process of computing derivatives automatically. In deep learning:
- We need gradients to update model parameters during training
- GradientTape "records" operations on tensors so it can compute gradients later
- This is the foundation of backpropagation in neural networks

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

print(f"TensorFlow version: {tf.__version__}")

## Basic GradientTape Usage

**GradientTape** is TensorFlow's tool for automatic differentiation. Think of it as a recorder that "watches" operations and can compute gradients.

Key concepts:
- Use `with tf.GradientTape() as tape:` to start recording
- The tape watches `tf.Variable` objects by default
- Use `tape.watch()` to watch `tf.constant` tensors
- Call `tape.gradient(target, sources)` to compute gradients

In [None]:
# Basic gradient computation
# Let's compute the derivative of f(x) = x^2 at x = 3
# We know mathematically that df/dx = 2x, so at x=3, df/dx = 6

# Create a variable (GradientTape watches variables by default)
x = tf.Variable(3.0, name='x')
print(f"Initial x: {x}")

# Use GradientTape to record operations
with tf.GradientTape() as tape:
    # Define the function f(x) = x^2
    y = x ** 2
    print(f"y = x^2 = {y}")

# Compute the gradient dy/dx
gradient = tape.gradient(y, x)
print(f"Gradient dy/dx at x=3: {gradient}")
print(f"Expected (2*3): 6")
print()

# Let's try a more complex function: f(x) = 3x^3 + 2x^2 + x + 1
# The derivative is: df/dx = 9x^2 + 4x + 1
x = tf.Variable(2.0)
with tf.GradientTape() as tape:
    y = 3 * x**3 + 2 * x**2 + x + 1
    print(f"f(2) = 3(2)^3 + 2(2)^2 + 2 + 1 = {y}")

gradient = tape.gradient(y, x)
print(f"Gradient at x=2: {gradient}")
print(f"Expected (9*4 + 4*2 + 1): {9*4 + 4*2 + 1}")

## Watching Constants and Multiple Variables

By default, GradientTape only watches `tf.Variable` objects. To compute gradients with respect to constants, you need to explicitly tell the tape to watch them.

In [None]:
# Watching constants
# GradientTape doesn't watch constants by default
x = tf.constant(3.0)  # This is a constant, not a variable

with tf.GradientTape() as tape:
    tape.watch(x)  # Explicitly tell the tape to watch this constant
    y = x ** 2

gradient = tape.gradient(y, x)
print(f"Gradient of constant: {gradient}")
print()

# Multiple variables
# Let's compute gradients of f(x, y) = x^2 + 3xy + y^3
x = tf.Variable(2.0, name='x')
y = tf.Variable(3.0, name='y')

with tf.GradientTape() as tape:
    # f(x,y) = x^2 + 3xy + y^3
    z = x**2 + 3*x*y + y**3
    print(f"f(2,3) = 2^2 + 3(2)(3) + 3^3 = {z}")

# Compute gradients with respect to both variables
gradients = tape.gradient(z, [x, y])
print(f"Gradient with respect to x: {gradients[0]}")  # df/dx = 2x + 3y
print(f"Gradient with respect to y: {gradients[1]}")  # df/dy = 3x + 3y^2
print(f"Expected df/dx at (2,3): {2*2 + 3*3}")
print(f"Expected df/dy at (2,3): {3*2 + 3*3**2}")

# You can also use a dictionary for cleaner code
variables = {'x': x, 'y': y}
gradients_dict = tape.gradient(z, variables)
print(f"Gradients as dict: {gradients_dict}")

## Persistent GradientTape

By default, GradientTape can only be used once. If you need to compute multiple gradients from the same tape, use `persistent=True`.

**Important**: Remember to delete persistent tapes to free memory!

In [None]:
# Persistent GradientTape allows multiple gradient computations
x = tf.Variable(2.0)
y = tf.Variable(3.0)

with tf.GradientTape(persistent=True) as tape:
    # Define multiple functions
    z1 = x**2 + y**2  # First function
    z2 = x*y          # Second function
    z3 = tf.sin(x) + tf.cos(y)  # Third function with trig functions

# Now we can compute gradients for all functions
grad_z1 = tape.gradient(z1, [x, y])
grad_z2 = tape.gradient(z2, [x, y])
grad_z3 = tape.gradient(z3, [x, y])

print(f"Function 1: z1 = x^2 + y^2 = {z1}")
print(f"Gradients of z1: {grad_z1}")
print(f"Expected: [2x, 2y] = [{2*2}, {2*3}]")
print()

print(f"Function 2: z2 = xy = {z2}")
print(f"Gradients of z2: {grad_z2}")
print(f"Expected: [y, x] = [{y.numpy()}, {x.numpy()}]")
print()

print(f"Function 3: z3 = sin(x) + cos(y) = {z3}")
print(f"Gradients of z3: {grad_z3}")
print(f"Expected: [cos(x), -sin(y)] = [{tf.cos(x).numpy():.4f}, {-tf.sin(y).numpy():.4f}]")

# Important: Delete the persistent tape to free memory
del tape
print("\nPersistent tape deleted to free memory")

## Higher-Order Derivatives

You can compute second derivatives (and higher) by nesting GradientTapes. This is useful for advanced optimization techniques and some physics-informed neural networks.

In [None]:
# Computing higher-order derivatives
# Let's compute the second derivative of f(x) = x^4
# f(x) = x^4
# f'(x) = 4x^3
# f''(x) = 12x^2

x = tf.Variable(2.0)

# Nest GradientTapes for higher-order derivatives
with tf.GradientTape() as tape2:  # Outer tape for second derivative
    with tf.GradientTape() as tape1:  # Inner tape for first derivative
        y = x**4
    
    # Compute first derivative
    first_derivative = tape1.gradient(y, x)
    print(f"Function: y = x^4 = {y}")
    print(f"First derivative: dy/dx = {first_derivative}")
    print(f"Expected (4x^3): {4 * 2**3}")

# Compute second derivative
second_derivative = tape2.gradient(first_derivative, x)
print(f"Second derivative: d²y/dx² = {second_derivative}")
print(f"Expected (12x^2): {12 * 2**2}")
print()

# Example with a more complex function
x = tf.Variable(1.0)

with tf.GradientTape() as tape2:
    with tf.GradientTape() as tape1:
        # f(x) = e^x * sin(x)
        y = tf.exp(x) * tf.sin(x)
    
    first_deriv = tape1.gradient(y, x)

second_deriv = tape2.gradient(first_deriv, x)

print(f"Complex function: y = e^x * sin(x) = {y:.4f}")
print(f"First derivative: {first_deriv:.4f}")
print(f"Second derivative: {second_deriv:.4f}")

## Gradients with Respect to Tensors

So far we've computed gradients of scalars. But in deep learning, we often need gradients of scalars with respect to tensor parameters (like weights and biases).

In [None]:
# Gradients with respect to tensors (like neural network weights)

# Simulate a simple linear model: y = W*x + b
# where W is a weight matrix and b is a bias vector

# Create some sample data
x = tf.constant([[1.0, 2.0, 3.0],    # Input features (batch_size=2, features=3)
                 [4.0, 5.0, 6.0]])

# Create model parameters
W = tf.Variable([[0.1, 0.2],         # Weight matrix (features=3, outputs=2)
                 [0.3, 0.4],
                 [0.5, 0.6]])

b = tf.Variable([0.1, 0.2])          # Bias vector (outputs=2)

print(f"Input shape: {x.shape}")     # (2, 3)
print(f"Weight shape: {W.shape}")    # (3, 2)
print(f"Bias shape: {b.shape}")      # (2,)
print()

with tf.GradientTape() as tape:
    # Forward pass: compute predictions
    predictions = tf.matmul(x, W) + b  # Linear transformation
    print(f"Predictions shape: {predictions.shape}")  # (2, 2)
    print(f"Predictions:\n{predictions}")
    
    # Compute a loss (let's use mean squared error with dummy targets)
    targets = tf.constant([[1.0, 0.0],   # Dummy target values
                          [0.0, 1.0]])
    
    loss = tf.reduce_mean(tf.square(predictions - targets))
    print(f"Loss: {loss}")

# Compute gradients of loss with respect to parameters
gradients = tape.gradient(loss, [W, b])
grad_W, grad_b = gradients

print(f"\nGradient of loss w.r.t. weights (dL/dW):")
print(f"Shape: {grad_W.shape}")  # Same shape as W: (3, 2)
print(grad_W)

print(f"\nGradient of loss w.r.t. bias (dL/db):")
print(f"Shape: {grad_b.shape}")  # Same shape as b: (2,)
print(grad_b)

# These gradients tell us how to update W and b to reduce the loss!
print("\nThese gradients can be used to update the parameters:")
learning_rate = 0.01
print(f"New W = W - lr * grad_W")
print(f"New b = b - lr * grad_b")

## Practical Example: Training a Simple Model

Let's put it all together and train a simple linear model using gradients. This shows the foundation of how neural networks are trained.

In [None]:
# Practical example: Training a linear model with gradient descent

# Generate some synthetic data
# True relationship: y = 3x + 2 + noise
np.random.seed(42)
n_samples = 100
x_data = np.random.uniform(-1, 1, (n_samples, 1)).astype(np.float32)
y_data = 3 * x_data + 2 + 0.1 * np.random.randn(n_samples, 1).astype(np.float32)

# Convert to TensorFlow tensors
x_train = tf.constant(x_data)
y_train = tf.constant(y_data)

print(f"Training data shapes: x={x_train.shape}, y={y_train.shape}")
print(f"True relationship: y = 3x + 2")
print()

# Initialize model parameters
# Our model: y_pred = w * x + b
w = tf.Variable(tf.random.normal([1, 1]), name='weight')  # Start with random weight
b = tf.Variable(tf.zeros([1]), name='bias')              # Start with zero bias

print(f"Initial parameters: w={w.numpy().flatten()}, b={b.numpy().flatten()}")

# Training parameters
learning_rate = 0.1
epochs = 100
losses = []  # To track training progress

# Training loop
for epoch in range(epochs):
    with tf.GradientTape() as tape:
        # Forward pass: compute predictions
        y_pred = tf.matmul(x_train, w) + b
        
        # Compute loss (mean squared error)
        loss = tf.reduce_mean(tf.square(y_pred - y_train))
    
    # Compute gradients
    gradients = tape.gradient(loss, [w, b])
    grad_w, grad_b = gradients
    
    # Update parameters (gradient descent)
    w.assign_sub(learning_rate * grad_w)  # w = w - lr * grad_w
    b.assign_sub(learning_rate * grad_b)  # b = b - lr * grad_b
    
    # Track progress
    losses.append(loss.numpy())
    
    # Print progress every 20 epochs
    if epoch % 20 == 0:
        print(f"Epoch {epoch:3d}: Loss = {loss:.4f}, w = {w.numpy().flatten()[0]:.3f}, b = {b.numpy().flatten()[0]:.3f}")

print(f"\nFinal parameters: w={w.numpy().flatten()[0]:.3f}, b={b.numpy().flatten()[0]:.3f}")
print(f"True parameters: w=3.000, b=2.000")
print(f"Final loss: {losses[-1]:.6f}")

# Plot the training progress
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(losses)
plt.title('Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Mean Squared Error')
plt.grid(True)

plt.subplot(1, 2, 2)
# Plot data and learned line
plt.scatter(x_data, y_data, alpha=0.5, label='Data')
x_line = np.linspace(-1, 1, 100)
y_line = w.numpy().flatten()[0] * x_line + b.numpy().flatten()[0]
y_true_line = 3 * x_line + 2
plt.plot(x_line, y_line, 'r-', label=f'Learned: y = {w.numpy().flatten()[0]:.2f}x + {b.numpy().flatten()[0]:.2f}')
plt.plot(x_line, y_true_line, 'g--', label='True: y = 3x + 2')
plt.title('Linear Regression Result')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

print("\nSuccess! The model learned the correct relationship using gradients.")

## Common Gotchas and Best Practices

Here are some important things to remember when working with GradientTape:

In [None]:
# Common gotchas and best practices

print("=== Common GradientTape Gotchas ===")
print()

# 1. Tape can only be used once (unless persistent=True)
print("1. Tape consumption:")
x = tf.Variable(2.0)

with tf.GradientTape() as tape:
    y = x**2

# First call works
grad1 = tape.gradient(y, x)
print(f"First gradient call: {grad1}")

# Second call returns None (tape is consumed)
grad2 = tape.gradient(y, x)
print(f"Second gradient call: {grad2}")
print("Solution: Use persistent=True if you need multiple gradient calls")
print()

# 2. Operations outside the tape are not recorded
print("2. Operations must be inside the tape:")
x = tf.Variable(3.0)
y = x**2  # This operation is OUTSIDE the tape

with tf.GradientTape() as tape:
    z = y + 1  # Only this operation is recorded

grad = tape.gradient(z, x)
print(f"Gradient when y=x^2 is outside tape: {grad}")
print("This is None because the tape didn't see the x^2 operation!")
print()

# 3. Correct version - all operations inside tape
with tf.GradientTape() as tape:
    y = x**2  # Now inside the tape
    z = y + 1

grad = tape.gradient(z, x)
print(f"Gradient when all operations inside tape: {grad}")
print()

# 4. Gradient of non-differentiable operations
print("3. Non-differentiable operations:")
x = tf.Variable(2.0)

with tf.GradientTape() as tape:
    # tf.cast changes the data type - this breaks the gradient flow
    y = tf.cast(x, tf.int32)
    z = tf.cast(y, tf.float32) + 1.0

grad = tape.gradient(z, x)
print(f"Gradient through type casting: {grad}")
print("Casting to integer and back breaks gradient flow!")
print()

# 5. Best practices
print("=== Best Practices ===")
print("1. Only put differentiable operations inside the tape")
print("2. Use persistent=True only when necessary (and remember to delete)")
print("3. Watch constants explicitly if you need their gradients")
print("4. Be careful with operations that break gradient flow (casting, indexing)")
print("5. Check that gradients are not None before using them")

# Example of checking gradients
x = tf.Variable(1.0)
with tf.GradientTape() as tape:
    y = x**2

grad = tape.gradient(y, x)
if grad is not None:
    print(f"\nGradient is valid: {grad}")
    # Safe to use gradient
else:
    print("\nWarning: Gradient is None!")
    # Handle this case appropriately

## Advanced Example: Custom Loss Function with Regularization

Let's implement a more complex example that shows how gradients work with custom loss functions and regularization terms.

In [None]:
# Advanced example: Custom loss with L2 regularization

# Create synthetic data for a polynomial regression problem
np.random.seed(42)
n_samples = 50
x_data = np.linspace(-2, 2, n_samples).reshape(-1, 1).astype(np.float32)
# True function: y = 0.5x^3 - 2x^2 + x + 1 + noise
y_data = (0.5 * x_data**3 - 2 * x_data**2 + x_data + 1 + 
          0.3 * np.random.randn(n_samples, 1)).astype(np.float32)

x_train = tf.constant(x_data)
y_train = tf.constant(y_data)

# Create a polynomial model: y = w3*x^3 + w2*x^2 + w1*x + b
w3 = tf.Variable(tf.random.normal([1, 1], stddev=0.1), name='w3')
w2 = tf.Variable(tf.random.normal([1, 1], stddev=0.1), name='w2')
w1 = tf.Variable(tf.random.normal([1, 1], stddev=0.1), name='w1')
b = tf.Variable(tf.zeros([1]), name='bias')

parameters = [w3, w2, w1, b]

def polynomial_model(x):
    """Polynomial model: y = w3*x^3 + w2*x^2 + w1*x + b"""
    x2 = tf.square(x)
    x3 = tf.multiply(x2, x)
    return tf.matmul(x3, w3) + tf.matmul(x2, w2) + tf.matmul(x, w1) + b

def custom_loss(y_true, y_pred, parameters, l2_weight=0.01):
    """Custom loss with L2 regularization"""
    # Mean squared error
    mse_loss = tf.reduce_mean(tf.square(y_true - y_pred))
    
    # L2 regularization (sum of squares of all parameters)
    l2_loss = tf.add_n([tf.reduce_sum(tf.square(param)) for param in parameters])
    
    # Total loss
    total_loss = mse_loss + l2_weight * l2_loss
    
    return total_loss, mse_loss, l2_loss

# Training parameters
learning_rate = 0.01
epochs = 200
l2_weight = 0.001  # Regularization strength

# Track training progress
total_losses = []
mse_losses = []
l2_losses = []

print(f"Training polynomial regression with L2 regularization (λ={l2_weight})")
print(f"Target function: y = 0.5x³ - 2x² + x + 1")
print()

# Training loop
for epoch in range(epochs):
    with tf.GradientTape() as tape:
        # Forward pass
        predictions = polynomial_model(x_train)
        
        # Compute loss
        total_loss, mse_loss, l2_loss = custom_loss(y_train, predictions, parameters, l2_weight)
    
    # Compute gradients
    gradients = tape.gradient(total_loss, parameters)
    
    # Check if any gradient is None
    if any(grad is None for grad in gradients):
        print("Warning: Some gradients are None!")
        break
    
    # Update parameters
    for param, grad in zip(parameters, gradients):
        param.assign_sub(learning_rate * grad)
    
    # Track progress
    total_losses.append(total_loss.numpy())
    mse_losses.append(mse_loss.numpy())
    l2_losses.append(l2_loss.numpy())
    
    # Print progress
    if epoch % 50 == 0:
        print(f"Epoch {epoch:3d}: Total Loss = {total_loss:.4f}, "
              f"MSE = {mse_loss:.4f}, L2 = {l2_loss:.4f}")
        print(f"  Coefficients: w3={w3.numpy().flatten()[0]:.3f}, "
              f"w2={w2.numpy().flatten()[0]:.3f}, "
              f"w1={w1.numpy().flatten()[0]:.3f}, "
              f"b={b.numpy().flatten()[0]:.3f}")

print(f"\nFinal coefficients:")
print(f"  w3 = {w3.numpy().flatten()[0]:.3f} (target: 0.5)")
print(f"  w2 = {w2.numpy().flatten()[0]:.3f} (target: -2.0)")
print(f"  w1 = {w1.numpy().flatten()[0]:.3f} (target: 1.0)")
print(f"  b  = {b.numpy().flatten()[0]:.3f} (target: 1.0)")

# Plot results
plt.figure(figsize=(15, 5))

# Plot 1: Training curves
plt.subplot(1, 3, 1)
plt.plot(total_losses, label='Total Loss')
plt.plot(mse_losses, label='MSE Loss')
plt.plot([l * l2_weight for l in l2_losses], label='L2 Loss (scaled)')
plt.title('Training Curves')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)

# Plot 2: Data and fitted curve
plt.subplot(1, 3, 2)
plt.scatter(x_data, y_data, alpha=0.6, label='Data')
x_test = np.linspace(-2, 2, 100).reshape(-1, 1)
y_pred = polynomial_model(tf.constant(x_test.astype(np.float32))).numpy()
y_true = 0.5 * x_test**3 - 2 * x_test**2 + x_test + 1
plt.plot(x_test, y_pred, 'r-', linewidth=2, label='Fitted')
plt.plot(x_test, y_true, 'g--', linewidth=2, label='True')
plt.title('Polynomial Fit')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True)

# Plot 3: Residuals
plt.subplot(1, 3, 3)
residuals = y_data - polynomial_model(x_train).numpy()
plt.scatter(x_data, residuals, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residuals')
plt.xlabel('x')
plt.ylabel('Residual')
plt.grid(True)

plt.tight_layout()
plt.show()

print("\nTraining complete! The model learned a good approximation of the polynomial.")
print("L2 regularization helped prevent overfitting to the noise.")

## Summary and Key Takeaways

In this notebook, you learned:

1. **Automatic Differentiation**: TensorFlow can automatically compute gradients of complex functions
2. **GradientTape**: The tool for recording operations and computing gradients
3. **Basic Usage**: Computing gradients of scalars and tensors
4. **Persistent Tapes**: When you need multiple gradient computations
5. **Higher-order derivatives**: Computing second derivatives and beyond
6. **Practical Applications**: Using gradients for training models
7. **Common Pitfalls**: What to watch out for when using GradientTape
8. **Advanced Examples**: Custom loss functions with regularization

## Next Steps

Now that you understand tensors and automatic differentiation, you're ready to learn about:
- Building neural network layers with Keras
- Creating and training complete models
- Working with real datasets

The gradient computation you learned here is the foundation of all neural network training!