<a href="https://colab.research.google.com/github/saptarshimazumdar/deep-learning-concepts/blob/main/backpropagation/gradient-validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##  **Gradient Checking for Neural Network Training**

you need to implement a gradient checking procedure to verify the correctness of backpropagation in a multilayer perceptron (MLP).Consider the following neural network architecture for regression:

* Input layer: $x \in \mathbb{R}^2$
* Hidden layer: 5 neurons with sigmoid activation
* Output layer: 1 neuron with linear activation

The network equations are:
$$h = \sigma(W^{(1)}x + b^{(1)})$$
$$\hat{y} = W^{(2)}h + b^{(2)}$$

The loss function is:
$$L = \frac{1}{2}(y - \hat{y})^2$$

You must generate a small synthetic dataset consisting of 20 samples where:
$$x_1, x_2 \sim \text{Uniform}(-1, 1)$$
and the target output is:
$$y = x_1^2 + x_2^2$$

1. Implement forward propagation and manual backpropagation using NumPy.
2. Implement gradient checking using finite difference approximation:
$$\frac{\partial L}{\partial \theta} \approx \frac{L(\theta + \epsilon) - L(\theta - \epsilon)}{2\epsilon}$$
where $\epsilon = 10^{-5}$ and $\theta$ represents any network parameter.

### **Solution**

#### Mathematical Derivations
**Forward Pass:**
* $Z^{(1)} = X W^{(1)} + b^{(1)}$
* $H = \sigma(Z^{(1)})$
* $Z^{(2)} = H W^{(2)} + b^{(2)}$
* $\hat{y} = Z^{(2)}$ (Linear activation)
* Batch Loss: $L = \frac{1}{N} \sum_{i=1}^N \frac{1}{2}(\hat{y}_i - y_i)^2$

**Backward Pass (Chain Rule):**
* Derivative of Loss w.r.t predictions: $d\hat{y} = \frac{\partial L}{\partial \hat{y}} = \frac{1}{N} (\hat{y} - y)$
* Gradients for Output Layer ($W^{(2)}, b^{(2)}$):
  * $dW^{(2)} = H^T d\hat{y}$
  * $db^{(2)} = \sum_{i=1}^N d\hat{y}_i$
* Gradients for Hidden Layer ($W^{(1)}, b^{(1)}$):
  * $dH = d\hat{y} (W^{(2)})^T$
  * $dZ^{(1)} = dH \odot H \odot (1 - H)$ <sup>( using the derivative of the sigmoid function $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ )</sup>
  * $dW^{(1)} = X^T dZ^{(1)}$
  * $db^{(1)} = \sum_{i=1}^N dZ^{(1)}_i$

In [None]:
import numpy as np

#### **Dataset Generation**

In [None]:
np.random.seed(42)
N = 20
X = np.random.uniform(-1, 1, size=(N, 2))
# Target: y = x1^2 + x2^2 (Shape: (20, 1))
y = (X[:, 0]**2 + X[:, 1]**2).reshape(N, 1)

In [None]:
class MultilayerPerceptron:
  def __init__(self, input_dim=2, hidden_dim=5, output_dim=1):
    # Initialize weights with xavier distribution (since activation is sigmoid)
    # Initialize biases to zero
    self.W1 = np.random.normal(0, 2/(input_dim + hidden_dim), (input_dim, hidden_dim))
    self.b1 = np.zeros((1, hidden_dim))
    self.W2 = np.random.normal(0, 2/(hidden_dim + output_dim), (hidden_dim, output_dim))
    self.b2 = np.zeros((1, output_dim))

  def sigmoid(self, z):
    return 1 / (1 + np.exp(-z))

  def forward(self, X):
    self.X = X

    # Hidden layer
    self.Z1 = np.dot(X, self.W1) + self.b1
    self.H = self.sigmoid(self.Z1)

    # Output layer
    self.Z2 = np.dot(self.H, self.W2) + self.b2
    self.y_hat = self.Z2

    return self.y_hat

  def compute_loss(self, y_hat, y):
    # Batch Mean Squared Error: 1/N * sum( 1/2 * (y_hat - y)^2 )
    N = y.shape[0]
    return np.sum(0.5 * (y_hat - y)**2) / N

  def backward(self, y):
    N = y.shape[0]

    # dL/dÅ·
    dy_hat = (self.y_hat - y) / N

    # Output layer gradients
    dW2 = np.dot(self.H.T, dy_hat)
    db2 = np.sum(dy_hat, axis=0, keepdims=True)

    # Hidden layer gradients
    dH = np.dot(dy_hat, self.W2.T)
    dZ1 = dH * self.H * (1 - self.H) # dH * sigmoid_derivative

    dW1 = np.dot(self.X.T, dZ1)
    db1 = np.sum(dZ1, axis=0, keepdims=True)

    return {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}

In [None]:
def gradient_check(model, X, y, epsilon=1e-5, error_cutoff=1e-7):
  print("--- Starting Gradient Check ---\n")

  # 1. Run standard forward and backward pass
  y_hat = model.forward(X)
  analytic_grads = model.backward(y)

  parameters = ['W1', 'b1', 'W2', 'b2']

  for p_name in parameters:
    param_array = getattr(model, p_name)
    analytic_grad = analytic_grads[p_name]
    numeric_grad = np.zeros_like(param_array)

    # Iterate through each element in the parameter array
    it = np.nditer(param_array, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
      idx = it.multi_index
      orig_val = param_array[idx]

      # Compute L(theta + epsilon)
      param_array[idx] = orig_val + epsilon
      y_hat_plus = model.forward(X)
      loss_plus = model.compute_loss(y_hat_plus, y)

      # Compute L(theta - epsilon)
      param_array[idx] = orig_val - epsilon
      y_hat_minus = model.forward(X)
      loss_minus = model.compute_loss(y_hat_minus, y)

      # Restore the original parameter value
      param_array[idx] = orig_val

      # Compute the numerical gradient for this specific parameter element
      numeric_grad[idx] = (loss_plus - loss_minus) / (2 * epsilon)
      it.iternext()

      # Calculate relative error: ||analytic - numeric|| / (||analytic|| + ||numeric||)
      numerator = np.linalg.norm(analytic_grad - numeric_grad)
      denominator = np.linalg.norm(analytic_grad) + np.linalg.norm(numeric_grad)
      rel_error = numerator / (denominator + 1e-15)

      print(f"Parameter {p_name}: relative error = {rel_error:.4e}" + (
          f"  --> Warning: Gradient for {p_name} might be incorrect!"
          if rel_error > error_cutoff else
          f"  --> Success: Gradients for {p_name} match."
      ))

In [None]:
model = MultilayerPerceptron()
gradient_check(model, X, y)

--- Starting Gradient Check ---

Parameter W1: relative error = 1.8299e-10  --> Success: Gradients for W1 match.
Parameter b1: relative error = 3.4248e-11  --> Success: Gradients for b1 match.
Parameter W2: relative error = 5.8844e-12  --> Success: Gradients for W2 match.
Parameter b2: relative error = 1.0700e-12  --> Success: Gradients for b2 match.
