###  Introduction and Setup
We’ll create a minimal automatic differentiation system, similar to [micrograd](https://github.com/karpathy/micrograd), to demonstrate how backpropagation works under the hood. 
The idea is to build a Value class that records a computation graph node’s value and gradient, and supports basic arithmetic operations. Then we will implement a method to perform a backward pass (using the chain rule) to compute gradients through the graph, and use those gradients to perform gradient descent updates.

$$ loss = ((w1 - 1)^2 + (w2 - 5)^2) * 0.5 $$

In [16]:
# Let's start by defining a Value class to represent nodes in our computational graph.
class Value:
    def __init__(self, data, _children=(), _op=''):
        """
        Initialize a Value object.
        data: the numeric value (scalar) this node holds.
        _children: the nodes that produced this value (for building the graph).
        _op: the operation that produced this value (for debug/tracing purposes).
        """
        self.data = data                  # the actual scalar value
        self.grad = 0.0                   # gradient of the loss w.rt this value (to be computed in backprop)
        self._prev = set(_children)       # set of parent nodes (inputs to the operation that produced this node)
        self._op = _op                    # op name (optional, useful for debug)
        self._backward = lambda: None     # function to backpropagate gradient from this node to its _prev
    
    def __repr__(self):
        # For convenience, when we print a Value it will show its data
        return f"Value(data={self.data})"

    def __add__(self, other):
        # Support addition: Value + Value or Value + scalar
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), _op='+')
        # Define the backward function for addition
        def _backward():
            # Gradient of the output w.rt each input is 1 (∂(a+b)/∂a = 1, ∂(a+b)/∂b = 1)
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out

    def __radd__(self, other):
        # Ensure commutativity: allows scalar + Value to use __add__
        return self + other

    def __mul__(self, other):
        # Support multiplication: Value * Value or Value * scalar
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), _op='*')
        def _backward():
            # ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out

    def __rmul__(self, other):
        # Ensure commutativity for scalar * Value
        return self * other

    def __sub__(self, other):
        # Define subtraction in terms of addition: a - b = a + (-b)
        return self + (-1 * other)

    def __pow__(self, exponent):
        # Only support exponent as int or float (scalar exponent)
        assert isinstance(exponent, (int, float)), "Only supporting int/float exponents for simplicity."
        out = Value(self.data ** exponent, (self,), _op=f'**{exponent}')
        def _backward():
            # ∂(a^k)/∂a = k * a^(k-1)
            self.grad += exponent * (self.data ** (exponent - 1)) * out.grad
        out._backward = _backward
        return out
    
    def backward(self):
        # Compute gradients of all values in the graph w.rt this Value (self).
        # 1. Topologically sort the graph of dependencies
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        # 2. Initialize the output node's gradient
        self.grad = 1.0
        # 3. Traverse nodes in reverse topological order and propagate gradients
        for node in reversed(topo):
            node._backward()

In [None]:
# Quick forward computation test:
w1 = Value(2.0)
w2 = Value(3.0)

# Construct an expression: ((w1 - 1)**2 + (w2 - 5)**2) * 0.5
a = w1 - 1
b = a**2
c = w2 - 5
d = c**2
s = b + d
loss = s * 0.5
print(loss)  # This is the loss given the current w1, w2

In [None]:
draw_dot(loss)

### Forward Pass

1. **Inputs:**
   - $ w_1 = 2.0 $
   - $ w_2 = 3.0 $

2. **Intermediate Computations:**
   - **For $ w_1 $:**
     - $ a = w_1 - 1 = 2.0 - 1 = 1.0 $
     - $ b = a^2 = 1.0^2 = 1.0 $
   - **For $ w_2 $:**
     - $ c = w_2 - 5 = 3.0 - 5 = -2.0 $
     - $ d = c^2 = (-2.0)^2 = 4.0 $
   - **Combine and Scale:**
     - $ s = b + d = 1.0 + 4.0 = 5.0 $
     - $ \text{loss} = 0.5 \times s = 0.5 \times 5.0 = 2.5 $

---

### Backward Pass

Start by setting $\frac{d\,\text{loss}}{d\,\text{loss}} = 1$.

1. **Loss to Sum $ s $:**
   - $\frac{d\,\text{loss}}{d\,s} = 0.5$ since $\text{loss} = 0.5 \times s$.

2. **Gradients Through the Sum:**
   - Since $ s = b + d $:
     - $\frac{d\,s}{d\,b} = 1$
     - $\frac{d\,s}{d\,d} = 1$
   - Thus:
     - $\frac{d\,\text{loss}}{d\,b} = 0.5$
     - $\frac{d\,\text{loss}}{d\,d} = 0.5$

3. **Backpropagation to $ w_1 $:**
   - For $ b = a^2 $ where $ a = w_1 - 1 $:
     - $\frac{d\,b}{d\,a} = 2a$
     - At $ a = 1.0 $, this is $ 2 \times 1.0 = 2 $
     - So, $\frac{d\,\text{loss}}{d\,a} = \frac{d\,\text{loss}}{d\,b} \times \frac{d\,b}{d\,a} = 0.5 \times 2 = 1.0$
   - For $ a = w_1 - 1 $:
     - $\frac{d\,a}{d\,w_1} = 1$
     - Hence, $\frac{d\,\text{loss}}{d\,w_1} = 1.0 \times 1 = 1.0$

4. **Backpropagation to $ w_2 $:**
   - For $ d = c^2 $ where $ c = w_2 - 5 $:
     - $\frac{d\,d}{d\,c} = 2c$
     - At $ c = -2.0 $, this is $ 2 \times (-2.0) = -4 $
     - So, $\frac{d\,\text{loss}}{d\,c} = \frac{d\,\text{loss}}{d\,d} \times \frac{d\,d}{d\,c} = 0.5 \times (-4) = -2.0$
   - For $ c = w_2 - 5 $:
     - $\frac{d\,c}{d\,w_2} = 1$
     - Thus, $\frac{d\,\text{loss}}{d\,w_2} = -2.0 \times 1 = -2.0$

---

### Final Summary

- **Forward Computation:**
  - $ w_1 = 2.0 $, $ w_2 = 3.0 $
  - $ a = 1.0 $, $ b = 1.0 $
  - $ c = -2.0 $, $ d = 4.0 $
  - $ s = 5.0 $
  - $ \text{loss} = 2.5 $

- **Backward Computation:**
  - $\frac{d\,\text{loss}}{d\,w_1} = 1.0$
  - $\frac{d\,\text{loss}}{d\,w_2} = -2.0$

Thus, the gradient with respect to $ w_1 $ is **1.0**, and the gradient with respect to $ w_2 $ is **-2.0**.




In [None]:
# Perform backpropagation on the computational graph
loss.backward()
# After running backward, w1.grad and w2.grad should be populated with ∂loss/∂w1 and ∂loss/∂w2


print(f"∂loss/∂w1 = {w1.grad}")  # expected 1.0 (since w1-1 = 1)
print(f"∂loss/∂w2 = {w2.grad}")  # expected -2.0 (since w2-5 = -2)

In [None]:
s.grad

In [None]:
draw_dot(loss)

In [None]:
# Initialize parameters
w1 = Value(2.0)
w2 = Value(3.0)
params = [w1, w2]

learning_rate = 0.1  # choose a learning rate
for i in range(100):
    # 1. Forward pass: compute the loss for current w1, w2
    loss = 0.5 * ((w1 - 1)**2 + (w2 - 5)**2)
    print(f"Iteration {i}: loss = {loss.data:.4f}")
    # 2. Backward pass: compute gradients
    loss.backward()
    # 3. Gradient descent update: w <- w - α * grad
    for p in params:
        p.data -= learning_rate * p.grad
        #  Reset gradient to 0 for next iteration (since .backward() accumulates gradients)
        p.grad = 0.0