# CSE 25 – Introduction to Artificial Intelligence  
## Worksheet 10: Hidden Layers and Nonlinearity

**Context (from last class)**

Previously, we learned how to compute gradients efficiently using **backpropagation** for a single-layer model (sigmoid + binary cross-entropy).

We saw:

- Why finite-difference gradient computation is inefficient  
- How the chain rule enables efficient gradient computation  
- How gradients are used to update parameters and reduce loss  

However, our model was still a single linear layer.

Today, we go one step further. In this worksheet, we will:

- Extend the binary classification model to **multi-class classification** using softmax  
- Understand why a single linear layer is insufficient for some problems (XOR)  
- Introduce hidden layers and activation functions  
- Analyze how backpropagation works in a deeper network
- Trace gradients through a multi-layer computation graph


**Learning Objectives**

By the end of today, you should be able to:

- Explain how **softmax** generalizes sigmoid to multiple classes
- Compute multi-class cross-entropy loss using one-hot encoding
- Explain why XOR cannot be solved by a single linear model
- Describe how hidden layers increase model expressivity
- Apply the chain rule to deeper computation graphs
- Explain how gradients propagate through multiple layers

**Instructions:**

Create a copy of this notebook and complete it during class.  
Work through the cells below **in order**.

You may discuss with your neighbors, but make sure you understand  
what each step is doing and why.

**Submission**

When finished, download the notebook as a PDF and upload it to Gradescope under  
`In-Class – Week 6 Thursday`.

To download as a PDF on DataHub:  
`File -> Save and Export Notebook As -> PDF`


The next cell generates diagrams for the different models discussed so far. You do not need to understand or modify the code - just run the cell to visualize the model structures.

In [None]:
from graphviz import Digraph

def val(dot, name, label=None):
    dot.node(name, label if label else name, shape="box")

def fn(dot, name, label):
    dot.node(name, label, shape="circle")

def connect(dot, src, dst, weight=None):
    dot.edge(src, dst, label=weight) if weight else dot.edge(src, dst)


def add_neuron(dot, suffix, activation, inputs, box_neuron=False, cluster_label="Neuron"):
    """Creates Σ -> z -> act structure."""
    s_id, z_id, a_id = f"sum{suffix}", f"z{suffix}", f"act{suffix}"
    z_lab = f"z{suffix}" if suffix else "z"
    
    if box_neuron:
        with dot.subgraph(name=f"cluster_{suffix}") as c:
            c.attr(label=cluster_label, style="dashed")
            fn(c, s_id, "Σ")
            val(c, z_id, z_lab)
            fn(c, a_id, activation)
            connect(c, s_id, z_id)
            connect(c, z_id, a_id)
    else:
        fn(dot, s_id, "Σ")
        val(dot, z_id, z_lab)
        fn(dot, a_id, activation)
        connect(dot, s_id, z_id)
        connect(dot, z_id, a_id)

    for src, weight in inputs:
        connect(dot, src, s_id, weight)
        
    return a_id


def common_graph(loss_label, activation_label, output_label, box_neuron=False):
    dot = Digraph(graph_attr={"rankdir": "LR"})
    
    val(dot, "one", "1"); val(dot, "x1", "x₁"); val(dot, "x2", "x₂"); val(dot, "y", "y")
    
    last_node = add_neuron(dot, "", activation_label, 
                           [("one", "* b"), ("x1", "* w₁"), ("x2", "* w₂")], box_neuron)
    
    val(dot, "out", output_label); fn(dot, "loss", loss_label); val(dot, "L", "L")
    connect(dot, last_node, "out"); connect(dot, "out", "loss")
    connect(dot, "y", "loss"); connect(dot, "loss", "L")
    return dot

# Standard Models
def general_linear_model_graph(box_neuron=False):return common_graph("L(.)","f(.)","out",box_neuron=box_neuron)
def linear_regression_graph(box_neuron=False): return common_graph("MSE", "id", "ŷ", box_neuron)
def perceptron_graph(box_neuron=False): return common_graph("P_Loss", "sign", "ŷ", box_neuron)
def logistic_regression_graph(box_neuron=False): return common_graph("BCE", "σ", "p", box_neuron)

We have covered three foundational linear models so far:

- **Linear Regression**: Uses the identity activation function and is trained with Mean Squared Error (MSE) loss. This model is suitable for regression tasks where the output is continuous.

- **Perceptron**: Uses the sign (or step) activation function and is trained using the perceptron update rule, which can be interpreted as minimizing the perceptron loss. This model is designed for binary classification with hard decision boundaries.

- **Logistic Regression**: Uses the sigmoid activation function and is trained with Binary Cross-Entropy (BCE) loss. This model outputs probabilities for binary classification.

  *(Note: We did not formally use this term earlier, but the sigmoid + BCE single-layer model is commonly known as Logistic Regression.)*

Although these models differ in activation functions and loss functions, they all share the same underlying structure: a single linear decision boundary.

In [None]:
linear_regression_graph()

In [None]:
perceptron_graph()

In [None]:
logistic_regression_graph()

#### A Common Computational Structure

If we look at the three models we just saw, we notice that they all share the same core structure:

1. Compute a weighted sum: $ z = \sum_{i=1}^{n} w_i x_i + b $

2. Apply an activation function: $out = f(.)$

3. Compute a loss: $L(.)$

We can therefore represent all three models in a unified form:

x → (weighted sum) → z → (activation f(.)) → out → (loss L(.))

Where:

- f(z) could be:
  - identity (linear regression)
  - sign (perceptron)
  - sigmoid (logistic regression)

- L(out, y) could be:
  - Mean Squared Error (MSE)
  - perceptron loss
  - binary cross-entropy


In [None]:
general_linear_model_graph()

#### The Core Computational Unit: A Neuron

The combination of:

- a weighted sum
- followed by an activation function

is often called a **neuron** (or perceptron in a broader sense).

Visually, we can group:

x → (weighted sum) → z → (activation) → a

into a single computational block.


##### Biological Inspiration

The term *neuron* comes from biology.

A biological neuron:

- Receives signals through dendrites  
- Aggregates those signals  
- Generates an output signal when its internal activation exceeds a threshold  

The artificial neuron abstracts this idea:

- Inputs $x_i$ correspond to incoming signals  
- Weights $w_i$ represent synaptic strength  
- The weighted sum aggregates signals  
- The activation function determines whether and how strongly the unit "fires"

This abstraction dates back to the seminal 1943 paper by McCulloch and Pitts:
W. McCulloch & W. Pitts (1943).  
*A Logical Calculus of the Ideas Immanent in Nervous Activity.*

Their model formalized the idea of weighted input aggregation followed by a threshold function, laying the foundation for modern neural networks.

Modern neural networks are inspired by biology, but they are highly simplified mathematical abstractions rather than detailed models of real neurons.


In [None]:
general_linear_model_graph(box_neuron=True)

This boxed structure is the fundamental building block of neural networks. A neural network consists of many such units connected together.


Before we connect multiple neurons into layered networks, we first extend our single-layer model to handle multiple classes.

#### Multi-Class Classification

In binary classification, we used the sigmoid function:

$$
p = \sigma(z) = \frac{1}{1 + e^{-z}}
$$

This produces a single probability $p$, and we use binary cross-entropy loss:

$$
L = -\big[y \log p + (1-y)\log(1-p)\big]
$$

##### Softmax

For multi-class classification, instead of producing one score $z$, the model produces a list of scores:

$$
z_1, z_2, \dots, z_K
$$

The **Softmax** function generalizes the sigmoid to the multi-class setting. It takes the raw scores and converts them into a probability distribution:

$$
p_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
$$

Softmax is used because we want the outputs to behave like probabilities across classes:
one probability distribution over K classes (not K independent sigmoids).

By dividing each exponentiated score by the sum of all exponentiated scores, Softmax ensures that:

- $p_i \ge 0$
- $\sum_{i=1}^{K} p_i = 1$
- Larger $z_i$ leads to larger probability

##### Cross-Entropy Loss

Previously, we used **Binary Cross-Entropy (BCE)**:
$$L = -\big[y \log p + (1-y)\log(1-p)\big]$$

For multiple classes, this generalizes to **Categorical Cross-Entropy**. We sum the error across all $K$ classes:

$$
L = -\sum_{i=1}^{K} y_i \log p_i
$$

**One-hot encoding**

If we represent our classes using *one-hot encoding*, then the true label is written as:

$$y = [0, \dots, 1, \dots, 0]$$

This is a list (or vector) where the entry for the correct class is $1$ and all other entries are $0$.

**The Simplified Loss**

Because $y_i = 0$ for every incorrect class, those terms are multiplied by zero and vanish from the summation. This leaves us with a simplified calculation that only cares about the probability assigned to the correct class:

$$L = -\log(p_{\text{correct}})$$

- If the model is confident in the right answer ($p \approx 1$), the loss is near **0**.
- If the model assigns a low probability to the right answer, the loss becomes very large.

In [None]:
def softmax_logistic_regression_graph(box_neuron=False):
    dot = Digraph(graph_attr={"rankdir": "LR", "labelloc": "t", "label": "Softmax Logistic Regression"})
    
    # Declare inputs first
    val(dot, "one1", "1"); val(dot, "one2", "1")
    val(dot, "x1", "x₁"); val(dot, "x2", "x₂")
    val(dot, "y", "y")

    # Define Neuron 1 (Top)
    n1 = add_neuron(dot, "1", "id", [("x1", "* w₁₁"), ("x2", "* w₂₁"), ("one1", "* b₁")], 
                    box_neuron, "Neuron (Class 1)")
    
    # Define Neuron 2 (Bottom)
    n2 = add_neuron(dot, "2", "id", [("x1", "* w₁₂"), ("x2", "* w₂₂"), ("one2", "* b₂")], 
                    box_neuron, "Neuron (Class 2)")

    fn(dot, "softmax", "softmax")
    connect(dot, n1, "softmax"); connect(dot, n2, "softmax")
    
    val(dot, "p1", "p₁"); val(dot, "p2", "p₂")
    connect(dot, "softmax", "p1"); connect(dot, "softmax", "p2")
    
    fn(dot, "loss", "CCE")
    connect(dot, "p1", "loss"); connect(dot, "p2", "loss"); connect(dot, "y", "loss")
    val(dot, "L", "L"); connect(dot, "loss", "L")
    
    return dot


In [None]:
softmax_logistic_regression_graph(box_neuron=True)

**NOTE: This is still a single layer model**

Even though the diagram shows **two neurons** (one per class) and a softmax step, this is still a **single-layer model**.

- Each class score is just a **weighted sum of the inputs**:

$$
z_1 = w_{11}x_1 + w_{21}x_2 + b_1, \qquad
z_2 = w_{12}x_1 + w_{22}x_2 + b_2
$$

- There are **no hidden layers**.
- Softmax only converts scores into probabilities. It does **not** add depth or extra learning capacity.

Because of this, the model can only learn **linear decision boundaries**.  It cannot represent curved or complex boundaries (for example, XOR).

#### The XOR Problem: A Challenge for Single-Layer Linear Models

The XOR (exclusive OR) function outputs 1 only when its two binary inputs differ.

| x1 | x2 | XOR |
|----|----|-----|
| 0  | 0  |  0  |
| 0  | 1  |  1  |
| 1  | 0  |  1  |
| 1  | 1  |  0  |

If we plot the four possible input combinations, we will see that no single straight line can separate the points where XOR is 1 from those where it is 0.

In the next cell: try changing the perceptron parameters (`w1`, `w2`, and `b`). Can you find any set of parameters that perfectly separates the blue and red points?


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider, Dropdown, VBox, HBox

# XOR dataset
X_xor = [
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
]
y_xor = [-1, 1, 1, -1]  # XOR labels for perceptron

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def plot_xor_model(w1=0.0, w2=0.0, b=0.0, activation="sign"):
    
    w = [w1, w2]

    # Separate points by class (based on original XOR labels)
    X_pos = [x for x, y in zip(X_xor, y_xor) if y == 1]
    X_neg = [x for x, y in zip(X_xor, y_xor) if y == -1]

    plt.figure(figsize=(4, 4))
    plt.scatter([x[0] for x in X_pos], [x[1] for x in X_pos],
                marker='x', s=100, label='Class +1')
    plt.scatter([x[0] for x in X_neg], [x[1] for x in X_neg],
                marker='o', s=100, label='Class -1')

    # Decision boundary: w1*x1 + w2*x2 + b = 0
    x1_vals = np.linspace(-0.5, 1.5, 100)

    if w1 == 0 and w2 == 0:
        plt.text(0.5, 0.5, "No decision boundary\n(w1=0, w2=0)",
                 fontsize=12, ha='center', va='center',
                 transform=plt.gca().transAxes)
    elif w2 != 0:
        x2_vals = [-(w1/w2)*x1 - b/w2 for x1 in x1_vals]
        plt.plot(x1_vals, x2_vals, linestyle='--', label='Decision boundary')
    elif w1 != 0:
        plt.axvline(x=-b/w1, linestyle='--', label='Decision boundary')

    # Predictions
    for x, y_true in zip(X_xor, y_xor):
        score = w[0] * x[0] + w[1] * x[1] + b

        if activation == "sign":
            y_pred = 1 if score >= 0 else -1
            label_text = f"Pred: {y_pred}"

        elif activation == "sigmoid":
            p = sigmoid(score)
            y_pred = 1 if p >= 0.5 else -1
            label_text = f"p={p:.2f}"

        plt.text(x[0] + 0.05, x[1] + 0.05,
                 label_text, fontsize=9)

    plt.xlabel("x1")
    plt.ylabel("x2")
    plt.title(f"XOR with {activation} activation")
    plt.xlim(-0.5, 1.5)
    plt.ylim(-0.5, 1.5)
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.grid(True)
    plt.show()


interact(
    plot_xor_model,
    w1=FloatSlider(value=1, min=-5, max=5, step=0.1, description='w1'),
    w2=FloatSlider(value=1, min=-5, max=5, step=0.1, description='w2'),
    b=FloatSlider(value=0, min=-5, max=5, step=0.1, description='b'),
    activation=Dropdown(
        options=["sign", "sigmoid"],
        value="sign",
        description="Activation"
    )
)

print("Try switching between sign and sigmoid to compare perceptron vs logistic regression behavior.")

#### Multi-Layer Perceptron (MLP) or Feed Forward Neural Network

A **hidden layer** is a layer of neurons whose outputs are not the final prediction.
Hidden layers sit between the inputs and the output.

- The input layer contains the input values \(x_1, x_2, \dots, x_n\).
- A hidden layer produces intermediate outputs.
- The output layer produces the final prediction.

> *NOTE: The term **Multi-Layer Perceptron (MLP)** is historical. The original perceptron (Rosenblatt, 1958) was a single-layer model with a threshold (step) activation function. Earlier work by McCulloch and Pitts (1943) introduced a similar thresholded neuron model without a learning rule. Modern MLPs stack multiple layers and typically use differentiable activation functions such as sigmoid, tanh, or ReLU.*

##### What changes when we add a hidden layer?

We introduce new intermediate outputs computed from the inputs.

Instead of:

x → (weighted sum) → output

We now have:

x → (weighted sum) → activation → (weighted sum of those intermediate outputs) → output

The model is now a composition of functions. The output depends on the weights through these intermediate values.

To compute gradients efficiently in this composed structure, we apply backpropagation.

#### Solving XOR with a Small Neural Network (2–2–1)

XOR cannot be solved with a single linear decision boundary.

To represent XOR, we need a model with a **hidden layer** that creates intermediate outputs, and an output neuron that combines them.

A common minimal example is a **2–2–1** feedforward neural network:

- **2 inputs**: \(x_1, x_2\)
- **2 hidden neurons**
- **1 output neuron**

The key idea is that the hidden layer can create an intermediate representation that makes the classes separable for the final output neuron.


In [None]:
def xor_mlp_graph(loss_label="BCE", box_neuron=False, activation_label="f(.)"):
    dot = Digraph(graph_attr={"rankdir": "LR"})

    val(dot, "one1", "1"); val(dot, "one2", "1"); val(dot, "one3", "1")
    val(dot, "x1", "x₁"); val(dot, "x2", "x₂"); val(dot, "y", "y")

    # Hidden Layer (with 2 Neurons)
    # Hidden Neuron 1 -> h1
    h1_act = add_neuron(dot, "1", activation_label, 
                        [("x1", "* w₁₁"), ("x2", "* w₁₂"), ("one1", "* b₁")], 
                        box_neuron, "Hidden Neuron 1")
    val(dot, "h1", "h₁")
    connect(dot, h1_act, "h1")

    # Hidden Neuron 2 -> h2
    h2_act = add_neuron(dot, "2", activation_label, 
                        [("x1", "* w₂₁"), ("x2", "* w₂₂"), ("one2", "* b₂")], 
                        box_neuron, "Hidden Neuron 2")
    val(dot, "h2", "h₂")
    connect(dot, h2_act, "h2")

    # Output Layer (with 1 Neuron)
    # Output Neuron -> p
    p_act = add_neuron(dot, "3", activation_label, 
                       [("h1", "* v₁"), ("h2", "* v₂"), ("one3", "* b₃")], 
                       box_neuron, "Output Neuron")
    val(dot, "p", "p")
    connect(dot, p_act, "p")

    # Loss & Final Scalar
    fn(dot, "loss", loss_label)
    val(dot, "L", "L")
    
    connect(dot, "p", "loss")
    connect(dot, "y", "loss")
    connect(dot, "loss", "L")

    return dot

In [None]:
xor_mlp_graph(loss_label="L(.)", box_neuron=True, activation_label="f(.)")

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display

# XOR dataset
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y = np.array([0,1,1,0], dtype=int)

def step(z):
    return (z >= 0).astype(float)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def activation_fn(z, activation):
    if activation == "sign":
        return step(z)
    elif activation == "sigmoid":
        return sigmoid(z)

def plot_two_hidden(w11=1.0, w12=1.0, b1=-0.5,
                    w21=1.0, w22=1.0, b2=-1.5,
                    activation="sign"):

    threshold = 0.5

    plt.figure(figsize=(6,5))

    # Grid
    x1_vals = np.linspace(-0.5, 1.5, 300)
    x2_vals = np.linspace(-0.5, 1.5, 300)
    X1, X2 = np.meshgrid(x1_vals, x2_vals)

    Z1 = w11*X1 + w12*X2 + b1
    Z2 = w21*X1 + w22*X2 + b2

    A1 = activation_fn(Z1, activation)
    A2 = activation_fn(Z2, activation)

    H1 = (A1 >= threshold).astype(int)
    H2 = (A2 >= threshold).astype(int)

    region_code = H1*2 + H2

    plt.contourf(
        X1, X2, region_code,
        levels=[-0.1, 0.5, 1.5, 2.5, 3.5],
        alpha=0.15
    )

    # XOR points
    plt.scatter(X[y==1,0], X[y==1,1], s=120, marker="x", label="Class 1")
    plt.scatter(X[y==0,0], X[y==0,1], s=120, marker="o", label="Class 0")

    xs = np.linspace(-0.5, 1.5, 200)

    # Hidden 1 boundary
    if abs(w12) > 1e-9:
        plt.plot(xs, -(w11/w12)*xs - b1/w12, "--", linewidth=2, color="black", label="Hidden 1 (z1=0)")
    elif abs(w11) > 1e-9:
        plt.axvline(x=-b1/w11, linestyle="--", linewidth=2, color="black", label="Hidden 1 (z1=0)")

    # Hidden 2 boundary
    if abs(w22) > 1e-9:
        plt.plot(xs, -(w21/w22)*xs - b2/w22, "--", linewidth=2, color="gray", label="Hidden 2 (z2=0)")
    elif abs(w21) > 1e-9:
        plt.axvline(x=-b2/w21, linestyle="--", linewidth=2, color="gray", label="Hidden 2 (z2=0)")

    # Show activations at XOR points
    for (x1, x2) in X:
        z1 = w11*x1 + w12*x2 + b1
        z2 = w21*x1 + w22*x2 + b2
        a1 = float(activation_fn(z1, activation))
        a2 = float(activation_fn(z2, activation))
        h1 = int(a1 >= threshold)
        h2 = int(a2 >= threshold)

        if activation == "sign":
            txt = f"(h1,h2)=({h1},{h2})"
        else:
            txt = f"p=({a1:.2f},{a2:.2f})"

        plt.text(x1-0.22, x2-0.22, txt, fontsize=8)

    plt.xlim(-0.5, 1.5)
    plt.ylim(-0.5, 1.5)
    plt.xlabel("x1")
    plt.ylabel("x2")
    plt.title(f"Two Hidden Neurons ({activation})")
    plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.grid(True, alpha=0.25)
    plt.show()

# Sliders
w11_slider = widgets.FloatSlider(value=0.5, min=-3, max=3, step=0.25, description="w11", continuous_update=False)
w12_slider = widgets.FloatSlider(value=1.0, min=-3, max=3, step=0.25, description="w12", continuous_update=False)
b1_slider  = widgets.FloatSlider(value=-0.5, min=-3, max=3, step=0.25, description="b1",  continuous_update=False)

w21_slider = widgets.FloatSlider(value=1.5, min=-3, max=3, step=0.25, description="w21", continuous_update=False)
w22_slider = widgets.FloatSlider(value=2.0, min=-3, max=3, step=0.25, description="w22", continuous_update=False)
b2_slider  = widgets.FloatSlider(value=-1.5, min=-3, max=3, step=0.25, description="b2",  continuous_update=False)

activation_dd = widgets.Dropdown(
    options=[("sign (step)", "sign"), ("sigmoid", "sigmoid")],
    value="sign",
    description="activation",
)

out = widgets.interactive_output(
    plot_two_hidden,
    {
        "w11": w11_slider, "w12": w12_slider, "b1": b1_slider,
        "w21": w21_slider, "w22": w22_slider, "b2": b2_slider,
        "activation": activation_dd,
    }
)

display(activation_dd,
        widgets.HBox([
            widgets.VBox([widgets.HTML("<b>Hidden Neuron 1</b>"),
                          w11_slider, w12_slider, b1_slider]),
            widgets.VBox([widgets.HTML("<b>Hidden Neuron 2</b>"),
                          w21_slider, w22_slider, b2_slider])
        ]),
        out)


**Note:** This visualization above shows only the **two hidden neurons** and how they partition the input space. In a full 2–2–1 network, an additional **output neuron** would take $(h_1, h_2)$ as inputs to produce the final XOR prediction.


In [None]:
xor_mlp_graph(loss_label="L(.)", box_neuron=True, activation_label="f(.)")

In [None]:
# XOR dataset
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y = np.array([0,1,1,0], dtype=int)

def step(z):
    return (np.asarray(z) >= 0).astype(float)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def act(z, activation):
    if activation == "sign":
        return step(z)      # {0,1}
    elif activation == "sigmoid":
        return sigmoid(z)   # (0,1)
    else:
        raise ValueError("Unknown activation")

def plot_three_neurons(w11=1.0, w12=1.0, b1=-0.5,
                       w21=1.0, w22=1.0, b2=-1.5,
                       v1=1.0, v2=-2.0, b3=-0.5,
                       hidden_activation="sign",
                       output_activation="sign"):

    plt.figure(figsize=(6,5))

    # Grid
    x1_vals = np.linspace(-0.5, 1.5, 300)
    x2_vals = np.linspace(-0.5, 1.5, 300)
    X1, X2 = np.meshgrid(x1_vals, x2_vals)

    # Hidden pre-activations
    Z1 = w11*X1 + w12*X2 + b1
    Z2 = w21*X1 + w22*X2 + b2

    # Hidden activations
    A1 = act(Z1, hidden_activation)
    A2 = act(Z2, hidden_activation)

    # Output pre-activation and activation
    Z3 = v1*A1 + v2*A2 + b3
    Yout = act(Z3, output_activation)

    # Decision: default threshold=0.5 for both
    Ypred = (Yout >= 0.5).astype(int)

    # Background regions
    plt.contourf(
        X1, X2, Ypred,
        levels=[-0.1, 0.5, 1.1],
        alpha=0.25,
        colors=["red", "blue"]
    )

    # XOR points
    plt.scatter(X[y==1,0], X[y==1,1], s=120, marker="x", label="Class 1")
    plt.scatter(X[y==0,0], X[y==0,1], s=120, marker="o", label="Class 0")

    xs = np.linspace(-0.5, 1.5, 200)

    # Hidden boundaries (z=0 lines)
    if abs(w12) > 1e-9:
        plt.plot(xs, -(w11/w12)*xs - b1/w12, "--", linewidth=2, color="black", label="Hidden 1 (z1=0)")
    elif abs(w11) > 1e-9:
        plt.axvline(x=-b1/w11, linestyle="--", linewidth=2, color="black", label="Hidden 1 (z1=0)")

    if abs(w22) > 1e-9:
        plt.plot(xs, -(w21/w22)*xs - b2/w22, "--", linewidth=2, color="gray", label="Hidden 2 (z2=0)")
    elif abs(w21) > 1e-9:
        plt.axvline(x=-b2/w21, linestyle="--", linewidth=2, color="gray", label="Hidden 2 (z2=0)")

    # Annotate XOR points
    for (x1, x2), yi in zip(X, y):
        z1 = w11*x1 + w12*x2 + b1
        z2 = w21*x1 + w22*x2 + b2
        a1 = float(act(z1, hidden_activation))
        a2 = float(act(z2, hidden_activation))

        z3 = v1*a1 + v2*a2 + b3
        yout = float(act(z3, output_activation))
        ypred = int(yout >= 0.5)

        if hidden_activation == "sign":
            htxt = f"h=({int(a1)},{int(a2)})"
        else:
            htxt = f"a=({a1:.2f},{a2:.2f})"

        if output_activation == "sign":
            otxt = f"out={int(yout)}, pred={ypred}"
        else:
            otxt = f"p={yout:.2f}, pred={ypred}"

        plt.text(x1+0.04, x2-0.18, htxt, fontsize=9)
        plt.text(x1+0.04, x2-0.32, otxt, fontsize=9)

    plt.xlim(-0.5, 1.5)
    plt.ylim(-0.5, 1.5)
    plt.xlabel("x1")
    plt.ylabel("x2")
    plt.title(f"Hidden: {hidden_activation} | Output: {output_activation}")
    plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.grid(True, alpha=0.25)
    plt.show()

# Dropdowns
act_options = [("sign (step)", "sign"), ("sigmoid", "sigmoid")]

hidden_act_dd = Dropdown(options=act_options, value="sign", description="hidden act")
output_act_dd = widgets.Dropdown(options=act_options, value="sign", description="output act")

# Sliders
w11_slider = FloatSlider(value=0.5, min=-3, max=3, step=0.25, description="w11", continuous_update=False)
w12_slider = FloatSlider(value=1.0, min=-3, max=3, step=0.25, description="w12", continuous_update=False)
b1_slider  = FloatSlider(value=-0.5, min=-3, max=3, step=0.25, description="b1",  continuous_update=False)

w21_slider = FloatSlider(value=1.5, min=-3, max=3, step=0.25, description="w21", continuous_update=False)
w22_slider = FloatSlider(value=2.0, min=-3, max=3, step=0.25, description="w22", continuous_update=False)
b2_slider  = FloatSlider(value=-1.5, min=-3, max=3, step=0.25, description="b2",  continuous_update=False)

v1_slider  = FloatSlider(value=-1.0, min=-3, max=3, step=0.25, description="v1", continuous_update=False)
v2_slider  = FloatSlider(value=2.0, min=-3, max=3, step=0.25, description="v2", continuous_update=False)
b3_slider  = FloatSlider(value=0.5, min=-3, max=3, step=0.25, description="b3", continuous_update=False)

hidden1_box = widgets.VBox([widgets.HTML("<b>Hidden Neuron 1</b>"), w11_slider, w12_slider, b1_slider])
hidden2_box = widgets.VBox([widgets.HTML("<b>Hidden Neuron 2</b>"), w21_slider, w22_slider, b2_slider])
output_box  = widgets.VBox([widgets.HTML("<b>Output Neuron</b>"), v1_slider, v2_slider, b3_slider])

out = widgets.interactive_output(
    plot_three_neurons,
    {
        "w11": w11_slider, "w12": w12_slider, "b1": b1_slider,
        "w21": w21_slider, "w22": w22_slider, "b2": b2_slider,
        "v1": v1_slider,  "v2": v2_slider,  "b3": b3_slider,
        "hidden_activation": hidden_act_dd,
        "output_activation": output_act_dd,
    }
)

display(widgets.HBox([hidden_act_dd, output_act_dd]),
        widgets.HBox([hidden1_box, hidden2_box, output_box]),
        out)


## Common Activation Functions

In neural networks, the activation function introduces nonlinearity.
Without it, stacking layers would still produce a linear model.

Below are three commonly used activation functions:

**Sigmoid**
$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$
- Output range: (0, 1)
- Often used for binary classification

**Tanh**

$$
\tanh(z) = \frac{e^{z} - e^{-z}}{e^{-z} + e^{z}}
$$

- Output range: (-1, 1)
- Zero-centered

**ReLU (Rectified Linear Unit)**
$$
\text{ReLU}(z) = \max(0, z)
$$
- Commonly used in hidden layers
- Computationally simple


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Activation functions
def sigmoid_plot(z):
    return 1 / (1 + np.exp(-z))

def tanh_plot(z):
    return np.tanh(z)

def relu_plot(z):
    return np.maximum(0, z)

def leaky_relu_plot(z, alpha=0.1):
    return np.where(z >= 0, z, alpha * z)

z = np.linspace(-5, 5, 400)

fig, axes = plt.subplots(2, 2, figsize=(15, 8))

# Sigmoid
axes[0, 0].plot(z, sigmoid_plot(z))
axes[0, 0].set_title("Sigmoid\nσ(z) = 1 / (1 + e^{-z})")
axes[0, 0].set_xlabel("z")
axes[0, 0].set_ylabel("σ(z)")
axes[0, 0].axhline(0,color='black', linewidth=1)
axes[0, 0].axvline(0,color='black', linewidth=1)
axes[0, 0].grid(True, alpha=0.3)

# Tanh
axes[0, 1].plot(z, tanh_plot(z))
axes[0, 1].set_title("Tanh\n tanh(z) = (e^z - e^{-z}) / (e^z + e^{-z})")
axes[0, 1].set_xlabel("z")
axes[0, 1].set_ylabel("tanh(z)")
axes[0, 1].axhline(0,color='black', linewidth=1)
axes[0, 1].axvline(0,color='black', linewidth=1)
axes[0, 1].grid(True, alpha=0.3)
# ReLU
axes[1, 0].plot(z, relu_plot(z))
axes[1, 0].set_title("ReLU\nReLU(z) = max(0, z)")
axes[1, 0].set_xlabel("z")
axes[1, 0].set_ylabel("ReLU(z)")
axes[1, 0].axhline(0,color='black', linewidth=1)
axes[1, 0].axvline(0,color='black', linewidth=1)
axes[1, 0].grid(True, alpha=0.3)

# Leaky ReLU
axes[1, 1].plot(z, leaky_relu_plot(z))
axes[1, 1].set_title("Leaky ReLU\nLeakyReLU(z) = max(0.1z, z)")
axes[1, 1].set_xlabel("z")
axes[1, 1].set_ylabel("Leaky ReLU(z)")
axes[1, 1].axhline(0,color='black', linewidth=1)
axes[1, 1].axvline(0,color='black', linewidth=1)
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


In practice:

- Sigmoid is typically used in the output layer for binary classification.
- Softmax is used in the output layer for multi-class classification.
- ReLU (or variants such as Leaky ReLU) is most commonly used in hidden layers.

In PA2, you will implement ReLU inside a multi-layer perceptron.

#### Autodiff

As we add hidden layers, the number of parameters grows quickly.

Each weight now influences the final output through multiple intermediate computations.
Computing all derivatives by hand becomes impractical.

In the single-layer case, we derived gradients explicitly.
For example, we computed derivatives such as:

$$
\frac{\partial L}{\partial w_i}
$$

In multi-layer networks, the output depends on parameters through
many intermediate variables:

$$
x \rightarrow z_1 \rightarrow a_1 \rightarrow z_2 \rightarrow \hat{y} \rightarrow L
$$

Keeping track of all these dependencies by hand quickly becomes tedious.

To handle this systematically, we use **backpropagation** implemented via
**automatic differentiation (autodiff)**.

Most modern deep learning libraries compute these gradients automatically.



**Important note:**

- During training, we apply backpropagation starting from the loss $L$ and set $L.\text{grad} = 1$.
- In this demo, we instead start from the prediction $p$ and set $p.\text{grad} = 1$ to illustrate how gradients flow through the computation graph.

In other words, we are computing: $\frac{\partial p}{\partial (\text{earlier values})}$

The same backpropagation mechanics apply when the final node is $L$ instead of $p$. Only the starting node changes.


In [None]:
import math
from graphviz import Digraph

# Helper functions to build and visualize the computational graph for autograd
def trace(root):
    nodes, edges = set(), set()
    def build(v):
        if id(v) in nodes:
            return
        nodes.add(id(v))
        for child in v["_prev"]:
            edges.add((id(child), id(v)))
            build(child)
    build(root)
    return nodes, edges

def find(root, target):
    stack=[root]
    seen=set()
    while stack:
        v=stack.pop()
        if id(v)==target:
            return v
        if id(v) in seen:
            continue
        seen.add(id(v))
        stack.extend(v["_prev"])
    return None


def draw(root, title=""):
    dot = Digraph(graph_attr={"rankdir":"LR", "label":title, "labelloc":"t"})
    nodes, edges = trace(root)

    for vid in nodes:
        v=find(root,vid)
        dot.node(
            name=str(vid),
            label=f"{{{v['label']}|data={v['data']:.4f}|grad={v['grad']:.4f}}}",
            shape="record"
        )
        if v["_op"]:
            op_id=str(vid)+v["_op"]
            dot.node(op_id, label=v["_op"], shape="circle")
            dot.edge(op_id, str(vid))

    for c,p in edges:
        parent=find(root,p)
        if parent["_op"]:
            dot.edge(str(c), str(p)+parent["_op"])
        else:
            dot.edge(str(c), str(p))
    return dot


In [None]:
# Autodiff implementation with addition, multiplication, and sigmoid operations.

# The create_value function initializes a value dictionary that represents a node in the computational graph.
# It takes the numerical data and an optional label, and sets up the structure for storing gradients, operation type, and previous nodes.
def create_value(data, label=""):
    return {
        "data": float(data),
        "grad": 0.0,
        "label": label,
        "_op": "",
        "_prev": [],
    }

# Add Operation: The add function creates a new value that represents the sum of two input values.
def add(a, b, label=""):
    out = create_value(a["data"] + b["data"], label)
    out["_op"] = "+"
    out["_prev"] = [a, b]
    return out

# Mul Operation: The mul function creates a new value that represents the product of two input values.
def mul(a, b, label=""):
    out = create_value(a["data"] * b["data"], label)
    out["_op"] = "*"
    out["_prev"] = [a, b]
    return out

# Sigmoid Operation: The sigmoid_func function creates a new value that represents the sigmoid activation of the input value z.
def sigmoid_func(z, label="p"):
    p = 1/(1+math.exp(-z["data"]))
    out = create_value(p, label)
    out["_op"] = "σ"
    out["_prev"] = [z]
    return out

**Autodiff Demo: Forward Pass for a Single Neuron**

We begin with the forward pass of a single neuron with sigmoid activation.

- Inputs: $x_1$, $x_2$
- Weights: $w_1$, $w_2$
- Bias: $b$

The neuron computes:

$$
z = w_1 x_1 + w_2 x_2 + b
$$

Then applies the sigmoid activation:

$$
p = \sigma(z) = \frac{1}{1 + e^{-z}}
$$

The diagram below shows the computation graph for this forward computation.  
At this stage, we are only computing values — no gradients have been propagated yet.


In [None]:
# Input Values
x1 = create_value(2.0,"x1")
x2 = create_value(-1.0,"x2")

# Set parameters
w1 = create_value(0.7,"w1")
w2 = create_value(-1.3,"w2")
b  = create_value(0.1,"b")

# FORWARD PASS
w1x1 = mul(w1,x1,"w1*x1")
w2x2 = mul(w2,x2,"w2*x2")
z_partial = add(w1x1, w2x2, "z_partial")
z = add(z_partial, b, "z")
p  = sigmoid_func(z,"p")

draw(p,"Forward pass (no gradients yet)")

Each rectangle in the diagram above represents a variable in the computation graph:

- **Label**: The name of the variable (e.g., `x1`, `w1`, `z`, `p`)
- **data**: The value of the variable computed during the forward pass
- **grad**: The gradient of the final node with respect to this variable (initialized to 0.0 before backpropagation)

Each circle represents an operation (e.g., addition `+`, multiplication `*`, or `sigmoid`) that combines or transforms variables.

The arrows show how values flow from inputs through operations to produce the final output.


Recall how we backpropagated the loss through the graph in the previous class.
We computed the gradient for BCE + sigmoid and then derived the gradients
with respect to the weights and biases.

Another way to compute these gradients is through **automatic differentiation (autodiff)**.
Modern libraries such as PyTorch, TensorFlow, and JAX use autodiff to compute
gradients efficiently.

Autodiff works by computing gradients locally at each operation and propagating
them backward through the computation graph using the chain rule.

In the next few cells, we will manually simulate reverse-mode autodiff:

1. Set the gradient of the final node to 1 (since $\frac{dp}{dp} = 1$).
2. Move backward through the graph, applying the chain rule at each operation.
3. Distribute or scale gradients according to the operation (addition, multiplication, activation).
4. Continue until all input variables have accumulated their gradients.

This shows how autodiff systematically computes gradients for all parameters in the model.


In [None]:
# Step 1: dp/dp = 1.0

p["grad"]=1.0
draw(p,"Step 1: dp/dp = 1")

In the diagram above, we manually set the gradient of the output $p$ to 1. This represents $\frac{dp}{dp} = 1$, which is always true for the output node in backpropagation.

The next step is to compute the gradient of the sigmoid activation. 

For $p = \sigma(z)$, the derivative is $p(1-p)$. 

This gives us $\frac{dp}{dz} = p(1-p)$.

To propagate the gradient back to $z$, we multiply the gradient at $p$ by $\frac{dp}{dz}$:
$$
z.\text{grad} = p.\text{grad} \times \frac{dp}{dz}
$$
Since $p.\text{grad}$ is 1, $z.\text{grad}$ becomes $p(1-p)$.

This process illustrates the chain rule in action, moving gradients backward through the computation graph.

In [None]:
# Step 2: Move back one edge: p to z
# dp/dz = p(1-p)
dp_dz = # YOUR CODE HERE
z["grad"] = p["grad"] * dp_dz

draw(p,"Step 2: dp/dz = dp/dp * dp/dz = 1 * p(1-p)")

In the next cell, we propagate the gradient from $z$ to its inputs.

Since $z$ was computed as

$$
z = z_{\text{partial}} + b,
$$

the gradient at $z$ is copied to both $z_{\text{partial}}$ and $b$.

For an addition operation $f(a, b) = a + b$, the partial derivatives are

$$
\frac{\partial f}{\partial a} = 1
\qquad \text{and} \qquad
\frac{\partial f}{\partial b} = 1.
$$

This means that a small change in either input causes the same change in the output.
During backpropagation, the gradient at the output is therefore copied directly to each input — no scaling is required.

This property holds for addition regardless of the values of $a$ and $b$.

The diagram will show the updated gradients for these nodes.


In [None]:
# Step 3: '+' copies gradient
# b.grad = z.grad
# z_partial.grad = z.grad

b["grad"]= # YOUR CODE HERE
z_partial["grad"]= # YOUR CODE HERE

draw(p,"Step 3: '+' copies gradient")

In [None]:
# Step 4: '+' copies gradient again
# w1x1.grad = z_partial.grad
# w2x2.grad = z_partial.grad

w1x1["grad"]= # YOUR CODE HERE
w2x2["grad"]= # YOUR CODE HERE

draw(p,"Step 4: '+' copies gradient again")

When backpropagating through a multiplication operation, the gradient is scaled by the value of the other input. 

For a node $f = a \times b$, the gradients are:

- $\frac{\partial f}{\partial a} = b$
- $\frac{\partial f}{\partial b} = a$

So, the gradient flowing into $z$ is multiplied by $b$ to update $a$'s gradient, and by $a$ to update $b$'s gradient. This is the chain rule in action for multiplication.

In [None]:
# Step 5: '*' scales gradient by the other input
# w1.grad = w1x1.grad * x1.data
# x1.grad = w1x1.grad * w1.data
# w2.grad = w2x2.grad * x2.data
# x2.grad = w2x2.grad * w2.data

w1["grad"]=# YOUR CODE HERE
x1["grad"]=# YOUR CODE HERE

w2["grad"]=# YOUR CODE HERE
x2["grad"]=# YOUR CODE HERE

draw(p,"Step 5: '*' scales gradient")

In [None]:
# Topological sort of the graph (from inputs to output)
# The following is a depth-first search that builds a list of nodes in topological order (inputs before outputs). 
# We use a set to track visited nodes to avoid cycles and redundant visits.
def topo_sort(root):
    topo = []
    visited = set()

    def build(v):
        vid = id(v)
        if vid in visited:
            return
        visited.add(vid)
        for child in v["_prev"]:
            build(child)
        topo.append(v)
    build(root)
    return topo

# Backpropagation: traverse the graph in reverse topological order and call _backward() at each node
def backward(root):
    topo = topo_sort(root)
    root["grad"] = 1.0
    for v in reversed(topo):
        # print(v["label"])
        v["_backward"]()

#### Exercise:
Update the `_backward()` functions for `add`, `mul`, and `sigmoid_func` **in the next cell** so they correctly propagate gradients using the chain rule.



In [None]:
# Add Operation: The add function creates a new value that represents the sum of two input values.
def add(a, b, label=""):
    out = create_value(a["data"] + b["data"], label)
    out["_op"] = "+"
    out["_prev"] = [a, b]

    # The backward function for addition is straightforward: 
    # the gradient just flows back equally to both inputs.
    # use += to accumulate gradients in case of multiple paths
    # a.grad += out.grad and b.grad += out.grad
    def _backward():
        a["grad"] += out["grad"]
        # YOUR CODE HERE
        pass
    out["_backward"] = _backward
    return out

# Mul Operation: The mul function creates a new value that represents the product of two input values.
def mul(a, b, label=""):
    out = create_value(a["data"] * b["data"], label)
    out["_op"] = "*"
    out["_prev"] = [a, b]

    # The backward function for multiplication uses the product rule: 
    # the gradient with respect to a is b * grad_out, and the gradient with respect to b is a * grad_out.
    # use += to accumulate gradients in case of multiple paths
    # a.grad += b.data * out.grad and b.grad += a.data * out.grad
    def _backward():
        # YOUR CODE HERE
        pass
    out["_backward"] = _backward
    return out

# Sigmoid Operation: The sigmoid_func function creates a new value that represents the sigmoid activation of the input value z.
def sigmoid_func(z, label="p"):
    p = 1/(1+math.exp(-z["data"]))
    out = create_value(p, label)
    out["_op"] = "σ"
    out["_prev"] = [z]

    # The backward function for sigmoid uses the derivative of the sigmoid function: 
    # grad_z = p * (1 - p) * grad_out, where p is the output of the sigmoid function.
    # use += to accumulate gradients in case of multiple paths
    # z.grad += p * (1-p) * out.grad
    def _backward():
        # YOUR CODE HERE
        pass
    out["_backward"] = _backward
    return out

In [None]:
# Let's reset the gradients to zero before backpropagation
for v in topo_sort(p):
    v["grad"] = 0.0

draw(p,"Forward pass (no gradients yet)")

In [None]:
# BACKWARD PASS (autodiff)
backward(p)
draw(p, "After backward() - gradients computed")

In [None]:
temp = add(create_value(2.0,"t1"), create_value(1.0,"t2"), "t1+t2")
print(temp["data"])
out = mul(temp, temp, "(t1+t2)^2")

draw(out, "Graph for (t1+t2)^2")

In [None]:
backward(out)
draw(out, "Graph for (t1+t2)^2")