# CSE 25 – Introduction to Artificial Intelligence
## Worksheet 9: Backpropagation and Efficient Learning

**Context (from last class):**  
Previously, we learned how to use loss functions and gradients to train linear models, and saw how finite differences can estimate gradients.

In this worksheet, we will:
- Understand why finite-difference gradient computation is slow for large models
- Learn the intuition behind **backpropagation** and the chain rule
- Implement efficient gradient computation for a simple linear model using backpropagation
- Compare backpropagation gradients to finite-difference gradients

**Learning Objectives:**
- Explain why backpropagation is more efficient than finite differences
- Apply the chain rule to compute gradients in a computation graph
- Implement backpropagation for a sigmoid + binary cross-entropy model
- Use gradients to update model parameters and reduce loss


**Instructions:**

Create a copy of this notebook and complete it during class.  
Work through the cells below **in order**.

You may discuss with your neighbors, but make sure you understand  
what each step is doing and why.

**Submission**

When finished, download the notebook as a PDF and upload it to Gradescope under  
`In-Class – Week 6 Tuesday`.

To download as a PDF on DataHub:  
`File -> Save and Export Notebook As -> PDF`

In [1]:
import math

# Load the toy data again for the next section
X_toy = [
    [1.5, 4],
    [1, 2],
    [2, 1],
    [3, 5],
    [4, 2],
    [0, 0],
    [1.5, -0.5]
]
y_toy = [1, 1, 0, 1, 0, 0, 0]

def stable_sigmoid(z):
    """
    This implementation avoids overflow issues by handling large positive and negative values of z separately.

    z: a numeric value.

    if z >= 0: use the 'standard' formula: 1/(1 + exp(-z))
    if z < 0: use the alternative formula to avoid overflow: exp(z) / (1 + exp(z))
    """
    if z >= 0:
        return 1 / (1 + math.exp(-z))
    else:
        ez = math.exp(z)
        return ez / (1 + ez)


def stable_binary_cross_entropy(p, y):
    """
    Clips p to avoid (math) issues, then computes binary cross-entropy loss.
    
    p: model confidence (between 0 and 1)
    y: true label (0 or 1)
    """
    eps = 1e-8

    if p < eps:
        p = eps
    if p > 1 - eps:
        p = 1 - eps

    return -(y * math.log(p) + (1 - y) * math.log(1 - p))

## Reducing the loss

Earlier in the course, we reduced error by asking:

- If we change a parameter slightly, how does the error change?
- In which direction should we move the parameter to reduce error?
- How sensitive is the error to that parameter?

We answered these questions by treating error as a function of the parameters and using **gradients** to understand how changes in the parameters affect the error.

[SIDE QUEST](https://docs.google.com/presentation/d/1g1uaurl-GckLb47FskBPzgtsrQJPiAQUhdVML_eOI_8/edit?usp=sharing): Play the slideshow and watch the videos to get some intuition aobut gradients. I have time marked them in the slides, but please feel free to watch the full video.

Here, we use the same idea, but with a more precise quantity called a **loss**.

The goal is the same:
understand how the loss changes with respect to the parameters so that we can update $w$ and $b$ to reduce it.

Our model has three parameters:
- $w_1$: weight for feature $x_1$
- $w_2$: weight for feature $x_2$
- $b$: bias

Computation Graph: 

$x \xrightarrow{\,w,b\,} z \xrightarrow{\,\sigma(.)\,} p \xrightarrow{\,y\,} L$

#### Indexing Convention

We will use **two different indices** to distinguish between the feature index and the example index: 

- **Feature index**: $i = 1, \dots, n$  
  Refers to *which input feature* ($x_1, x_2, \dots$).

- **Example index**: $j = 1, \dots, N$  
  Refers to *which training example* in the dataset.

We will write it as:

$$
x_i^{(j)} \quad \text{= feature } i \text{ of training example } j
$$

In Python, indexing starts at **0**, so this corresponds to:

$$
x_i^{(j)} \;\longleftrightarrow\; \texttt{X[j-1][i-1]}
$$

In practice, when we loop over the dataset in code, we usually let `j` already range from `0` to `N-1`.  
In that case, `X[j]` corresponds to example $j+1$, and we only need to remember that feature indices are also shifted by one.


In [None]:
# The forward pass to get the confidence values
def forward(X, w1, w2, b):
    """
    Inputs: 
    X: list of data points, where each point is a list of 2 features [x1, x2]
    w1, w2: weights for the two features
    b: bias term

    Output: A list of p (confidence values between 0 and 1) for each data point
    """
    
    # Initialize an empty list to store confidence values
    p_list = []

    # Iterate through each data point in X
    for j in range(len(X)):
        x1 = X[j][0]
        x2 = X[j][1]
        
        # Calculate the linear score z = w1*x1 + w2*x2 + b
        z = w1 * x1 + w2 * x2 + b
        
        # Use the score to get the sigmoid confidence
        p = stable_sigmoid(z)

        # Append the confidence value to the list
        p_list.append(p)
        
    return p_list

In [None]:
# The Loss Function
def get_avg_binary_cross_entropy(p_list, y):
    """
    
    p_list: list of predicted confidence values (between 0 and 1)
    y: list of true labels (0 or 1)

    """
    total_loss = 0.0
    for j in range(len(p_list)):
        p = p_list[j]
        y_true = y[j]
        
        # Safety: Clip p to avoid log(0) error
        loss_j = stable_binary_cross_entropy(p, y_true)
        total_loss += loss_j
        
    return total_loss / len(p_list)

### Calculating the Gradient with Respect to Multiple Parameters

When our model has more than one input feature, such as
$$
y = w_1 x_1 + w_2 x_2 + b,
$$
the loss depends on all the parameters: $w_1$, $w_2$, and $b$. In general, models can have many weights ($w_1, w_2, \dots, w_n$) and a bias $b$.

The loss $\mathcal{L}(p, y)$ measures how well the model's prediction $p$ matches the true label $y$ for a single data point. We cannot change $y$, so we focus on how changes in the prediction $p$ affect the loss.

Since $p$ depends on $z$, which in turn depends on the parameters of our model, the **dataset-level loss** becomes a function of the model parameters:

$$
\mathcal{L}(w_1, w_2, b)
= \frac{1}{N} \sum_{j=1}^N \mathcal{L}\!\left(p^{(j)}, y^{(j)}\right),
$$

where
$$
p^{(j)} = \sigma\!\left(z^{(j)}\right),
\qquad
z^{(j)} = w_1 x_1^{(j)} + w_2 x_2^{(j)} + b.
$$

This is the **average loss over all $N$ training examples**. When we change a parameter (for example, $w_1$), every prediction $p^{(j)}$ may change. As a result, the loss for each data point may change, and the gradient captures the effect of that parameter on the **average loss across the dataset**. This is why, during optimization, we treat the loss as a function of the model parameters.

To optimize these parameters, we compute the **partial derivatives** of the average loss with respect to each parameter:

- $\frac{\partial \mathcal{L}}{\partial w_1}$: how the average loss changes as $w_1$ changes, holding the other parameters fixed.
- $\frac{\partial \mathcal{L}}{\partial w_2}$: how the average loss changes as $w_2$ changes, holding the other parameters fixed.
- $\frac{\partial \mathcal{L}}{\partial b}$: how the average loss changes as $b$ changes.

One way to approximate these derivatives is the **finite difference method**, which measures how the average dataset loss changes when we slightly nudge one parameter:

$$
\frac{\partial \mathcal{L}}{\partial w_1}
\approx
\frac{\mathcal{L}(w_1+\epsilon, w_2, b) - \mathcal{L}(w_1, w_2, b)}{\epsilon}
$$

$$
\frac{\partial \mathcal{L}}{\partial w_2}
\approx
\frac{\mathcal{L}(w_1, w_2+\epsilon, b) - \mathcal{L}(w_1, w_2, b)}{\epsilon}
$$

$$
\frac{\partial \mathcal{L}}{\partial b}
\approx
\frac{\mathcal{L}(w_1, w_2, b+\epsilon) - \mathcal{L}(w_1, w_2, b)}{\epsilon}
$$

Here, $\epsilon$ is a small number. This idea works for any number of parameters: we nudge one parameter at a time and observe how the **average loss over the dataset** changes. In practice, backpropagation computes these gradients efficiently without rerunning the model separately for each parameter.


In [None]:
# The gradient calculation using the finite difference method
# To see how $(w,b)$ affect $L$, we nudge one parameter at a time and look at how that affects the loss.

def get_gradients(X, y, w1, w2, b):
    eps = 0.0001
    
    # STEP 0: Baseline - Where are we now?
    # We must Run Forward -> Then Calculate Loss
    base_conf = forward(X, w1, w2, b)
    base_loss  = get_avg_binary_cross_entropy(base_conf, y)
    
    # STEP 1: Get gradient for w1
    # Nudge w1 -> Re-run Forward -> Re-calculate Loss
    w1_conf = forward(X, w1 + eps, w2, b)
    w1_loss  = get_avg_binary_cross_entropy(w1_conf, y)
    grad_w1  = (w1_loss - base_loss) / eps
    
    # STEP 2: Get gradient for w2
    # Nudge w2 -> Re-run Forward -> Re-calculate Loss
    w2_conf = forward(X, w1, w2 + eps, b)
    w2_loss  = get_avg_binary_cross_entropy(w2_conf, y)
    grad_w2  = (w2_loss - base_loss) / eps
    
    # STEP 3: Get gradient for bias
    # Nudge b -> Re-run Forward -> Re-calculate Loss
    b_conf = forward(X, w1, w2, b + eps)
    b_loss  = get_avg_binary_cross_entropy(b_conf, y)
    grad_b  = (b_loss - base_loss) / eps
     
    return grad_w1, grad_w2, grad_b

In [None]:
# The Training Loop

# Initialize Parameters
w1 = 0.0
w2 = 0.0
b  = 0.0

# Initialize hyperparameters
learning_rate = 0.1
epochs = 500

print(f"Initial Loss: {get_avg_binary_cross_entropy(forward(X_toy, w1, w2, b), y_toy):.4f}")

for i in range(epochs):
    
    # 1. Calculate Gradients (This runs the model 4 times!)
    dw1, dw2, db = get_gradients(X_toy, y_toy, w1, w2, b)
    
    # 2. Update Weights
    w1 = w1 - learning_rate * dw1
    w2 = w2 - learning_rate * dw2
    b  = b  - learning_rate * db
    
    if i % 50 == 0:
        # Check progress
        curr_preds = forward(X_toy, w1, w2, b)
        curr_loss = get_avg_binary_cross_entropy(curr_preds, y_toy)
        print(f"Iter {i}: w1={w1:.2f}, w2={w2:.2f}, b={b:.2f} | Loss={curr_loss:.4f}")

print("\nFinal Result:")
print("Final loss:", curr_loss)

print(f"w1: {w1:.4f}, w2: {w2:.4f}, b: {b:.4f}")

We just trained our model by repeating the follwing steps:

- Running the model forward to get the confidence values $p$
- Computing the loss using $p$ and $y$
- Nudging one parameter at a time to see how the loss changes
- Updating the parameters using those estimates


**Q. This works - but it is slow, not optimal. Why?**

*Hint:*  Look closely at `get_gradients(X, y, w1, w2, b)`. How many times does the model run forward and recompute the loss during **one training step**?


`YOUR ANSWER HERE`

<!-- When using the finite difference method in the training loop above, we must run the model forward and recompute the loss separately for each parameter (once for the baseline, and once for each parameter nudge). For a model with $k$ parameters, this means $k+1$ forward passes per training step. As the number of parameters grows (for example, in deep neural networks with thousands or millions of weights), this approach becomes extremely slow and computationally expensive. This inefficiency is why we need a better method, **backpropagation**, which computes all gradients in a single backward pass, regardless of the number of parameters. -->

### Backpropagation Idea

Each step depends only on the one before it:

- The loss $L$ depends on $p$ and $y$ (fixed)
- The confidence $p$ depends on $z$
- The score $z$ depends on $x$ (fixed), $w$ and $b$

So instead of changing one parameter at a time and rerunning everything, we can:

1. Start at the loss
2. Measure how sensitive the loss is to its input
3. Pass that sensitivity backward through the computation

This is called **backpropagation**.


$$
x \;\xrightarrow{w,b}\; z \;\xrightarrow{\sigma}\; p \;\xrightarrow{y}\; L
$$

For the computation graph above, backpropagation answers the following questions:

1. If the value of $p$ changes slightly, how much does the loss $L$ change? $\big(\frac{\partial L}{\partial p}\big)$
2. If the score $z$ changes slightly, how much does the prediction $p$ change? $\big(\frac{\partial p}{\partial z}\big)$
3. If a parameter ($w$ or $b$) changes slightly, how much does the score $z$ change? $\big(\frac{\partial z}{\partial w_i}, \frac{\partial z}{\partial b}\big)$

Once we know these **local effects**, we multiply them together.

This is the **chain rule**, applied to a computation graph.

For any parameter $w_i$, the chain rule gives:

$$
\frac{\partial L}{\partial w_i}=
\frac{\partial L}{\partial p}
\cdot
\frac{\partial p}{\partial z}
\cdot
\frac{\partial z}{\partial w_i}
$$

Similarly, for the bias:

$$
\frac{\partial L}{\partial b}=
\frac{\partial L}{\partial p}
\cdot
\frac{\partial p}{\partial z}
\cdot
\frac{\partial z}{\partial b}
$$

#### Gradients with respect to the score

For **one example** $(x, y)$, the gradient of the loss with respect to the score $z$ is:

$$
\frac{\partial L}{\partial z} = \frac{\partial L}{\partial p}
\cdot
\frac{\partial p}{\partial z}
\cdot = p - y
$$

You are **not expected** to derive this result in this course. We will use it directly.
But if you're interested in learning how we get this - you can see it [here](https://drive.google.com/file/d/1XSSyNVk8HMhs4gKi4odK_hX7BBldkYzb/view?usp=sharing).


#### Understanding the signal `p - y`

Recall that gradient descent updates parameters by moving in the **opposite**
direction of the gradient.

1. If $y = 1$ and $p = 0.8$:
   - Is the gradient positive or negative?
   - Will the update push the score $z$ **up** or **down**?

   `YOUR ANSWER HERE`

2. If $y = 1$ and $p = 0.2$:
   - Is gradient positive or negative?
   - Will the update push the score $z$ **up** or **down**?

   `YOUR ANSWER HERE`

3. If $y = 0$ and $p = 0.8$:
   - Is gradient positive or negative?
   - Will the update push the score $z$ **up** or **down**?

   `YOUR ANSWER HERE`


#### Gradients with respect to the parameters

$$
\frac{\partial z}{\partial w_i} = x_i,
\qquad
\frac{\partial z}{\partial b} = 1
$$

Using the chain rule, the gradients for **one example** are:

$$
\frac{\partial L}{\partial w_i} =
\frac{\partial L}{\partial p}
\cdot
\frac{\partial p}{\partial z}
\cdot
\frac{\partial z}{\partial w_i}
= (p - y)\,x_i
$$

$$
\frac{\partial L}{\partial b}=
\frac{\partial L}{\partial p}
\cdot
\frac{\partial p}{\partial z}
\cdot
\frac{\partial z}{\partial b} = p - y
$$

- If the model predicts **too high** ($p > y$), the gradient is positive -> decrease the score next time.
- If the model predicts **too low** ($p < y$), the gradient is negative -> increase the score next time.
- If the prediction is correct ($p \approx y$), the gradient is near zero.


### Exercise: Implement Backpropagation Gradients

Fill in the `YOUR CODE HERE` sections in the function below to compute the gradients of the average dataset loss with respect to `w1`, `w2`, and `b` using the backpropagation rules for sigmoid + binary cross-entropy:

*Hint: Review the formulas in the previous markdown cell.*

After you complete this, run the next cell to compare your gradients to the finite-difference method!

In [None]:
def get_backprop_grads_dataset(X, y, w1, w2, b):
    """
    Gradients of the AVERAGE dataset loss w.r.t. (w1, w2, b)
    for sigmoid + BCE using the simplified rule:

        dL/dz = p - y

    where p = sigmoid(z) and z = w1*x1 + w2*x2 + b.

    For one example:
        dL/dw1 = (p - y) * x1
        dL/dw2 = (p - y) * x2
        dL/db  = (p - y)

    X: list of data points, where each point is a list of 2 features [x1, x2]
    y: list of true labels (0 or 1)
    w1, w2: weights for the two features
    b: bias term

    Return: dL/dw1, dL/dw2, dL/db
    """
    
    grad_w1 = 0.0
    grad_w2 = 0.0
    grad_b  = 0.0

    N = len(X)
    
    # Run forward once to get all the confidence values (p) for the dataset
    p_list = forward(X, w1, w2, b)
    
    for j in range(N):
        x1 = X[j][0]
        x2 = X[j][1]
        yj = y[j]
        pj = p_list[j]

        # Gradient of loss w.r.t. z (the linear score)
        # For sigmoid + BCE: dL/dz = p - y
        dL_dz = ... # YOUR CODE HERE

        # Gradient of loss w.r.t. parameters (using the simplified backprop rule)
        # Accumulate gradients over the dataset by summing contributions from each example
        grad_w1 += ... # YOUR CODE HERE
        grad_w2 += ... # YOUR CODE HERE
        grad_b  += ... # YOUR CODE HERE

    # Convert from SUM loss to AVERAGE loss
    grad_w1 /= N
    grad_w2 /= N
    grad_b  /= N

    return grad_w1, grad_w2, grad_b


In [None]:
# Compare finite-difference gradients vs backprop gradients at some test params
w1_test, w2_test, b_test = 0.3, -0.2, 0.1

fd = get_gradients(X_toy, y_toy, w1_test, w2_test, b_test)
bp = get_backprop_grads_dataset(X_toy, y_toy, w1_test, w2_test, b_test)

print("Finite diff:", fd)
print("Backprop:   ", bp)
print("Abs diff:   ", (abs(fd[0]-bp[0]), abs(fd[1]-bp[1]), abs(fd[2]-bp[2])))


### Dataset loss

Our training loop minimizes the **average loss over the entire dataset**:

$$
L_{\text{avg}}(w_1, w_2, b) = \frac{1}{N}\sum_{j=1}^N L(p_j, y_j)
$$

where

$$
p_j = \sigma(z_j), \qquad z_j = w_1 x_{1}^{(j)} + w_2 x_{2}^{(j)} + b
$$

So the gradients we want are:

$$
\frac{\partial L_{\text{avg}}}{\partial w_1},\quad
\frac{\partial L_{\text{avg}}}{\partial w_2},\quad
\frac{\partial L_{\text{avg}}}{\partial b}
$$

Backpropagation will compute the gradient contribution for each example and then we will **average** them.

Note that, backprop still needs the forward predictions $p$. The speedup is that we no longer rerun the model once per parameter nudge.



In [None]:
# The Training Loop

# Initialize Parameters
w1 = 0.0
w2 = 0.0
b  = 0.0

# Initialize hyperparameters
learning_rate = 0.1
epochs = 500

print(f"Initial Loss: {get_avg_binary_cross_entropy(forward(X_toy, w1, w2, b), y_toy):.4f}")

for i in range(epochs):
    
    dw1, dw2, db = get_backprop_grads_dataset(X_toy, y_toy, w1, w2, b)
    
    w1 = w1 - learning_rate * dw1
    w2 = w2 - learning_rate * dw2
    b  = b  - learning_rate * db
    
    if i % 50 == 0:
        # Check progress
        curr_preds = forward(X_toy, w1, w2, b)
        curr_loss = get_avg_binary_cross_entropy(curr_preds, y_toy)
        print(f"Iter {i}: w1={w1:.2f}, w2={w2:.2f}, b={b:.2f} | Loss={curr_loss:.4f}")

print("\nFinal Result:")
print("Final loss:", curr_loss)

print(f"w1: {w1:.4f}, w2: {w2:.4f}, b: {b:.4f}")