# Gradients and Initialization

In the previous chapter we discussed an iterative algorithm to find the parameters that lead to mimization of the loss function.<br>
The basic idea, is to initialise the parameters randomly and then by a series of small updates we decrease the average loss. <br>
The key idea behind these changes are based on computing the gradients of the loss with respect to the parameters at the current position.<br>

In this chapter we'll:

1. Discuss how to efficiently calculate gradients efficiently
2. Initialization of the parameters

NOTE: It may help to first view the `TensorDerivatives` notebook before reading this as it'll review certain derivatives of various dimensions making certain concepts easier to grasp in this chapter.

## Problem definitions

Consider a network $f[x, \phi]$ with multivariate input x, parameters $\phi$, and three hidden layers $h_1, h_2, h_3$

$$\begin{align}
h_1 &= a[\beta_0 + \Omega_0x] \\
h_2 &= a[\beta_1 + \Omega_1h_1] \\
h_3 &= a[\beta_2 + \Omega_1h_2] \\
f[x, \phi] &= \beta_3 + \Omega_3h_3
\end{align}$$

Based on the SGD algorithm we (generally) apply the following update rule: 

$$\phi_{t+1} = \phi_{t} -\alpha \sum_{i \in B_t}\frac{\partial l_i[\phi_t]}{\partial \phi}$$

$\alpha$ is the learning rate <br>
$B_t$ the batch indicies at iteration $t$ <br>

To compute this update we have two main parameters to derive: 

$$\frac{\partial l_i}{\partial \beta_k} \qquad \text{and} \qquad \frac{\partial l_i}{\partial \Omega_k} \quad \forall k \in \{0, 1, \dots, K\}$$



## Computing Derivatives

The derivative of the loss informs us how the loss changes when we make small changes to the parameters. 

The $\textcolor{lightblue}{backpropagation \ algorithm}$ computes these derivatives for us.

**Observation 1**

Each weight matrix $\Omega_i$ is multiplied the activation at a hidden unit and adds the resuls to another hidden unit in the next layer, continuing this process until our output layer. <br>
By the definition of the activation function, this means that certain parameters will be amplified or attenuated by the activation for each hidden layer.<br> The process of running the network for each data instance and thus storing the effects of the cumulative activations of the hidden unit is known as $\textcolor{lightblue}{forward \ pass}$.

**Observation 2**

The weights and biases are constantly being affected as they pass through the deep and interconnect network. At the end, the network produces an output in which we measure it's Loss.<br> We wish to understand how changing the parameters modifies the loss, which in turn means going back through each step of the layer that may have caused changed to the parameters.<br> This is known as $\textcolor{lightblue}{backward \ pass}$. 

Once we calculated the respective derivative we update the parameters and start again the $forward-pass$ and $backward-pass$.


<div align="center">

<img  src="../images/chap6/forwardpass.png" alt="2-Layer Net" width="700" />

</div>

<div align="center">

### Example
</div>

$\text{Let our model }$ $$f[x, \phi] = \beta_3 + \omega_3 \cdot \cos[\beta_2 + \omega_2 \cdot \exp(\beta_1 + \omega_1 \cdot \sin[\beta_0 + \omega_0 \cdot x])]$$

$\text{Our parameters }$ $$\phi = \{\beta_0, \omega_0, \beta_1, \omega_1, \beta_2, \omega_2, \beta_3, \omega_3\}$$

$\text{Our Loss function }$ $$L[ \phi ] = \sum_i l_i \quad | \quad l_i = (f[x_i, \phi] - y_i)^2$$

Based on the above we wish to determine how the loss is afffected by a change in **each** of the parameters at each layer of the network: 

$$ \frac{\partial l_i}{\partial \beta_0}, \ \frac{\partial l_i}{\partial \omega_0}, \ \frac{\partial l_i}{\partial \beta_1}, \ \frac{\partial l_i}{\partial \omega_1}, \ \frac{\partial l_i}{\partial \beta_2}, \frac{\partial l_i}{\partial \omega_2}, \frac{\partial l_i}{\partial \beta_3}, \frac{\partial l_i}{\partial \omega_3}$$

### Naïve Approach

Compute directly using the chain rule...


<div style="background-color:#fff3cd; color:#000; border-left:5px solid #ffc107; padding:10px; margin:10px 0;">

**⚠️ Problems with the Naïve Approach:**

- Requires computing the full chain rule for each parameter separately  
- Many computations are repeated (e.g., $(f[x_i, \phi] - y_i)$ appears in every derivative)  
- Computationally expensive for deep networks  
- Intermediate terms like $\sin[h_4]$, $\exp(h_2)$, $\cos[h_0]$ are recalculated multiple times  

</div>

**This motivates the need for the backpropagation algorithm**, which efficiently computes all derivatives by:
1. Computing intermediate values once during the forward pass
2. Reusing these values during a single backward pass
3. Storing gradients at each layer and propagating them backwards


### Backpropagation Approach

Let's divide our above function into linear components denoted by $f_k$ and non-linear components denoted by $h_k$: 

$$\begin{align}
f_0 &= \beta_0 + \omega_0 \cdot x_i \\
h_1 &= \sin(f_0) \\ 
f_1 &= \beta_1 + \omega_1 \cdot h_1 \\
h_2 &= \exp[f_1] \\
f_2 &= \beta_2 + \omega_2 \cdot h_2 \\
h_3 &= \cos(f_2) \\
f_3 &= \beta_3 + \omega_3 \cdot h_3 \\
l_i &= (f_3 - y_i)^2
\end{align}$$

<div align="center">

<img  src="../images/chap6/fwdpass.png" alt="2-Layer Net" width="700" />

</div>

- We aim to minimize the number of computation. 
- Reuse Derivatives where possible 
- Compute isolated derivative Calculations.
- Obtain the same result as the direct long chain approach. 

#### The Chain Rule 

$\text{Suppose y is a function of u, and } u \text{ is a function of } x: y = f(y), \quad u = g(x) \\ \text{Then the derivative of y with respect to x is: }$ 

$$\boxed{\frac{\partial y}{\partial x} = \frac{\partial f}{\partial u}\cdot\frac{\partial g}{\partial x}}$$


$\text{For a function composed of many nested functions (such as a deep NN) the chain rule generalizes to:}$

$$\boxed{\frac{\partial l}{\partial \phi} = \frac{\partial l}{\partial z_n} \cdot \frac{\partial z_n}{\partial z_{n-1}} \cdot \frac{\partial z_{n-1}}{\partial z_{n-2}} \cdots \frac{\partial{z_1}}{\partial \phi}}$$






This means that we can sequentially apply the computation as we move backward along the graph.

We'll work backwards from the loss, computing and reusing intermediate derivatives.

**Recall our decomposition:**
$$\begin{align}
f_0 &= \beta_0 + \omega_0 \cdot x_i \\
h_1 &= \sin(f_0) \\ 
f_1 &= \beta_1 + \omega_1 \cdot h_1 \\
h_2 &= \exp(f_1) \\
f_2 &= \beta_2 + \omega_2 \cdot h_2 \\
h_3 &= \cos(f_2) \\
f_3 &= \beta_3 + \omega_3 \cdot h_3 \\
l_i &= (f_3 - y_i)^2
\end{align}$$

---


#### Step 0: Compute the derivative of the loss

$$\frac{\partial l_i}{\partial f_3} = 2(f_3 - y_i)$$

**Store this value!** We'll reuse it for all subsequent calculations.

---


#### Layer 3: Computing $\frac{\partial l_i}{\partial \beta_3}$ and $\frac{\partial l_i}{\partial \omega_3}$

**For $\beta_3$:**

Since $f_3 = \beta_3 + \omega_3 \cdot h_3$, we have:
$$\frac{\partial f_3}{\partial \beta_3} = 1$$

By the chain rule:
$$\boxed{\frac{\partial l_i}{\partial \beta_3} = \frac{\partial l_i}{\partial f_3} \cdot \frac{\partial f_3}{\partial \beta_3} = 2(f_3 - y_i) \cdot 1 = 2(f_3 - y_i)}$$

**For $\omega_3$:**

$$\frac{\partial f_3}{\partial \omega_3} = h_3$$

$$\boxed{\frac{\partial l_i}{\partial \omega_3} = \frac{\partial l_i}{\partial f_3} \cdot \frac{\partial f_3}{\partial \omega_3} = 2(f_3 - y_i) \cdot h_3}$$

---


#### Moving to Layer 2: Compute $\frac{\partial l_i}{\partial h_3}$

We need this for the next layer:
$$\frac{\partial l_i}{\partial h_3} = \frac{\partial l_i}{\partial f_3} \cdot \frac{\partial f_3}{\partial h_3} = 2(f_3 - y_i) \cdot \omega_3$$

**Store this value!**

---

#### Compute $\frac{\partial l_i}{\partial f_2}$

Since $h_3 = \cos(f_2)$:
$$\frac{\partial h_3}{\partial f_2} = -\sin(f_2)$$

$$\frac{\partial l_i}{\partial f_2} = \frac{\partial l_i}{\partial h_3} \cdot \frac{\partial h_3}{\partial f_2} = 2(f_3 - y_i) \cdot \omega_3 \cdot (-\sin(f_2))$$

**Store this value!**

---


**For $\beta_2$:**

Since $f_2 = \beta_2 + \omega_2 \cdot h_2$:
$$\frac{\partial f_2}{\partial \beta_2} = 1$$

$$\boxed{\frac{\partial l_i}{\partial \beta_2} = \frac{\partial l_i}{\partial f_2} \cdot \frac{\partial f_2}{\partial \beta_2} = -2\omega_3(f_3 - y_i) \sin(f_2)}$$

**For $\omega_2$:**

$$\frac{\partial f_2}{\partial \omega_2} = h_2$$

$$\boxed{\frac{\partial l_i}{\partial \omega_2} = \frac{\partial l_i}{\partial f_2} \cdot \frac{\partial f_2}{\partial \omega_2} = -2\omega_3(f_3 - y_i) h_2 \sin(f_2)}$$

---

#### Moving to Layer 1: Compute $\frac{\partial l_i}{\partial h_2}$

$$\frac{\partial l_i}{\partial h_2} = \frac{\partial l_i}{\partial f_2} \cdot \frac{\partial f_2}{\partial h_2} = -2\omega_3(f_3 - y_i) \sin(f_2) \cdot \omega_2$$

**Store this value!**

---

#### Compute $\frac{\partial l_i}{\partial f_1}$

Since $h_2 = \exp(f_1)$:
$$\frac{\partial h_2}{\partial f_1} = \exp(f_1) = h_2$$

$$\frac{\partial l_i}{\partial f_1} = \frac{\partial l_i}{\partial h_2} \cdot \frac{\partial h_2}{\partial f_1} = -2\omega_3\omega_2(f_3 - y_i) h_2 \sin(f_2)$$

**Store this value!**

---

#### Layer 1: Computing $\frac{\partial l_i}{\partial \beta_1}$ and $\frac{\partial l_i}{\partial \omega_1}$

**For $\beta_1$:**

Since $f_1 = \beta_1 + \omega_1 \cdot h_1$:
$$\frac{\partial f_1}{\partial \beta_1} = 1$$

$$\boxed{\frac{\partial l_i}{\partial \beta_1} = \frac{\partial l_i}{\partial f_1} \cdot \frac{\partial f_1}{\partial \beta_1} = -2\omega_3\omega_2(f_3 - y_i) h_2 \sin(f_2)}$$

**For $\omega_1$:**

$$\frac{\partial f_1}{\partial \omega_1} = h_1$$

$$\boxed{\frac{\partial l_i}{\partial \omega_1} = \frac{\partial l_i}{\partial f_1} \cdot \frac{\partial f_1}{\partial \omega_1} = -2\omega_3\omega_2(f_3 - y_i) h_1 h_2 \sin(f_2)}$$

---

#### Moving to Layer 0: Compute $\frac{\partial l_i}{\partial h_1}$

$$\frac{\partial l_i}{\partial h_1} = \frac{\partial l_i}{\partial f_1} \cdot \frac{\partial f_1}{\partial h_1} = -2\omega_3\omega_2\omega_1(f_3 - y_i) h_2 \sin(f_2)$$

**Store this value!**

---

#### Compute $\frac{\partial l_i}{\partial f_0}$

Since $h_1 = \sin(f_0)$:
$$\frac{\partial h_1}{\partial f_0} = \cos(f_0)$$

$$\frac{\partial l_i}{\partial f_0} = \frac{\partial l_i}{\partial h_1} \cdot \frac{\partial h_1}{\partial f_0} = -2\omega_3\omega_2\omega_1(f_3 - y_i) h_2 \sin(f_2) \cos(f_0)$$

**Store this value!**

---

#### Layer 0: Computing $\frac{\partial l_i}{\partial \beta_0}$ and $\frac{\partial l_i}{\partial \omega_0}$

**For $\beta_0$:**

Since $f_0 = \beta_0 + \omega_0 \cdot x_i$:
$$\frac{\partial f_0}{\partial \beta_0} = 1$$

$$\boxed{\frac{\partial l_i}{\partial \beta_0} = \frac{\partial l_i}{\partial f_0} \cdot \frac{\partial f_0}{\partial \beta_0} = -2\omega_3\omega_2\omega_1(f_3 - y_i) h_2 \sin(f_2) \cos(f_0)}$$

**For $\omega_0$:**

$$\frac{\partial f_0}{\partial \omega_0} = x_i$$

$$\boxed{\frac{\partial l_i}{\partial \omega_0} = \frac{\partial l_i}{\partial f_0} \cdot \frac{\partial f_0}{\partial \omega_0} = -2\omega_3\omega_2\omega_1 x_i(f_3 - y_i) h_2 \sin(f_2) \cos(f_0)}$$

---


### Key Observations

1. **Efficiency**: Each intermediate derivative (like $\frac{\partial l_i}{\partial f_3}$, $\frac{\partial l_i}{\partial f_2}$, etc.) is computed **only once** and reused for multiple parameter gradients.

2. **Sequential Computation**: We move backwards through the network, computing gradients layer by layer.

3. **Reuse**: Notice how $2(f_3 - y_i)$ appears in all gradients, $\sin(f_2)$ appears in all gradients from layer 2 backwards, etc.

4. **Storage**: During the forward pass, we store $f_0, h_1, f_1, h_2, f_2, h_3, f_3$. During the backward pass, we compute and store intermediate gradients.

This is the essence of **backpropagation**: compute once, reuse everywhere!


<div align="center">

<img  src="../images/chap6/bckwadpass.png" alt="2-Layer Net" width="700" />

</div>

## Backpropagation Algorithm 


<div align="center">

<img  src="../images/chap6/forwardpass.png" alt="2-Layer Net" width="700" />

</div>

**Strong Reccomendation: Try to solve this on your own before reading on**

### Network Architecture

- **Input:** $\mathbf{x} \in \mathbb{R}^{3}$ (3-dimensional input vector)
- **Layer 1:** 4 hidden units → $\mathbf{h}_1 \in \mathbb{R}^{4}$
- **Layer 2:** 2 hidden units → $\mathbf{h}_2 \in \mathbb{R}^{2}$
- **Layer 3:** 3 hidden units → $\mathbf{h}_3 \in \mathbb{R}^{3}$
- **Output Layer:** $\mathbf{f} \in \mathbb{R}^{2}$ (2-dimensional output)
- **Loss:** $L = \sum_i l_i$ where $l_i = \|\mathbf{f}(\mathbf{x}_i, \phi) - \mathbf{y}_i\|^2$

---


### Forward Pass Equations

**Layer 0 → Layer 1:**
$$\mathbf{f}_0 = \boldsymbol{\beta}_0 + \boldsymbol{\Omega}_0 \mathbf{x}$$
$$\mathbf{h}_1 = a[\mathbf{f}_0]$$

where:
- $\boldsymbol{\beta}_0 \in \mathbb{R}^{4}$ (bias vector)
- $\boldsymbol{\Omega}_0 \in \mathbb{R}^{4 \times 3}$ (weight matrix)
- $\mathbf{f}_0 \in \mathbb{R}^{4}$ (pre-activation)
- $\mathbf{h}_1 \in \mathbb{R}^{4}$ (activation)

**Layer 1 → Layer 2:**
$$\mathbf{f}_1 = \boldsymbol{\beta}_1 + \boldsymbol{\Omega}_1 \mathbf{h}_1$$
$$\mathbf{h}_2 = a[\mathbf{f}_1]$$

where:
- $\boldsymbol{\beta}_1 \in \mathbb{R}^{2}$ 
- $\boldsymbol{\Omega}_1 \in \mathbb{R}^{2 \times 4}$
- $\mathbf{f}_1 \in \mathbb{R}^{2}$
- $\mathbf{h}_2 \in \mathbb{R}^{2}$

**Layer 2 → Layer 3:**
$$\mathbf{f}_2 = \boldsymbol{\beta}_2 + \boldsymbol{\Omega}_2 \mathbf{h}_2$$
$$\mathbf{h}_3 = a[\mathbf{f}_2]$$

where:
- $\boldsymbol{\beta}_2 \in \mathbb{R}^{3}$
- $\boldsymbol{\Omega}_2 \in \mathbb{R}^{3 \times 2}$
- $\mathbf{f}_2 \in \mathbb{R}^{3}$
- $\mathbf{h}_3 \in \mathbb{R}^{3}$

**Layer 3 → Output:**
$$\mathbf{f}_3 = \boldsymbol{\beta}_3 + \boldsymbol{\Omega}_3 \mathbf{h}_3$$

where:
- $\boldsymbol{\beta}_3 \in \mathbb{R}^{2}$
- $\boldsymbol{\Omega}_3 \in \mathbb{R}^{2 \times 3}$
- $\mathbf{f}_3 \in \mathbb{R}^{2}$ (final output)

**Loss:**
$$l_i = \|\mathbf{f}_3 - \mathbf{y}_i\|^2 = (\mathbf{f}_3 - \mathbf{y}_i)^T(\mathbf{f}_3 - \mathbf{y}_i)$$

---
---

### Backward Pass: Computing Gradients

#### Step 0: Gradient of Loss w.r.t. Output

$$\frac{\partial l_i}{\partial \mathbf{f}_3} = 2(\mathbf{f}_3 - \mathbf{y}_i)$$

**Shape:** $\mathbb{R}^{2}$

**Store this!**

---

#### Layer 3: Gradients w.r.t. $\boldsymbol{\beta}_3$ and $\boldsymbol{\Omega}_3$

**For $\boldsymbol{\beta}_3$:**

Since $\mathbf{f}_3 = \boldsymbol{\beta}_3 + \boldsymbol{\Omega}_3 \mathbf{h}_3$:

$$\frac{\partial \mathbf{f}_3}{\partial \boldsymbol{\beta}_3} = \mathbf{I}_{2 \times 2}$$

$$\boxed{\frac{\partial l_i}{\partial \boldsymbol{\beta}_3} = \frac{\partial l_i}{\partial \mathbf{f}_3} = 2(\mathbf{f}_3 - \mathbf{y}_i)}$$

**Shape:** $\mathbb{R}^{2}$

**For $\boldsymbol{\Omega}_3$:**

$$\frac{\partial \mathbf{f}_3}{\partial \boldsymbol{\Omega}_3} = \mathbf{h}_3^T$$

$$\boxed{\frac{\partial l_i}{\partial \boldsymbol{\Omega}_3} = \frac{\partial l_i}{\partial \mathbf{f}_3} \mathbf{h}_3^T = 2(\mathbf{f}_3 - \mathbf{y}_i) \mathbf{h}_3^T}$$

**Shape:** $\mathbb{R}^{2 \times 3}$ (outer product)

---

#### Propagate to Layer 3: $\frac{\partial l_i}{\partial \mathbf{h}_3}$

$$\frac{\partial l_i}{\partial \mathbf{h}_3} = \boldsymbol{\Omega}_3^T \frac{\partial l_i}{\partial \mathbf{f}_3} = 2\boldsymbol{\Omega}_3^T(\mathbf{f}_3 - \mathbf{y}_i)$$

**Shape:** $\mathbb{R}^{3}$

**Store this!**

---

#### Compute $\frac{\partial l_i}{\partial \mathbf{f}_2}$

Since $\mathbf{h}_3 = a[\mathbf{f}_2]$:

$$\frac{\partial \mathbf{h}_3}{\partial \mathbf{f}_2} = \text{diag}(a'[\mathbf{f}_2])$$

$$\frac{\partial l_i}{\partial \mathbf{f}_2} = \text{diag}(a'[\mathbf{f}_2]) \cdot \frac{\partial l_i}{\partial \mathbf{h}_3}$$

Or element-wise:
$$\frac{\partial l_i}{\partial \mathbf{f}_2} = a'[\mathbf{f}_2] \odot \frac{\partial l_i}{\partial \mathbf{h}_3}$$

where $\odot$ denotes element-wise multiplication.

**Shape:** $\mathbb{R}^{3}$

**Store this!**

---


#### Layer 2: Gradients w.r.t. $\boldsymbol{\beta}_2$ and $\boldsymbol{\Omega}_2$

**For $\boldsymbol{\beta}_2$:**

$$\boxed{\frac{\partial l_i}{\partial \boldsymbol{\beta}_2} = \frac{\partial l_i}{\partial \mathbf{f}_2}}$$

**Shape:** $\mathbb{R}^{3}$

**For $\boldsymbol{\Omega}_2$:**

$$\boxed{\frac{\partial l_i}{\partial \boldsymbol{\Omega}_2} = \frac{\partial l_i}{\partial \mathbf{f}_2} \mathbf{h}_2^T}$$

**Shape:** $\mathbb{R}^{3 \times 2}$

---

#### Propagate to Layer 2: $\frac{\partial l_i}{\partial \mathbf{h}_2}$

$$\frac{\partial l_i}{\partial \mathbf{h}_2} = \boldsymbol{\Omega}_2^T \frac{\partial l_i}{\partial \mathbf{f}_2}$$

**Shape:** $\mathbb{R}^{2}$

**Store this!**

---

#### Compute $\frac{\partial l_i}{\partial \mathbf{f}_1}$

$$\frac{\partial l_i}{\partial \mathbf{f}_1} = a'[\mathbf{f}_1] \odot \frac{\partial l_i}{\partial \mathbf{h}_2}$$

**Shape:** $\mathbb{R}^{2}$

**Store this!**

---

#### Layer 1: Gradients w.r.t. $\boldsymbol{\beta}_1$ and $\boldsymbol{\Omega}_1$

**For $\boldsymbol{\beta}_1$:**

$$\boxed{\frac{\partial l_i}{\partial \boldsymbol{\beta}_1} = \frac{\partial l_i}{\partial \mathbf{f}_1}}$$

**Shape:** $\mathbb{R}^{2}$

**For $\boldsymbol{\Omega}_1$:**

$$\boxed{\frac{\partial l_i}{\partial \boldsymbol{\Omega}_1} = \frac{\partial l_i}{\partial \mathbf{f}_1} \mathbf{h}_1^T}$$

**Shape:** $\mathbb{R}^{2 \times 4}$

---

#### Propagate to Layer 1: $\frac{\partial l_i}{\partial \mathbf{h}_1}$

$$\frac{\partial l_i}{\partial \mathbf{h}_1} = \boldsymbol{\Omega}_1^T \frac{\partial l_i}{\partial \mathbf{f}_1}$$

**Shape:** $\mathbb{R}^{4}$

**Store this!**

---

#### Compute $\frac{\partial l_i}{\partial \mathbf{f}_0}$

$$\frac{\partial l_i}{\partial \mathbf{f}_0} = a'[\mathbf{f}_0] \odot \frac{\partial l_i}{\partial \mathbf{h}_1}$$

**Shape:** $\mathbb{R}^{4}$

**Store this!**

---

#### Layer 0: Gradients w.r.t. $\boldsymbol{\beta}_0$ and $\boldsymbol{\Omega}_0$

**For $\boldsymbol{\beta}_0$:**

$$\boxed{\frac{\partial l_i}{\partial \boldsymbol{\beta}_0} = \frac{\partial l_i}{\partial \mathbf{f}_0}}$$

**Shape:** $\mathbb{R}^{4}$

**For $\boldsymbol{\Omega}_0$:**

$$\boxed{\frac{\partial l_i}{\partial \boldsymbol{\Omega}_0} = \frac{\partial l_i}{\partial \mathbf{f}_0} \mathbf{x}^T}$$

**Shape:** $\mathbb{R}^{4 \times 3}$

---

### Summary: Backpropagation Algorithm

1. **Forward Pass:** Compute and store $\mathbf{f}_0, \mathbf{h}_1, \mathbf{f}_1, \mathbf{h}_2, \mathbf{f}_2, \mathbf{h}_3, \mathbf{f}_3$

2. **Backward Pass:** 
   - Start from $\frac{\partial l_i}{\partial \mathbf{f}_3}$
   - For each layer $k = 3, 2, 1, 0$:
     - Compute $\frac{\partial l_i}{\partial \boldsymbol{\beta}_k}$ and $\frac{\partial l_i}{\partial \boldsymbol{\Omega}_k}$
     - Propagate gradient backwards: $\frac{\partial l_i}{\partial \mathbf{h}_k} = \boldsymbol{\Omega}_k^T \frac{\partial l_i}{\partial \mathbf{f}_k}$
     - Apply activation derivative: $\frac{\partial l_i}{\partial \mathbf{f}_{k-1}} = a'[\mathbf{f}_{k-1}] \odot \frac{\partial l_i}{\partial \mathbf{h}_k}$

3. **Update Parameters:** Use computed gradients in SGD update rule

---

### Key Pattern: The Backpropagation Recursion

For layer $k$:

$$\frac{\partial l_i}{\partial \boldsymbol{\beta}_k} = \frac{\partial l_i}{\partial \mathbf{f}_k}$$

$$\frac{\partial l_i}{\partial \boldsymbol{\Omega}_k} = \frac{\partial l_i}{\partial \mathbf{f}_k} \mathbf{h}_k^T$$

$$\frac{\partial l_i}{\partial \mathbf{h}_k} = \boldsymbol{\Omega}_{k+1}^T \frac{\partial l_i}{\partial \mathbf{f}_{k+1}}$$

$$\frac{\partial l_i}{\partial \mathbf{f}_{k-1}} = a'[\mathbf{f}_{k-1}] \odot \frac{\partial l_i}{\partial \mathbf{h}_k}$$

This pattern repeats for every layer!




## Backpropagation in Branched Networks

<div align="center">

### Branched Network Architecture Example

</div>

<div align="center">

<img  src="../images/chap6/branched_net.png" alt="2-Layer Net" width="700" />

</div>



**Strong Recommendation: Try to solve this on your own before reading on**

This example demonstrates backpropagation in a network that **branches and then rejoins**, which is common in modern architectures like ResNets, Inception networks, and U-Nets.

---

### Network Architecture

- **Input:** $\mathbf{x} \in \mathbb{R}^{3}$ (3-dimensional input vector)
- **Layer 1:** 4 hidden units → $\mathbf{h}_1 \in \mathbb{R}^{4}$
- **Branch A:** 2 hidden units → $\mathbf{h}_A \in \mathbb{R}^{2}$
- **Branch B:** 3 hidden units → $\mathbf{h}_B \in \mathbb{R}^{3}$
- **Join Layer:** 2 hidden units → $\mathbf{h}_J \in \mathbb{R}^{2}$
- **Output Layer:** $\mathbf{f} \in \mathbb{R}^{2}$ (2-dimensional output)
- **Loss:** $l = \|\mathbf{f} - \mathbf{y}\|^2$

**Key Insight:** After Layer 1, the computation splits into two parallel branches (A and B), which are then concatenated and fed into the Join Layer.

---

### Forward Pass Equations

**Layer 1:**
$$\mathbf{f}_1 = \boldsymbol{\beta}_1 + \boldsymbol{\Omega}_1 \mathbf{x}$$
$$\mathbf{h}_1 = a_1[\mathbf{f}_1]$$

where:
- $\boldsymbol{\beta}_1 \in \mathbb{R}^{4}$ (bias vector)
- $\boldsymbol{\Omega}_1 \in \mathbb{R}^{4 \times 3}$ (weight matrix)
- $\mathbf{f}_1 \in \mathbb{R}^{4}$ (pre-activation)
- $\mathbf{h}_1 \in \mathbb{R}^{4}$ (activation)

---

**Branch A:**
$$\mathbf{f}_A = \boldsymbol{\beta}_A + \boldsymbol{\Omega}_A \mathbf{h}_1$$
$$\mathbf{h}_A = a_A[\mathbf{f}_A]$$

where:
- $\boldsymbol{\beta}_A \in \mathbb{R}^{2}$
- $\boldsymbol{\Omega}_A \in \mathbb{R}^{2 \times 4}$
- $\mathbf{f}_A \in \mathbb{R}^{2}$
- $\mathbf{h}_A \in \mathbb{R}^{2}$

---

**Branch B (parallel to Branch A):**
$$\mathbf{f}_B = \boldsymbol{\beta}_B + \boldsymbol{\Omega}_B \mathbf{h}_1$$
$$\mathbf{h}_B = a_B[\mathbf{f}_B]$$

where:
- $\boldsymbol{\beta}_B \in \mathbb{R}^{3}$
- $\boldsymbol{\Omega}_B \in \mathbb{R}^{3 \times 4}$
- $\mathbf{f}_B \in \mathbb{R}^{3}$
- $\mathbf{h}_B \in \mathbb{R}^{3}$

---

**Join Layer (concatenate branches):**
$$\mathbf{f}_J = \boldsymbol{\beta}_J + \boldsymbol{\Omega}_J \begin{bmatrix} \mathbf{h}_A \\ \mathbf{h}_B \end{bmatrix}$$
$$\mathbf{h}_J = a_J[\mathbf{f}_J]$$

where:
- $\begin{bmatrix} \mathbf{h}_A \\ \mathbf{h}_B \end{bmatrix} \in \mathbb{R}^{5}$ (concatenation of Branch A and B outputs)
- $\boldsymbol{\beta}_J \in \mathbb{R}^{2}$
- $\boldsymbol{\Omega}_J \in \mathbb{R}^{2 \times 5}$
- $\mathbf{f}_J \in \mathbb{R}^{2}$
- $\mathbf{h}_J \in \mathbb{R}^{2}$

---

**Output Layer:**
$$\mathbf{f} = \boldsymbol{\beta}_O + \boldsymbol{\Omega}_O \mathbf{h}_J$$

where:
- $\boldsymbol{\beta}_O \in \mathbb{R}^{2}$
- $\boldsymbol{\Omega}_O \in \mathbb{R}^{2 \times 2}$
- $\mathbf{f} \in \mathbb{R}^{2}$ (final output)

**Loss:**
$$l = \|\mathbf{f} - \mathbf{y}\|^2 = (\mathbf{f} - \mathbf{y})^T(\mathbf{f} - \mathbf{y})$$

---
---

### Backward Pass: Computing Gradients

#### Step 0: Gradient of Loss w.r.t. Output

$$\frac{\partial l}{\partial \mathbf{f}} = 2(\mathbf{f} - \mathbf{y})$$

**Shape:** $\mathbb{R}^{2}$

**Store this!**

---

#### Output Layer: Gradients w.r.t. $\boldsymbol{\beta}_O$ and $\boldsymbol{\Omega}_O$

**For $\boldsymbol{\beta}_O$:**

$$\boxed{\frac{\partial l}{\partial \boldsymbol{\beta}_O} = \frac{\partial l}{\partial \mathbf{f}} = 2(\mathbf{f} - \mathbf{y})}$$

**Shape:** $\mathbb{R}^{2}$

**For $\boldsymbol{\Omega}_O$:**

$$\boxed{\frac{\partial l}{\partial \boldsymbol{\Omega}_O} = \frac{\partial l}{\partial \mathbf{f}} \mathbf{h}_J^T = 2(\mathbf{f} - \mathbf{y}) \mathbf{h}_J^T}$$

**Shape:** $\mathbb{R}^{2 \times 2}$ (outer product)

---

#### Propagate to Join Layer: $\frac{\partial l}{\partial \mathbf{h}_J}$

$$\frac{\partial l}{\partial \mathbf{h}_J} = \boldsymbol{\Omega}_O^T \frac{\partial l}{\partial \mathbf{f}} = 2\boldsymbol{\Omega}_O^T(\mathbf{f} - \mathbf{y})$$

**Shape:** $\mathbb{R}^{2}$

**Store this!**

---

#### Compute $\frac{\partial l}{\partial \mathbf{f}_J}$

Since $\mathbf{h}_J = a_J[\mathbf{f}_J]$:

$$\frac{\partial l}{\partial \mathbf{f}_J} = a_J'[\mathbf{f}_J] \odot \frac{\partial l}{\partial \mathbf{h}_J}$$

**Shape:** $\mathbb{R}^{2}$

**Store this!**

---

#### Join Layer: Gradients w.r.t. $\boldsymbol{\beta}_J$ and $\boldsymbol{\Omega}_J$

**For $\boldsymbol{\beta}_J$:**

$$\boxed{\frac{\partial l}{\partial \boldsymbol{\beta}_J} = \frac{\partial l}{\partial \mathbf{f}_J}}$$

**Shape:** $\mathbb{R}^{2}$

**For $\boldsymbol{\Omega}_J$:**

$$\boxed{\frac{\partial l}{\partial \boldsymbol{\Omega}_J} = \frac{\partial l}{\partial \mathbf{f}_J} \begin{bmatrix} \mathbf{h}_A \\ \mathbf{h}_B \end{bmatrix}^T}$$

**Shape:** $\mathbb{R}^{2 \times 5}$ (outer product with concatenated vector)

---

#### Propagate to Concatenated Branches: $\frac{\partial l}{\partial \begin{bmatrix} \mathbf{h}_A \\ \mathbf{h}_B \end{bmatrix}}$

$$\frac{\partial l}{\partial \begin{bmatrix} \mathbf{h}_A \\ \mathbf{h}_B \end{bmatrix}} = \boldsymbol{\Omega}_J^T \frac{\partial l}{\partial \mathbf{f}_J}$$

**Shape:** $\mathbb{R}^{5}$

**Critical Step:** Split this gradient into two parts:

$$\frac{\partial l}{\partial \mathbf{h}_A} = \text{first 2 elements of } \boldsymbol{\Omega}_J^T \frac{\partial l}{\partial \mathbf{f}_J}$$

$$\frac{\partial l}{\partial \mathbf{h}_B} = \text{last 3 elements of } \boldsymbol{\Omega}_J^T \frac{\partial l}{\partial \mathbf{f}_J}$$

**Store both!**

---

#### Branch A: Compute $\frac{\partial l}{\partial \mathbf{f}_A}$

$$\frac{\partial l}{\partial \mathbf{f}_A} = a_A'[\mathbf{f}_A] \odot \frac{\partial l}{\partial \mathbf{h}_A}$$

**Shape:** $\mathbb{R}^{2}$

**Store this!**

---

#### Branch A: Gradients w.r.t. $\boldsymbol{\beta}_A$ and $\boldsymbol{\Omega}_A$

**For $\boldsymbol{\beta}_A$:**

$$\boxed{\frac{\partial l}{\partial \boldsymbol{\beta}_A} = \frac{\partial l}{\partial \mathbf{f}_A}}$$

**Shape:** $\mathbb{R}^{2}$

**For $\boldsymbol{\Omega}_A$:**

$$\boxed{\frac{\partial l}{\partial \boldsymbol{\Omega}_A} = \frac{\partial l}{\partial \mathbf{f}_A} \mathbf{h}_1^T}$$

**Shape:** $\mathbb{R}^{2 \times 4}$

---

#### Branch A: Propagate to $\mathbf{h}_1$

$$\frac{\partial l}{\partial \mathbf{h}_1}^{(A)} = \boldsymbol{\Omega}_A^T \frac{\partial l}{\partial \mathbf{f}_A}$$

**Shape:** $\mathbb{R}^{4}$

**Store this - but DON'T use it yet!**

---

#### Branch B: Compute $\frac{\partial l}{\partial \mathbf{f}_B}$

$$\frac{\partial l}{\partial \mathbf{f}_B} = a_B'[\mathbf{f}_B] \odot \frac{\partial l}{\partial \mathbf{h}_B}$$

**Shape:** $\mathbb{R}^{3}$

**Store this!**

---

#### Branch B: Gradients w.r.t. $\boldsymbol{\beta}_B$ and $\boldsymbol{\Omega}_B$

**For $\boldsymbol{\beta}_B$:**

$$\boxed{\frac{\partial l}{\partial \boldsymbol{\beta}_B} = \frac{\partial l}{\partial \mathbf{f}_B}}$$

**Shape:** $\mathbb{R}^{3}$

**For $\boldsymbol{\Omega}_B$:**

$$\boxed{\frac{\partial l}{\partial \boldsymbol{\Omega}_B} = \frac{\partial l}{\partial \mathbf{f}_B} \mathbf{h}_1^T}$$

**Shape:** $\mathbb{R}^{3 \times 4}$

---

#### Branch B: Propagate to $\mathbf{h}_1$

$$\frac{\partial l}{\partial \mathbf{h}_1}^{(B)} = \boldsymbol{\Omega}_B^T \frac{\partial l}{\partial \mathbf{f}_B}$$

**Shape:** $\mathbb{R}^{4}$

**Store this!**

---

#### **Critical Step: Merge Gradients at $\mathbf{h}_1$**

Since $\mathbf{h}_1$ feeds into **both** Branch A and Branch B, we must **sum** the gradients from both paths:

$$\boxed{\frac{\partial l}{\partial \mathbf{h}_1} = \frac{\partial l}{\partial \mathbf{h}_1}^{(A)} + \frac{\partial l}{\partial \mathbf{h}_1}^{(B)}}$$

**Shape:** $\mathbb{R}^{4}$

**This is the key difference from sequential networks!**

---

#### Compute $\frac{\partial l}{\partial \mathbf{f}_1}$

$$\frac{\partial l}{\partial \mathbf{f}_1} = a_1'[\mathbf{f}_1] \odot \frac{\partial l}{\partial \mathbf{h}_1}$$

**Shape:** $\mathbb{R}^{4}$

**Store this!**

---

#### Layer 1: Gradients w.r.t. $\boldsymbol{\beta}_1$ and $\boldsymbol{\Omega}_1$

**For $\boldsymbol{\beta}_1$:**

$$\boxed{\frac{\partial l}{\partial \boldsymbol{\beta}_1} = \frac{\partial l}{\partial \mathbf{f}_1}}$$

**Shape:** $\mathbb{R}^{4}$

**For $\boldsymbol{\Omega}_1$:**

$$\boxed{\frac{\partial l}{\partial \boldsymbol{\Omega}_1} = \frac{\partial l}{\partial \mathbf{f}_1} \mathbf{x}^T}$$

**Shape:** $\mathbb{R}^{4 \times 3}$

---

### Summary: Backpropagation in Branched Networks

1. **Forward Pass:** Compute and store $\mathbf{f}_1, \mathbf{h}_1, \mathbf{f}_A, \mathbf{h}_A, \mathbf{f}_B, \mathbf{h}_B, \mathbf{f}_J, \mathbf{h}_J, \mathbf{f}$

2. **Backward Pass:** 
   - Start from $\frac{\partial l}{\partial \mathbf{f}}$
   - Propagate back through Output Layer and Join Layer
   - **Split gradient** at the concatenation point into $\frac{\partial l}{\partial \mathbf{h}_A}$ and $\frac{\partial l}{\partial \mathbf{h}_B}$
   - Compute gradients for Branch A and Branch B **independently**
   - **Sum the gradients** that flow back to $\mathbf{h}_1$: $\frac{\partial l}{\partial \mathbf{h}_1} = \frac{\partial l}{\partial \mathbf{h}_1}^{(A)} + \frac{\partial l}{\partial \mathbf{h}_1}^{(B)}$
   - Continue backpropagation through Layer 1

3. **Key Rule for Branching:** When a layer's output feeds into multiple paths, **sum all gradients** flowing back from those paths.

---

### Why Sum the Gradients?

By the **multivariate chain rule**, when a variable $\mathbf{h}_1$ affects the loss through multiple paths:

$$\frac{\partial l}{\partial \mathbf{h}_1} = \frac{\partial l}{\partial \mathbf{h}_1}^{(\text{via Branch A})} + \frac{\partial l}{\partial \mathbf{h}_1}^{(\text{via Branch B})}$$

This is because:
- Loss depends on $\mathbf{h}_A$, which depends on $\mathbf{h}_1$
- Loss also depends on $\mathbf{h}_B$, which depends on $\mathbf{h}_1$
- Total change in loss = sum of changes through all paths

This principle generalizes to any acyclic computational graph!