### CS231n
#### Optimization 2
## Intuitive Summary of Partial Derivatives


### 1. Definition of a Partial Derivative

A partial derivative measures how a function changes as **one variable** changes, holding all others constant:

$$
\frac{\partial f(x)}{\partial x}
= \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}
$$

- **Intuition**  
  - If you increase \(x\) by a tiny amount \(h\), the partial derivative tells you **how much** and **in which direction** \(f\) will change.  
  - **Sign**: positive means \(f\) increases, negative means \(f\) decreases.  
  - **Magnitude**: the sensitivity of \(f\) to \(x\).

---

### 2. Addition Function

Consider

$$
f(x,y) = x + y.
$$

Then

$$
\frac{\partial f}{\partial x} = 1, 
\quad
\frac{\partial f}{\partial y} = 1.
$$

- **Intuition**  
  - Bumping either \(x\) or \(y\) by \(h\) raises \(f\) by exactly \(h\).

---

### 3. Max Function (Subgradient)

Consider

$$
f(x,y) = \max(x,y).
$$

Then

$$
\frac{\partial f}{\partial x}
=
\begin{cases}
1, & x \ge y,\\
0, & x < y,
\end{cases}
\quad
\frac{\partial f}{\partial y}
=
\begin{cases}
1, & y \ge x,\\
0, & y < x.
\end{cases}
$$

- **Intuition**  
  - Only the larger input “wins” and gets gradient 1; the other gets 0.



## Forward & Backward Pass: End-to-End Flow

### 1. Forward Pass
1. **Inputs**  
   \(x, y, z\)  (예: \(x=-2,\;y=5,\;z=-4\))
2. **Step 1: Addition**  
   $$
   q = x + y
   \quad\Longrightarrow\quad
   q = -2 + 5 = 3
   $$
3. **Step 2: Multiplication**  
   $$
   f = q \times z
   \quad\Longrightarrow\quad
   f = 3 \times (-4) = -12
   $$
4. **Output**  
   \(f = -12\)

---

### 2. Backward Pass
1. **Start at the output**  
   $$
   \frac{\partial f}{\partial f} = 1
   $$
2. **Backprop through** \(f = q \times z\)  
   $$
   \frac{\partial f}{\partial q} = z = -4,
   \quad
   \frac{\partial f}{\partial z} = q = 3
   $$
3. **Backprop through** \(q = x + y\)  
   $$
   \frac{\partial q}{\partial x} = 1,
   \quad
   \frac{\partial q}{\partial y} = 1
   $$
4. **Chain rule 적용**  
   $$
   \frac{\partial f}{\partial x}
   = \frac{\partial f}{\partial q}\,\frac{\partial q}{\partial x}
   = (-4)\times1 = -4,
   $$
   $$
   \frac{\partial f}{\partial y}
   = \frac{\partial f}{\partial q}\,\frac{\partial q}{\partial y}
   = (-4)\times1 = -4,
   $$
   $$
   \frac{\partial f}{\partial z}
   = 3
   $$
5. **Gradients 결과**  
   $$
   \frac{\partial f}{\partial x} = -4,
   \quad
   \frac{\partial f}{\partial y} = -4,
   \quad
   \frac{\partial f}{\partial z} = 3.
   $$

---

### 3. Goal of Backpropagation
- **Objective**: Compute how **sensitive** the loss function \(L\) is to each parameter \(w_i\), i.e.
  $$
  \frac{\partial L}{\partial w_i}.
  $$
- **How**:
  1. Define **local derivatives** at each node in the computation graph.  
  2. From the final loss, **propagate gradients backward** using the **chain rule**.  
  3. Accumulate each parameter’s gradient in turn.

This entire reverse-gradient process is called **backpropagation**, and calling `loss.backward()` in a deep learning framework **automates** it.

---

### 4. Key Takeaway
- **Forward**: Compute values from inputs to loss via each operation.  
- **Backward**: Starting at loss, propagate gradients backward through each operation node by node.  
- **Backpropagation** = chaining local gradients via the chain rule to obtain \(\frac{\partial L}{\partial w_i}\) for every parameter.

## Local Derivatives & Chain Rule

### 1. Local derivative “formulas”
Each operation (gate) knows how to compute its own partial derivatives:

- **Addition**  
  $q = x + y$  
  $ \partial q / \partial x = 1 $,  
  $ \partial q / \partial y = 1 $

- **Multiplication**  
  $f = q \times z$  
  $ \partial f / \partial q = z $,  
  $ \partial f / \partial z = q $

- **ReLU**  
  $g(z) = \max(0, z)$  
  $ g'(z) = 1 $ if $z>0$, else $g'(z) = 0$

---

### 2. Chain Rule
To compute the gradient of a composite function, multiply local derivatives along the path:

If $f(x,y,z) = (x + y)\,z$, then

- $ \partial f / \partial x = (\partial f / \partial q)\times(\partial q/\partial x) = z \times 1 $
- $ \partial f / \partial y = z \times 1 $
- $ \partial f / \partial z = q \times 1 $

---

### 3. Backpropagation = Chain Rule on the Graph
1. Define local derivatives at each node.  
2. Start from the loss output and propagate gradients backward by multiplying local derivatives at each step.  
3. Frameworks’ `loss.backward()` automates this process across the entire network.

## Sigmoid Neuron: Forward & Backward Setup

### 1. Neuron definition
- **Weights**: $w = [w_0, w_1, w_2]$ where $w_0$ is the bias  
- **Inputs**: $x = [x_0, x_1]$  

The neuron computes:
1. **Affine (dot) layer**  
   $$
     \text{dot} \;=\; w_0\,x_0 \;+\; w_1\,x_1 \;+\; w_2
   $$
2. **Sigmoid activation**  
   $$
     f \;=\; \sigma(\text{dot})
     \;=\;
     \frac{1}{1 + e^{-\text{dot}}}
   $$

---

### 2. Forward pass example
Let
\[
  w = [2,\,-3,\,-3], \quad
  x = [-1,\,-2].
\]

1. **Compute dot**  
   $$
     \text{dot}
     = 2\times(-1) \;+\; (-3)\times(-2) \;+\; (-3)
     = -2 \;+\; 6 \;-\; 3
     = 1.
   $$
2. **Apply sigmoid**  
   $$
     f = \frac{1}{1 + e^{-1}}
       \approx 0.731.
   $$

In code:
```python
dot = w[0]*x[0] + w[1]*x[1] + w[2]   # dot = 1
f   = 1.0 / (1 + math.exp(-dot))    # f ≈ 0.731



In [3]:
# The backprop for this neuron
w=[2,-3,-3] 
x=[-1,-2]

# forward pass
dot=w[0]*x[0]+w[1]*x[1]+w[2]
f=1.0/(1+math.exp(-dot)) # sigmoid function

# backward pass through this neuron (backpropagation)
ddot=f*(1-f) # gradient on dot variable(d dot/d), using the sigmoid gradient derivation
dx=[w[0]*ddot, w[1]*ddot]
dw=[x[0]*ddot, x[1]*ddot, 1.0*ddot]


## Backprop in Practice: Staged Computation

Consider the function
$$
  f(x, y) \;=\; \frac{x + \sigma(y)}{\;\sigma(x)\;+\;(x + y)^2\;},
$$
where $\sigma(z)=1/(1+e^{-z})$ is the sigmoid.


In [10]:
x = 3 # example values
y = -4

# forward pass
sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator   #(1)
num = x + sigy # numerator                               #(2)
sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y                                              #(4)
xpysqr = xpy**2                                          #(5)
den = sigx + xpysqr # denominator                        #(6)
invden = 1.0 / den                                       #(7)
f = num * invden # done                                  #(8)


# backprop 
# --- Stage 8 → 7: f = num * invden ---
# 1) f 에서 num 과 invden 에 각각 기울기 전달
dnum    = invden      # ∂f/∂num   = invden
dinvden = num         # ∂f/∂invden = num

# --- Stage 7 → 6: invden = 1/den ---
# 2) invden 에서 den 으로 기울기 전파
#    ∂(1/den)/∂den = -1/den²

dden    = (-1.0 / (den**2)) * dinvden
# --- Stage 6 → 5,3: den = sigx + xpysqr ---
# 3) 덧셈이므로, sigx 와 xpysqr 에 똑같이 dden 전파
dsigx   = 1 * dden
dxpysqr = 1 * dden

# --- Stage 5 → 4: xpysqr = xpy**2 ---
# 4) 제곱의 국소 미분: ∂(xpy²)/∂xpy = 2*xpy
dxpy    = (2 * xpy) * dxpysqr

# --- Stage 4 → 2,4: xpy = x + y ---
# 5) 덧셈이므로, x 와 y 에 똑같이 dxpy 전파
dx      = 1 * dxpy
dy      = 1 * dxpy

# --- Stage 3 → x: sigx = σ(x) ---
# 6) 시그모이드의 국소 미분: ∂σ/∂x = σ(x)*(1−σ(x))
#    이미 dx 에 xpy 경로 기울기가 들어 있으니 += 로 누적
dx     += ((1 - sigx) * sigx) * dsigx

# --- Stage 2 → x,y: num = x + sigy ---
# 7) num = x + sigy 인 덧셈 블록
dx     += 1 * dnum     # ∂num/∂x    = 1
dsigy   = 1 * dnum     # ∂num/∂sigy = 1

# --- Stage 1 → y: sigy = σ(y) ---
# 8) 마지막으로 시그모이드 국소 미분 전파
dy     += ((1 - sigy) * sigy) * dsigy



1.5456448841066441


In [None]:
# Matrix-Matrix multiply gradient

# forward pass
W=np.random.randn(5,10)  # (5 x 10)=weights
X=np.random.randn(10,3)  # (10 x 3)=inputs/activations
D=W.dot(X)

# backward pass (given dD = ∂L/∂D)
dD=np.random.randn(*D.shape)

dW=dD.dot(X.T)   # (5×3)·(3×10) → (5×10)
dX = W.T.dot(dD) # (10×5)·(5×3) → (10×3)