# Foundation

---

## Derivative 
**Derivative** of univariate function: 
$$
\frac{df(x)}{dx} = \lim_{\Delta x\ \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x}.
$$
The derivate tells us how sensitive of $f(x)$ on $x$.  
**Partial Derivative** of multivariate function:
$$
\frac{\partial f(x,y)}{\partial x} = \lim_{\Delta x\ \to 0} \frac{f(x + \Delta x, y) - f(x, y)}{\Delta x}.
$$
The partial derivate with respect to $x$ tells us how sensitive of $f(x,y)$ on $x$ given a value of $y$.  
**Chain Rule**   
Univeriate case  
Suppose $u=\phi(x)$ and $v=\psi(x)$, under some conditions, then
$$
\frac{df(\phi(x),\psi(x))}{dx}=
\frac{\partial f(u,v)}{\partial u}\frac{d\phi(x)}{dx}+
\frac{\partial f(u,v)}{\partial v}\frac{d\psi(x)}{dx}.
$$
Multivariate case  
Suppose $u=\phi(x,y)$ and $v=\psi(x,y)$, under some conditions, then
$$
\frac{\partial f(\phi(x,y),\psi(x,y))}{\partial x}=
\frac{\partial f(u,v)}{\partial u}\frac{\partial \phi(x,y)}{\partial x}+
\frac{\partial f(u,v)}{\partial v}\frac{\partial \psi(x,y)}{\partial x}.
$$
A special case, suppose $u=\phi(x,y)$, $v=\psi(x)$ and $w=\omega(y)$, under some conditions, then
$$
\frac{\partial f(\phi(x,y),\psi(x),\omega(y),x,y)}{\partial x}=
\frac{\partial f(u,v,w,x,y)}{\partial u}\frac{\partial \phi(x,y)}{\partial x}+
\frac{\partial f(u,v,w,x,y)}{\partial v}\frac{d\psi(x)}{dx}+
\frac{\partial f(u,v,w,x,y)}{\partial x}.
$$
Note that though $f(\phi(x,y),\psi(x),\omega(y),x,y)=f(u,v,w,x,y)$, but their partial derivative with respect to $x$ are different.
$\partial f(\phi(x,y),\psi(x),\omega(y),x,y)/\partial x$ consider $y$ is a constant with respect to $x$, while $\partial f(u,v,w,x,y)/\partial x$ consider $u,v,w$ and $y$ are constants with respect to $x$.   
Suppose $\theta = (\theta^1,\cdots,\theta^n)$, we may apply chain rule recursively to get the partial derivative $\partial f(\theta)/\partial \theta_i$ for $\forall i\in \{1,\cdots,n\}$.  
Belw is an example of using chain rule.

## Backprobagation
### A simple example
Suppose $q=\phi(x,y)=x+y$ and $f(\phi(x,y),z)=\phi(x,y)*z=(x+y)*z$, so $f(\phi(x,y),z)=f(q,z)=q*z$. The partial derivatives with respect to $x, y$ and $z$ are
$$
\frac{\partial f(\phi(x,y),z)}{\partial x} = \frac{\partial f(q,z)}{\partial q}
\frac{\partial \phi(x,y)}{\partial x} = z\times 1 = z,\\
\frac{\partial f(\phi(x,y),z)}{\partial y} = \frac{\partial f(q,z)}{\partial q}
\frac{\partial \phi(x,y)}{\partial y} = z\times 1 = z,\\
\frac{\partial f(\phi(x,y),z)}{\partial z} = \frac{\partial f(q,z)}{\partial z}=q.
$$

In [6]:
# set some inputs
x = -2; y = 5; z = -4

# perform the forward pass
q = x + y # q becomes 3
f = q * z # f becomes -12

# perform the backward pass (backpropagation) in reverse order:
# first backprop through f = q * z
dfdz = q # df/dz = q, so gradient on z becomes 3
dfdq = z # df/dq = z, so gradient on q becomes -4
# now backprop through q = x + y
dfdx = 1.0 * dfdq # dq/dx = 1. And the multiplication here is the chain rule!
dfdy = 1.0 * dfdq # dq/dy = 1
print([dfdx, dfdy, dfdz])

[-4.0, -4.0, 3]


The computation above can be visualized with a circuit diagram:
![circuit-simple](./figures/fd/circuit-simple.png)

**Intuition of backprobagation**  
After  performing the forward pass, we compute the partial derivative with respect to $q$ and use this result to compute the partial derivative with respect to $x$ and $y$ using chain rult. Backprobagation is a partial derivative computation procedure that starts at the output layer until the input layer working in backward order.

**Intuition of parallel computing**  
After performing the forward pass, computing the partial derivatives with respect to $q$ and $z$ are independent which can be computed in parallel way. Knowing the partial derivative with respect to $q$, computing the partial derivatives with respect to $x$ and $y$ are independent which also can be computed in parallel way.

### Sigmoid Example  
Suppose we have the sigmoid function below:
$$
f_x(w) = \frac{1}{1+e^{-(w_0x_0 + w_1x_1 + w_2)}},
$$
where $x$ is the input and $w$ is the parameter for which we want to compute the partial derivative with respect to each of its elements.  
Let 
$$
u = w_0x_0 + w_1x_1 + w_2,
$$
and
$$
g_x(u) = \frac{1}{1+e^{-u}} = f_x(w).
$$
Using chain rule, the partial derivative with respect to $w_0$ is
$$
\frac{\partial f_x(w)}{\partial w_0}=\frac{\partial g_x(u(w))}{\partial w_0}=
\frac{g_x(u)}{\partial u}\frac{\partial u(w)}{\partial w_0}.
$$
The partial derivative of $g_x(u)$ with respect to $u$ is
$$
\frac{\partial g_x(u)}{\partial u} = (1-g_x(u))g_x(u)
$$
and the partial derivative of $u(w)$ with respect to $w_0$ is straightforward which is $x_0$.
Similarly, we can easily derive the partial derivative with respect to $w_1$ and $w_2$.

In [10]:
import math

w = [2,-3,-3] # assume some random weights and data
x = [-1, -2]

# forward pass
u = w[0]*x[0] + w[1]*x[1] + w[2]
g = 1.0 / (1 + math.exp(-u)) # sigmoid function

# backward pass through the neuron (backpropagation)
dgdu = (1 - g) * g # gradient on dot variable, using the sigmoid gradient derivation
dgdw = [x[0] * dgdu, x[1] * dgdu, 1.0 * dgdu] # backprop into w
# we're done! we have the partial derivatives on the inputs to the circuit
print(dgdw)

[-0.19661193324148185, -0.3932238664829637, 0.19661193324148185]


### Staged Computation  
To be continued

### Vectorized Operation  
To be continued

## Gradient
Let $e_l=(\cos\alpha,\cos\beta)$ be a **unit vector** of the direction $l$ on a 2-D plane $(x,y)$, then the **directional derivative** of $f(x,y)$ with respect to $l$ is
$$
\frac{\partial f}{\partial l} = \lim_{t\ \to 0^+} \frac{f(x + t\cos\alpha, y+t\cos\beta) - f(x, y)}{t}.
$$
Directional derivative tells us how sensitive of $f(x,y)$ on the direction $l$ defined by $e_l=(\cos\alpha,\cos\beta)$.  
A theorem shows that
$$
\frac{\partial f}{\partial l} = \frac{\partial f(x,y)}{\partial x}\cos\alpha+
\frac{\partial f(x,y)}{\partial y}\cos\beta
$$
The **gradient** \\(\nabla f\\) of $f(x,y)$ is the vector of partial derivatives that 
$$
\nabla f = (\frac{\partial f(x,y)}{\partial x}, \frac{\partial f(x,y)}{\partial y}),
$$ 
and then
$$
\frac{\partial f}{\partial l}=\nabla f\cdot e_l= |\nabla||e_l|\cos\theta=|\nabla|\cos\theta
$$
where $\theta$ is the angle between $\nabla f$ and $e_l$.   
So when $\theta = 0$ that $\nabla f$ and $e_l$ have the same direction, the partial derivative are maximized which motivates the **gradient learning method**.
Gradient is the most efficient direction for converging to an optimal point.