# Chain Rule and Jacobian Matrix
Each layer $f$ in a neural network is just a function mapping from $\mathbb{R}^m \rightarrow \mathbb{R}^n $. 

## Scaler version
Without loss of generality, consider the a compositional function $f(x(t), y(t))$, we can show: 
$$
\begin{aligned}
f'(t) =& \lim_{\Delta t \to 0} \frac{f(x(t+\Delta t), y(t+\Delta t)) - f(x(t), y(t))}{\Delta t} \\
      =& \lim_{\Delta t \to 0} \frac{  f(x(t+\Delta t), y(t+\Delta t)) - f(x(t+\Delta t), y(t))  + f(x(t+\Delta t), y(t)) - f(x(t), y(t))  }{\Delta t} \\
      =& \lim_{\Delta t \to 0} \frac{f(x(t+\Delta t), y(t+\Delta t)) - f(x(t+\Delta t), y(t))}{\Delta t}  + \lim_{\Delta t \to 0} \frac{f(x(t+\Delta t), y(t)) - f(x(t), y(t))}{\Delta t} \\
      =& \lim_{\Delta t \to 0} \frac{f(x(t+\Delta t), y(t+\Delta t)) - f(x(t+\Delta t), y(t))}  {y(t+\Delta t) - y(t)} \cdot  \frac{y(t+\Delta t) - y(t)}{\Delta t} + \\
       & \lim_{\Delta t \to 0} \frac{f(x(t+\Delta t), y(t)) - f(x(t), y(t))}  {x(t+\Delta t) - x(t)} \cdot  \frac{x(t+\Delta t) - x(t)}{\Delta t} \\
 \doteq& \lim_{\Delta t \to 0} \frac{f(x(t+\Delta t), y(t) + \Delta y) - f(x(t+\Delta t), y(t))}  {\Delta y} \cdot  \frac{y(t+\Delta t) - y(t)}{\Delta t} + \\
       & \lim_{\Delta t \to 0} \frac{f(x(t) + \Delta x, y(t)) - f(x(t), y(t))}  {\Delta x} \cdot  \frac{x(t+\Delta t) - x(t)}{\Delta t} \\
 \doteq& \frac{\partial}{\partial y} f(x, y) \cdot \frac{\partial}{\partial t} y(t) + \frac{\partial}{\partial x} f(x, y) \cdot \frac{\partial}{\partial t} x(t) \\
\end{aligned}
$$ 

iff $\Delta t \rightarrow 0$ implies $\Delta x \rightarrow 0$ and $\Delta y \rightarrow 0$, alternatively, under [Lipschitz continuity](./lipschitz.ipynb).

## Multivariate version

In more general case, for $f(x(t))$ where $x \in \mathbb{R}^n, t \in \mathbb{R}^m, f: \mathbb{R}^n \rightarrow \mathbb{R}$ and $x: \mathbb{R}^m \rightarrow \mathbb{R}^n$, 
the partial derivative w.r.t. each coordinate in $t$ indexed by $i$ is

$$
\frac{\partial}{\partial t_i} f(t_i) =
    \begin{bmatrix}
        \frac{\partial f(x)}{\partial x_1} & ... & \frac{\partial f(x)}{\partial x_n}
    \end{bmatrix}
    \cdot
    \begin{bmatrix}
        \frac{\partial x_1}{\partial t_i} \\
        \vdots \\
        \frac{\partial x_n}{\partial t_i}
    \end{bmatrix}
\doteq
    \nabla_x^T f (x)
    \cdot
    \begin{bmatrix}
        \frac{\partial x_1}{\partial t_i} \\
        \vdots \\
        \frac{\partial x_n}{\partial t_i}
    \end{bmatrix},
$$ 

therefore 

$$
\begin{aligned}
\nabla_t^T f(t)
\doteq
\begin{bmatrix}
    \frac{\partial f(t)}{\partial t_1}, ..., \frac{\partial f(t)}{\partial t_m}
\end{bmatrix} &=
\nabla_x^T f (x)
\cdot
\begin{bmatrix}
    \partial x_1 / \partial t_1 & \partial x_1 / \partial t_2 & ... & \partial x_1 / \partial t_m \\
    \partial x_2 / \partial t_1 & \partial x_2 / \partial t_2 & ... & \partial x_2 / \partial t_m \\
    \vdots & \ddots \\
    \partial x_n / \partial t_1 & \partial x_n / \partial t_2 & ... & \partial x_n / \partial t_m \\
\end{bmatrix}\\
&\doteq \nabla_x^T f (x) \cdot J_t x
\end{aligned}
$$ 
where the RHS matrix is called the *Jacobian matrix* $J_t x$. 