**First Order ODE**
- 1st order DE: any problem that follows
\begin{gather*}
    \frac{dy}{dt} = f(y, t)
\end{gather*}
- We don't know how to calculate $y$, but know how to calculate the change of $y$
- Many types of 1st-order DEs
    - 1st order, linear DE $\rightarrow y' + p(t)y = g(t) \Rightarrow y' = g(t) -p(t)y$
    - Separable DE $\rightarrow p(y)y' = g(t) \Rightarrow y' = g(t)/p(y)$
    - Bernoullie DE $\rightarrow y' + p(t)y = y^n \Rightarrow y' = y^n - p(t)y$

**ODE Example**: Free-falling Object
- Want to know thevelocty of a falling object at time t
- Newton's law of motion: $F = ma = m \frac{dv}{dt}$
- So we have $ m \frac{dv}{dt} = F(t, v)$
- Withoutair friction: $m \frac{dv}{dt} = mg$
- With air friction: $m \frac{dv}{dt} = mg- \gamma v \Rightarrow \frac{dv}{dt} = g - \frac{\gamma v}{m}$
- We can analytically solve this, and need initial conditions to find all constants

**Initial Value Problem** $:=$ Differential Equation + Initial Condition

**How to solve ODE**
- Numerical Solution: What if we cannot solve the DE analytically?
    - We an solve numerically

**Euler's Method**
- Finds numerical solution to IVPs
- Given initial value problem $\frac{dy}{dt} = f(t, y)$ and $y(t_0) = y_0$
- Slope of the solution at time $t = t_0$ is defined as
\begin{gather*}
    \frac{dy}{dt}\mid_{t = t_0} = f(t_0, y_0)
\end{gather*}
- Tangent line to the solution at $t = t_0$
\begin{gather*}
    y = y_0 + f(t_0, y_0) (t-t_0)
\end{gather*}
- Take a small step along the tangent line, and approximate $y_1$ (just like gradient descent)
- There will be some error at $\hat{y}_1$ equal to $|\hat{y}_1 - y_1|$
- Error increases with the number of steps
    - Can be lower by having smaller time steps

The algorithm is as follows:
1. Define $f(t, y)$ 
2. Input $t_0$ and $y_0$
3. input step size $h$ and number of steps $n$
4. for $j$ from $1$ to $n$ do
    - $m = f(t_0, y_0)$
    - $y_1 = y_0 + h * m$
    - $t_1 = t_0 + h$
    - Print $t_1$ and $y_1$
    - $t_0 = t_1$
    - $y_0 = y_1$
5. end

When step sizes go to $0$, we can write this as:
\begin{gather*}
    y_T = y_0 + \int_{0}^T f(y_t, t)dt = \text{ODESolve}(y_0, f, [0, T], \Delta T)
\end{gather*}

**Runge-Kutta (RK4) Method**
- Better precision than Euler's method
- Given IVP, $y_{n+1} = y_n + \frac{1}{6}h(k_1 + 2k_2 + 2k_3 + k_4)$ where $k_i$ depend on $f$ with different inputs and previous $k_{i-1}$

**ODE Solvers**
- Long history in mathemtics and physics
- Fixed step size solvers
    - Euler
    - Midpoint
    - Runge-Kutta
    - Adams-Bashforth
- Adaptive step size solvers
    - Dormand-Prince
    - Dormand-Prince-Shampine
    - Bogacki-Shampine

**2nd order DE**
- General form:
\begin{gather*}
    p(y, t)y'' + q(y, t)y' + r(y, t)y = g(y, t)
\end{gather*}
- Usually constant coefficient, possibly non-const term
- Can be solved numerically

**Neural ODE**

**ResNet**
- $\mathbf{h}_{t+1} = \text{ReLU}(W_t \mathbf{h}_t + b{t}) + \mathbf{h}_t$
- General form: $\mathbf{h}_{t+1} = f(\mathbf{h}_t, \theta_t) + \mathbf{h}_t$
- Rewrite:
\begin{gather*}
    \mathbf{h}_{t+1} = f(\mathbf{h}_t, \theta_t) + \mathbf{h}_t \iff \mathbf{h}_{t+1} - \mathbf{h}_t = f(\mathbf{h}_t, \theta_t) \\
    \iff \frac{\mathbf{h}_{t+1} - \mathbf{h}_t}{1} = f(\mathbf{h}_t, \theta_t) \\
    \iff \frac{\mathbf{h}_{t+\Delta} - \mathbf{h}_t}{\Delta} \mid _{\Delta = 1} = f(\mathbf{h}_t, \theta_t) \\
    \Rightarrow \underset{\Delta \rightarrow 0}{\lim} = f(\mathbf{h}_t, t, \theta) \\ \Rightarrow 
    \frac{d\mathbf{h}_t}{dt} = f(\mathbf{h}_t, t,\theta) 
\end{gather*}

ResNet properties:
- $L$ discrete layers
- Latent state changes discretely
- Latent state dynamics controlled by L functions

Continous NeuralODE properties:
- Infinite layers
- Latent state changes continuously
- Latent state dynamics controlled by one function

We can apply similar logic to RNNs

Now, in Neural ODE, $f(\mathbf{h}(t), t, \theta)$ is the neural network in:
\begin{gather*}
    \frac{d\mathbf{h}(t)}{dt} = f(\mathbf{h}(t), t, \theta)
\end{gather*}
- We have an ODE problem
- We do not know $y'$
- We wan to learn $y'$ from data via NN and BackProp

NeuralODE Forward Propagation
- Input state $\mathbf{h}(0)$
- State dynamics
\begin{gather*}
    \frac{d\mathbf{h}(t)}{dt} = f_{\theta}(\mathbf{h}(t), t)
\end{gather*}
    - $f_{\theta}$ is typically just an MLP with some hidden layers
- Output state:
\begin{gather*}
    \mathbf{h}(T) = \mathbf{h}(0) + \int_{0}^Tf_{\theta} (\mathbf{h}(t), t) dt \\
    \mathbf{h}(T) = \text{ODESolve}(\mathbf{h}(0), f, [0, T], \theta)
\end{gather*}

NeuralODE Backward Propagation
- Not a good idea to backprop through an ODE solver.
- What do we do?
- We want to avoid doing backprop throught the solvers. 

**Adjoint Sensitivity Method**
- Main contribution of this paper
- Necessary gradients
    - $\frac{\partial \mathcal{L}}{\partial \mathbf{h}(0)}$: Gradients of the loss with respect to input state
    - $\frac{\partial \mathcal{L}}{\partial \mathbf{h}(1)}, \frac{\partial \mathcal{L}}{\partial \mathbf{h}(2)}, \dots$: (in time series) Gradients of the loss with respect to intermediate states
    - $\frac{\partial \mathcal{L}}{\partial \theta}$: Gradients of the loss with respect to the dynamics function params
- Use **adjoint sensitivity method**
- Simply put, solve ODE backwards to obtain $\frac{\partial \mathcal{L}}{\partial \mathbf{h}(t)}$ given $\frac{\partial \mathcal{L}}{\partial \mathbf{h}(T)}$
- We use the solver backwards! (reuse the solver, the same ODESolve)
- Solve $\frac{\partial \mathcal{L}}{\partial \theta}$ and $\frac{\partial \mathcal{L}}{\partial \mathbf{h}(t)}$ **together at the same time**

Overall:
- 1 ODESolve For FP
- 1 ODESolve (for both losses at the same time) for BP
- No need to do backprop, just keep doing ODESolve
- At the end, you get the desired gradients of the losses, then just updates parameters through grad descent

$\frac{\partial L(z(t))}{\partial t} = \frac{\partial L(z(t))}{\partial z(t)} \frac{\partial{z}(t)}{\partial t} = a(t)^T \frac{\partial\left({z}(t_0) + \int_{t_0}^t f(z(t), t, \theta)dt\right)}{\partial t}$ 