# online nets

Goal: evaluate a deep learning alternative capable of true online learning. Solution requirements:

1. catastrophic forgetting should be impossible
2. all data is integrated into sufficient statistics of fixed dimension
3. have predictive power comparable to deep learning

## alternative model

Least squares regression (LSR) meets solution requirements 1 and 2. To achieve requirement 3, we'll structure our model similarly to deep learning by having LSR models depend on other LSR models, effectively producing a Gaussian Bayes net with nonlinear activations.
Unfortunately, Bayes nets' latent variables result in computationally intractible integrals, so we'll use a heuristic to "observe" all variables.

Our LSR model can be written as $X_n\beta_n = Y_n$, where
- $X_n \in \mathbb{R}^{n \times p}$ is a matrix of regressor columns and $x_i \in \mathbb{R}^{1 \times p}$ observation rows.
- $Y_n \in \mathbb{R}^{n \times q}$ is a matrix of dependent variable columns and $y_i \in \mathbb{R}^{1 \times q}$ observation rows.
- $\beta \in \mathbb{R}^{p \times q}$ is our matrix of regression weights.

First, we'll rephrase our problems as ito time series problems, so that we may assume our input and output dimensions are equivalent or $p = q$. While this category of problems clearly covers reinforcement learning, it can also cover more general problems.
For example, given the right network topology and interpretting $X_{n+k}$ as the $n^{th}$ sample's output, we cover feed-forward classification problems as well.

## numerical considarations

We'll use the Sherman-Morrison formula (SMF) to derive our online updates in a computationally tractible way.

$$(A + uv^T)^{-1} = A^{-1} - \frac{A^{-1} av^T A^{-1}}{1 + v^tA^{-1}u}$$

We will later define our problems in terms of recurrent nets and time series. Input and output dimensions thus equate, so let $p = q$.

With regularization term $\lambda >0$, the L2-regularized estimate of $\beta$ is $\left(\sum_{i=1}^nx_i^tx_i + \lambda \right)^{-1}\sum_{i=1}^n x_i^Ty_i$. However, we'll need an additional inverse, so we must add further regularization.
Take our $\beta$ estimate to be $\hat{\beta}_n = \left(\sum_{i=1}^nx_i^tx_i + \lambda \right)^{-1}\left(\sum_{i=1}^n x_i^Ty_i + \lambda \right)$. We'll derive our SMF-inverse updates with these definitions:
- $A_n := \sum_{i=1}^n x_i^Tx_i + \lambda$
- $A_0 := \lambda I_{p \times p}$
- $B_n := \sum_{i=1}^n x_i^T y_i + \lambda$
- $B_0 := \lambda I_{p \times p}$

With these definitions, we have that $\hat{\beta}_n = A_n^{-1} B_n$ and also that $\hat{\beta}_n = \left(A_n + x_i^Tx_i \right)^{-1} \left( B_n + x_i^Ty_i \right)$. Applying SMF, we get these identities
$$A_{n+1}^{-1} = A_n^{-1} - \frac{A_n^{-1} x_i^Tx_i A_n^{-1}}{1+x_iA_n^{-1}x_i^T}$$
$$B_{n+1}^{-1} = B_n^{-1} - \frac{B_n^{-1} x_i^Ty_i B_n^{-1}}{1+y_iB_n^{-1}x_i^T}$$

So, we have derived our computationally-tractible inverse updates.
