# online nets

Goal: evaluate a deep learning alternative capable of true online learning. Solution requirements:

1. catastrophic forgetting should be impossible
2. all data is integrated into sufficient statistics of fixed dimension
3. have predictive power comparable to deep learning

## alternative model

Least squares regression (LSR) meets solution requirements 1 and 2. To achieve requirement 3, we'll structure our model similarly to deep learning by having LSR models depend on other LSR models, effectively producing a Gaussian Bayes net with nonlinear activations.
Unfortunately, Bayes nets' latent variables result in computationally intractible integrals, so we'll use a heuristic to "observe" all variables.

Our LSR model can be written as $X_n\beta_n = Y_n$, where
- $X_n \in \mathbb{R}^{n \times p}$ is a matrix of regressor columns and $x_i \in \mathbb{R}^{1 \times p}$ observation rows.
- $Y_n \in \mathbb{R}^{n \times q}$ is a matrix of dependent variable columns and $y_i \in \mathbb{R}^{1 \times q}$ observation rows.
- $\beta \in \mathbb{R}^{p \times q}$ is our matrix of regression weights.

First, we'll rephrase our problems as into time series problems, so that we may assume our input and output dimensions are equivalent or $p = q$. While this category of problems clearly covers reinforcement learning, it can also cover more general problems.
For example, given the right network topology and interpretting $X_{n+k}$ as the $n^{th}$ sample's output, we cover feed-forward classification problems as well. Our network will be used as a recurrent net, so we'll use this additional constraint:
- $X_{n+1} := \sigma(Y_n)$ where $\sigma: \mathbb{R}^p \to \mathbb{R}^p$ is an invertible activation function.

To accomodate both observed and latent variables, let
- $O_n \in \mathbb{R}^{p - l}$ be our observed varaibles
- $L_n \in \mathbb{R}^l$ be our latent variables
- $X_n^T = [O_n^T \; | \; L_n^T]$ be a concatentation of both observed and latent variables.

Given $(O_n, L_n, O_{n+1})$ observed but $L_{n+1}$ not observed, we now construct a heuristic to guess $L_{n+1}$. Rather simply, we'll run the LSR backwards: $\hat{X}_n := Y_n \beta_n^{-1}$. We'll the use the $\hat{L}_n$ portion of $\hat{X}_n$ as $L_{n+1}$.
Then, we'll fit our model to $Y_n^T =  \left[ \sigma^{-T}(O_{n+1}) \; \big| \; \hat{L}_n^T \right]$.

## motivating the heuristic

The advantage of this heuristic over backpropagation is that it supports mathematically guaranteed online learning. However, we must motivate using $\hat{L}^n$ as $L_{n+1}$ in $Y_n$.
The essential idea is that we produce an approximate backpropagation equivalent by allowing $O_{n+1}$ to inform what $L_n$ should have been to produce $O_{n+1}$.
Of course, having $L_{n+1}$ updated requires another iteration to inform $O_{n+2}$, so solutions may require several iterations to converge.

This heuristic is indeed approximate and lacks the mathematical rigour guaranteeing deep learning's success, but it is worth trying because deep learning's computational cost is approaching infeasibility.

## numerical considarations

We'll use the Sherman-Morrison formula (SMF) to derive $\hat{\beta}_n^{-1}$ in a computationally tractible way.

$$(A + uv^T)^{-1} = A^{-1} - \frac{A^{-1} av^T A^{-1}}{1 + v^tA^{-1}u}$$

We will later define our problems in terms of recurrent nets and time series. Input and output dimensions thus equate, so let $p = q$.

With regularization term $\lambda >0$, the L2-regularized estimate of $\beta$ is $\left(\sum_{i=1}^nx_i^tx_i + \lambda \right)^{-1}\sum_{i=1}^n x_i^Ty_i$. However, we'll need an additional inverse, so we must add further regularization.
Take our $\beta$ estimate to be $\hat{\beta}_n = \left(\sum_{i=1}^nx_i^tx_i + \lambda \right)^{-1}\left(\sum_{i=1}^n x_i^Ty_i + \lambda \right)$. We'll derive our SMF-inverse updates with these definitions:
- $A_n := \sum_{i=1}^n x_i^Tx_i + \lambda$
- $A_0 := \lambda I_{p \times p}$
- $B_n := \sum_{i=1}^n x_i^T y_i + \lambda$
- $B_0 := \lambda I_{p \times p}$

With these definitions, we have that $\hat{\beta}_n = A_n^{-1} B_n$ and also that $\hat{\beta}_{n+1} = \left(A_n + x_i^Tx_i \right)^{-1} \left( B_n + x_i^Ty_i \right)$. Applying SMF, we get these identities
$$A_{n+1}^{-1} = A_n^{-1} - \frac{A_n^{-1} x_i^Tx_i A_n^{-1}}{1+x_iA_n^{-1}x_i^T}, \; \; \; B_{n+1}^{-1} = B_n^{-1} - \frac{B_n^{-1} x_i^Ty_i B_n^{-1}}{1+y_iB_n^{-1}x_i^T}$$

So, we have derived our computationally-tractible inverse updates for $A_n$ and $B_n$. Now, since $\hat{\beta}_n = A_n^{-1}B_n$ it is trivial to calculate $\hat{\beta}_n^{-1} = B_n^{-1} A_n$.

## the heuristic algorithm 

1. Choose $\lambda > 0$, $p \in \mathbb{N}$, $l \in \{0, 1, \ldots, p-1\}$, sampling distribution $(X_{i+1} \; | \; X_i) \sim F(x_{i+1} \; | \; x_i)$ & $X_0 \sim F(x_0)$, and activation function $\sigma: \mathbb{R}^p \to \mathbb{R}^p$.
2. Set $n \gets 0$, $A_0 \gets \lambda I_{p \times p}$, $A_0^{-1} \gets \lambda^{-1}I_{p \times p}$, $B_0 \gets \lambda I_{p \times p}$, and $B_0^{-1} \gets \lambda I_{p \times p}$.
3. Sample $X_n, O_n, O_{n+1}$.
4. Calculate $\hat X_n \gets Y_n \hat \beta_n^{-1} = Y_n B_n^{-1} A_n$.
5. Set $\hat Y_n^T \gets \left[ \sigma^{-T}(O_{n+1}) \; \big|\; \hat L_n^T \right]$ where $\hat X_n^T = \left[ \hat O_n \; \big|\; \hat L_n^T \right]$.
6. Given $(X_n, Y_n, A_n, A_n^{-1}, B_n, B_n^{-1})$ now well-defined, calculate $(Y_{n+1}, A_{n+1}, A_{n+1}^{-1}, B_{n+1}, B_{n+1}^{-1})$.


# first experiment: mnist classification

In [None]:
## TODO