# Regularizers as memory

The greatest challenge in modern AI research is the limiting returns computational work. 
While the greatest advances can only be afforded by large technology firms, even they 
cannot afford to push results further. A clear plateau has developed. 
Insufficient computational efficiency motivates a return to theory, begs a question: 
_where is waste most significant?_ 

The advent of catestrophic forgetting shows raw, non-compressed information must be continually re-applied 
if it is not to be forgotten. If important information could be preserved--even partially--we would expect 
more efficient computation. In short, _this work targets the realization of memory_.

Regularized likelihood equations have a Lagrangian form, so implicitly describe geometric constraints on estimates.
For example, here's an estimate constrained to an L2-sphere in $\Theta$-space.

$$ \hat \theta = \text{arg max}_{\theta \in \Theta} n^{-1} \sum_{i=1}^n \log f_X(X_i;\theta) - \lambda \| \theta \|_2^2 $$

In this work, we'll generalize the regularizer $\| \cdot \|_2^2$ to support alternative geometries, 
in attempting to construct numerically convenient memory approximations. 
Particulary, we'll seek to approximate the following equation.
Note that it introduces parabolic geometric constraints on the estimate. 

$$ \hat \theta = \text{arg max}_{\theta \in \Theta} n^{-1} \sum_{i=1}^n \log f_X(X_i;\theta) - \frac{\lambda}{2} (\theta - \theta_0)^T \mathcal{I}_{\theta_0} (\theta - \theta_0) $$

## RL-reweighted estimates 

This first estimate is designed to improve efficiency of sampling in reinforcement learning (RL) by 
up-weighting more-important observations. We'll use this estimate:

$$ \hat \theta_n = \text{arg max}_\theta \left( \sum_{i=1}^n \| \hat r_i \|_2^{2b} \right)^{-1} \sum_{i=1}^n \| \hat r_i \|_2^{2b} \log f_X(X_i; \theta) - \frac{\lambda}{2} (\theta - \hat \theta_{n-1})^T H_i (\theta - \hat \theta_{n-1})  $$

where 
- $\hat r_i$ is the estimated reward for observation $i$, 
- $b > 0$ is a bias term,  
- $H_i = \sum_{j=1}^i \left( \nabla_\theta \log f_X(X_j; \theta)|_{\theta = \hat \theta_{j-1}} \right) \left( \nabla_\theta \log f_X(X_j; \theta)|_{\theta = \hat \theta_{j-1}} \right)^T$ is a Hessian approximation, and
- $\hat \theta_0$ is an initial estimate guess, commonly required in optimization routines. 

Later, we'll experiment with recency bias and rank reductions for $H_i$.