# memory as geometry 

Under asymptotic analysis and mild regularity assumptions, all statistical models have a Lagrangian-form regularizer equivalent to memory \[1\]. 
We will mathematically generalize this concept to show a breadth of geometries equivalent to memory. 
As a corollary, we'll construct universal sufficient statistics. 
Finally, given large models' recent immense productivity, it is natural to expect hardware limitations to soon motivate miniaturization (TODO CITE). 
We'll leverage memory's geometric properties to construct miniaturization techniques for deep learning, including 
memory merges, regularizers as memory, and a _frontal lobe_ concept. 

## result: Langrangian-regularized estimates are MLEs 

The MLE's powerful theoretical guarantees provide an important foundation for Deep Learning's success. 
So, we should be curious as to when our new estimation paradigms are equivalent to MLEs. 
Theory is a guide to the statistical researcher. 

**LEMMA 1:** 

Let the following solution exist uniquely. 

$$ \hat \theta_L = \arg \max_{\theta \in \Theta} n^{-1} \log f_X(X;\theta) - \lambda g(\theta) $$

Then then there exists linear subspace $H \subset \Theta$ 
such that $\hat \theta_L$ is the solution to the following optimization program. 

$$ \hat \theta_L = \arg \max_{\theta \in H} \log f_X(X; \theta) $$

**PROOF:** 

Define $\hat{\mathcal{L}} := n^{-1} \log f_X(X;\theta) - \lambda g(\theta)$. 
Since $\hat \theta_L = \arg \max_{\theta \in \Theta} \hat{\mathcal{L}}$ exists uniquely, 
and $\log f_X, g$ are differentiable, so we are guaranteed that 

1. $\nabla_\theta \hat{\mathcal{L}} = 0$ and 
2. $\nabla_\lambda \hat{\mathcal{L}} = 0$. 

There exists $c = g(\hat \theta_L)$ 
and $\mathcal{L} := n^{-1} \log f_X(X;\theta) - \lambda( g(\theta) - c )$ such that 

3. $\nabla_\theta \hat{\mathcal{L}} = \nabla_\theta \mathcal{L} = 0$ and 
4. $\nabla_\lambda \mathcal{L} = c$. 

By (4), $\theta \in g^{-1}(c)$. 
For $\theta$ sufficiently near $\hat \theta_L$, $g^{-1}(\theta) \approx \left( J_{g, \hat \theta_L} \right)^{-1} \left(\theta - \hat \theta_L \right)$ by the inverse function theorem (TODO: check calcs, and use Constant Rank Theorem instead). 
Define this local linear subspace as $H := \left\{ \theta \; : \; \eta \; s.t. \; \exists \theta \; \& \; \eta = \left( J_{g, \hat \theta_L} \right)^{-1} \left(\theta - \hat \theta_L \right) \right\}$ or $H$-space. 

Since $\nabla_\theta \hat{\mathcal{L}} = 0$ also holds, 
$\mathcal{L}$ is a Langrangian constraining $\theta$ to linear subspace $H \subset \Theta$. 

$\square$

**RESULT 1:** 

If $\hat \theta_{MLE}:= \arg\max_{\theta \in \Theta} \log f_X(X;\theta) $, 
then $\hat \theta_L$ is an MLE. 

**PROOF:** 

Since $\hat \theta_{MLE}$ is an MLE, $f_X$ is a likelihood function. 
By LEMMA 1, the estimation program is constrained to well-defined linear subspace $H \subset \Theta$. 
Hence $\hat \theta_L$ is indeed an MLE. 

$\square$ 

TODO: Clarify $J$. 

**COROLLARY 1:** 

Define $J := \left[ \partial \theta_0 / \partial \eta_j \right]$, 
$J^+$ to be the pseudo-inverse matrix, 
and $\mathcal{I}_H :=  J^+ \mathcal{I}_{\theta_0} J^{+T}$

If $X_i \sim  f_X(x; \theta_0)$ and $\hat \theta_L$ is sufficiently near $\hat \theta_{MLE}$ and $n$ sufficiently large, 
then $\hat \theta_L$ has a corresponding estimate $\hat \eta$ in the $H$-space basis,  
and $\sqrt{n} \left( \hat \eta - J^+ \theta_0 \right) \sim N(0, \mathcal{I}_H )$. 

**PROOF:** 

$0 = \nabla_\theta \log f_X(X; \theta) $

$\approx \nabla_\theta \log f_X(X; \theta_0) - \left( \theta - \theta_0 \right)^T \nabla_\theta^T \nabla_\theta \log f_X(X; \theta_0) $ by Taylor expansion 

$ \Rightarrow \sqrt{n}^{-1} \nabla_\theta \log f_X(X; \theta_0) \approx \frac{\sqrt{n}}{n} \left( \theta - \theta_0 \right)^T \nabla_\theta^T \nabla_\theta \log f_X(X; \theta_0) $

$ \Rightarrow N \mathcal{I}_{\theta_0}^{1/2} =_d \sqrt{n} \left( \theta - \theta_0 \right)^T \mathcal{I}_{\theta_0} $ 
where $=_d$ is equivalence in distribution and $N \sim N\left(0, I_{\dim(\Theta)} \right)$ 

Apply lemma 1 by parameterizing $H$-space via $\theta \gets J \eta$.

$ \Rightarrow \sqrt{n} \left( J \eta - \theta_0 \right) \sim N(0, \mathcal{I}_{\theta_0}) $

$ \Rightarrow \sqrt{n} \left( \eta - J^+ \theta_0 \right) \sim N(0, J^+ \mathcal{I}_{\theta_0} J^{+T} ) $ 

$\square$ 

Since $J^+$ is also a projection matrix, the distribution of $\hat \eta$ is equivalent to the $\hat \theta_{MLE}$ projection, 
$\sqrt{n} J^+ \left( \hat \theta_{MLE} - \theta_0 \right) \sim N(0, \mathcal{I}_H )$. 

**COROLLARY 2:** 

$\text{rank}(J^+) > 0 \Rightarrow \mathcal{I}_H < \mathcal{I}_{\theta_0}$ 

**PROOF:** 

Since $\mathcal{I}_{\theta_0} > 0$ (positive semi-definite), there is matrix $A$ such that $\mathcal{I}_{\theta_0} = AA^T $. 
Since $J^+$ is a projection matrix, we have the following.

$ \mathcal{I}_{\theta_0} = \mathcal{I}_H + (I - J^+)AA^TJ^{+T} + J^+AA^T(I - J^+)^T + (I - J^+)AA^T(I - J^+)^T$ 

$ = \mathcal{I}_H + 0 + 0 + (I - J^+)AA^T(I - J^+)^T $ by linear independence of complementrary projections 

Recognizing $(I - J^+)AA^T(I - J^+)^T > 0$ completes the proof. 

$\square$ 

If $\hat \theta_L \not= \hat \theta_{MLE}$ for all $n$, then $\theta_0 \not \in H$, 
and we may interpret $H$ as the tangent space of a biased sub-manifold in $\Theta$. 
However, since dimensional reductions cause $\mathcal{I}_H < \mathcal{I}_{\theta_0}$, 
a sufficiently-unbiased manifold produces more-efficient estimators, 
particularly so if $\theta_0 \in \lim_{n \to \infty} H$. 

## result: elliptical sub-manifolds in $\Theta$-space are memories

Define a _memory_ as all information associated with samples $X_1, X_2, \ldots, X_n$ under a model. 
Concretely, we'll use approximate MLE sufficient statistics $\left( \hat \theta, \hat{\mathcal{I}} \right)$.

WLOG, assume $\mathcal{I}_{\theta_0} > 0$. 
If not, perfect correlations exist, so the model can be reparameterized to a lower dimension. 
$\mathcal{I}_{\theta_0} > 0 \Rightarrow \mathcal{I}_{\theta_0} = AA^T \; \& \; \exists A^{-1}$. 
For all $B \in \mathbb{R}^{p \times q}$ where $p = \dim(\Theta)$ and $q \leq p$, 
there exists $J = A^{-1}B$ mapping from an unbiased, linear $H$-space to $\Theta$-space, 
and $(\theta - \theta_0)^T AA^T (\theta - \theta_0) = (\eta - \eta_0)^T J^T AA^T J (\eta - \eta_0) = (\eta - \eta_0)^T BB^T (\eta - \eta_0)^T$.
This is true for any choice of $B$, so covers all PSD matrices $BB^T \geq 0$. 

Choosing $g(\theta) = (\theta - \theta_0)^T BB^T (\theta - \theta_0)$ and applying lemma 1, 
we recognize the existence of a elliptical sub-manifold in $\Theta$-space corresponding to Lagrangian-form regularizer $g(\theta)$. 
Hence, there exists an MLE with $\mathcal{I} = BB^T$ when estimation is constrained to a subspace with $J = A^{-1}B$. 
So, for every elliptical sub-manifold with Riemann metric $\mathcal{I} = BB$, there exists another linear sub-manifold $H$ 
such that $\arg\max_{\theta \in H} \log f_X(X;\theta)$ has approximate sufficient statistics $(\hat \theta, BB^T)$. 

Since we interpret $(\hat \theta, BB^T)$ as a memory, 
every elliptical sub-manifold $(\eta - \eta_0)^T BB^T (\eta - \eta_0)^T$ 
has an associated memory. 

TODO: look for a 1-1 correspondence, tying each manifold to a unique memory. 
Example consideration: $(\hat \eta, BB^T)$ and $(\eta - \hat \eta)^T BB^T (\eta - \hat \eta)^T$. 

## frontal lobe 

In humans, the frontal lobe operates on a longer-term timeline than Hippocampus-based short-term memory, 
but on a shorter horizon than remaining cortical regions, 
so functions as a mid-term horizon or _working space_ (TODO: CITE!). 
Observing miniaturization requirements on deep learning imposes parameter space restrictions, 
which deny infinite growth strategies like `net2net` (TODO: CITE!). 
Instead, we'll allocate dimensional parameter space to a flex capacity, 
so operating in similar capacity to a biological frontal lobe. 
Mathematically, we'll achieve this by all filling dimensional space with information, 
then periodically reducing information stored in the frontal lobe parameters. 
The net result is a temporary increase in dimensional space available to the model, 
which routinely moves information into long-term memory. 
This results in the following two-step optimization procedure.

1. $ \hat \theta = \arg \max_\theta n_B^{-1} \log f_X(X; \theta) $
$ - n_A^{-1} (\theta - \hat \theta_A)^T \hat{\mathcal{I}}_{\hat \theta_A} (\theta - \hat \theta_A) $
2. $ \hat \theta = \arg \max_\theta  n_B^{-1} \log f_X(X; \theta) $ 
$ - n_A^{-1} (\theta - \hat \theta_A)^T \hat{\mathcal{I}}_{\hat \theta_A} (\theta - \hat \theta_A) $ 
$ - \lambda \| \hat{\mathcal{I}}_\theta \|_{FL} $
(TODO: check sample sizes)

As we'll illustrate (TODO!), a frontal lobe is practically motivated in deep learning, 
because freezing all parameters in memory effectively consumes all dimensional space. 
By regularly relieving dimensional space, a necessary working space can be retained. 

## references 

\[1\] Kirkpatrick et al. (2017) "Overcoming catastrophic forgetting in neural networks", PNAS