# Sufficiency and Efficiency in Deep Reinforcement Learning

_W. Evan Durno, 2024_

This work produces miniaturization and optimal efficiency results for reinforcement learning (RL) with deep neural networks. 
Miniaturization is achieved by modernizing the definition of sufficient statistics to accommodate the needs of deep learning, 
thus producing theoretically infinite storage under the right circumstances. 
The optimal efficiency result is achieved by adjusting the Cramer-Rao Lower Bound (CRLB) away from classical statistics 
and toward the needs of RL, where sampling distributions can change. 
Efficiency is important, because it means we're getting the most out of our data. 
Results are theoretically general in that they are applicable to much of deep learning, 
and supported by observations from two different games. 

The key mathematical premise is that RL agents change their sampling distributions as they learn. 
This violates the simple random sampling assumptions of classical statistics, 
thereby making old results less applicable. 
To account for this change, this work introduces the concept of a sufficient statistic regularizer (SSR), 
an approximately correct sufficient statistic applicable to deep learning. 
Further, we adjust the CRLB to account for continuously deforming sampling distributions, 
and discover SSRs are used to yield optimally efficient data utilization in RL. 

## Introduction 

The success of modern deep learning depends substantially on massive compute tasks. 
This can be seen in multi-million dollar model-fitting costs \[7\] of large language models (LLMs). 
With compute at this scale, the next frontier of progress is likely to lay in the pursuit of efficiency. 
This work contributes to efficiency by revisiting older statistical concepts in-need of modernization: 
sufficient statistics and statistical efficiency via the Cramer Rao Lower Bound (CRLB). 
Sufficient statistics provide miniaturization benefits by bounding computational 
processing and storage requirements. 
Statistical efficiency is optimized analyzing the CRLB in the Reinforcement Learning (RL) context, 
ensuring we get best-possible value from our data. 

This work's applied targets focus on RL 
because it's our current best bet in delivering artificial general intelligence (AGI) \[9\]. 
Further, RL \[12: chapter 11 "Markov Decision Processes"\] is new enough that there is room for impactful theoretical development. 
This is indeed the case, as following the needs of RL challenges foundational assumptions of statistics, 
and invites development of an AI-flavored theory of statistics. 

In 1922, a world before an explosion of computational resources, 
R. A. Fisher described the chief task of statistics as "reduction of data," 
and promptly introduced his formalization of _sufficient statistics_ as 
one of a few primary tools for achieving this \[8\].
This work brings sufficient statistics into the modern era through approximate 
methods making them computationally tractable for deep learning models. 
The result carries Fisher's vision forward by ultimately setting limits 
on the amount of data that must be stored and processed in the handling 
of modern models. 
I call this method the _sufficient statistic regularizer_ (SSR). 
The method isn't entirely novel \[10, 11\], 
but recognizing its approximate sufficiency is. 
Exponential family sufficient statistics are known to produce finite-dimensional 
storage opportunities \[13\] which are yielded here, 
so a miniaturization opportunity exists - this is how theoretically infinite storage is achieved. 
However, the bounds are large, so applications may be patchy or require 
more-intensive data processing before they are useful. 
Regardless, developing SSRs sets a useful framing for describing my 
immediately-applicable contributions to the CRLB. 

The CRLB \[14, 15\] was developed in a statistical era focussed on experimentalist science, 
and so related theory depends on a simple random sample assumption, 
frequently called the _independence and identical distribution_ (_iid_) assumption. 
The iid assumption has served us well as a foundation for building powerful statistical theories 
ultimately allowing modern AI to occur. 
The maximum likelihood estimate (MLE) is a good example of this \[8, 16\]. 
However, the further we follow AI's arc, the further we stray from simple random samples. 
RL agents learn as samples are generated and thus change their sampling distributions as sampling occurs. 
So, we must contend with drifting measures $\mathbb{P}_t, \mathbb{P}_{t+1}, \ldots$, 
and reconcile their differences usefully. 
Despite explicitly violating the _iid assumption_, 
we explore how to combine datasets usefully and optimally. 
The result is an RL-adjusted version of the CRLB, 
built on independence assumptions but _less identical_ distributions. 

**Formalizing transitions via a mixture model** 

This work ultimately updates the CRLB for RL by accommodating the need for continuously deforming distributions, 
$\mathbb{P}_t, \mathbb{P}_{t+1}, \ldots$.
To achieve this, we reasonably assume our deep learning model can accommodate the complexity of each measure $\mathbb{P}_t$, 
so instead have a single measure $\mathbb{P}$ and have a series of parameter points $\theta_t, \theta_{t+1}, \ldots$ in our parameter space $\Theta \subset \mathbb{R}^p$. 
To simplify analysis, we'll only focus on a single transition at a time $(\theta_t, \theta_{t+1}) = (\theta_A, \theta_B)$. 
We assume we are free to sample from each point as needed, without a change of distribution.  

To coherently apply the theory of experimentalist statistics to RL's changing distributions, 
we use a mixture model. 
For each observation $X_i$, we assume it is selected randomly from either $A$ or $B$, 
despite the model sampling sequentially from $A$ then $B$. 
This hand-off facilitates a change of distribution while using older statistical machinery which 
assumes a single distribution. 

1. $X_i = M_i X_{B_i} + (1 - M_i) X_{A_i}$
2. $i \in \{0, 1, 2, \ldots, n\}$
3. $\mathbb{P}[M_i = 1] = \pi = 1 - \mathbb{P}[M_i = 0]$
4. $X_{A_i} \sim f_{X_1} (x ; \theta_A) $
5. $X_{B_i} \sim f_{X_1} (x ; \theta_B) $

Despite being a mixture model, our sampling procedure reveals $M_i$. 
We choose when to update our model and when to sample, 
so we know which version generated which data. 
So, our analysis is effectively under the $\mathbb{P}_\theta[\; \cdot \;| M]$ measure. 

This estimation paradigm accommodates for the non-_iid_ nature of RL 
by assuming _iid_ sampling per version of the model. 
This opens-up questions around how to optimally combine data. 
In RL applications, it'll be typical to have a lot of data, 
but each $n_t$ will be small. 
For example, if $n_B$ is small, 
then any estimator will have variance of at least $\mathcal{I}^{-1}/n_B$, 
so it may be preferable to accommodate a little bias by leveraging all prior data. 
We call the combined estimate $\hat \theta_{AB}$. 
Since $n_B$ is small and $n$ quite large, 
we should expect that $\text{MSE}(\hat \theta_B) > \text{MSE}(\hat \theta_{AB})$ holds generally, 
provided $\theta_A$ and $\theta_B$ are not too far apart. 
Below, I build-out the theory proving optimality of $\hat \theta_{AB}$. 

**Structure of this document** 

This document has opened with a description of the problem, 
relevant literature, and describes our fundamental 
mathematical framework. 

The following section describes a modernized perspective on sufficient statistics, via SSRs. 
SSRs bring sufficient statistic theory into the deep learning era via 
an approximate framework, instead of the usual exact one. 
SSRs are built-out because they provide an appropriate framing for my efficiency result. 

After describing SSRs, the CRLB is applied to our mixture process, 
thereby building a framework for optimally efficient use of data in RL. 
The result is derived mathematically, then verified with two experiments. 
The first experiment applies this RL-adjusted CRLB to a very simple game: 
Cart Pole with a 4-dimensional state space. 
The second experiment is uses offline RL and real-world robotics. 

The discussion shares a few perspectives, 
and follow-up research opportunities. 

The appendix covers a limited-memory version of the Lanczos algorithm, 
which can be an eigenvector algorithm with small modifications. 
It calculates a basis of an $r$-dimensional Krylov space for a covariance matrix 
$\hat\Sigma \in \mathbb{R}^{p \times p}$ without actually storing the whole matrix in-memory. 
For $n$ sample vectors estimating $\hat \Sigma$, the algorithm can calculate the basis 
in $O(nrp)$ time and using about $O(rp)$ space. 
This gets around the typically prohibitive $O(p^2)$ space requirement of deep 
learning models. 
I use this algorithm to provide computationally tractable Fisher Information matrices. 

## SSRs 

**Sufficient statistics as regularizers** 

Data reduction is the purpose of sufficient statistics. 
This becomes particularly apparent when studying their behavior under maximum likelihood estimation. 
The Fisher-Neyman defintion of a sufficient statistic $T(x)$ is that there 
exists $g$ and $h$ for density $f$ such that $f(x; \theta) = h(x) g(T(x) ; \theta)$. 
Notice that under maximum likelihood estimation, the $h$ term becomes irrelevant, leaving only $T(x)$, 
thereby providing an opportunity to reduce dimensionality of all data stored. 

$ \hat \theta = \arg \max_\theta f(x; \theta) = \arg \max_\theta \log f(x; \theta) $
$ = \arg \max_\theta \log(h(x)) + \log(g(x; T(x))) = \arg \max_\theta \log(g(x; T(x)))$ 

For example, dataset $x$ may require $O(n)$ space to store, but for $\theta \in \mathbb{R}^p$, 
it's common that $T$ only require $O(p)$ storage space. 
So, for a model with fixed parameter dimension, 
it is reasonably possible to have theoretically infinite data storage in a finite space. 
Of course, the amount of information truly stored will be practically bounded. 

For the purposes of this work, it is convenient to view sufficient statistics as regularizers. 
In situations where new data is added to old, like RL, this is particularly relevant. 
For example, with old data $X_A$ packed into $T_A = T(X_A)$ supplanting $\hat \theta_A$, 
we may add data $X_B$ ultimately producing $T_B = T(X_B)$ sufficiently estimating $\theta_B$. 

$\hat \theta_B = \arg\max_\theta f_X(X; \theta) = \arg \max_\theta \log f_X(X_A;\theta) + \log f_X(X_B;\theta) $
$ = \arg\max_\theta \log f_X(X_A;\theta) + \log g(T(X_B; \theta))$

Here, $X$ is the vector containing both $X_A$ and $X_B$, concatenated. 

If we then insert a scalar multiple $\lambda$, we recover the familiar regularizer form. 

$\hat \theta_B = \arg\max_\theta \log f_X(X_A;\theta) + \lambda \log g(T(X_B; \theta)) $

Let this form of regularizer be the SSR. 
Like other regularizers, $\lambda \log g(T(X_B); \theta)$ causes $\theta$ to 
stay near some point $\theta_A$. 
For example and without loss of generality (WLOG), $\lambda \| \theta \|^2$ centers $\theta$ on zero. 

**Universal SSRs via approximation** 

Data reduction via sufficient statistics is like compression. 
A breadth of theory has been developed for losses compression. 
Here, we'll highlight the benefits of lossy compression 
by relaxing our sufficient statistic definition to accommodate an approximation. 
Define $X_n \approx_{a.s.} Y$ to mean $\lim_{n \to \infty} X_n = Y \; a.s.$, 
and $X_n \approx_{\mathbb{P}} Y$ to mean $\lim_{n \to \infty} X_n = Y$ in $\mathbb{P}$.
Then, instead of defining $T$ sufficient if $f_X(x; \theta) = h(x) g(T(x); \theta)$, 
we'll accept $f_X(x; \theta) \approx_{\mathbb{P}} h(x) g(T(x); \theta)$ for large sample size $n$. 

With this relaxation, we are free to create approximate universal sufficient statistics 
only requiring sufficient regularity assumptions apply that the central limit theorem (CLT) 
apply to make the log likelihood normally distributed. 

First, recognize that the following Taylor expansion is accurate for $\theta$ near $\theta_A$. 

$n^{-1} \log f_X(X;\theta) \approx n^{-1}\log f_X(X; \theta_A) $
$ + (\theta - \theta_A)^T n^{-1} \nabla_\theta \log f_X(X; \theta_A)$
$ + 2^{-1}(\theta - \theta_A)^T n^{-1} \left( \nabla_\theta^2 \log f_X(X; \theta_A) \right) (\theta - \theta_A) $  

$\approx_{a.s.} \mathbb{E} \log f_X(X_1; \theta_A) + (\theta - \theta_A)^T \mathbb{E} \nabla_\theta \log f_X(X_1; \theta_A)$ 
$ + 2^{-1} (\theta - \theta_A)^T \left( \mathbb{E} \nabla_\theta^2 \log f_X(X_1;\theta_A) \right) (\theta - \theta_A) $ 
(apply the strong law of large numbers (SLLN))

$ = \mathbb{E} \log f_X(X_1; \theta_A) + 0 - 2^{-1} (\theta - \theta_A)^T \mathcal{I}_{\theta_A} (\theta - \theta_A) $

Here, we utilize further approximations:
- $\hat{\mathcal{I}}(X_A) = \hat{\mathcal{I}} = n^{-1}\sum_{i=1}^n G_i G_i^T \approx_{a.s.} \mathcal{I}$ 
where $G_i = \nabla_\theta \log f_X(X_i; \theta_A) $ 
- $\hat \theta_A(X_A) = \hat \theta_A = \arg \max_\theta f_X(X_A;\theta) \approx_{\mathbb{P}} \theta_A$

Then we realize the following. 

$\mathbb{E} \log f_X(X_1; \theta_A) - 2^{-1} (\theta - \theta_A)^T \mathcal{I}_{\theta_A} (\theta - \theta_A)$

$\approx_{\mathbb{P}} \mathbb{E} \log f_X(X_1; \theta_A) - n_A 2^{-1} (\theta - \hat \theta_A(X_A))^T \hat{\mathcal{I}}(X_A) (\theta - \hat \theta_A(X_A))$

Here, $n_A$ is the sample size of $X_A$. 

This approximation is not-yet useful, still depending on $\theta_A$. 
So, we must apply it in maximum likelihood estimation to drop the $\mathbb{E} \log f_X(X_1; \theta_A)$ term. 
Here, we add new data $X_B$ to the sample, while approximately retaining all information of $X_A$ 
in $T(X_A) = \left(\hat \theta_A(X_A), \hat{\mathcal{I}}(X_A) \right)$.

$\hat \theta_B = \arg\max_\theta f_X(X; \theta) = \arg\max_\theta \log f_X(X_B; \theta) + \log f_X(X_A;\theta) $

$ \approx_{\mathbb{P}} \arg\max_\theta \log f_X(X_B; \theta) $
$+ \mathbb{E} \log f_X(X_1; \theta_A) - n_A 2^{-1} (\theta - \hat \theta_A(X_A))^T \hat{\mathcal{I}}(X_A) (\theta - \hat \theta_A(X_A)) $

$ = \arg\max_\theta \log f_X(X_B; \theta) - n_A 2^{-1} (\theta - \hat \theta_A(X_A))^T \hat{\mathcal{I}}(X_A) (\theta - \hat \theta_A(X_A)) $

We thus identify the universal SSR $2^{-1} (\theta - \hat \theta_A(X_A))^T \hat{\mathcal{I}}(X_A) (\theta - \hat \theta_A(X_A))$ 
and its regularization parameter _natural value_ $\lambda = n_A$. 

Through the lens of SSRs, we find the regularization parameter $\lambda$ is just a ratio of sample sizes. 

Note that the discovery of this particular regularizer is not novel \[6\], 
but recognizing that it approximates a sufficient statistic is.

**Miniaturization** 

Storing $\hat \theta(X_A)$ takes $O(p)$ space and $\hat{\mathcal{I}}(X_A)$ takes $O(p^2)$. 
So, while technically finite relative to $O(n)$, the sheer size of modern deep learning models 
makes any $O(p^2)$ storage requirement infeasible. 
So, this work will leverage a series of approximations that keep practically effective approximations 
to $\mathcal{I}$ in $O(p)$. 
- The simplest approximation is only storing diagonal terms, and zeroing all others. 
- A better (but usually unnecessary) approach is leveraging a Krylov estimate to 
approximate the major eigenvectors of $\mathcal{I}$. 
These are calculated via a modified version of the Lanczos algorithm described in Appendix A.

Sufficient statistics have always been known to have this theoretically infinite storage property. 
This work merely modernizes the concept for today's massive models. 

These low-dimensional Fisher Information approximations provide a way to again yield sufficient statistics 
for deep learning in a computationally tractable way. 
However, the cheap cost of storage makes this technology valuable in specific contexts:
- when miniaturization is important, 
- and when input data is larger than the $O(rp)$ sufficient statistic storage requirement. 

While SSR miniaturization is not immediately valuable to all deep learning approaches, 
it provides a framing which is useful for optimal efficiency in most deep RL applications. 

## RL-adjusted CRLB

As an RL agent learns from its environment, it tries new things and produces new kinds of data. 
Accepting that RL doesn't sample from identical distributions invites us to find an optimal transition paradigm. 
Interpreting _optimality_ as _efficiency_, this means working with the CRLB. 
Particularly, since maximum likelihood estimates (MLEs) from different distributions appear biased to another, 
we'll work with the mean squared error (MSE) form of the CRB. 
The strategy is simple: calculate the MSE under biased MLE deformations, and minimize it with respect to $\lambda$. 

To recover a optimal $\lambda$, we use the following definitions. 
- $X_A$ is the vector (or matrix) of observations having density $f_X(x; \theta_A)$. 
- $X_B$ similarly has density $f_X(x; \theta_B)$. 
- $\hat \theta_A = \arg\max_\theta f_X(X_A; \theta)$ is a simple MLE. 
- $\hat \theta_B = \arg\max_\theta f_X(X_B; \theta)$
- $\hat \theta_{AB} = \arg\max_\theta \log f_X(X_B; \theta) - \lambda 2^{-1} (\theta - \theta_A)^T \mathcal{I}_A (\theta - \theta_A) $ 
- $\theta_{AB} = \arg\max_\theta \pi 2^{-1} (\theta - \theta_B)^T \hat{\mathcal{I}} (\theta - \theta_B) + (1 - \pi) 2^{-1} (\theta - \theta_A)^T \hat{\mathcal{I}} (\theta - \theta_A)  $
- $n_A = \#A, n_B = \#B, n = n_A + n_B$
- $\pi = \lim_{n \to \infty} n_B/n$. WLOG, we assume this series converges so we may have $\pi \approx n_B/n$. 

WLOG, assume $\mathcal{I} > 0$.

We'll further assume $f_X$ smooth in $\theta$ and $\theta_A$ and $\theta_B$ sufficiently near that they share approximately equivalent Hessians. 
So, $\mathcal{I} \approx n_B^{-1} \nabla_\theta^2 \log f_X(X; \theta_B) \approx n_A^{-1} \nabla_\theta^2 \log f_X(X; \theta_A) $. 
After building-out Lemma 1, we'll see how this produces natural values for $\mathcal{I}_B = \pi \mathcal{I}$ and $\mathcal{I}_A = (1-\pi) \mathcal{I}$.

Lemmas 1 and 4 will likely be considered trivial by some readers, but I'm keeping them to engage a wider audience. 

**Lemma 1: Double Taylor Series**

**Lemma 1:** For smooth $f$, $g$, 
and $\theta_A$ sufficiently near $\theta_B$, $f(\theta) + g(\theta) \approx T_k(f, \theta_A, \theta) + T_k(g, \theta_B, \theta)$, 
where $T_k(h, x, y)$ is a $k$-term Taylor series expansion around $x$ as a function of $y$. 

**Proof:** Since $f$ smooth, for all $\varepsilon > 0$ 
there exists open neighborhood $\Theta_A \subset \Theta$ around $\theta_A$ 
such that $\| T_k(f, \theta_A, \theta) - f(\theta) \| < \varepsilon/2$, 
by Taylor's Theorem. 

Since $\theta_B$ arbitrarily near $\theta_A$, choose $\theta_B \in \Theta_A$. 
Again by Taylor's Theorem, there exists $\Theta_B \subset \Theta_A$ such that $\theta_B \in \Theta_B$, 
and $\| T_k(g, \theta_B, \theta) - f(\theta) \| < \varepsilon/2$. 

So, $ \| f(\theta) + g(\theta) - T_k(f, \theta_A, \theta) - T_k(g, \theta_B, \theta) \|$
$ \leq \| f(\theta) - T_k(f, \theta_A, \theta) \| + \| g(\theta) - T_k(g, \theta_B, \theta) \| $
$ \leq \varepsilon/2 + \varepsilon/2 $.

Thus, for all $\theta \in \Theta_B$, our total approximation error does not exceed $\varepsilon$, 
so $f(\theta) + g(\theta) \approx T_k(f, \theta_A, \theta) + T_k(g, \theta_B, \theta)$. 

$\square$

To motivate our $\mathcal{I}_{A,B} \approx \pi \mathcal{I} + (1-\pi)\mathcal{I} = \mathcal{I}_B + \mathcal{I}_A$ assumption, 
start by recognizing that $\mathcal{I} = \mathcal{I}_{A,B} = \mathcal{I}_B + \mathcal{I}_A$ is generally true by independence of $X_B$ and $X_A$. 
However, discovering natural values for $\mathcal{I}_B$ and $\mathcal{I}_A$ requires we
apply the Double Taylor Series expansion to the log likelihood estimate.

$\arg\max_\theta \log f_X(X;\theta) $ 
$= \arg\max_\theta n^{-1} \log f_X(X;\theta) $ 
$= \arg\max_\theta n^{-1} \log f_X(X_B;\theta) + n^{-1} \log f_X(X_A;\theta) $

Apply Lemma 1 around $\theta_B$ and $\theta_A$. 

$ \approx \arg\max_\theta n^{-1} \log f_X(X_B;\theta_B) + n^{-1} (\theta - \theta_B)^T \nabla_\theta \log f_X(X_B; \theta_B) $
$ + n^{-1} 2^{-1} (\theta - \theta_B)^T \left( \nabla_\theta^2 \log f_X(X_B; \theta_B) \right) (\theta - \theta_B) $
$ + n^{-1} \log f_X(X_A;\theta_A) + n^{-1} (\theta - \theta_A)^T \nabla_\theta \log f_X(X_A; \theta_A) $
$ + n^{-1} 2^{-1} (\theta - \theta_A)^T \left( \nabla_\theta^2 \log f_X(X_A; \theta_A) \right) (\theta - \theta_A) $

Multiply by $1 = n_A/n_A = n_B/n_B$ and to line-up limits.

$ \approx \arg\max_\theta n_B n^{-1} n_B^{-1} \log f_X(X_B;\theta_B) + n_B n^{-1} n_B^{-1} (\theta - \theta_B)^T \nabla_\theta \log f_X(X_B; \theta_B) $
$ + n_B n^{-1} n_B^{-1} 2^{-1} (\theta - \theta_B)^T \left( \nabla_\theta^2 \log f_X(X_B; \theta_B) \right) (\theta - \theta_B) $
$ + n_B n^{-1} n_B^{-1} \log f_X(X_A;\theta_A) + n_B n^{-1} n_B^{-1} (\theta - \theta_A)^T \nabla_\theta \log f_X(X_A; \theta_A) $
$ + n_B n^{-1} n_B^{-1} 2^{-1} (\theta - \theta_A)^T \left( \nabla_\theta^2 \log f_X(X_A; \theta_A) \right) (\theta - \theta_A) $

Apply the SLLN. The key observation is that we've assumed all Hessians average to the same value $\mathcal{I}$, approximately.

$ \approx_{a.s.} \arg\max_\theta \pi \mathbb{E} \log f_X(X_{B_1};\theta_B) + 0 $
$ - \pi 2^{-1} (\theta - \theta_B)^T \mathcal{I} (\theta - \theta_B) $
$ + (1-\pi) \mathbb{E} \log f_X(X_{A_1};\theta_A) + 0 $
$ - (1-\pi) 2^{-1} (\theta - \theta_A)^T \mathcal{I} (\theta - \theta_A) $

$ = \arg\max_\theta - 2^{-1} (\theta - \theta_B)^T \pi \mathcal{I} (\theta - \theta_B) $
$ - 2^{-1} (\theta - \theta_A)^T (1-\pi) \mathcal{I} (\theta - \theta_A) $

$ = \arg\max_\theta - 2^{-1} (\theta - \pi\theta_B - (1-\pi)\theta_A)^T \left( \pi \mathcal{I} + (1-\pi) \mathcal{I} \right) (\theta - \pi\theta_B - (1-\pi)\theta_A) $

In the natural place of $\mathcal{I}_B$ we find $\pi \mathcal{I}$. 
Similarly, we are free to take $\mathcal{I}_A = (1-\pi)\mathcal{I}$. 

**Lemma 2: The small differences assumption**

$\theta_A$ and $\theta_B$ need to quite close for our analysis to hold rigorously. 
They must be closer than our sample size is large. 
Mathematically, we assume WLOG that $\theta_B - \theta_A = O(n^{-s}), s > 1/2$, 
similar to Holder continuity. 
While technically useful, this assumption is counter-intuitive. 
After all, our parameters are not functions of our sample size. 
Fortunately, in application, $n$ is fixed by the time estimation occurs. 
So, we are free to assume that parameters are indeed progressing along some limit, 
but all sampling has simply occurred at the $n^{th}$ step, 
causing estimation to remain undisturbed.  

This strange analytic tool has a practical interpretation: 
not only must $\theta_t$ progress continuously in $t$, 
but also change slowly relative to $\sqrt{n}$. 

**Lemma 2:** If $\theta_B - \theta_A = O(n^{-s}), s > 1/2$, then $\sqrt{n}(\hat \theta_{AB} - \pi\theta_B - (1-\pi)\theta_A) \to_D \cdot$ 
and $\sqrt{n}(\hat \theta_{AB} - \theta_A) \to_D \cdot$. 
So, these limits exist, converging in distribution. 

**Proof:** First, we derive the limiting expected value. 

$\hat \theta_{AB} \approx \arg\max_\theta n^{-1} \log f_X(X_B; \theta) + n^{-1} \log f_X(X_A; \theta) $

Apply Lemma 1 and the SLLN. 

$ \approx \arg\max_\theta -\pi 2^{-1}(\theta - \theta_B)^T \mathcal{I} (\theta - \theta_B) - (1-\pi) 2^{-1}(\theta - \theta_A)^T \mathcal{I} (\theta - \theta_A) $

Optimized at $ 0 = \nabla_\theta \left( - 2^{-1} (\theta - \theta_B)^T \pi \mathcal{I} (\theta - \theta_B) - 2^{-1} (\theta - \theta_A)^T (1-\pi) \mathcal{I} (\theta - \theta_A) \right) $

$ = - (\theta - \theta_B)^T \pi \mathcal{I} - (\theta - \theta_A)^T (1-\pi) \mathcal{I} $

$ \Leftrightarrow 0 = 0 \mathcal{I}^{-1} = - (\theta - \theta_B)^T \pi - (\theta - \theta_A)^T (1-\pi) $
$ = - \theta^T + \pi \theta_B^T + (1-\pi) \theta_A^T $

$ \Rightarrow \theta = \pi \theta_B + (1-\pi) \theta_A$.

So, $\mathbb{E}\hat \theta_{AB} \approx \pi \theta_B + (1-\pi) \theta_A$ for $n$ large. 

Next, we use to exact definition of $\hat \theta_{AB}$ to recognize it's just an MLE constrained to an elliptical sub-manifold of $\Theta$. 

$\hat \theta_{AB} = \arg\max_\theta \log f_X(X_B; \theta) - \lambda 2^{-1}(\theta - \theta_A)^T \mathcal{I} (\theta - \theta_A)$ 
has a solution satisfying 
$0 = \nabla_\theta \log f_X(X_B; \theta) - \lambda \nabla_\theta 2^{-1}(\theta - \theta_A)^T \mathcal{I} (\theta - \theta_A) + 0 $
$ = \nabla_\theta \log f_X(X_B; \theta) - \lambda \nabla_\theta \left( 2^{-1}(\theta - \theta_A)^T \mathcal{I} (\theta - \theta_A) - c \right) $, 
for some $c \in \mathbb{R}_{\geq 0}$. 

Define $\mathcal{L}(\theta, \lambda') := \log f_X(X_B; \theta) - \lambda' \left( 2^{-1}(\theta - \theta_A)^T \mathcal{I} (\theta - \theta_A) - c \right) $.  
Notice that $\lambda$ is a fixed value during estimation, so $\lambda'$ is a parameter.

Recognize that by construction, $\nabla_\theta \mathcal{L}\left(\hat \theta_{AB}, \lambda \right) $ 
$ = \frac{\partial}{\partial \lambda'} \mathcal{L}\left(\hat \theta_{AB}, \lambda \right) = 0 $. 

Thus, $\mathcal{L}$ is the Lagrangian of an elliptically constrained parameter. 
Specifically, the sub-manifold is $H := \{ \theta \in \Theta \; : \; c = 2^{-1}(\theta - \theta_A)^T \mathcal{I} (\theta - \theta_A) \}$. 

So, $\hat \theta_{AB} = \arg\max_{\theta \in H} \log f_X(X_B; \theta) $ is just an MLE.

Hence, $\sqrt{n}(\hat \theta_{AB} - \mathbb{E}\hat \theta_{AB})$ converges in distribution. 

Further, $\sqrt{n}(\hat \theta_{AB} - \theta_A) $
$ = \sqrt{n}(\hat \theta_{AB} - \theta_A - \mathbb{E}\hat \theta_{AB} + \mathbb{E}\hat \theta_{AB})$
$ \approx \sqrt{n}(\hat \theta_{AB} - \theta_A - \mathbb{E}\hat \theta_{AB} + \pi \theta_B + (1-\pi) \theta_A) $
$ = \sqrt{n}(\hat \theta_{AB} - \mathbb{E}\hat \theta_{AB}) - \pi \sqrt{n}(\theta_B - \theta_A) $

Apply the _small differences assumption_.

$ = \sqrt{n}(\hat \theta_{AB} - \mathbb{E}\hat \theta_{AB}) - \pi O(n^{1/2-s}) $ which converges in distribution because $O(n^{1/2-s}) \to_{a.s} 0$.

$\square$

**Lemma 3: Asymptotic behavior of $\hat \theta_{AB}$** 

The closed-form expected value and variance will be used in proving and demonstrating efficiency. 

**Lemma 3:** As $n \to \infty$, $\hat \theta_{AB} \sim \mathcal{N} \left(\pi \theta_B + (1-\pi) \theta_A, \pi \mathcal{I}^{-1}/n \right)$, 
provided $\lambda = n_A$ and the _small differences assumption_ is met. 

**Proof:** Our proof strategy is very similar to that of MLE asymptotic normality. 
However, we'll need to construct a few additional limits converging to $\pi$. 
As with other MLE asymptotic normality proofs, we start by recognizing $0 = \nabla_\theta \log f_X(X_B; \theta) - \lambda (\theta - \theta_A)^T \mathcal{I} $.

$0 = n^{-1/2} \nabla_\theta \log f_X(X_B; \theta) - n^{-1/2} \lambda (\theta - \theta_A)^T \mathcal{I} $

Apply Taylor series expansion of $\nabla_\theta \log f_X(X_B; \theta)$ around $\theta_B$. 

$ \approx n^{-1/2} \nabla_\theta \log f_X(X_B; \theta_B) + n^{-1/2} (\theta -\theta_B)^T \nabla_\theta^2 \log f_X(X_B; \theta_B) $
$ - n^{-1/2} \lambda (\theta - \theta_A)^T \mathcal{I} $

Prepare limits by multiplying by 1 and setting $\lambda = n_A$. 

$ = \sqrt{\frac{n_B}{n}} n_B^{-1/2} \nabla_\theta \log f_X(X_B; \theta_B) + \sqrt{n} \frac{n_B}{n} (\theta -\theta_B)^T n_B^{-1} \nabla_\theta^2 \log f_X(X_B; \theta_B) $
$ - \sqrt{n} \frac{n_A}{n} (\theta - \theta_A)^T \mathcal{I} $

Take $n \to \infty$ to produce 4 kinds of limit simultaneously: 
- $n_B/n \to_{a.s.} \pi$. 
- $n_B^{-1/2} \nabla_\theta \log f \to_D N_B \mathcal{I}^{1/2}$ via CLT, where $N_B$ is a normally distributed variable. 
- $n_B^{-1} \nabla_\theta^2 \log f \to_{a.s.} -\mathcal{I} $ via SLLN. 
- $\sqrt{n}(\theta - \theta_B) \to_D \cdot$, by Lemma 2.

All 4 of these limits can be combined with Slutksy's Theorem to produce convergence in distribution. 

$ \Rightarrow \lim_{n \to \infty} 0 $
$ \approx \left( \lim_{n \to \infty} \sqrt{\frac{n_B}{n}} \right) $
$ \left( \lim_{n \to \infty} n_B^{-1/2} \nabla_\theta \log f_X(X_B; \theta_B) \right) $
$ + \left( \lim_{n \to \infty} \frac{n_B}{n} \right) $
$ \left( \lim_{n \to \infty} \sqrt{n} (\theta -\theta_B)^T \right) $
$ \left( \lim_{n \to \infty} n_B^{-1} \nabla_\theta^2 \log f_X(X_B; \theta_B) \right) $
$ - \left( \lim_{n \to \infty} \frac{n_A}{n} \right) $
$ \left( \lim_{n \to \infty} \sqrt{n} (\theta - \theta_A)^T \right) \mathcal{I} $ 

$ \approx_D \sqrt{\pi} N_B \mathcal{I}^{1/2} - \lim_{n \to \infty} \sqrt{n} \pi (\theta - \theta_A)^T \mathcal{I} - \lim_{n \to \infty} \sqrt{n} (1-\pi) (\theta - \theta_A)^T \mathcal{I} $

$ \Rightarrow \lim_{n \to \infty} \sqrt{n} (\theta - \pi \theta_B - (1-\pi) \theta_A) \approx_D \sqrt{\pi} N_B \mathcal{I}^{-1/2} $

$ \square $

**Lemma 4: Multivariate CRLB**

**Lemma 4:** For $\hat \theta \in \mathbb{R}^p, \text{Cov}(\hat \theta) \geq \mathcal{I}^{-1}/n$.

**Proof:** WLOG, take $n = 1$. 

Let $\text{Cov}(\hat \theta) = \Sigma = P \Lambda P^T$, 
where $P$ orthogonal and $\Lambda$ diagonal. 

Take $\eta: \Theta \to H \subset \mathbb{R}^p$ as $\eta(\theta) = P^{-1} \theta = P^T \theta$, 
so $\eta^{-1}(\eta) = \theta(\eta) = P \eta$. 

Use $\eta$ to map $\hat \theta$ from $\Theta$ to an uncorrelated space $H$. 

Thus, $\text{Cov}(\eta(\hat \theta)) = \text{Cov}(P^T \hat \theta) = P^T \text{Cov}(\hat \theta) P = P^T P \Lambda P^T P = \Lambda $. 

Now apply the univariate CRLB element-wise to the diagonal of $\Lambda$ to get the minimum variance of our uncorrelated estimate $\eta(\hat \theta)$. 
So, $\Lambda \geq \mathcal{I}^{-1}_H$. 

To get the associated information matrix in $\Theta$-space, apply the inverse transform on $H$-space. 
This transforms the Hessian $\mathcal{I}_H$ with Jacobian $J = P$.

$ \mathcal{I}^{-1} = \mathcal{I}_\Theta^{-1} = \left( J \mathcal{I}_H J^T \right)^{-1} $
$ = \left( P \mathcal{I}_H P^T \right)^{-1} = P \mathcal{I}_H^{-1} P^T $
$ \leq \text{Cov}\left( \theta( \eta( \hat \theta)) \right) = \text{Cov}(\hat \theta)$. 

$ \square $

**RL-adjusted CRLB (RL-CRLB)** 

**Result:** $\mathbb{E}\| \hat \theta_{AB} - \theta_B \|^2 \gtrsim \pi \text{tr}[\mathcal{I}^{-1}]/n + (1-\pi)^2 \| \theta_B - \theta_A \|^2$, 
minimized over $\pi$ when $ n_A \approx n(1-\pi) = \text{tr}\left[\mathcal{I}^{-1}\right]/ \left(2 \| \theta_B - \theta_A \|^2 \right) $. 

**Proof:** Before studying the mean squared error (MSE) $\mathbb{E}\| \hat \theta_{AB} - \theta_B \|^2$, we start with the matrix form, say $E$ for "error". 

$ E = \mathbb{E}(\hat \theta_{AB} - \theta_B)(\hat \theta_{AB} - \theta_B)^T $

Add $0 = \mathbb{E} \hat \theta_{AB} - \mathbb{E} \hat \theta_{AB}$

$ = \text{Cov}\hat \theta_{AB} + 0 + (\mathbb{E} \hat \theta_{AB} - \theta_B)(\mathbb{E} \hat \theta_{AB} - \theta_B)^T $

Apply the Multivariate CRLB (Lemma 4) on transformed space $\mathbb{E} \hat \theta_{AB} (\Theta)$.

$ \geq \left( \nabla_{\theta_B} \mathbb{E}\hat \theta_{AB} \right)^T \mathcal{I}_B^{-1} n^{-1} \left( \nabla_{\theta_B} \mathbb{E}\hat \theta_{AB} \right) $
$ + (\mathbb{E} \hat \theta_{AB} - \theta_B)(\mathbb{E} \hat \theta_{AB} - \theta_B)^T$

By assumption of $\theta_B$ near $\theta_A$, $\mathcal{I}_B \approx \pi \mathcal{I}$.

$ \approx \left( \nabla_{\theta_B} \mathbb{E}\hat \theta_{AB} \right)^T \pi^{-1} \mathcal{I}^{-1} n^{-1} \left( \nabla_{\theta_B} \mathbb{E}\hat \theta_{AB} \right) $
$ + (\mathbb{E} \hat \theta_{AB} - \theta_B)(\mathbb{E} \hat \theta_{AB} - \theta_B)^T$

Apply $\mathbb{E} \hat \theta_{AB} \approx \pi \theta_B + (1-\pi) \theta_A$ from Lemma 3.

$ \approx \pi \mathcal{I}^{-1} n^{-1} + (1-\pi)^2 (\theta_B - \theta_A)(\theta_B - \theta_A)^T $

$ \Rightarrow \mathbb{E}\| \hat \theta_{AB} - \theta_B \|^2 = \text{tr}[E] $
$ \gtrsim \pi \text{tr}[\mathcal{I}^{-1}]/n + (1-\pi)^2 \| \theta_B - \theta_A \|^2 $

Find the critical point in $\pi$ for optimality over $\pi$. 

$ 0 = \frac{\partial}{\partial \pi} \left( \text{tr} \left[ \pi \mathcal{I}^{-1} \right]/n - 2(1-\pi) \| \theta_B - \theta_A \|^2  \right) $

$ \Rightarrow  1-\pi = \text{tr}\left[\mathcal{I}^{-1}\right]/(2 n \| \theta_B - \theta_A \|^2) $. 

$ \Rightarrow \lambda = n_A \approx n(1-\pi) = \text{tr}\left[\mathcal{I}^{-1}\right]/(2 \| \theta_B - \theta_A \|^2) $

$\square$

Leveraging Lemma 3, notice that this makes $\hat \theta_{AB}$ efficient under the RL-adjusted paradigm of estimation. 
Rigorously, $\mathbb{E} \| \hat \theta_{AB} - \theta_B \|^2 \approx \pi \text{tr}[\mathcal{I}^{-1}]/n + (1-\pi)^2 \| \theta_B - \theta_A \|^2 $. 
Of course, normally the definition of efficiency assumes consistency, 
but in this context of mixing distributions, bias should be accommodated. 
So, we interpret "efficient" as "minimizing MSE".

**Bring it all together**

This work derives SSRs before producing the RL-CRLB, 
because it provides an applied context. 
The RL-CRLB only provides an optimal value for $\pi$, say $\hat \pi$. 
This optimality can drive impact by informing us of the optimal value for our regularization term $\lambda$. 
After all, SSR derivation shows $\lambda$ acts equivalently to $n_A$. 
Thus, we are free to choose a $\hat \lambda$ which corresponds to $\hat \pi$ 
by taking $\hat \lambda = n(1 - \hat \pi) = \text{tr}\left[\mathcal{I}^{-1}\right]/\left(2 \| \hat\theta_B - \hat\theta_A \|^2 \right) $. 
Assuming our sample sizes $n_A$ and $n_B$ are sufficiently large that $\hat \theta_A$, $\hat \theta_B$, and 
$\hat{\mathcal{I}}$ are accurate, 
we are free to adjust $\pi$ to optimal by using $\hat \lambda$. 
So, choosing $\lambda = \hat \lambda$ 
causes $\hat \theta_{AB}$ to have optimally minimal MSE. 

## Experiments

TODO: re-run experiments with correct $\hat \lambda$ value. 

**Experimental result 1: Q-Learning Cart Pole**

This initial experiment demonstrates the effectiveness of SSRs, 
and studies the behavior of $\| \theta_A - \theta_B \|^2_F$ near the solution point. 
The experiment is kept purposefully simple to isolate experimental effects. 
We apply a [small deep net](https://github.com/wdurno/notebooks/commit/8e20a0ec7a5c2376b954d2a04a930064e8e77f69#diff-9067e1aea905e3792706d011ba5a0007ba0147bbc9f8b02e76a0b9b07acbd13d) 
to [Cart Pole with a 4-dimensional state space](https://www.gymlibrary.dev/environments/classic_control/cart_pole/). 

Experimental conditions are 

1. SSRs are applied with $\lambda = n_B$. The observation cache is entirely cleared after each memorization event.
2. SSRs are applied with approximately optimal $\lambda = \hat \lambda = n_B (\theta_A - \theta_B)^T \hat{\mathcal{I}} (\theta_A - \theta_B)$. Since $\theta_A$ is never known, we use the $\theta_A - \theta_B$ difference from the prior memorization event. The observation cache is entirely cleared after each memorization event.
3. SSRs are applied with $\lambda = \hat \lambda / 10$. The observation cache is entirely cleared after each memorization event.
4. SSRs are applied with $\lambda = \hat \lambda * 10$. The observation cache is entirely cleared after each memorization event. 
5. Control: No SSRs are applied, but the observation cache is cleared at moments coinciding with each memorization event. 
6. Control: No SSRs are applied, and observations caches are entirely retained. 

**TODO:** clean-up this graphic: axes, title, number conditions 

In the below graphic, the x-axis tracks the game iteration, 
and the y-axis tracks average game score over 1000 agents. 
Agents restart their games when they lose and during memorization events. 
Three memorization events can clearly be seen in the graph.

![cart pole](./data/df-experiment-22.png)


This graphic clearly supports the optimality of the SSR method in RL applications. 
Lift is clearly more-efficient, as the method uses data most-efficiently. 
Better yet, the method enjoys the miniaturization benefits of storing any amount of data in an $O(p)$ space. 

It can be seen that the lagged estimate of $\hat \lambda$ is decently accurate, 
but so is taking $\hat \lambda = 1$ even during early gameplay. 

Also note that miniaturization benefits are evident. 
The agent with SSR (no memory) and discarding data is the lowest performer. 
In this way, we see the benefit of a long memory. 
Further, we see the SSR (memory) providing its mathematically guaranteed memory equivalent, 
but in a miniaturized form. 

**Experimental result 2: robotics**

This work's sufficiency and efficiency results are asymptotic results, 
so are theoretically valid for a large breadth of models. 
So, the sufficiency and efficiency should be demonstrated in another game beyond Cart Pole.
Further, sufficiency and efficiency are particularly useful toward robotics, 
this new game is played by a real-world robot. 

The robot is a PiCar V \[1\] wheeled robot with articulated camera, 
capable of driving forward & backward, turning left & right, 
and looking up, forward, left & right. 
Video processing and servo articulation is processed by an on-board Raspberry Pi 4B.

![robot](./data/robot.jpg)

This PiCar is customized to minimize on-board processing, 
instead running [this](https://github.com/wdurno/picar-v-rl-env/blob/1d5f5e15b390ef7cceae3fbe8875f71fbc76995f/run_api.py) server, offloading deep net processing to 
[this](https://github.com/wdurno/notebooks/blob/814f7cfed779fc4990f770b20a5ad924fbd7c693/regularizers-as-memory/car.py)
GPU-accelerated client. 

Similar to the out-of-the-box [software](https://github.com/sunfounder/SunFounder_PiCar-V/blob/master/ball_track/ball_tracker.py), 
the RL game is to chase a red ball and have the agent get close to it, 
confirming the presence visually. 
Experiments were conducted in the same household office, 
under near-consistent lighting conditions. 

Three experiments were run.

1. (Control) standard Q-Learning, retaining all data in the memory buffer 
2. (Experimental 1) Using memorization, clearing the memory buffer after each memorization, 
using $\hat \lambda_t = (\hat \theta_{t} - \hat \theta_{t-1})^T \hat{\mathcal{I}}(\hat \theta_{t} - \hat \theta_{t-1}) $, where $\hat{\mathcal{I}}$ is estimated with a Krylov rank of 10 (see Appendix A).
Memorization occurs 40 times.
3. (Experimental 2) Same as experimental 1, but with $\hat \lambda = 1$ for every iteration.

Data were all sampled by an agent following experimental condition 2.
All three experimental conditions used the same datasets, 
with the agent being exposed to data in the same order it was sampled. 
Thus, this was an offline RL experiment. 
Since optimal offline RL metrics are unclear \[2\], 
we simply use expected discounted reward per batch at teach time step. 

![robot metrics](./data/df-experiment-24.png)

As can be seen, this new game has exceptional results when 
SSRs are applied and $\hat \lambda$ is projected. 
This illustrates the potential for universal effectiveness of SSR efficiency. 
Further,  $\hat \lambda = 1 = n_A/n_B$ continues to be an adequate value. 

**Experimental result 3: actor-critic Cart Pole**

Actor-critic methods feature actor and critic deep nets which co-evolve, 
effectively causing either models' regression target to move slowly. 
Interpreting this slow movement as continuity, 
we are free to apply the RL-CRLB. 

In this experiment, we modify Cart Pole to have a continuous action space 
thereby making actor-critic methods applicable. 
Instead of having the agent choose between two actions (left or right), 
we have the actor choose $\mathbb{P}[\text{left}] = 1 - \mathbb{P}[\text{right}]$. 
As programmed [here](https://github.com/wdurno/notebooks/blob/ccb0b401d4111e711868f997624442f07c8ef92b/regularizers-as-memory/regmem_ac.py), 
- in the _experimental condition_ both models get their own SSR which is updated every 500 game steps; 
- in the _control condition_, no SSRs were used; 
- 20000 game steps were observed, restarting games whenever they are done; 
- scores represent the average of 1000 control and 1000 experimental agents. 

The data illustrate how efficient RL estimates with RL-CRLB regularization generates metric lift through optimally-efficient utilization of data. 

![actor critic cart pole](./data/df-experiment-27.png)

**Experimental result 4: GPT 2 Tuning**

TODO Now that I've proven RL-CRLB applies to actor-critic methods, I am comfortable proceeding into this experiment.  

This approach should involve a continuous transition from a language model loss to a reinforcement learning loss. 
My theory shows this transition to be optimal at each infinitesimal step in the path. 
Since the action space is large, an Actor-Critic (AC) approach is needed. 

As is, the AC paradigm is not sufficiently build-out from a probabilistic perspective to 
immediately activate classical statistical theory \[18\]. Let's quickly solve that 
problem here.

AC models accommodate a large action space by adjusting probabilities $p$ of actions 
$a_t$ given state $s_t$, written as $p(a_t|s_t; \theta)$. 
The choice action is selected at random, according to its probability. 
Further, to enable Bellman equation estimation, the estimated state value 
$\hat V(s_t; \theta)$ is fit to the observed, discounted value 
$V(s_t) = \sum_{s=t}^T \gamma^s r_s$. 

The loss is $\ell = \ell_{actor} + \ell_{critic}$, where 
- $\ell_{actor} = - \sum_t^T (V(s_t) - \hat V(s_t)) \log p(a_t|s_t; \theta) $, where $V(s_t) = V(s_t; \theta)$ is at a fixed value of $\theta$. 
- $\ell_{critic} = L(V(s_t), \hat V(s_t; \theta))$, where $L$ is any convex loss with a unique, unbiased minimum. 

Assume $\hat \theta = \arg\min_\theta \ell(s_t;\theta)$ exists uniquely, 
$s_t$ are Markovian over $t$, 
and $C = \int_{s_t} e^{-\ell(s_t;\theta)} \in \mathbb{R}_{>0}$.
Then $\hat \theta = \arg\max_\theta e^{-\ell(s_t;\theta)}C^{-1} $ is an MLE
for $s_t \sim e^{-\ell(s_t;\theta)}C^{-1}$.
So, by applying minor regularity assumptions, 
our loss is equivalent to a log likelihood, thereby activating my RL-CRLB theory. 

Since $V(s_t)$ is not a function of action $a_t$, it is really the projected value 
of the chosen action. Of course, the chosen action 
changes as the model is fit, so sampled data ages poorly. Ultimately, data re-use 
doesn't work for AC models. This requires data not be re-used after models use it for 
fitting. This challenges SSR calculation, which works best by recalculating gradients 
at $\hat \theta_B$. So, SSR calculation must be done near-continuously. Unfortunately, 
I've failed to do this so far, despite trying. I fear that applying AC and thus 
meaningfully tuning LLMs with RL will depend on me solving this problem first. 
So, I hope to solve my challenges in this order:

1. Solve continuous SSR calculation for Q Learning. 
2. Demonstrate continuous SSR calculation with AC Cart Pole. 
3. Apply SSRs to AC-tuned LLMs. 

## Discussion 

**When is RL-CRLB efficiency useful?** 

My efficiency result is likely immediately applicable to all RL applications, 
and any situations where sampling distributions drift slowly and continuously. 
This represents a most-optimal opportunity for all RL applications with enough data.

This all depends heavily on the assumption that $\text{MSE}(\hat \theta_{AB}) < \text{MSE}(\hat \theta_B)$, 
which is very likely to occur when incremental sample sizes $n_B$ are small relative to the aggregate sample. 
This is what makes RL a prime target for application. 

**What do likelihoods have to do with deep nets?**

TODO a lot, lol 

**How efficient do we need to be?**

TODO: Simulations vs real world

**Are these approximations rigorous?** 

TODO 

**When will SSR miniaturization be useful?**

The miniaturization benefits of SSRs in deep learning will not likely be useful until 
we start pursuing more-ambitious forms of data. 
Right now, we're spending millions on fitting language models on text data, 
so it is likely that miniaturization will be necessary for similarly-effective results 
on more-complex data. 
Imagine the sheer networking logistics required to handle exabytes of high-resolution 
video data. 
In that context, the $O(p)$ storage requirement, while definitely massive, 
is absolutely far more feasible than the $O(n)$ alternative without sufficient statistics. 
In this work, I've managed with $O(p) \approx 10p$. 
So, when our datasets become 10 times larger than our models, 
we may perhaps begin considering SSRs in practical application.

If you are unsure whether this will ever occur, consider the alternative. 
If $p \gtrsim O(n)$ for all $n$, then we are forever fitting more parameters to fewer data. 
This is likely to result in over-fit and no interpolation. 
Such agents will never have any depth of understanding. 

**Why only optimize $\pi$?** 

There is so much more work to do in the vast complexity of $\mathcal{I}$. 
In the RL-CRLB equation $\pi \text{tr}[\mathcal{I}^{-1}]/n + (1-\pi)^2 \| \theta_B - \theta_A \|^2$, 
my choice to target $\pi$ is largely one of convenience. 
Information Geometry \[17\] invites a deep exploration of metrics and model selection. 
Alternatively, one might explore strategies to optimally transport information 
from $\theta_A$ to $\theta_B$ over larger neighborhoods. 
Large neighborhoods are valuable, because they may free us from the _small differences assumption_, 
and allow this work's theory to contribute to _transfer learning_ meaningfully. 
For example, information may be optimally shared between two very different models. 

Further, it is likely that Information Geometry is the natural language of RL, 
because estimate covariance is a function of the statistical manifold's local tangent space, 
and this work phrases RL as traversal of paths on the manifold. 
The essential task is ensuring that traversal occurs most-efficiently. 

**$n_A \approx \text{tr}\left[\mathcal{I}^{-1}\right]/\left(2 \| \hat\theta_B - \hat\theta_A \|^2 \right)$ isn't an integer**

In the era of big data, sample size matters far less, and information geometry matters far more. 
This work optimally combines information with no regard to the actual observed sample sizes. 
It is merely assumed that sample sizes are large enough to meet asymptotic normality requirements. 

As a motivating thought experiment, consider the following normally distributed data and imagine our task is to estimate $\mu$. 

$$ \begin{bmatrix} X \\ Y \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mu \\ \mu \end{bmatrix}, \begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix} \right) $$

The Fisher Information $\mathcal{I}_{\mu \mu} = 2(1 - \rho)/(1 - \rho^2)$ indicates that 
an efficient estimate will have variance $\text{Var}(\hat \mu) = n^{-1}2^{-1}(1 - \rho^2)/(1-\rho)$. 
Take note of these 3 informative values:
- $\text{Var}(\hat \mu) \to 0$ as $\rho \to -1$. So, only a single sample is needed to produce an accurate estimate when $\rho \approx -1$. 
- $\text{Var}(\hat \mu) = (2n)^{-1}$ when $\rho = 0$. So, only half as much data is needed to produce an accurate estimate when our dimensions are uncorrelated. 
- $\text{Var}(\hat \mu) \to 1/n$ as $\rho \to 1$. So, there is no efficiency premium when $\rho \approx 1$. 

Notice how $\mathcal{I}$ structure strongly dictates sampling needs. 
By being sensitive to these opportunities, we can drop sampling requirements.
If we are interested in using our data more efficiently, 
we can leverage opportunities like this more. 
This work takes a step in that direction by entirely disregarding actual sample sizes $n_A$ and $n_B$, 
and instead choosing such values optimally. 
It's not about sample size; it's about the shape of our information. 

**Better ways to estimate $\hat{\mathcal{I}}$**

I'm using a method-of-moments estimate for my sufficient statistic $\hat{\mathcal{I}}$. 
There's a good chance you can do better. 
Your estimate need only be positive semi-definite. 
Conveniently, this can be obtained by estimating any $LL^T + \Lambda$ form. 
For example, perhaps there is an ideal neural network architecture for estimating $L$. 

Similarly, my $(\theta_B - \theta_A)$ projection methods have been incredibly naive. 
There is plenty of room here to treat SSR estimation as an entirely parallel problem 
to model fitting. 

**Why not continuously integrate data into memory?**

In experiment 1, we saw data memorized infrequently at clear punctuated times. 
The result is a very "choppy" progression of metrics. 
I do this because theoretical deduction is my guide in picking high-value experimental targets. 
I've stuck close to my theory and only aggregated information once asymptotic sample sizes were clearly met. 
This doesn't mean opportunities don't exist. 
This is an opportunity for both empirical and theoretical study. 

On the empirical front, it is worthwhile to study practical ways to continuously integrate data into memory (SSRs). 
The challenge will be maintaining competitive metrics, 
because deviation from asymptotic assumptions will cause results to degrade. 

On the theoretical front, it is worth asking why I only combine two points in $\Theta$ at a time. 
For example, we may construct a mixture of 3 points $\hat \theta_{ABC} = \pi_A \theta_A + \pi_B \theta_B + \pi_C \theta_C$. Or, perhaps we pursue truly continuous integration and choose infinitely many 
points with a stochastic integral $\hat \theta = \int \hat \theta_t \pi_t dt$.
Again, the challenge will be in ensuring that the distribution of $\hat \theta_t$ is truly known. 
Are the samples truly asymptotic in size? 
What does that even mean if every observation is immediately integrated?

**Transfer learning**

Assuming $\theta_A$ and $\theta_B$ are sufficiently near, 
this work provides theory for optimally combining information between two models. 
This is transfer learning on an infinitesimal scale. 
Reapplying the method along path $\Theta_T = \{\theta_t\}_{t \in [0,1]} $, 
we obtain an optimized version of transfer learning as it's normally understood. 
There is likely more work required to efficiently transport information 
along non-infinitesimal paths. 
However, since this is a noisy walk on a sometimes billion-dimensional manifold, 
we may need to settle for heuristics. 

TODO finish experiment 4

## Conclusion

This work has derived the sufficient statistic regularizer (SSR) as a means to miniaturize models 
leveraging truly massive datasets, 
and has derived the reinforcement learning-adjusted Cramer Rao Lower Bound (RL-CRLB) 
which proves how to most-optimally utilize data in a reinforcement learning (RL) context. 
While the SSR miniaturization result is likely to only yield impact in the future, 
the RL-CRLB efficiency result is likely to deliver immediate impact to RL. 
Results are derived mathematically, then demonstrated experimentally. 
The paradigm deviates significantly from the classical statistics foundations of AI, 
violating the identical distribution assumption, 
yet still provides a mathematically coherent approach to combining such data. 
The overall result is a step away from experimentalist statistics as a foundation for AI, 
and toward the mathematical needs of AI. 
Particularly, this work accommodates the AI agent's ability to learn and change its sampling distribution accordingly. 

## References 

\[1\] https://www.sunfounder.com/products/smart-video-car

\[2\] Romain Deffayet, Thibaut Thonet, Jean-Michel Renders, & Maarten de Rijke (2022) "Offline Evaluation for Reinforcement Learning-based Recommendation: A Critical Issue and Some Alternatives", ACM SIGIR Forum, Vol. 56 No. 2. 

\[3\] Alexei Krylov (1931). On the numerical solution of equations whose solution determine the frequency of small vibrations of material systems. _Izv. Akad. Nauk. SSSR Otd Mat. Estest, 1, 491-539._ 

\[4\] Cornelius Lanczos (1950). An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. _Journal of Research of the National Bureau of Standards_. 45 (4): 255–282. doi:10.6028/jres.045.026 

\[5\] Ryo Karakida, Shotaro Akaho,  Shun-ichi Amari (2019). Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach. _arXiv_: 1806.01316v3.

\[6\] Kirkpatrick et al. (2017) "Overcoming catastrophic forgetting in neural networks", PNAS, Vol. 114, No. 13.

\[7\] Knight, Will. "OpenAI's CEO Says the Age of Giant AI Models Is Already Over". Wired. www.wired.com.

\[8\] R. A. Fisher (1922) "On the mathematical foundations of theoretical statistics", The Royal Society, ISSN 0264-3952, eISSN 2053-9258. 
Retrieved from [https://royalsocietypublishing.org/doi/10.1098/rsta.1922.0009](https://royalsocietypublishing.org/doi/10.1098/rsta.1922.0009)
on 1 Jan 2024. 

\[9\] David Silver, Satinder Singh, Doina Precup, Richard S. Sutton (2021) "Reward is Enough", Artificial Intelligence, https://doi.org/10.1016/j.artint.2021.103535.

\[10\] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Raia Hadsell, et al. (2017) "Overcoming catastrophic forgetting in neural networks", PNAS, 114 (13), 3521-3526.

\[11\] Yann LeCun, John Denker, Sara Solla (1990) "Optimal Brain Damage", Advances in Neural Information Processing Systems 2 (NIPS 1989).

\[12\] Richard E. Bellman (1957) "Dynamic Programming", Princeton University Press, Princeton, New Jersey.

\[13\] D. A. S. Fraser (1963) "On Sufficiency and the Exponential Family", Vol. 25, No. 1, pp. 115-123.

\[14\] Herald Cramer (1946) "Mathematical Methods Of Statistics", Princeton University Press, Princeton. 

\[15\] C. R. Rao (1945) "Information and the accuracy attainable in the estimation of statistical parameters", Bull. Calcutta Math. Soc.37, 81-91.

\[16\] S. S. Wilks (1938) "The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses", Annals of Mathematical Statististics 9(1): 60-62. DOI: 10.1214/aoms/1177732360

\[17\] Shun-ichi Amari, Hiroshi Nagaoka (2000), "Translations of Mathematical Monographs: Methods of Information Geometry", American Mathematical Society, Oxford University Press, Volume 191.

\[18\] Vijay Konda, John Tsitsiklis (1999), "Actor-Critic Algorithms", NIPS 12.

## Appendix A: Limited-Memory Lanczos algorithm 

Production-quality deep learning models will have intractably-large Fisher Information matrices.
For a model with $p$ parameters, $\mathcal{I}$ has $O(p^2)$ values.
To overcome this, we'll use low rank approximation $LL^T + \Lambda \approx \mathcal{I}$, 
$L \in \mathbb{R}^{p \times r}$ and $\Lambda \in \text{diag}(\mathbb{R}_{\geq 0}^{p \times p})$, 
with $r$ significantly smaller than $p$. 
To estimate the $LL^T$ part, 
I've modified a Krylov method \[3\] to provide algorithmically-efficient updates to our approximation. 
We may justify our approximation through a combination of 
supporting experimental evidence in our robitics experiment 
and theory arguing only for a sharp drop-off of eigenvalues. \[5\].

Unfortunately, we cannot use pre-existing software, 
because most eigenpair algorithms are designed to have all of $\mathcal{I}$ as input.
Since $\mathcal{I}$ will not fit in memory, we have a challenge. 
Our only advantage is that we observe gradients $G_i \sim_{idd} N_p(0, \mathcal{I} )$. 
Fortunately, the _Lanczos algorithm_ \[4\] only requires we calculate $\mathcal{I}v$, not that we actually store $\mathcal{I}$. 

**Limited-Memory Lanczos Algorithm**

The Lanczos algorithm is a _Krylov_ method, built around _Krylov subspace_ $\text{span}\left\{ \mathcal{I}v, \mathcal{I}^2v, \ldots, \mathcal{I}^rv \right\}$.

The key observation is this: we can calculate the _Krylov vectors_ $\mathcal{I}v, \mathcal{I}^2v, \ldots, \mathcal{I}^rv$ with computational 
efficiency when using estimate $ \hat{\mathcal{I}}  = n^{-1}\sum_i^n G_i G_i^T$. With $G_i$ and $v$ in $\mathbb{R}^{p \times 1}$, 
we can expand recursively as follows.

$$ \hat{\mathcal{I}}^m v = \hat{\mathcal{I}}^{m-1} \left( \sum_i^n G_i G_i^T \right) v = \hat{\mathcal{I}}^{m-1} \left( \sum_i^n G_i G_i^Tv \right) $$

Notice how we now work with $O(nr)$ $O(p)$-time operations (totalling $O(nrp)$-time) and no $O(p^2)$-space operations. 
This is all possible because we only ever work with vector-vector operations, matrices are never used in forming the Krylov vectors. 
So, we should enjoy computational feasibility, if $nr$ is significantly smaller than $p^2$, which is expected in a deep learning context. 