# Sufficiency and Efficiency in Deep RL

_W. Evan Durno, 2023_

**TODO:** the whole document contains a mixture of multivariate and univariate representations of $\theta$. Switch-over to multivariate.  

As a reinforcement learning (RL) agent explores its environment, 
it learns and tries new things, and thus changes its sampling distribution. 
Thus, the RL paradigm poorly aligns with the foundational identical distribution assumptions 
underpinning the statistical theories allowing AI to function correctly. 
For example, it can be useful to discard old observations during fitting. 
Similarly, we ought to ask, how can we optimally fit our models while adding new samples from new distributions? 

In this work, we introduce the concept of a sufficient statistic regularizer (SSR), 
and prove the existence of an optimal regularizing value $\lambda$ to assist in fitting 
our models while distributions change smoothly and locally. 
Further, by relaxing the sufficient statistic definition to accommodate a reasonable approximation, 
we prove the existence of asymptotically universal sufficient statistics applicable in and beyond RL, 
assuming reasonable regularity assumptions. 
The net result is an RL paradigm which 
(1) incorporates new, non-identically distributed data optimally, and 
(2) achieves miniaturization by storing data in an $O(p)$-sized space instead of $O(n)$. 

## Introduction 

TODO: Describe the problem precisely 

TODO: Cover similar research 

## SSRs 

**Sufficient statistics as regularizers** 

Data reduction is the purpose of sufficient statistics. 
This becomes particularly apparent when studying their behavior under maximum likelihood estimation. 
The Fisher-Neyman defintion of a sufficient statistic $T(x)$ is that there 
exists $g$ and $h$ for density $f$ such that $f(x; \theta) = h(x) g(T(x) ; \theta)$. 
Notice that under maximum likelihood estimation, the $h$ term becomes irrelevant, leaving only $T(x)$, 
thereby providing an opportunity to reduce dimensionality of all data stored. 

$ \hat \theta = \arg \max_\theta f(x; \theta) = \arg \max_\theta \log f(x; \theta) $
$ = \arg \max_\theta \log(h(x)) + \log(g(x; T(x))) = \arg \max_\theta \log(g(x; T(x)))$ 

For example, dataset $x$ may require $O(n)$ space to store, but for $\theta in \mathbb{R}^p$, 
it's common that $T$ only require $O(p)$ storage space. 
So, for a model with fixed parameter dimension, 
it is reasonably possible to have theoretically infinite data storage in a finite space. 
Of course, the amount of information truly stored will be practically bounded. 

For the purposes of this work, it is convenient to view sufficient statistics as regularizers. 
In situations where new data is added to old, like RL, this is particularly relevant. 
For example, with old data $X_A$ packed into $T_A = T(X_A)$ supplanting $\hat \theta_A$, 
we may add data $X_B$ ultimately producing $T_B = T(X_B)$ sufficiently estimating $\theta_B$. 

$\hat \theta_B = \arg\max_\theta f_X(X; \theta) = \arg \max_\theta \log f_X(X_A;\theta) + \log f_X(X_B;\theta) $
$ = \arg\max_\theta \log f_X(X_A;\theta) + \log g(T(X_B; \theta))$

Here, $X$ is the vector containing both $X_A$ and $X_B$, concatenated. 

If we then insert a scalar multiple $\lambda$, we recover the familiar regularizer form. 

$\hat \theta_B = \arg\max_\theta \log f_X(X_A;\theta) + \lambda \log g(T(X_B; \theta)) $

Let this form of regularizer be the SSR. 
Like other regularizers, $\lambda \log g(T(X_B); \theta)$ causes $\theta$ to 
stay near some point $\theta_A$. 
For example and without loss of generality (WLOG), $\lambda \| \theta \|^2$ centers $\theta$ on zero. 

**Universal SSRs via approximation** 

Data reduction via sufficient statistics is like compression. 
A breadth of theory has been developed for losses compression. 
Here, we'll highlight the benefits of lossy compression 
by relaxing our sufficient statistic definition to accommodate an approximation. 
Define $X_n \approx_{a.s.} Y$ to mean $\lim_{n \to \infty} X_n = Y \; a.s.$, 
and $X_n \approx_{\mathbb{P}} Y$ to mean $\lim_{n \to \infty} X_n = Y$ in $\mathbb{P}$.
Then, instead of defining $T$ sufficient if $f_X(x; \theta) = h(x) g(T(x); \theta)$, 
we'll accept $f_X(x; \theta) \approx_{\mathbb{P}} h(x) g(T(x); \theta)$ for large sample size $n$. 

With this relaxation, we are free to create approximate universal sufficient statistics 
only requiring sufficient regularity assumptions apply that the central limit theorem (CLT) 
apply to make the log likelihood normally distributed. 

First, recognize that the following Taylor expansion is accurate for $\theta$ near $\theta_A$. 

$n^{-1} \log f_X(X;\theta) \approx n^{-1}\log f_X(X; \theta_A) $
$ + (\theta - \theta_A) n^{-1}\frac{\partial}{\partial \theta} \log f_X(X; \theta_A)$
$ + 2^{-1}(\theta - \theta_A)^2 n^{-1} \frac{\partial^2}{\partial \theta^2} \log f_X(X; \theta_A) $  

$\approx_{a.s.} \mathbb{E} \log f_X(X_1; \theta_A) + (\theta - \theta_A) \mathbb{E} \frac{\partial}{\partial \theta} \log f_X(X_1; \theta_A)$ 
$ + 2^{-1} (\theta - \theta_A)^2 \mathbb{E} \frac{\partial^2}{\partial \theta^2} \log f_X(X_1;\theta_A) $ (apply the strong law of large numbers (SLLN))

$ = \mathbb{E} \log f_X(X_1; \theta_A) + 0 - 2^{-1} (\theta - \theta_A)^2 \mathcal{I}_{\theta_A} $

Here, we utilize further approximations:
- $\hat{\mathcal{I}}(X_A) = \hat{\mathcal{I}} = n^{-1}\sum_{i=1}^n G_i G_i^T \approx_{a.s.} \mathcal{I}$ 
where $G_i \nabla_\theta \log f_X(X_i; \theta_A) $ 
- $\hat \theta_A(X_A) = \hat \theta_A \arg \max_\theta f_X(X_A;\theta) \approx_{\mathbb{P}} \theta_A$

Then we realize the following. 

$\mathbb{E} \log f_X(X_1; \theta_A) - 2^{-1} (\theta - \theta_A)^2 \mathcal{I}_{\theta_A}$

$\approx_{\mathbb{P}} \mathbb{E} \log f_X(X_1; \theta_A) - n_A 2^{-1} (\theta - \hat \theta_A(X_A))^2 \hat{\mathcal{I}}(X_A)$

Here, $n_A$ is the sample size of $X_A$. 

This approximation is not-yet useful, still depending on $\theta_A$. 
So, we must apply it in maximum likelihood estimation to drop the $\mathbb{E} \log f_X(X_1; \theta_A)$ term. 
Here, we add new data $X_B$ to the sample, while approximately retaining all information of $X_A$ 
in $T(X_A) = \left(\hat \theta_A(X_A), \hat{\mathcal{I}}(X_A) \right)$.

$\hat \theta_B = \arg\max_\theta f_X(X; \theta) = \arg\max_\theta \log f_X(X_B; \theta) + \log f_X(X_A;\theta) $

$ \approx_{\mathbb{P}} \arg\max_\theta \log f_X(X_B; \theta) $
$+ \mathbb{E} \log f_X(X_1; \theta_A) - n_A 2^{-1} (\theta - \hat \theta_A(X_A))^2 \hat{\mathcal{I}}(X_A) $

$ = \arg\max_\theta \log f_X(X_B; \theta) - n_A 2^{-1} (\theta - \hat \theta_A(X_A))^2 \hat{\mathcal{I}}(X_A) $

We thus identify the universal SSR $2^{-1} (\theta - \hat \theta_A(X_A))^2 \hat{\mathcal{I}}(X_A)$ 
with its natural regularization parameter value $\lambda = n_A$. 

**Miniaturization** 

Storing $\hat \theta(X_A)$ takes $O(p)$ space and $\hat{\mathcal{I}}(X_A)$ takes $O(p^2)$. 
So, while technically finite relative to $O(n)$, the sheer size of modern deep learning models 
makes any $O(p^2)$ storage requirement infeasible. 
So, this work will leverage a series of approximations that keep practically effective approximations 
to $\mathcal{I}$ in $O(p)$. 
- The simplest approximation is only storing diagonal terms, and zeroing all others. 
- A better (but usually unnecessary) approach is leveraging a Krylov estimate to 
approximate the major eigenvectors of $\mathcal{I}$. 
These are calculated via a modified version of the Lanczos algorithm. 

For experimental validation showing these approximations are effective, see Appendix B.

## Optimal MLE efficiency under deforming distributions 

As an RL agent learns from its environment, it tries new things and produces new kinds of data. 
Accepting that RL doesn't sample from identical distributions invites us to find an optimal transition paradigm. 
Interpreting _optimality_ as _efficiency_, this means working with the Cramer-Rao lower bound (CRB). 
Particularly, since maximum likelihood estimates (MLEs) from different distributions appear biased to another, 
we'll work with the mean squared error (MSE) form of the CRB. 
The strategy is simple: calculate the MSE under biased MLE deformations, and minimize it with respect to $\lambda$. 

To recover a optimal $\lambda$, we use the following definitions. 
- $X_A$ is the vector (or matrix) of observations having density $f_X(x; \theta_A)$. 
- $X_B$ similarly has density $f_X(x; \theta_B)$. 
- $\hat \theta_A = \arg\max_\theta f_X(X_A; \theta)$ is a simple MLE. 
- $\hat \theta_B = \arg\max_\theta f_X(X_B; \theta)$
- $\hat \theta_{AB} = \arg\max_\theta \log f_X(X_A ; \theta) + \log f_X(X_B; \theta) $ 
$ = \arg\max_\theta \sum_{i \in A} \log f_X(X_i ; \theta) + \sum_{i \in B} \log f_X(X_i; \theta) $ 
- $n_A = \#A, n_B = \#B, n = n_A + n_B$
- $p = \lim_{n \to \infty} n_B/n$. WLOG, we assume this series converges so we may have $p \approx n_B/n$. 

With these definitions in-place, we can realize this $\theta_{AB}$ form which can optimize MSE in $p$, determining optimal $\lambda$. 

$\hat \theta_{AB} = \arg\max_\theta \log f_X(X_A ; \theta) + \log f_X(X_B; \theta) $

$ \approx_{\mathbb{P}} \arg\max_\theta \log f_X(X_B; \theta) - n_A 2^{-1} (\theta - \hat \theta_A)^T \hat{\mathcal{I}} (\theta - \hat \theta_A)$

$ \approx n^{-1}\arg\max_\theta \log f_X(X_B; \theta) - (1-p) 2^{-1} (\theta - \hat \theta_A)^T \hat{\mathcal{I}} (\theta - \hat \theta_A)$

**CRB for mixed data** 

The CRB was designed with independent and identically distributed (iid) observations in-mind. 
Given our core observation that RL deforms distributions during sampling, 
we'll apply some regularity assumptions on our distributions to keep the CRB relevant. 
Assume that: 
- $f_X(x; \theta)$ is smooth in $\theta$ near $\theta_B$. 
- $\theta_A$ is sufficiently near to $\theta_B$ that $\log f_X(x; \theta_A) \approx \log f_X(x; \theta_B) + (\theta_A - \theta_B)^T \nabla_\theta \log f_X(x; \theta_B) $
$+ 2^{-1} (\theta_A - \theta_B)^T \left( \nabla^2_\theta \log f_X(x; \theta_B) \right) (\theta_A - \theta_B) $ holds.

This implies both $\theta_A$ and $\theta_B$ have approximately the same Hessians, $-\mathcal{I}$. 
Leveraging equivalent Hessians, the CRB can be applied to minimize the MSE. 

For bias term $b = \mathbb{E} \hat \theta_{AB} - \theta_B$ and $b' = \nabla_{\theta_B}b$, the CRB clearly states this effect of a biased estimate.

$\mathbb{E}(\hat \theta_{AB} - \theta_B) \geq (1 + b')^2 \mathcal{I}^{-1} + b^2$

Recognizing $\mathbb{E}\hat \theta_{AB} \approx_{\mathbb{P}} = p \theta_B + (1-p) \theta_A$, we recover this inequality. 

$\mathbb{E}(\hat \theta_{AB} - \theta_B) \geq (1 - 1 + p)^2/(p\mathcal{I} + (1-p) \mathcal{I}) + (1-p)^2(\theta_A - \theta_B) $
$ = p^2 \mathcal{I}^{-1} + (1-p)^2 (\theta_A - \theta_B)$

Differentiating by $p$ and setting to zero, we recover an optimal approximate value for $\lambda$. 

$\frac{n_A}{n_B} \approx \frac{p}{p-1} = (\theta_A - \theta_B)^2 \mathcal{I} $

Using the natural value for $\lambda = n_A$, we recover $\hat \lambda = n_B (\theta_A - \theta_B)^2 \mathcal{I} $.

Unfortunately, $\hat \lambda$ is used to estimate $\theta_B$, so we do not yet know $(\theta_A - \theta_B)^2$. 
However, in RL, $(\theta_A - \theta_B)^2$ is observed recursively, so may be projected via moving average. 
So, $\hat \lambda$ is practically estimatable. 

Under this paradigm, $n_A$ is still useful when arbitrarily large. 
However, setting it to $n_A = \hat \lambda$ virtually up-samples $X_A$ or $X_B$. 

**$\theta_t$ as diffusion $W_t$**

In applying the above theory, estimating $(\theta_A - \theta_B)^2$ via moving average invites inevitably viewing 
the path between $\theta_A$ and $\theta_B$ as a diffusion. 
Here, we'll formalize this intuition. 
For all $s \in [0, t]$, let $\theta_s = \sigma W_s $, a standard Brownian motion. 
WLOG, assume $\theta_0 = W_0 = 0$. 

We assume that large samples may be taken at any individual $\theta_s$, 
enabling $\hat \theta_s \approx_{\mathbb{P}} \theta_s$ for each $s$, 
via MLE consistency. 
Notice that we do not sample continuously and we are indeed discretizing our sampling points. 
So, for each $\theta_s$, we'll have observations $X_s$ with sample size $n$. 
For simplicity's sake, continuous sampling, instead of batches at $\theta_s$ points, will not be considered here.

Let our estimate be $\hat \theta = p \hat \theta_t + (p-1) (m-1)^{-1} \sum_{j=1}^{m-1} \hat \theta_{tj/m}$ and $p \in (0, 1)$. 
So, we explicitly up-sample $\theta_t$. 
In application, it is essential to keep $m$ somewhat small 
because computational resources must be spent keeping $n$ for all of the $m$ sample batches. 
Sufficiently large $n$ allows $\hat \theta_s \approx_{\mathbb{P}} \theta_s$, 
which our following derivation relies on. 

Fascinatingly, if $m$ could meaningfully be allowed to become large while maintaining $\hat \theta_s \approx_{\mathbb{P}} \theta_s$, 
our estimate approaches an Ito integral $\hat \theta \approx p \hat \theta_t + (1-p) \int_0^t W_s ds$. 
This is an interesting pure exploration we'll leave out-of-scope for this work. 

To study efficiency in this context, we must now reconstruct the CRB from its Cauchy-Schwarz (CS) foundations.
For notational brevity, 
- let $T(X_j) = \hat \theta_{tj/m}$, and 
- $V_j = \frac{\partial}{\partial \theta_{tj/m}} \log f_X(X_j; \theta_{tj/m})$.  

Also, assume all $\theta_{tj/m}$ are near $\theta_t$, and that $\log f_X(x; \theta)$ is smooth, 
so that each batch has approximately the same Hessian, so $\mathcal{I} = \mathcal{I}_{tj/m}$ for each $j$. 
These definitions and assumptions give us the following CS components. 

$\text{Var}V = p \mathcal{I} + (1-p) (m-1)^{-1} \sum_{j=1}^{m-1} \mathcal{I} = \mathcal{I} $ by independence of samples $X_j$. 

$\forall j \in \{0, 1, 2, \ldots, m\}, \mathbb{E}V_j = 0$ 
$ \Rightarrow \forall j, \text{Cov}(T(X_j), V_j) = \mathbb{E}T(X_j) V_j - 0$ 
$ = \mathbb{E} T(X_j) \frac{\partial}{\partial \theta_j} \log f_X(X_j; \theta_{tj/m}) $
$ = \frac{\partial}{\partial \theta_j} \mathbb{E} T(X_j) $ 
$ = \frac{\partial}{\partial \theta_j} \psi_j $. 
Usual CRB depends on this integral-derivative exchange. 

In our case bias $b = \mathbb{E}\hat \theta - \theta \approx_{\mathbb{P}} \mathbb{E}[p \theta_t - (1-p)(m-1)^{-1}\sum_{j=1}^{m-1}\theta_{tj/m}] - \theta_t $

TODO: Experimental results illustrating findings 

## Alternative perspective: submanifolds versus data 

TODO: Data effectively reduces parameter dimension 

TODO: Reducing MSE with submanifolds 

## Discussion 

TODO: SSRs provide a universal miniaturization paradigm 

TODO: RL lift should be expected by accounting for deforming distributions 

TODO: Submanifolds should be utilized wherever possible to reduce parameter dimension 

## References 

TODO: add content here, as needed 

## Appendix A: modified Lanczos algorithm 

TODO: my limited-memory Lanczos algorithm 

## Appendix B: Effectiveness of $O(p)$ Hessian approximations 

TODO: Show results 