# SVI and gradient estimators

### Setup

We've defined a Pyro model with observations ${\bf x}$ and latents ${\bf z}$ of the form $p_{\theta}({\bf x}, {\bf z}) = p_{\theta}({\bf x}|{\bf z}) p_{\theta}({\bf z})$. We've also defined a Pyro guide (i.e. a variational distribution) of the form $q_{\rm \phi}({\bf z})$. Here ${\theta}$ and $\phi$ are variational parameters for the model and guide, respectively. (In particular these are _not_ random variables that call for a Bayesian treatment).

We'd like to maximize the log evidence $\log p_{\theta}({\bf x})$ by maximizing the ELBO (the evidence lower bound) given by 

${\rm ELBO} \equiv \mathbb{E}_{q_{\phi}({\bf z})} \left [ 
\log p_{\theta}({\bf x}, {\bf z}) - \log q_{\phi}({\bf z})
\right]$

To do this we're going to take (stochastic) gradient steps on the ELBO in the parameter space $\{ \theta, \phi \}$ (see reference [1] for early work on this approach). So we need to be able to compute unbiased estimates of 

$\nabla_{\theta,\phi} {\rm ELBO} = \nabla_{\theta,\phi}\mathbb{E}_{q_{\phi}({\bf z})} \left [ 
\log p_{\theta}({\bf x}, {\bf z}) - \log q_{\phi}({\bf z})
\right]$

How do we do this for general stochastic functions `model()` and `guide()`? To simplify notation let's generalize our discussion a bit and ask how we can compute gradients of expectations of an arbitrary cost function $f({\bf z})$. Let's also drop any distinction between $\theta$ and $\phi$. So we want to compute

$\nabla_{\phi}\mathbb{E}_{q_{\phi}({\bf z})} \left [
f_{\phi}({\bf z}) \right]$

Let's start with the easiest case.

### The easiest case: reparameterizable random variables

Suppose that we can reparameterize things such that 

$\mathbb{E}_{q_{\phi}({\bf z})} \left [f_{\phi}({\bf z}) \right]
=\mathbb{E}_{q({\bf \epsilon})} \left [f_{\phi}(g_{\phi}({\bf \epsilon})) \right]$

Crucially we've moved all the $\phi$ dependence inside of the exectation; $q({\bf \epsilon})$ is a fixed distribution with no dependence on $\phi$. This kind of reparameterization can be done for many distributions (e.g. the normal distribution); see reference [2] for a discussion. In this case we can pass the gradient straight through the expectation to get

$\nabla_{\phi}\mathbb{E}_{q({\bf \epsilon})} \left [f_{\phi}(g_{\phi}({\bf \epsilon})) \right]=
\mathbb{E}_{q({\bf \epsilon})} \left [\nabla_{\phi}f_{\phi}(g_{\phi}({\bf \epsilon})) \right]$

Assuming $f(\cdot)$ and $g(\cdot)$ are sufficiently smooth, we can now get unbiased estimates of the gradient of interest by taking a Monte Carlo estimate of this expectation.

### The trickier case: non-reparameterizable random variables

What if we can't do the above reparameterization? Unfortunately this is the case for many distributions of interest, for example all discrete distributions. In this case our estimator takes a bit more complicated form.

We begin by expanding the gradient of interest as

$\nabla_{\phi}\mathbb{E}_{q_{\phi}({\bf z})} \left [
f_{\phi}({\bf z}) \right]= 
\nabla_{\phi} \int d{\bf z} \; q_{\phi}({\bf z}) f_{\phi}({\bf z})$

and use the chain rule to write this as 

$ \int d{\bf z} \; \left \{ (\nabla_{\phi}  q_{\phi}({\bf z})) f_{\phi}({\bf z}) + q_{\phi}({\bf z})(\nabla_{\phi} f_{\phi}({\bf z}))\right \} $

At this point we run into a problem. We know how to take samples from $q(\cdot)$ (we just run the guide forward) but $\nabla_{\phi}  q_{\phi}({\bf z})$ isn't even a valid probability density. So we need to massage this formula so that it's in the form of an expectation w.r.t. $q(\cdot)$. This is easily done using the identity

$ \nabla_{\phi}  q_{\phi}({\bf z}) = 
q_{\phi}({\bf z})\nabla_{\phi} \log q_{\phi}({\bf z})$

which allows us to rewrite the gradient of interest as 

$\mathbb{E}_{q_{\phi}({\bf z})} \left [
(\nabla_{\phi} \log q_{\phi}({\bf z})) f_{\phi}({\bf z}) + \nabla_{\phi} f_{\phi}({\bf z})\right]$

This form of the gradient estimator&mdash;variously known as the REINFORCE estimator or score function estimator or the likelihood ratio estimator&mdash;is amenable to simple Monte Carlo estimation.

Note that one way to package this result (which is covenient for implementation) is to introduce a surrogate loss function

${\rm surrogate \;loss} \equiv
\log q_{\phi}({\bf z}) \overline{f_{\phi}({\bf z})} + f_{\phi}({\bf z})$

Here the bar indicates that the term is held constant (i.e. it is not to be differenated w.r.t. $\phi$). To get a (single-sample) Monte Carlo gradient estimate, we sample the latent random variables, compute the surrogate loss, and differentiate. The result is an unbiased estimate of $\nabla_{\phi}\mathbb{E}_{q_{\phi}({\bf z})} \left [
f_{\phi}({\bf z}) \right]$.

### Variance or Why I Wish I Was Doing MLE Deep Learning

We now have a general recipe for an unbiased gradient estimator of expectations of cost functions. Unfortunately, in the more general case where our $q(\cdot)$ includes non-reparameterizable random variables, this estimator tends to have high variance. Indeed in many cases of interest the variance is so high that the estimator is effectively unusable. So we need strategies to reduce variance (for a discussion see reference [3]). We're going to pursue two strategies. The first strategy takes advantage of the particular structure of the cost function $f(\cdot)$. The second strategy effectively introduces a way to reduce variance by using information from previous estimates of 
$\mathbb{E}_{q_{\phi}({\bf z})} [ f_{\phi}({\bf z})]$. As such it is somewhat analogous to using momentum in stochastic gradient descent. 

### Reducing variance by paying attention to dependency structure

In the above discussion we stuck to a general cost function $f_{\phi}({\bf z})$. We could continue in this vein (the approach we're about to discuss is applicable in the general case) but for concreteness let's zoom back in. In the case of stochastic variational inference, we're interested in a particular cost function of the form

$\log p_{\theta}({\bf x} | {\rm Pa}_p ({\bf x})) +
\sum_i p_{\theta}({\bf z}_i | {\rm Pa}_p ({\bf z}_i)) 
- \sum_i \log q_{\phi}({\bf z}_i | {\rm Pa}_q ({\bf z}_i))$

where we've broken the log ratio $\log p_{\theta}({\bf x}, {\bf z})/q_{\phi}({\bf z})$ into a sum over the different latent random variables $\{{\bf z}_i \}$. We've also introduced the notation 
${\rm Pa}_p (\cdot)$ and ${\rm Pa}_q (\cdot)$ to denote the parents of a particle random variable in the model and in the guide, respectively. (The reader might worry what the appropriate notion of dependency would be in the case of general stochastic functions; here we simply mean regular ol' dependency within a single execution trace). The point is that different terms in the cost function have different dependencies on the random variables $\{ {\bf z}_i \}$ and this is something we can take advantage of.

To make a long story short, for any non-reparameterizable latent random variable ${\bf z}_i$ the surrogate loss is going to have a term 

$\log q_{\phi}({\bf z}_i) \overline{f_{\phi}({\bf z})} $

It turns out that we can remove some of the terms in $\overline{f_{\phi}({\bf z})}$ and still get an unbiased gradient estimator; furthermore, doing so will generally decrease the variance. In particular (see reference [3] for details) we can remove any terms in $\overline{f_{\phi}({\bf z})}$ that are not downstream of the latent variable ${\bf z}_i$ (downstream w.r.t. to the dependency structure of the guide). 

In Pyro, all of this logic is taken care of automatically by the `SVI` class. In particular as long as we switch on `trace_graph=True`, Pyro will keep track of the dependency structure within the execution traces of the model and guide and construct a surrogate loss that has all the unnecessary terms removed:

In [None]:
elbo = SVI(model, guide, optimizer, "ELBO", trace_graph=True)

Note that leveraging this dependency information takes extra computations, so `trace_graph=True` should only be invoked in the case where your model has non-reparameterizable random variables. 


### Aside: Dependency tracking in Pyro

Finally, a word about dependency tracking. Tracking dependency within a stochastic function that includes arbitrary Python code is a bit tricky. The approach currently implemented in Pyro is analogous to the one used in WebPPL (cf. reference [4]). Briefly, a conservative notion of dependency is used that relies on sequential ordering. If random variable ${\bf z}_2$ follows ${\bf z}_1$ in a given stochastic function then ${\bf z}_2$ _may be_ dependent on ${\bf z}_1$ and therefore _is_ assumed to be dependent. To mitigate the overly coarse conclusions drawn by this kind of dependency, Pyro includes constructs for declaring things as independent, namely `irange` and `iarange` [**SEE LINK**]. For use cases with non-reparameterizable variables, it is therefore important for the user to make use of these constructs to take full advantage of the variance reduction provided by `SVI`. In some cases it may also pay to consider reordering random variables within a stochastic function (if possible). It's also worth noting that we expect to add finer notions of dependency tracking in a future version of Pyro.

### Reducing variance with data-dependent baselines

The second strategy for reducing variance in our ELBO gradient estimator goes under the name of baselines (see e.g. reference [5]). It actually makes use of the same bit of math that underlies the variance reduction strategy discussed above, except now instead of removing terms we're going to add terms. Basically, instead of removing terms with zero expectation that tend to _contribute_ to the variance, we're going to add specially chosen terms with zero expectation that work to _reduce_ the variance.

blah blah blah

## References

[1] `Black Box Variational Inference`,<br/>&nbsp;&nbsp;&nbsp;&nbsp;
Rajesh Ranganath, Sean Gerrish, David M. Blei

[2] `Auto-Encoding Variational Bayes`,<br/>&nbsp;&nbsp;&nbsp;&nbsp;
Diederik P Kingma, Max Welling

[3] `Gradient Estimation Using Stochastic Computation Graphs`,
<br/>&nbsp;&nbsp;&nbsp;&nbsp;
    John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel
    
[4] `Deep Amortized Inference for Probabilistic Programs`
<br/>&nbsp;&nbsp;&nbsp;&nbsp;
Daniel Ritchie, Paul Horsfall, Noah D. Goodman

[5] `Neural Variational Inference and Learning in Belief Networks`
<br/>&nbsp;&nbsp;&nbsp;&nbsp;
Andriy Mnih, Karol Gregor