In this notebook we explore how to implement black-box variational inference (bvi). The technique is general and especially bound to one introduced in [Black Box Variational Inference, R. Ranganath, et al.](https://arxiv.org/abs/1401.0118). From now on we will use BVI to refer to variational inference methods that can work on any latent models without specific model-based derivation.

# How it works

Probabilistic models with latent variables has various applications in machine learning. Latent variables models allows us to reason about the hidden structure among latent variables given data. The main problem of working with these models is to infer posterior distribution of latent variables. This allows us to say something about latent variables. Unforutnately to compute the posterior distribution is intractable. This forces practitioers to resort to approximate methods.

There are two approximation scheme MCMC and Variational Inference. Variational Inference cast the posterior approximiation into an optimization problem. This allows it to scale with big dataset. On the other hand, this optimization procedure force practitioners to derive the formula for gradients. BVI promises a way to estimate the gradient so that the practitioner is free from this tedious task and focus on evaluating different stastistical models.

BVI works as follows. Suppose we have a model $p(x,z)$, where x, z denotes observed variables and latent ones, respectively. Note that normally we need to use bold face to indicate $x$ and $z$ are list of variables. The goal is to compute $p(z|x)$, which is hard for most models of interest. VI approximates this posterior with a variational distribution $q(z|\lambda)$. By choosing an optimial value $\lambda^{*}$, VI maximize an objective function called ELBO. More details about ELBO will be presented later.

In order to maximize ELBO using nurmical optimization, we need to compute the gradient of ELBO. It is sometimes impossible to derive a closed-form formula for the gradient. BVI overcome this problem by estimating the gradients using Monte Carlo methods. 

That was pretty much it about BVI. There more details you need to take into account when implementing BVI but the idea is just like we discussed.

# ELBO

The first thing, you need to remember is that maximize ELBO(q) is equivalent to KL(q||p), which is the KL-divergence from the variational distribution to the true posterior distribution. This give the justification to BVI, or VI in general. ELBO(q) is defined as follows:
$$\mathcal{L}(\lambda) = E_{q_{\lambda}{(z)}}\big[ log(p(x,z) - log(q(z))\big]$$

We want to compute the gradient $\nabla \mathcal{L}$. By rewriting $\nabla \mathcal{L}$ as an expectation with respect to variational distribution. This expection can be estimated by sampling from variational distribution. That is:

$$\nabla_{\lambda} \mathcal{L} = \frac{1}{S}\sum_{s=1}^{S} \nabla_{\lambda}log\,q(z_s|\lambda)(log\,p(x, z_s) - log \,q(z_s |\lambda)),$$

where $z_s \sim q(z|\lambda).$

With this gradient estimator, we can perform stochastic optimization.

# Blackbox Variational Inference

Still the optimization relies on estimation of gradients. We need take care of the variance of this estimation. There are two methods:

- Rao-Blackwellization: The idea is to replace a function whose expectation is being estimated by Monte Carlo by another function that has the same expectation but with smaller variance. "That is, to estimate  $E_{q}[f]$ via Monte Carlo the emprical average of $\hat{f}$ where $\hat{f}$ is picked so that $E_q[f] = E_q[\hat{f}]$ and $Var_q[f] > Var_q[\hat{f}]$
- Control Variates: This is to be filled later.