# Introduction
We want to maximize the log-likehood of data $\log p_{\theta}(x)$ **indirectly** by maximizing its lowwer bound ELBO instead. 

Recall that 
$$\begin{align}ELBO &= E_{q_{\phi}(z)}\big[\log p_{\theta}(x,z) - \log(q_{\phi}(z) \big]\\
&= E_{q_{\phi}(z)}\log p_{\theta}(x|z)p(z) - \log(q_{\phi}(z) \big]\\
&= E_{q_{\phi}(z)}\log p_{\theta}(x|z) + \log_{\theta}p(z) - \log(q_{\phi}(z) \big]
\end{align}$$

To optimize ELBO we need to be able to compute its gradient wrt to parameters $\phi, \theta$. If both model and guide are from exponential family then we can derive this gradient in a closed form. Hence the standard gradient descent can be applied easily. 

Unfortunately, for a general model and guide, it is not possible to derive a closed form for the gradient. To overcome this, we use Monte Carlo to compute unbiased estimate of the ELBO(q) gradients.

$$\nabla_{\phi, \theta}ELBO = \nabla_{\phi, \theta} E_{q_{\phi}(z)}\big[\log p_{\theta}(x,z) - \log(q_{\phi}(z) \big]$$

There are two main questions to ask here:
1. How to fastly compute the log likelihood term $\log p_{\theta}(x|z)$ since our dataset could contains millions of data points.
2. We now use $\theta$ to denote $\theta, \phi$ collectively and ask ourself a more generic quesiton: how do we estimate $\nabla_\phi E_{q_\phi(z)}\big[f_{\phi}(z) \big]?$

# Fast Log-likelhood evaluation

By exploiting the conditional indenpence of data points, we can estimate this term as:

$$\sum_{i=1}^{N}\log p(x_i | z) \approx \frac{N}{M}\sum_{i \in I_M}\log p(x_i|z),$$

where $I_M$ is mini-batch indices of dataset.

To indicate conditional independence specifically, we can use ``pyro.plate``: 

```python
def model(data):
    # sample f from the beta prior
    f = pyro.sample("latent_fairness", dist.Beta(alpha0, beta0))
    # loop over the observed data [WE ONLY CHANGE THE NEXT LINE]
    for i in pyro.plate("data_loop", len(data)):
        # observe datapoint i using the bernoulli likelihood
        pyro.sample("obs_{}".format(i), dist.Bernoulli(f), obs=data[i])

```

Or we can do it in vectorized fashion:

```python
with plate('observe_data'):
    pyro.sample('obs', dist.Bernoulli(f), obs=data)
```

and then we have the subsampling for free

```python
with plate('observe_data', size=10, subsample_size=5) as ind:
    pyro.sample('obs', dist.Bernoulli(f),
                obs=data.index_select(0, ind))
```

More details can be found [here](http://pyro.ai/examples/svi_part_ii.html)

# ELBO Gradient Estimation

# References

1. [Pyro](http://pyro.ai/examples/svi_part_i.html#)