In [1]:
%autosave 0
%matplotlib inline

import os, sys
sys.path.insert(0, os.path.expanduser('~/git/github/pymc-devs/pymc3'))

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import rc
import daft
import pymc3 as pm
import numpy as np
import seaborn as sns

from IPython.display import display
from IPython.display import HTML
import IPython.core.display as display

Autosave disabled


<br />

<div style="text-align: center;">
<font size="7"><b>Variational inference in </b></font>
<div style="display: inline-block; vertical-align: middle;">
<img src="./pymc3_logo.jpg" width=200 height=200 /></div> 
</div>
<br />
<br />
<br />
<br />
<div style="text-align: right;">
<font size="6">Taku Yoshioka</font>
</div>

<br />

# PyMC3
* Python library for Bayesian inference

    * Define probabilistic models consisting of random variables (RVs)
    * Draw samples of RVs from the posterior distribution with MCMC
    * Fit parametrized approximate posterior to data with variational inference

# Agenda
* Bayesian inference and variational inference
* Stochastic gradient for optimization
* Example: Bayesian neural network
* Autoencoding Variational Bayes (AEVB)
* Example: Convolutional variational autoencoder

# Probabilistic model
* Decomposition of the joint probability of all random variables (RVs)

$$
p(\mathbf{x},\mathbf{z}) = p(\mathbf{x}|\mathbf{z})p(\mathbf{z})
$$
* $p(\mathbf{x},\mathbf{z})$: joint distribution of $\mathbf{x}$ and $\mathbf{z}$
* $p(\mathbf{z})$: prior distribution on $\mathbf{z}$
* $p(\mathbf{x}|\mathbf{z})$: likelihood of data $\mathbf{x}$ given $\mathbf{z}$

# Bayesian inference
* Infer the posterior distribution $p(\mathbf{z}|\mathbf{x})$, the distribution of the unkown RVs $\mathbf{z}$ given observations $\mathbf{x}$ (which is also RVs), based on the Bayes theorem:

$$
p(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x},\mathbf{z})}{p(\mathbf{x})}=\frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{x})}
$$

# Variational inference
* Approximate $p(\mathbf{z}|\mathbf{x})$ by variational distribution $q(\mathbf{z})$
* Maximize evidence lower bound (ELBO) ${\cal L}(q)$ w.r.t. $q$ minimizes $KL[q(\mathbf{z})||p(\mathbf{z}|\mathbf{x})]$

\begin{eqnarray}
\cal{L}(q) & = & \mathbb{E}_{q(\mathbf{z})}\left[\log p(\mathbf{x},\mathbf{z}) - \log q(\mathbf{z})\right] \\
           & = & \mathbb{E}_{q(\mathbf{z})}\left[\log p(\mathbf{x}|\mathbf{z})\right] - KL\left[\log q(\mathbf{z})||p(\mathbf{z})\right] \\
           & = & \log p(\mathbf{x}) + KL[q(\mathbf{z})||p(\mathbf{z}|\mathbf{x})]
\end{eqnarray}

* Pros: Can obtain tractable solution under factorized $q(\mathbf{x})$ and conjugate prior
* Cons: Restricted probabilistic models because of the tractability

# Stochastic gradient
* Normal distribution as $q(\mathbf{z})$ with parameters $\gamma$ (variational parameters)

$$
q_{\lambda}(\mathbf{z}) = \prod_{d=1}^{D}N(z_{d};\mu_{d},\sigma_{d}^{2}), \gamma=\left\{\mu_{d}, \sigma_{d}\right\}_{d=1}^{D}
$$

* Maximize ELBO wrt $\gamma$ with stochastic gradient
* Reparametrization $\mathbf{z} = \mathbf{\sigma}\odot\mathbf{\epsilon}+\mathbf{\mu}$ makes the calculation of gradients easy
    
$$
\begin{eqnarray}
\mathbb{E}_{q}\left[\log p(\mathbf{x},\mathbf{z})\right] & = & \int p(\mathbf{\epsilon})\log p(\mathbf{x},\mathbf{\sigma}\odot\mathbf{\epsilon}+\mathbf{\mu}) d\mathbf{\epsilon}, \\
\partial_{\gamma}\mathbb{E}_{q}\left[\log p(\mathbf{x},\mathbf{z})\right] & = & \int p(\mathbf{\epsilon})\partial_{\gamma}\log p(\mathbf{x},\mathbf{\sigma}\odot\mathbf{\epsilon}+\mathbf{\mu}) d\mathbf{\epsilon}
\end{eqnarray}
$$

* Monte Carlo sampling of the gradient (its expectation it the true gradient)

# Three steps for VI in PyMC3
1. Define model
    * aaa


2. Run optimization
    * bbb


3. Draw samples from variational posterior
    * ccc

# Example: Bayesian neural network

# Latent variables for observations
* Consider the case where each of i.i.d. observations latent variables

$$
\log p(\mathbf{X},\mathbf{Z},\Theta) = \sum_{i=1}^{N}\log p(\mathbf{x}_{i},\mathbf{z}_{i},\Theta)
$$

* Two types of latent variables

    * $\mathbf{z}_{i}$: Local random variables, related to each sample
    * $\Theta$: Global random variables, shared with all samples

# Autoencoding variational Bayes
* Approximate $p(\Theta,\mathbf{Z}|\mathbf{X})$ by $q(\Theta)\prod_{i=1}^{N}q(\mathbf{z}_{i})$
* Assignment of variational parameters for each $\mathbf{z}_{i}$ does not generalize to new data
* Estimate the parameters of $q(\mathbf{z}_{i})$ (means and stds) by neural network with parameters $\nu$
* Log likelihood $p(\mathbf{x}_{i}|\mathbf{z}_{i},\Theta)$ can also parametrized (\eta)
* Optimize ELBO wrt $\gamma$, $\nu$, and $\eta$

\begin{eqnarray}
{\cal L}(\gamma,\nu,\eta) & =
    \mathbf{c}_{o}\mathbb{E}_{q(\Theta)}\left[
        \sum_{i=1}^{N}\mathbb{E}_{q(\mathbf{z}_{i})}\left[
            \log p(\mathbf{y}_{i}|\mathbf{z}_{i},\Theta,\eta)
        \right]
    \right]
    - \mathbf{c}_{g}KL\left[q(\Theta)||p(\Theta)\right]
    - \mathbf{c}_{l}\sum_{i=1}^{N}KL\left[q(\mathbf{z}_{i})||p(\mathbf{z}_{i})\right]
\end{eqnarray}

# Example: Convolutional variational autoencoder

# Future work
* Normalizing flows ([#1490](https://github.com/pymc-devs/pymc3/pull/1490))
* Stein variational gradient descent ([#1549](https://github.com/pymc-devs/pymc3/pull/1549))