# Operator Variational Inference (OPVI)

Inspired by

    Rajesh Ranganath, Jaan Altosaar, Dustin Tran, David M. Blei
    "Operator Variational Inference", https://arxiv.org/abs/1610.09033


To so long time had passed since paper above was published when OPVI bacame a new paradigm for all VI stuff in PyMC3. Why is it so expressive and covers all methods exist in literature? The idea behind OPVI is modularity.

It generalizes variational inverence so that the problem is build with blocks. The first and essential block is **Model** itself. Second is **Approximation**, in some cases $log Q(D)$ is not really needed. Necessity depends on the third and forth part of that black box, **Operator** and **Test Function** respectively.

Operator is like an approach we use, it constructs loss from given Model, Approximation and Test Function. The last one is not needed if we minimize KL Divergence from Q to posterior. As a drawback we need to compute $loq Q(D)$. Sometimes approximation family is intractable and $loq Q(D)$ is not available, here comes *LS(Langevin Stein) Operator* with a set of test functions or *KSD* for taking gradients needed for inference.

Test Function has more unintuitive meaning. It is usually used with LS operator and represents all we want from our approximate distribution. For any given vector based function of $z$ LS operator yields zero mean function under posterior. $loq Q(D)$ is no more needed. That opens a door to rich approximation families as neural networks.

Not only [ADVI](https://arxiv.org/abs/1506.03431) and [Langevin Stein Operator](https://arxiv.org/abs/1610.09033) VI are applicable with OPVI framework. [Normalizing Flows](https://arxiv.org/abs/1505.05770) / [SVGD](https://arxiv.org/abs/1608.04471) / [ASVGD](http://bayesiandeeplearning.org/papers/BDL_21.pdf) fit well for it.

There are a lot of new papers like the following list that are very interesting to read and inspire for more contributions.

* [Gradient Estimators for Implicit Models](https://arxiv.org/abs/1705.07107)
* [Improving Variational Auto-Encoders using Householder Flow](https://arxiv.org/abs/1611.09630)
* [Black-box Importance Sampling](https://arxiv.org/abs/1610.05247)
* [Learning to Draw Samples: With Application to Amortized MLE for Generative Adversarial Learning](https://arxiv.org/abs/1611.01722)

Moreover there is an approach for [Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114) (AEVB) that allows to do great things.

So taking it all there are not so many methods exist

| Method                     | pymc3       |
|----------------------------|-------------|
| ADVI                       | +           |
| FullRankADVI               | +           |
| Normalizing Flows          | coming soon |
| Langevin Stein Operator VI | -           |
| SVGD                       | +           |
| ASVGD                      | +           |

As you see it PyMC3 has a great support for variational inference with both scalable (ADVI) and accurate methods (SVGD). Internals are tricky to make it all work.

## Model

As mentioned before OPVI has Model, TestFuntions, Operator and Approximation. The essential part is ofc `Model`

In [1]:
import pymc3 as pm

In [2]:
# here is it
pm.Model

pymc3.model.Model

So there is nothing tricky here. We all use this class to define models

In [3]:
pm.Model.logpt

<property at 0x1146c1ef8>

This property returns a tensor that we can later use for constructing loss functions in VI

## Operators
The essential code for OPVI is in variatioanal module

In [4]:
pm.variational.opvi

<module 'pymc3.variational.opvi' from '/Users/ferres/dev/pymc3/pymc3/variational/opvi.py'>

Here you can find base classes for Approximations, Operators, Test Functions

In [5]:
print(pm.variational.opvi.Operator.__doc__)

Base class for Operator

    Parameters
    ----------
    approx : :class:`Approximation`
        an approximation instance

    Notes
    -----
    For implementing Custom operator it is needed to define :func:`Operator.apply` method
    


In [6]:
print(pm.variational.opvi.Operator.apply.__doc__)

Operator itself

        .. math::

            (O^{p,q}f_{\theta})(z)

        Parameters
        ----------
        f : :class:`TestFunction` or None
            function that takes `z = self.input` and returns
            same dimensional output

        nmc : n
            monte carlo samples to use

        Returns
        -------
        `TensorVariable`
            symbolically applied operator
        


In pymc3 we have three of them

In [7]:
print(pm.variational.operators.KL.__doc__)


    Operator based on Kullback Leibler Divergence

    .. math::

        KL[q(v)||p(v)] = \int q(v)\log\frac{q(v)}{p(v)}dv
    


In [8]:
print(pm.variational.operators.KSD.__doc__)


    Operator based on Kernelized Stein Discrepancy

    *Input:* A target distribution with density function :math:`p(x)`
        and a set of initial particles :math:`\{x^0_i\}^n_{i=1}`

    *Output:* A set of particles :math:`\{x_i\}^n_{i=1}` that approximates the target distribution.

    .. math::

        x_i^{l+1} \leftarrow \epsilon_l \hat{\phi}^{*}(x_i^l) \\
        \hat{\phi}^{*}(x) = \frac{1}{n}\sum^{n}_{j=1}[k(x^l_j,x) \nabla_{x^l_j} logp(x^l_j)+ \nabla_{x^l_j} k(x^l_j,x)]

    Parameters
    ----------
    approx : :class:`Empirical`
        Empirical Approximation used for inference

    References
    ----------
    -   Qiang Liu, Dilin Wang (2016)
        Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
        arXiv:1608.04471
    


In [9]:
print(pm.variational.operators.AKSD.__doc__)


    Amortized Stein Variational Gradient Descent

    This inference is based on Kernelized Stein Discrepancy
    it's main idea is to move initial noisy particles so that
    they fit target distribution best.

    Algorithm is outlined below

    *Input:* Parametrized random generator :math:`R_{\theta}`

    *Output:* :math:`R_{\theta^{*}}` that approximates the target distribution.

    .. math::

        \Delta x_i &= \hat{\phi}^{*}(x_i) \\
        \hat{\phi}^{*}(x) &= \frac{1}{n}\sum^{n}_{j=1}[k(x_j,x) \nabla_{x_j} logp(x_j)+ \nabla_{x_j} k(x_j,x)] \\
        \Delta_{\theta} &= \frac{1}{n}\sum^{n}_{i=1}\Delta x_i\frac{\partial x_i}{\partial \theta}

    References
    ----------
    -   Dilin Wang, Yihao Feng, Qiang Liu (2016)
        Learning to Sample Using Stein Discrepancy
        http://bayesiandeeplearning.org/papers/BDL_21.pdf

    -   Dilin Wang, Qiang Liu (2016)
        Learning to Draw Samples: With Application to Amortized MLE for Generative Adversarial Learning
      

## Approximations

Approximations is another core part for OPVI. To make them flexible and scalable we use modular structure there too

In [10]:
print(pm.variational.opvi.Approximation.__doc__)

Base class for approximations.

    Parameters
    ----------
    local_rv : dict[var->tuple]
        mapping {model_variable -> local_variable (:math:`\mu`, :math:`\rho`)}
        Local Vars are used for Autoencoding Variational Bayes
        See (AEVB; Kingma and Welling, 2014) for details
    model : :class:`Model`
        PyMC3 model for inference
    cost_part_grad_scale : float or scalar tensor
        Scaling score part of gradient can be useful near optimum for
        archiving better convergence properties. Common schedule is
        1 at the start and 0 in the end. So slow decay will be ok.
        See (Sticking the Landing; Geoffrey Roeder,
        Yuhuai Wu, David Duvenaud, 2016) for details
    scale_cost_to_minibatch : bool, default False
        Scale cost to minibatch instead of full dataset
    random_seed : None or int
        leave None to use package global RandomStream or other
        valid value to create instance specific one

    Notes
    -----
    Defining a

Then we subclass and define more approximations in a separate module.

In [11]:
print(pm.variational.approximations.__all__)

['MeanField', 'FullRank', 'Empirical', 'sample_approx']


# Test Functions

Not all our operators use test functions. So we yet have the only one kernel for SVGD/ASVGD

In [12]:
print(pm.variational.test_functions.__all__)

['rbf']


## Inference

Combining it all we can construct an inference method.

In [13]:
from pymc3.variational.inference import Inference
from pymc3.variational.operators import KL, KSD, AKSD
from pymc3.variational.approximations import MeanField, FullRank, Empirical

with pm.Model() as model:
    norm = pm.Normal('N')

In [14]:
help(Inference)

Help on class Inference in module pymc3.variational.inference:

class Inference(builtins.object)
 |  Base class for Variational Inference
 |  
 |  Communicates Operator, Approximation and Test Function to build Objective Function
 |  
 |  Parameters
 |  ----------
 |  op : Operator class
 |  approx : Approximation class or instance
 |  tf : TestFunction instance
 |  local_rv : dict
 |      mapping {model_variable -> local_variable}
 |      Local Vars are used for Autoencoding Variational Bayes
 |      See (AEVB; Kingma and Welling, 2014) for details
 |  model : Model
 |      PyMC3 Model
 |  op_kwargs : dict
 |      kwargs passed to :class:`Operator`
 |  kwargs : kwargs
 |      additional kwargs for :class:`Approximation`
 |  
 |  Methods defined here:
 |  
 |  __init__(self, op, approx, tf, local_rv=None, model=None, op_kwargs=None, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  fit(self, n=10000, score=None, callbacks=None, progressbar=True, 

One can construct inference by hand

In [15]:
inference = Inference(
    model=model,
    op=KL,
    approx=MeanField,
    tf=None
)
inference.objective

<pymc3.variational.opvi.ObjectiveFunction at 0x11cf1e6d8>

In [16]:
inference.fit()

Average Loss = 0.0024892: 100%|██████████| 10000/10000 [00:00<00:00, 11669.79it/s] 
Finished [100%]: Average Loss = 0.0025053


<pymc3.variational.approximations.MeanField at 0x117708198>

or get output for objective function in the same way. Recall the formula from `Operator.apply` docstring
 $$objective = (O^{p,q}f_{\theta})(z)$$

In [17]:
with model:
    objective = KL(MeanField())(None)

In [18]:
objective

<pymc3.variational.opvi.ObjectiveFunction at 0x11ddf6940>

In [19]:
objective(nmc=10).eval()

array(0.09470486640930176, dtype=float32)

As you see this modularity allows to create separable code for OPVI