# SVI Part I: An Introduction to Stochastic Variational Inference in Pyro

Pyro has been designed with particular attention paid to supporting stochastic variational inference as a general purpose inference algorithm.  Let's see how we go about doing variational inference in Pyro.

## Setup

We're going to assume we've already defined our model in Pyro (for more details on how this is done see [Intro Part I](intro_part_i.ipynb)).
As a quick reminder, the model is given as a stochastic function `model(*args, **kwargs)`, which, in the general case takes arguments. The different pieces of `model()` are encoded via the mapping:

1. observations $\Longleftrightarrow$ `pyro.sample` with the `obs` argument
2. latent random variables $\Longleftrightarrow$ `pyro.sample`
3. parameters $\Longleftrightarrow$ `pyro.param`


# SVI第一部分：Pyro中的随机变分推断简介

pyro的设计为对于将随机变分推断转为一种通用推断算法具有特别的兴趣。让我们看看我们在pyro中如何做变分推断。

## 准备

我们将假定我们在pyro中已经定义了我们的model（怎么定义参见[简洁I](intro_part_i.ipynb)）。作为快速的复习，model被以随机函数
`model(*args,**kwargs)`所给定，它在一般情况可以接受参数。`model()`不同的片段可以被以下映射所编码：

1. 观测 $\Longleftrightarrow$ `pyro.sample`，带有`obs`参数。
2. 隐随机变量 $\Longleftrightarrow$ `pyro.sample`
3. 参数 $\Longleftrightarrow$ `pyro.param`

Now let's establish some notation. The model has observations ${\bf x}$ and latent random variables ${\bf z}$ as well as parameters $\theta$. It has a joint probability density of the form 

$$p_{\theta}({\bf x}, {\bf z}) = p_{\theta}({\bf x}|{\bf z}) p_{\theta}({\bf z})$$

We assume that the various probability distributions $p_i$ that make up $p_{\theta}({\bf x}, {\bf z})$ have the following properties:

1. we can sample from each $p_i$
2. we can compute the pointwise log pdf $p_i$ 
3. $p_i$ is differentiable w.r.t. the parameters $\theta$

现在让我们引入一些符号。模型具有观测值 ${\bf x}$以及隐随机变量${\bf z}$以及参数$\theta$。它拥有联合概率密度形式

$$p_{\theta}({\bf x}, {\bf z}) = p_{\theta}({\bf x}|{\bf z}) p_{\theta}({\bf z})$$

我们假定多种不同的概率分布$p_i$，其使得$p_\theta(\mathbf{x},\mathbf{z})$拥有以下性质：

1. 我们可以从每个$p_i$中采样
2. 我们可以计算每个样本的log pdf $p_i$。
3. $p_i$是对于参数$\theta$可导的。


## Model Learning

In this context our criterion for learning a good model will be maximizing the log evidence, i.e. we want to find the value of $\theta$ given by

$$\theta_{\rm{max}} = \underset{\theta}{\operatorname{argmax}} \log p_{\theta}({\bf x})$$

where the log evidence $\log p_{\theta}({\bf x})$ is given by

$$\log p_{\theta}(x) = \log \int\! d{\bf z}\; p_{\theta}({\bf x}, {\bf z})$$

In the general case this is a doubly difficult problem. This is because (even for a fixed $\theta$) the integral over the latent random variables $\bf z$ is often intractable. Conversely, even if we know how to calculate the log evidence for all values of $\theta$, maximizing the log evidence as a function of $\theta$ will in general be a difficult non-convex optimization problem. 

In addition to finding $\theta_{\rm{max}}$, we would like to calculate the posterior over the latent variables $\bf z$:

$$ p_{\theta_{\rm{max}}}({\bf z} | {\bf x}) = \frac{p_{\theta_{\rm{max}}}({\bf x} , {\bf z})}{
\int \! d{\bf z}\; p_{\theta_{\rm{max}}}({\bf x} , {\bf z}) } $$

Note that the denominator of this expression is the (usually intractable) evidence. Variational inference offers a scheme for finding $\theta_{\rm{max}}$ and computing an approximation to the posterior $p_{\theta_{\rm{max}}}({\bf z} | {\bf x})$. Let's see how that works.


## 模型学习

在这个主题下，我们的对于学到一个好的模型的评价标准将是最大化log evidence，作为例子，我们想要找到$\theta$值，使得:

$$
\theta_{max} = \underset{\theta}{\operatorname{argmax}} \log p_\theta(\mathbf{x})
$$

其中log evidence $\log p_\theta(\mathbf{x})$被给定以：

$$
\log p_\theta(x) = \log \int d\mathbf{z} p_\theta(\mathbf{x},\mathbf{z})
$$

一般而言这是个双重困难的问题。这是因为（即使对于固定的$\theta$），在隐随机变量$\mathbf{z}$上的也是难以求得解析解的。
反过来，即使我们知道如何计算log evidence对于所有的$\theta$的值，最大化 log evidence，其作为$\theta$的一个函数，
也在一般情况是一个困难的非凸优化问题。

除了寻找$\theta_{max}$外，我们还想计算隐变量$\mathbf{z}$的后验：

$$ p_{\theta_{\rm{max}}}({\bf z} | {\bf x}) = \frac{p_{\theta_{\rm{max}}}({\bf x} , {\bf z})}{
\int \! d{\bf z}\; p_{\theta_{\rm{max}}}({\bf x} , {\bf z}) } $$

注意这个表达式的分母是（通常不可解析求解的） evidence。变分推断提供了一共寻找$\theta_{\rm{max}}$，计算后验分布
$p_{\theta_{\rm{max}}}(\bf z) | {\bf x}$的框架。让我们看看它如何运转。


## Guide

The basic idea is that we introduce a parameterized distribution $q_{\phi}({\bf z})$, where $\phi$ are known as the variational parameters. This distribution is called the variational distribution in much of the literature, and in the context of Pyro it's called the **guide** (one syllable instead of nine!). The guide will serve as an approximation to the posterior.

Just like the model, the guide is encoded as a stochastic function `guide()` that contains `pyro.sample` and `pyro.param` statements. It does _not_ contain observed data, since the guide needs to be a properly normalized distribution. Note that Pyro enforces that `model()` and `guide()` have the same call signature, i.e. both callables should take the same arguments. 


## Guide

简单的想法是我们引入一个参数化的分布$q_\phi(\mathbf{z})$，其中$\phi$通常称为变分参数。而这个分布在很多文献中
则被称为变分分布，而在Pyro的术语中，它叫guide(只需要1个音节而不是9个！)。guide当作后验的一个近似。

和model一样，guide被编码为随机函数`guide()`，其包含`pyro.sample`与`pyro.param`语句。它并不包含观测数据，因为guide需要
作为一个恰当正则化的分布（properly normalized distribution）。注意Pyro强制`model()`与`guide()`具有相当的
调用signatrue，比如说，这两个callble对象应当接收同样的参数。

Since the guide is an approximation to the posterior $p_{\theta_{\rm{max}}}({\bf z} | {\bf x})$, the guide needs to provide a valid joint probability density over all the latent random variables in the model. Recall that when random variables are specified in Pyro with the primitive statement `pyro.sample()` the first argument denotes the name of the random variable. These names will be used to align the random variables in the model and guide. To be very explicit, if the model contains a random variable `z_1`

```python
def model():
    pyro.sample("z_1", ...)
```

then the guide needs to have a matching `sample` statement

```python
def guide():
    pyro.sample("z_1", ...)
```

The distributions used in the two cases can be different, but the names must line-up 1-to-1. 

Once we've specified a guide (we give some explicit examples below), we're ready to proceed to inference.
Learning will be setup as an optimization problem where each iteration of training takes a step in $\theta-\phi$ space that moves the guide closer to the exact posterior.
To do this we need to define an appropriate objective function. 


由于guide是后验$p_{\theta_{max}}(\mathbf{z} \mid \mathbf{x})$的近似，guide需要对所有隐变量有一个联合概率分布。
回忆当pyro中的随机变量都是以`pyro.sample()`被指定的，而该语句第一个参数表示这个随机变量的名字。这些名字将被用来将
model和guide中的随机变量进行一一映射。明确地说，如果model包含一个随机变量`z_1`。

```python
def model():
    pyro.sample("z_1", ...)
```

那guide也需要有一个匹配的`sample`语句。

```python
def guide():
    pyro.sample("z_1", ...)
```

两个情况用的分布也许是不一样的，但是这些名字必须完全11对应。

一旦我们确定了一个guide(我们将在下面展示一些清楚的例子)，我们就已经可以开始进行推断了。学习会被设定为一个优化问题，
训练中的每一步迭代在$\theta-\phi$空间中走一步，这使得guide会更接近精确后验一步。
为了做到这一点我们必须定义合适的目标函数。



## ELBO

A simple derivation (for example see reference [1]) yields what we're after: the evidence lower bound (ELBO). The ELBO, which is a function of both $\theta$ and $\phi$, is defined as an expectation w.r.t. to samples from the guide:

$${\rm ELBO} \equiv \mathbb{E}_{q_{\phi}({\bf z})} \left [ 
\log p_{\theta}({\bf x}, {\bf z}) - \log q_{\phi}({\bf z})
\right]$$

By assumption we can compute the log probabilities inside the expectation. And since the guide is assumed to be a parametric distribution we can sample from, we can compute Monte Carlo estimates of this quantity. Crucially, the ELBO is a lower bound to the log evidence, i.e. for all choices of $\theta$ and $\phi$ we have that 

$$\log p_{\theta}({\bf x}) \ge {\rm ELBO} $$


## ELBO

通过一个简单的推断（作为例子可参考 [1]）产生了我们想要的目标函数： evidence lower bound(ELBO)。ELBO，其为$\theta$与$\phi$的函数，
被定义为guide中对样本取期望的结果。

$${\rm ELBO} \equiv \mathbb{E}_{q_{\phi}({\bf z})} \left [ 
\log p_{\theta}({\bf x}, {\bf z}) - \log q_{\phi}({\bf z})
\right]$$

根据假设，我们可以计算期望中的log概率。而又因为guide已经被假定是我们可以从中采样的参数化分布，我们可以计算此式的Monte Carlo估计。
重要的一点是，ELBO是log evidence的下界，从而，对于所有的$\theta,\phi$的选择，我们有：

$$
\log p_\theta(\mathbf{x}) \ge \mathrm{ELBO}
$$

So if we take (stochastic) gradient steps to maximize the ELBO, we will also be pushing the log evidence higher (in expectation). Furthermore, it can be shown that the gap between the ELBO and the log evidence is given by the KL divergence between the guide and the posterior:

$$ \log p_{\theta}({\bf x}) - {\rm ELBO} = 
\rm{KL}\!\left( q_{\phi}({\bf z}) \lVert p_{\theta}({\bf z} | {\bf x}) \right) $$

This KL divergence is a particular (non-negative) measure of 'closeness' between two distributions. So, for a fixed $\theta$, as we take steps in $\phi$ space that increase the ELBO, we decrease the KL divergence between the guide and the posterior, i.e. we move the guide towards the posterior. In the general case we take gradient steps in both $\theta$ and $\phi$ space simultaneously so that the guide and model play chase, with the guide tracking a moving posterior $\log p_{\theta}({\bf z} | {\bf x})$. Perhaps somewhat surprisingly, despite the moving target, this optimization problem can be solved (to a suitable level of approximation) for many different problems.

So at high level variational inference is easy: all we need to do is define a guide and compute gradients of the ELBO. Actually, computing gradients for general model and guide pairs leads to some complications (see the tutorial [SVI Part III](svi_part_iii.ipynb) for a discussion). For the purposes of this tutorial, let's consider that a solved problem and look at the support that Pyro provides for doing variational inference. 


所以如果我们（随机地）采用梯度来最大化ELBO，我们将也同样会使得log evidence变得更大（期望意义上）。进一步的，
可以得到ELBO与log evidence之间就差一个guide与后验之间的KL散度。

$$
\log p_\theta(\mathbf{x}) - \mathrm{ELBO} = \mathrm{KL}(q_\phi(\mathbf{z}) || p_\theta(\mathbf{z} | \mathbf{x}))
$$

KL散度是一种特定的（非负的）的两个分布之间的“接近程度”的度量。从而，对于固定的$\theta$，当我们在$\phi$空间优化上一步，使得
ELBO增大，我们也在缩减guide与后验之间的KL散度，这可以看作使得guide更贴近真实后验了。在一般情况下我们同时在$\theta$与$\phi$空间内
梯度上升，使得我们可以看到guide在追踪同样运动着的后验$\log p_\theta(\mathbf{z} | \mathbf{x})$。也许有些出人意料，
尽管甚至优化目标在不断变化，这个优化问题在很多问题中都可以（以恰当程度的近似）被解决。

所以从高观点上可以说，所有我们所需要的无非就是定义guide以及计算ELBO的梯度。事实上，计算梯度对于一般的model和guide对
会造成复杂的问题（参见教程[SVI第三部分](svi_part_iii.ipynb)的有关讨论）。对于这个教程而言，让我们考虑一个已解决的问题，
然后看看pyro对于解决这个问题提供了什么帮助。

## `SVI` Class

In Pyro the machinery for doing variational inference is encapsulated in the `SVI` class. (At present `SVI` only provides support for the ELBO objective, but in the future Pyro will provide support for alternative variational objectives.)

The user needs to provide three things: the model, the guide, and an optimizer. We've discussed the model and guide above and we'll discuss the optimizer in some detail below, so let's assume we have all three ingredients at hand. To construct an instance of `SVI` that will do optimization via the ELBO objective, the user writes

```python
import pyro
from pyro.infer import SVI, Trace_ELBO
svi = SVI(model, guide, optimizer, loss=Trace_ELBO())
```

The `SVI` object provides two methods, `step()` and `evaluate_loss()`, that encapsulate the logic for variational learning and evaluation:

1. The method `step()` takes a single gradient step and returns an estimate of the loss (i.e. minus the ELBO). If provided, the arguments to `step()` are piped to `model()` and `guide()`. 

2. The method `evaluate_loss()` returns an estimate of the loss _without_ taking a gradient step. Just like for `step()`, if provided, arguments to `evaluate_loss()` are piped to `model()` and `guide()`.

For the case where the loss is the ELBO, both methods also accept an optional argument `num_particles`, which denotes the number of samples used to compute the loss (in the case of `evaluate_loss`) and the loss and gradient (in the case of `step`). 


## `SVI`类

在pyro中，做变分推断的机制被封装在`SVI`类中。（当前，`SVI`只提供对于ELBO目标的支持，但在未来pyro将提供其他变分目标）。

用户需要提供三个东西，model,guide,与优化器。我们已经在前面讨论了model和guide，我们将在下面讨论优化器的一些细节。
让我们假定准备好了这3个东西。为了构建一个`SVI`实例来做ELBO的优化，用户可写如下代码：

```python
import pyro
from pyro.infer import SVI, Trace_ELBO
svi = SVI(model, guide, optimizer, loss=Trace_ELBO())
```
`SVI`对象提供了两个方法，`step()`与`evaluate_loss()`，它们封装了变分学习和评估的相关逻辑。

1. 方法`method()` 执行了梯度优化不走并且返回对于loss的估计（如负的ELBO）。如果提供了的话，`step`的参数也会传给`model()`与`guide()`。

2. 方法`evaluate_loss()` 返回loss的一个估计，但_没有_执行梯度优化步。就像`step()`一样，如果提供了，给`evaluate_loss()`的参数
会被传给`model()`与`guide()`。

对于loss为ELBO的情况，这两个方法也可以接收一个可选的参数 `num_particles`，其表示用来计算loss的样本个数（在`evaluate_loss`中）
或计算loss及梯度的样本个数（传给`step`时）。

## Optimizers

In Pyro, the model and guide are allowed to be arbitrary stochastic functions provided that

1. `guide` doesn't contain `pyro.sample` statements with the `obs` argument
2. `model` and `guide` have the same call signature

This presents some challenges because it means that different executions of `model()` and `guide()` may have quite different behavior, with e.g. certain latent random variables and parameters only appearing some of the time. Indeed parameters may be created dynamically during the course of inference. In other words the space we're doing optimization over, which is parameterized by $\theta$ and $\phi$, can grow and change dynamically.

In order to support this behavior, Pyro needs to dynamically generate an optimizer for each parameter the first time it appears during learning. Luckily, PyTorch has a lightweight optimization library (see [torch.optim](http://pytorch.org/docs/master/optim.html)) that  can easily be repurposed for the dynamic case. 

All of this is controlled by the `optim.PyroOptim` class, which is basically a thin wrapper around PyTorch optimizers. `PyroOptim` takes two arguments: a constructor for PyTorch optimizers `optim_constructor` and a specification of the optimizer arguments `optim_args`. At high level, in the course of optimization, whenever a new parameter is seen `optim_constructor` is used to instantiate a new optimizer of the given type with arguments given by `optim_args`. 


## 优化器

在pyro中，model与guide可以是任意的随机函数，除了要满足以下两个要求：

1. `guide`不包含有`obs`参数的`pyro.sample`语句。
2. `model`与`guide`有相同的调用signature。

这种任意性造成了一些麻烦，考虑到`model()`与`guide()`的不同的执行可能带来完全不同的行为，如，特定的隐随机变量和参数只出现在
部分时间中（真的，参数也可以在推断的过程中被动态的创建）。换句话说，我们在做推断的空间，其被以$\theta$与$\phi$参数化，
可以变大而且动态的变化。

为了支持这种行为,pyro需要动态的对于每个参数生成一个优化器当它在学习中被第一次创建时。幸运的是，PyTorch有一个轻量级的
优化算法库(参见 [torch.optim](http://pytorch.org/docs/master/optim.html))，其很容易被复用到动态的情形中。

所有这一切被`optim.PyroOptim`类控制，它只是一个PyTorch优化器的浅层封装。`PyroOptim`接收两个参数，一个是PyTorch优化器的
构造器`optim_constructor`，另一个对是优化器参数的指定`optim_args`。在优化过程中，当一个新的参数出现，`optim_constructor`
被用来实例化一个给定类型的新的优化器，其参数则被以`optim_args`所确定。



Most users will probably not interact with `PyroOptim` directly and will instead interact with the aliases defined in `optim/__init__.py`. Let's see how that goes. There are two ways to specify the optimizer arguments. In the simpler case, `optim_args` is a _fixed_ dictionary that specifies the arguments used to instantiate PyTorch optimizers for _all_ the parameters:

```python
from pyro.optim import Adam

adam_params = {"lr": 0.005, "betas": (0.95, 0.999)}
optimizer = Adam(adam_params)
```

The second way to specify the arguments allows for a finer level of control. Here the user must specify a callable that will be invoked by Pyro upon creation of an optimizer for a newly seen parameter. This callable must have the following signature:

1. `module_name`: the Pyro name of the module containing the parameter, if any
2. `param_name`: the Pyro name of the parameter


大多数用户不会怎么与`PyroOptim`直接交互，而是使用在`optim/__init__.py`中定义的别名。存在两种方式指定优化器参数。在更简单的情况中，
`optim_args`是*固定*的的词典，其指定一些参数，它们被用来实例化PyTorch优化器对于*所有*的参数：

```python
from pyro.optim import Adam

adam_params = {"lr": 0.005, "betas": (0.95, 0.999)}
optimizer = Adam(adam_params)
```
第二种指定参数的方式允许进行更细致的控制。此时用户必须指定一个callable对象，其在会被pyro启动，在一个新参数的优化器被创建之时。
这个callable对象必须拥有下列signature:

1. `module_name`: 含有参数的pyro模块名，如果真的有的话
2. `param_name`: 参数的pyro名


This gives the user the ability to, for example, customize learning rates for different parameters. For an example where this sort of level of control is useful, see the [discussion of baselines](svi_part_iii.ipynb). Here's a simple example to illustrate the API:

```python
from pyro.optim import Adam

def per_param_callable(module_name, param_name):
    if param_name == 'my_special_parameter':
        return {"lr": 0.010}
    else:
        return {"lr": 0.001}

optimizer = Adam(per_param_callable)
```

This simply tells Pyro to use a learning rate of `0.010` for the Pyro parameter `my_special_parameter` and a learning rate of `0.001` for all other parameters.


这使得用户得以自定义不同的参数的学习率。对于这种程度的控制的确是有用的一个例子，参见[基线讨论](svi_part_iii.ipynb)。这里有一个
简单的例子展示这个API：

```python
from pyro.optim import Adam

def per_param_callable(module_name, param_name):
    if param_name == 'my_special_parameter':
        return {"lr": 0.010}
    else:
        return {"lr": 0.001}

optimizer = Adam(per_param_callable)
```

它告诉pyro使用`0.010`的学习率对于pyro参数`my_special_parameter`，以及学习率`0.001`对于其他所有参数。



## A simple example

We finish with a simple example. You've been given a two-sided coin. You want to determine whether the coin is fair or not, i.e. whether it falls heads or tails with the same frequency. You have a prior belief about the likely fairness of the coin based on two observations:

- it's a standard quarter issued by the US Mint
- it's a bit banged up from years of use

So while you expect the coin to have been quite fair when it was first produced, you allow for its fairness to have since deviated from a perfect 1:1 ratio. So you wouldn't be surprised if it turned out that the coin preferred heads over tails at a ratio of 11:10. By contrast you would be very surprised if it turned out that the coin preferred heads over tails at a ratio of 5:1&mdash;it's not _that_ banged up.

To turn this into a probabilistic model we encode heads and tails as `1`s and `0`s. We encode the fairness of the coin as a real number $f$, where $f$ satisfies $f \in [0.0, 1.0]$ and $f=0.50$ corresponds to a perfectly fair coin. Our prior belief about $f$ will be encoded by a beta distribution, specifically $\rm{Beta}(10,10)$, which is a symmetric probability distribution on the interval $[0.0, 1.0]$ that is peaked at $f=0.5$. 

## 一个简单的例子

我们以一个简单的例子结束。你已经被给定一个双面硬币。你必须判定是否这个硬币是公平的（也就是说，它落到正面与反面的概率相等）。
你对于硬币的公平程度的先验信念来自两个观察：

- 它是一个标准的美国制币厂发行的25美分硬币。
- 在数年的使用中它有些破损了。

所以尽管你认为硬币在它刚出场的时候是公平的，你也允许它的公平性从1:1的比率有所偏离。所以你不会对它的真实比率，正对反其实是11:10十分吃惊。
不过你会对它会有正对反5:1的比率吃惊，它不应当破损到`如此`的程度。

为了把这个观点转化为一个概率模型，我们需要编码正面和反面为`1`和`0`。我们编码硬币的公平性为一个实数$f$，其中$f$满足$f \in [0.0, 1.0]$
且$f=0.50$表示完美的公平性。我们关于$f$的先验信念将被以一个beta分布所编码，特别地，$\mathrm{Beta}(10,10)$，其为一个对称的，
取值$[0,1]$的概率分布，它在$f=0.5$处有一个峰。

To learn something about the fairness of the coin that is more precise than our somewhat vague prior, we need to do an experiment and collect some data. Let's say we flip the coin 10 times and record the result of each flip. In practice we'd probably want to do more than 10 trials, but hey this is a tutorial.

Assuming we've collected the data in a list `data`, the corresponding model is given by

```python
import pyro.distributions as dist

def model(data):
    # define the hyperparameters that control the beta prior
    alpha0 = torch.tensor(10.0)
    beta0 = torch.tensor(10.0)
    # sample f from the beta prior
    f = pyro.sample("latent_fairness", dist.Beta(alpha0, beta0))
    # loop over the observed data
    for i in range(len(data)):
        # observe datapoint i using the bernoulli 
        # likelihood Bernoulli(f)
        pyro.sample("obs_{}".format(i), dist.Bernoulli(f), obs=data[i])
```


为了学到像硬币的公平性这样的东西，并让它们比我们含混地先验更精确，我们需要做一个实验并且收集一些数据。比如说投硬币10次
然后记录每次投掷的结果，实践中，我们将可能想要做超过10次实验，但是哎这只是个教程。

```python
import pyro.distributions as dist

def model(data):
    # define the hyperparameters that control the beta prior
    alpha0 = torch.tensor(10.0)
    beta0 = torch.tensor(10.0)
    # sample f from the beta prior
    f = pyro.sample("latent_fairness", dist.Beta(alpha0, beta0))
    # loop over the observed data
    for i in range(len(data)):
        # observe datapoint i using the bernoulli 
        # likelihood Bernoulli(f)
        pyro.sample("obs_{}".format(i), dist.Bernoulli(f), obs=data[i])
```


Here we have a single latent random variable (`'latent_fairness'`), which is distributed according to $\rm{Beta}(10, 10)$. Conditioned on that random variable, we observe each of the datapoints using a bernoulli likelihood. Note that each observation is assigned a unique name in Pyro.

Our next task is to define a corresponding guide, i.e. an appropriate variational distribution for the latent random variable $f$. The only real requirement here is that $q(f)$ should be a probability distribution over the range $[0.0, 1.0]$, since $f$ doesn't make sense outside of that range. A simple choice is to use another beta distribution parameterized by two trainable parameters $\alpha_q$ and $\beta_q$. Actually, in this particular case this is the 'right' choice, since conjugacy of the bernoulli and beta distributions means that the exact posterior is a beta distribution. In Pyro we write:

```python
def guide(data):
    # register the two variational parameters with Pyro.
    alpha_q = pyro.param("alpha_q", torch.tensor(15.0), 
                         constraint=constraints.positive)
    beta_q = pyro.param("beta_q", torch.tensor(15.0), 
                        constraint=constraints.positive)
    # sample latent_fairness from the distribution Beta(alpha_q, beta_q)
    pyro.sample("latent_fairness", dist.Beta(alpha_q, beta_q))
```


这里我们只有一个隐随机变量（`'latent_fairness'`），其分布为$\mathrm{Beta}(10,10)$。条件化到这个随机变量，我们观测每个数据点，
使用bernoulli分布的似然函数。注意每个观测值在pyro中被指定了一个唯一的名字。

我们的下一个任务是定义对应的guide，它可能是，比如说，是一个对于隐随机变量$f$的一个恰当的变分分布。$q(f)$唯一真正的要求是，它应当是
一个在$[0,1]$上的概率分布，因为$f$在这个区间外并没有任何意义。一个简单的选择是使用另一个beta分布，参数化被以两个可训练的参数
$\alpha_q$与$\beta_q$，事实上，在这个特定情况中，它是“精确正确”的选择，因为bernoulli与beta分布的共轭意味着其精确后验分布也是
beta分布。在pyro中我们写：

```python
def guide(data):
    # register the two variational parameters with Pyro.
    alpha_q = pyro.param("alpha_q", torch.tensor(15.0), 
                         constraint=constraints.positive)
    beta_q = pyro.param("beta_q", torch.tensor(15.0), 
                        constraint=constraints.positive)
    # sample latent_fairness from the distribution Beta(alpha_q, beta_q)
    pyro.sample("latent_fairness", dist.Beta(alpha_q, beta_q))
```




There are a few things to note here:

- We've taken care that the names of the random variables line up exactly between the model and guide.
- `model(data)` and `guide(data)` take the same arguments.
- The variational parameters are `torch.tensor`s. The `requires_grad` flag is automatically set to `True` by `pyro.param`.
- We use `constraint=constraints.positive` to ensure that `alpha_q` and `beta_q` remain non-negative during optimization.

Now we can proceed to do stochastic variational inference. 

```python
# set up the optimizer
adam_params = {"lr": 0.0005, "betas": (0.90, 0.999)}
optimizer = Adam(adam_params)

# setup the inference algorithm
svi = SVI(model, guide, optimizer, loss=Trace_ELBO())

n_steps = 5000
# do gradient steps
for step in range(n_steps):
    svi.step(data)
```    

Note that in the `step()` method we pass in the data, which then get passed to the model and guide. 

The only thing we're missing at this point is some data. So let's create some data and assemble all the code snippets above into a complete script:

这里有很多东西值得一提：

- 我们特别关注model与guide中随机变量名字之间的一一对应。
- `model(data)`与`guide(data)`接收相同的参数。
- 变分参数是`torch.tensor`。这里`requires_grad`标记被`pyro.param`自动设为`True`。
- 我们使用`constraint=constraints.positive` 来保证 `alpha_q`与`beta_q`保持非负在优化中。

现在我们可以执行随机变分推断了。

```python
# set up the optimizer
adam_params = {"lr": 0.0005, "betas": (0.90, 0.999)}
optimizer = Adam(adam_params)

# setup the inference algorithm
svi = SVI(model, guide, optimizer, loss=Trace_ELBO())

n_steps = 5000
# do gradient steps
for step in range(n_steps):
    svi.step(data)
```    

注意在`step()`中我们传入了数据，这也会被传入model与guide中。

现在我们唯一还缺的东西是一些数据。所以让我们直接生成一些数据并且把上面这些代码片段合并为一个完整的脚本。

In [None]:
from __future__ import print_function
import math
import os
import torch
import torch.distributions.constraints as constraints
import pyro
from pyro.optim import Adam
from pyro.infer import SVI, Trace_ELBO
import pyro.distributions as dist

# this is for running the notebook in our testing framework
smoke_test = ('CI' in os.environ)
n_steps = 2 if smoke_test else 2000

# enable validation (e.g. validate parameters of distributions)
assert pyro.__version__.startswith('0.3.1')
pyro.enable_validation(True)

# clear the param store in case we're in a REPL
pyro.clear_param_store()

# create some data with 6 observed heads and 4 observed tails
data = []
for _ in range(6):
    data.append(torch.tensor(1.0))
for _ in range(4):
    data.append(torch.tensor(0.0))

def model(data):
    # define the hyperparameters that control the beta prior
    alpha0 = torch.tensor(10.0)
    beta0 = torch.tensor(10.0)
    # sample f from the beta prior
    f = pyro.sample("latent_fairness", dist.Beta(alpha0, beta0))
    # loop over the observed data
    for i in range(len(data)):
        # observe datapoint i using the bernoulli likelihood
        pyro.sample("obs_{}".format(i), dist.Bernoulli(f), obs=data[i])

def guide(data):
    # register the two variational parameters with Pyro
    # - both parameters will have initial value 15.0. 
    # - because we invoke constraints.positive, the optimizer 
    # will take gradients on the unconstrained parameters
    # (which are related to the constrained parameters by a log)
    alpha_q = pyro.param("alpha_q", torch.tensor(15.0), 
                         constraint=constraints.positive)
    beta_q = pyro.param("beta_q", torch.tensor(15.0), 
                        constraint=constraints.positive)
    # sample latent_fairness from the distribution Beta(alpha_q, beta_q)
    pyro.sample("latent_fairness", dist.Beta(alpha_q, beta_q))

# setup the optimizer
adam_params = {"lr": 0.0005, "betas": (0.90, 0.999)}
optimizer = Adam(adam_params)

# setup the inference algorithm
svi = SVI(model, guide, optimizer, loss=Trace_ELBO())

# do gradient steps
for step in range(n_steps):
    svi.step(data)
    if step % 100 == 0:
        print('.', end='')

# grab the learned variational parameters
alpha_q = pyro.param("alpha_q").item()
beta_q = pyro.param("beta_q").item()

# here we use some facts about the beta distribution
# compute the inferred mean of the coin's fairness
inferred_mean = alpha_q / (alpha_q + beta_q)
# compute inferred standard deviation
factor = beta_q / (alpha_q * (1.0 + alpha_q + beta_q))
inferred_std = inferred_mean * math.sqrt(factor)

print("\nbased on the data and our prior belief, the fairness " +
      "of the coin is %.3f +- %.3f" % (inferred_mean, inferred_std))

### Sample output:

```
based on the data and our prior belief, the fairness of the coin is 0.532 +- 0.090
```

This estimate is to be compared to the exact posterior mean, which in this case is given by $16/30 = 0.5\bar{3}$.
Note that the final estimate of the fairness of the coin is in between the the fairness preferred by the prior (namely $0.50$) and the fairness suggested by the raw empirical frequencies ($6/10 = 0.60$). 

### 示例输出

```
based on the data and our prior belief, the fairness of the coin is 0.532 +- 0.0090
```

这个估计可以与精确后验均值比较，在这个例子中它是$16/30=0.5\bar{3}$。注意到最终的关于公平性的估计结果
在先验给定的公平性（$0.5$）与原始经验频率($6/10=0.60$)之间。

## References

[1] `Automated Variational Inference in Probabilistic Programming`,
<br/>&nbsp;&nbsp;&nbsp;&nbsp;
David Wingate, Theo Weber

[2] `Black Box Variational Inference`,<br/>&nbsp;&nbsp;&nbsp;&nbsp;
Rajesh Ranganath, Sean Gerrish, David M. Blei

[3] `Auto-Encoding Variational Bayes`,<br/>&nbsp;&nbsp;&nbsp;&nbsp;
Diederik P Kingma, Max Welling