<h2 id="probabilistic-models">Probabilistic Models</h2>
<p>A probabilistic model asserts how observations from a natural phenomenon arise. The model is a <em>joint distribution</em> <span class="math display">\[\begin{aligned}
  p(\mathbf{x}, \mathbf{z})\end{aligned}\]</span> of observed variables <span class="math inline">\(\mathbf{x}\)</span> corresponding to data, and latent variables <span class="math inline">\(\mathbf{z}\)</span> that provide the hidden structure to generate from <span class="math inline">\(\mathbf{x}\)</span>. The joint distribution factorizes into two components.</p>
<p>The <em>likelihood</em> <span class="math display">\[\begin{aligned}
  p(\mathbf{x} \mid \mathbf{z})\end{aligned}\]</span> is a probability distribution that describes how any data <span class="math inline">\(\mathbf{x}\)</span> depend on the latent variables <span class="math inline">\(\mathbf{z}\)</span>. The likelihood posits a data generating process, where the data <span class="math inline">\(\mathbf{x}\)</span> are assumed drawn from the likelihood conditioned on a particular hidden pattern described by <span class="math inline">\(\mathbf{z}\)</span>.</p>
<p>The <em>prior</em> <span class="math display">\[\begin{aligned}
  p(\mathbf{z})\end{aligned}\]</span> is a probability distribution that describes the latent variables present in the data. It posits a generating process of the hidden structure.</p>
<p>For details on how to specify a model in Edward, see the <a href="/api/model">model API</a>. We describe several examples in detail in the <a href="/tutorials/">tutorials</a>.</p>

<h2 id="inference-of-probabilistic-models">Inference of Probabilistic Models</h2>
<p>This tutorial asks the question: what does it mean to do inference of probabilistic models? This sets the stage for understanding how to design inference algorithms in Edward.</p>
<h3 id="the-posterior">The posterior</h3>
<p>How can we use a model <span class="math inline">\(p(\mathbf{x}, \mathbf{z})\)</span> to analyze some data <span class="math inline">\(\mathbf{x}\)</span>? In other words, what hidden structure <span class="math inline">\(\mathbf{z}\)</span> explains the data? We seek to infer this hidden structure using the model.</p>
<p>One method of inference leverages Bayes’ rule to define the <em>posterior</em> <span class="math display">\[\begin{aligned}
  p(\mathbf{z} \mid \mathbf{x})
  &=
  \frac{p(\mathbf{x}, \mathbf{z})}{\int p(\mathbf{x}, \mathbf{z}) \text{d}\mathbf{z}}.\end{aligned}\]</span> The posterior is the distribution of the latent variables <span class="math inline">\(\mathbf{z}\)</span>, conditioned on some (observed) data <span class="math inline">\(\mathbf{x}\)</span>. Drawing analogy to representation learning, it is a probabilistic description of the data’s hidden representation.</p>
<p>From the perspective of inductivism, as practiced by classical Bayesians (and implicitly by frequentists), the posterior is our updated hypothesis about the latent variables. From the perspective of hypothetico-deductivism, as practiced by statisticians such as Box, Rubin, and Gelman, the posterior is simply a fitted model to data, to be criticized and thus revised <span class="citation" data-cites="box1982apology gelman2013philosophy">(Box, 1982; Gelman &amp; Shalizi, 2013)</span>.</p>
<h3 id="inferring-the-posterior">Inferring the posterior</h3>
<p>Now we know what the posterior represents. How do we calculate it? This is the central computational challenge in inference.</p>
<p>The posterior is difficult to compute because of its normalizing constant, which is the integral in the denominator. This is often a high-dimensional integral that lacks an analytic (closed-form) solution. Thus, calculating the posterior means <em>approximating</em> the posterior.</p>
<p>For details on how to specify inference in Edward, see the <a href="/api/inference">inference API</a>. We describe several examples in detail in the <a href="/tutorials/">tutorials</a>.</p>

<h2 id="variational-inference">Variational Inference</h2>
<p>Variational inference is an umbrella term for algorithms which cast posterior inference as optimization <span class="citation" data-cites="hinton1993keeping waterhouse1996bayesian jordan1999introduction">(Hinton &amp; Camp, 1993; Jordan, Ghahramani, Jaakkola, &amp; Saul, 1999; Waterhouse, MacKay, &amp; Robinson, 1996)</span>.</p>
<p>The core idea involves two steps:</p>
<ol>
<li>posit a family of distributions <span class="math inline">\(q(\mathbf{z}\;;\;\lambda)\)</span> over the latent variables;</li>
<li>match <span class="math inline">\(q(\mathbf{z}\;;\;\lambda)\)</span> to the posterior by optimizing over its parameters <span class="math inline">\(\lambda\)</span>.</li>
</ol>
<p>This strategy converts the problem of computing the posterior <span class="math inline">\(p(\mathbf{z} \mid \mathbf{x})\)</span> into an optimization problem: minimize a divergence measure <span class="math display">\[\begin{aligned}
  \lambda^*
  &=
  \arg\min_\lambda \text{divergence}(
  p(\mathbf{z} \mid \mathbf{x})
  ,
  q(\mathbf{z}\;;\;\lambda)
  ).\end{aligned}\]</span> The optimized distribution <span class="math inline">\(q(\mathbf{z}\;;\;\lambda^*)\)</span> is used as a proxy to the posterior <span class="math inline">\(p(\mathbf{z}\mid \mathbf{x})\)</span>.</p>
<p>Edward takes the perspective that the posterior is (typically) intractable, and thus we must build a model of latent variables that best approximates the posterior. It is analogous to the perspective that the true data generating process is unknown, and thus we build models of data to best approximate the true process.</p>

<h2 id="textklqp-minimization"><span class="math inline">\(\text{KL}(q\|p)\)</span> Minimization</h2>
<p>One form of variational inference minimizes the Kullback-Leibler divergence <strong>from</strong> <span class="math inline">\(q(\mathbf{z}\;;\;\lambda)\)</span> <strong>to</strong> <span class="math inline">\(p(\mathbf{z} \mid \mathbf{x})\)</span>, <span class="math display">\[\begin{aligned}
  \lambda^*
  &=
  \arg\min_\lambda \text{KL}(
  q(\mathbf{z}\;;\;\lambda)
  \;\|\;
  p(\mathbf{z} \mid \mathbf{x})
  )\\
  &=
  \arg\min_\lambda\;
  \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)}
  \big[
  \log q(\mathbf{z}\;;\;\lambda)
  -
  \log p(\mathbf{z} \mid \mathbf{x})
  \big].\end{aligned}\]</span> The KL divergence is a non-symmetric, information theoretic measure of similarity between two probability distributions <span class="citation" data-cites="hinton1993keeping waterhouse1996bayesian jordan1999introduction">(Hinton & Camp, 1993; Jordan, Ghahramani, Jaakkola, & Saul, 1999; Waterhouse, MacKay, & Robinson, 1996)</span>.</p>
<h3 id="the-evidence-lower-bound">The Evidence Lower Bound</h3>
<p>The above optimization problem is intractable because it directly depends on the posterior <span class="math inline">\(p(\mathbf{z} \mid \mathbf{x})\)</span>. To tackle this, consider the property <span class="math display">\[\begin{aligned}
  \log p(\mathbf{x})
  &=
  \text{KL}(
  q(\mathbf{z}\;;\;\lambda)
  \;\|\;
  p(\mathbf{z} \mid \mathbf{x})
  )\\
  &\quad+\;
  \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)}
  \big[
  \log p(\mathbf{x}, \mathbf{z})
  -
  \log q(\mathbf{z}\;;\;\lambda)
  \big]\end{aligned}\]</span> where the left hand side is the logarithm of the marginal likelihood <span class="math inline">\(p(\mathbf{x}) = \int p(\mathbf{x}, \mathbf{z}) \text{d}\mathbf{z}\)</span>, also known as the model evidence. (Try deriving this using Bayes’ rule!)</p>
<p>The evidence is a constant with respect to the variational parameters <span class="math inline">\(\lambda\)</span>, so we can minimize <span class="math inline">\(\text{KL}(q\|p)\)</span> by instead maximizing the Evidence Lower BOund, <span class="math display">\[\begin{aligned}
  \text{ELBO}(\lambda)
  &=\;
  \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)}
  \big[
  \log p(\mathbf{x}, \mathbf{z})
  -
  \log q(\mathbf{z}\;;\;\lambda)
  \big].\end{aligned}\]</span> In the ELBO, both <span class="math inline">\(p(\mathbf{x}, \mathbf{z})\)</span> and <span class="math inline">\(q(\mathbf{z}\;;\;\lambda)\)</span> are tractable. The optimization problem we seek to solve becomes <span class="math display">\[\begin{aligned}
  \lambda^*
  &=
  \arg \max_\lambda \text{ELBO}(\lambda).\end{aligned}\]</span> As per its name, the ELBO is a lower bound on the evidence, and optimizing it tries to maximize the probability of observing the data. What does maximizing the ELBO do? Splitting the ELBO reveals a trade-off <span class="math display">\[\begin{aligned}
  \text{ELBO}(\lambda)
  &=\;
  \mathbb{E}_{q(\mathbf{z} \;;\; \lambda)}[\log p(\mathbf{x}, \mathbf{z})]
  - \mathbb{E}_{q(\mathbf{z} \;;\; \lambda)}[\log q(\mathbf{z}\;;\;\lambda)],\end{aligned}\]</span> where the first term represents an energy and the second term (including the minus sign) represents the entropy of <span class="math inline">\(q\)</span>. The energy encourages <span class="math inline">\(q\)</span> to focus probability mass where the model puts high probability, <span class="math inline">\(p(\mathbf{x}, \mathbf{z})\)</span>. The entropy encourages <span class="math inline">\(q\)</span> to spread probability mass to avoid concentrating to one location.</p>
<p>Edward uses two generic strategies to obtain gradients for optimization.</p>
<ul>
<li>Score function gradient;</li>
<li>Reparameterization gradient.</li>
</ul>
<h2 id="score-function-gradient">Score function gradient</h2>
<p>Gradient descent is a standard approach for optimizing complicated objectives like the ELBO. The idea is to calculate its gradient <span class="math display">\[\begin{aligned}
  \nabla_\lambda\;
  \text{ELBO}(\lambda)
  &=
  \nabla_\lambda\;
  \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)}
  \big[
  \log p(\mathbf{x}, \mathbf{z})
  -
  \log q(\mathbf{z}\;;\;\lambda)
  \big],\end{aligned}\]</span> and update the current set of parameters proportional to the gradient.</p>
<p>The score function gradient estimator leverages a property of logarithms to write the gradient as <span class="math display">\[\begin{aligned}
  \nabla_\lambda\;
  \text{ELBO}(\lambda)
  &=\;
  \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)}
  \big[
  \nabla_\lambda \log q(\mathbf{z}\;;\;\lambda)
  \:
  \big(
  \log p(\mathbf{x}, \mathbf{z})
  -
  \log q(\mathbf{z}\;;\;\lambda)
  \big)
  \big].\end{aligned}\]</span> The gradient of the ELBO is an expectation over the variational model <span class="math inline">\(q(\mathbf{z}\;;\;\lambda)\)</span>; the only new ingredient it requires is the <em>score function</em> <span class="math inline">\(\nabla_\lambda \log q(\mathbf{z}\;;\;\lambda)\)</span> <span class="citation" data-cites="paisley2012variational ranganath2014black">(Paisley, Blei, & Jordan, 2012; Ranganath, Gerrish, & Blei, 2014)</span>.</p>
<p>We can use Monte Carlo integration to obtain noisy estimates of both the ELBO and its gradient. The basic procedure follows these steps:</p>
<ol>
<li>draw <span class="math inline">\(S\)</span> samples <span class="math inline">\(\{\mathbf{z}_s\}_1^S \sim q(\mathbf{z}\;;\;\lambda)\)</span>,</li>
<li>evaluate the argument of the expectation using <span class="math inline">\(\{\mathbf{z}_s\}_1^S\)</span>, and</li>
<li>compute the empirical mean of the evaluated quantities.</li>
</ol>
<p>A Monte Carlo estimate of the gradient is then <span class="math display">\[\begin{aligned}
  \nabla_\lambda\;
  \text{ELBO}(\lambda)
  &\approx\;
  \frac{1}{S}
  \sum_{s=1}^{S}
  \big[
  \big(
  \log p(\mathbf{x}, \mathbf{z}_s)
  -
  \log q(\mathbf{z}_s\;;\;\lambda)
  \big)
  \:
  \nabla_\lambda \log q(\mathbf{z}_s\;;\;\lambda)
  \big].\end{aligned}\]</span> This is an unbiased estimate of the actual gradient of the ELBO.</p>
<h2 id="reparameterization-gradient">Reparameterization gradient</h2>
<p>If the model has differentiable latent variables, then it is generally advantageous to leverage gradient information from the model in order to better traverse the optimization space. One approach to doing this is the reparameterization gradient <span class="citation" data-cites="kingma2014auto rezende2014stochastic">(Kingma & Welling, 2014; Rezende, Mohamed, & Wierstra, 2014)</span>.</p>
<p>Some variational distributions <span class="math inline">\(q(\mathbf{z}\;;\;\lambda)\)</span> admit useful reparameterizations. For example, we can reparameterize a normal distribution <span class="math inline">\(\mathbf{z} \sim \text{Normal}(\mu, \Sigma)\)</span> as <span class="math inline">\(\mathbf{z} \sim \mu + L \text{Normal}(0, I)\)</span> where <span class="math inline">\(\Sigma = LL^\top\)</span>. In general, write this as <span class="math display">\[\begin{aligned}
  \epsilon &\sim q(\epsilon)\\
  \mathbf{z} &= \mathbf{z}(\epsilon \;;\; \lambda),\end{aligned}\]</span> where <span class="math inline">\(\epsilon\)</span> is a random variable that does <strong>not</strong> depend on the variational parameters <span class="math inline">\(\lambda\)</span>. The deterministic function <span class="math inline">\(\mathbf{z}(\cdot;\lambda)\)</span> encapsulates the variational parameters instead, and following the process is equivalent to directly drawing <span class="math inline">\(\mathbf{z}\)</span> from the original distribution.</p>
<p>The reparameterization gradient leverages this property of the variational distribution to write the gradient as <span class="math display">\[\begin{aligned}
  \nabla_\lambda\;
  \text{ELBO}(\lambda)
  &=\;
  \mathbb{E}_{q(\epsilon)}
  \big[
  \nabla_\lambda
  \big(
  \log p(\mathbf{x}, \mathbf{z}(\epsilon \;;\; \lambda))
  -
  \log q(\mathbf{z}(\epsilon \;;\; \lambda) \;;\;\lambda)
  \big)
  \big].\end{aligned}\]</span> The gradient of the ELBO is an expectation over the base distribution <span class="math inline">\(q(\epsilon)\)</span>, and the gradient can be applied directly to the inner expression.</p>
<p>We can use Monte Carlo integration to obtain noisy estimates of both the ELBO and its gradient. The basic procedure follows these steps:</p>
<ol>
<li>draw <span class="math inline">\(S\)</span> samples <span class="math inline">\(\{\epsilon_s\}_1^S \sim q(\epsilon)\)</span>,</li>
<li>evaluate the argument of the expectation using <span class="math inline">\(\{\epsilon_s\}_1^S\)</span>, and</li>
<li>compute the empirical mean of the evaluated quantities.</li>
</ol>
<p>A Monte Carlo estimate of the gradient is then <span class="math display">\[\begin{aligned}
  \nabla_\lambda\;
  \text{ELBO}(\lambda)
  &\approx\;
  \frac{1}{S}
  \sum_{s=1}^{S}
  \big[
  \nabla_\lambda
  \big(
  \log p(\mathbf{x}, \mathbf{z}(\epsilon_s \;;\; \lambda))
  -
  \log q(\mathbf{z}(\epsilon_s \;;\; \lambda) \;;\;\lambda)
  \big)
  \big].\end{aligned}\]</span> This is an unbiased estimate of the actual gradient of the ELBO. Empirically, it exhibits lower variance than the score function gradient, leading to faster convergence in a large set of problems.</p>

<h2 id="textklpq-minimization"><span class="math inline">\(\text{KL}(p\|q)\)</span> Minimization</h2>
<p>One form of variational inference minimizes the Kullback-Leibler divergence <strong>from</strong> <span class="math inline">\(p(\mathbf{z} \mid \mathbf{x})\)</span> <strong>to</strong> <span class="math inline">\(q(\mathbf{z}\;;\;\lambda)\)</span>, <span class="math display">\[\begin{aligned}
  \lambda^*
  &=
  \arg\min_\lambda \text{KL}(
  p(\mathbf{z} \mid \mathbf{x})
  \;\|\;
  q(\mathbf{z}\;;\;\lambda)
  )\\
  &=
  \arg\min_\lambda\;
  \mathbb{E}_{p(\mathbf{z} \mid \mathbf{x})}
  \big[
  \log p(\mathbf{z} \mid \mathbf{x})
  -
  \log q(\mathbf{z}\;;\;\lambda)
  \big].\end{aligned}\]</span> The KL divergence is a non-symmetric, information theoretic measure of similarity between two probability distributions.</p>
<h3 id="minimizing-an-intractable-objective-function">Minimizing an intractable objective function</h3>
<p>The <span class="math inline">\(\text{KL}(p\|q)\)</span> objective we seek to minimize is intractable; it directly involves the posterior <span class="math inline">\(p(\mathbf{z} \mid \mathbf{x})\)</span>. Ignoring this for the moment, consider its gradient <span class="math display">\[\begin{aligned}
  \nabla_\lambda\;
  \text{KL}(
  p(\mathbf{z} \mid \mathbf{x})
  \;\|\;
  q(\mathbf{z}\;;\;\lambda)
  )
  &=
  -
  \mathbb{E}_{p(\mathbf{z} \mid \mathbf{x})}
  \big[
  \nabla_\lambda\;
  \log q(\mathbf{z}\;;\;\lambda)
  \big].\end{aligned}\]</span> Both <span class="math inline">\(\text{KL}(p\|q)\)</span> and its gradient are intractable because of the posterior expectation. We can use importance sampling to both estimate the objective and calculate stochastic gradients <span class="citation" data-cites="oh1992adaptive">(Oh & Berger, 1992)</span>.</p>
<h3 id="adaptive-importance-sampling">Adaptive Importance sampling</h3>
<p>First rewrite the expectation to be with respect to the variational distribution, <span class="math display">\[\begin{aligned}
  -
  \mathbb{E}_{p(\mathbf{z} \mid \mathbf{x})}
  \big[
  \nabla_\lambda\;
  \log q(\mathbf{z}\;;\;\lambda)
  \big]
  &=
  -
  \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)}
  \Bigg[
  \frac{p(\mathbf{z} \mid \mathbf{x})}{q(\mathbf{z}\;;\;\lambda)}
  \nabla_\lambda\;
  \log q(\mathbf{z}\;;\;\lambda)
  \Bigg].\end{aligned}\]</span></p>
<p>We then use importance sampling to obtain a noisy estimate of this gradient. The basic procedure follows these steps:</p>
<ol>
<li>draw <span class="math inline">\(S\)</span> samples <span class="math inline">\(\{\mathbf{z}_s\}_1^S \sim q(\mathbf{z}\;;\;\lambda)\)</span>,</li>
<li>evaluate <span class="math inline">\(\nabla_\lambda\; \log q(\mathbf{z}_s\;;\;\lambda)\)</span>,</li>
<li>compute the normalized importance weights <span class="math display">\[\begin{aligned}
    w_s
    &=
    \frac{p(\mathbf{z}_s \mid \mathbf{x})}{q(\mathbf{z}_s\;;\;\lambda)}
    \Bigg/
    \sum_{s=1}^{S}
    \frac{p(\mathbf{z}_s \mid \mathbf{x})}{q(\mathbf{z}_s\;;\;\lambda)}
  \end{aligned}\]</span></li>
<li>compute the weighted sum.</li>
</ol>
<p>The key insight is that we can use the joint <span class="math inline">\(p(\mathbf{x},\mathbf{z})\)</span> instead of the posterior when estimating the normalized importance weights <span class="math display">\[\begin{aligned}
  w_s
  &=
  \frac{p(\mathbf{z}_s \mid \mathbf{x})}{q(\mathbf{z}_s\;;\;\lambda)}
  \Bigg/
  \sum_{s=1}^{S}
  \frac{p(\mathbf{z}_s \mid \mathbf{x})}{q(\mathbf{z}_s\;;\;\lambda)} \\
  &=
  \frac{p(\mathbf{x}, \mathbf{z}_s)}{q(\mathbf{z}_s\;;\;\lambda)}
  \Bigg/
  \sum_{s=1}^{S}
  \frac{p(\mathbf{x}, \mathbf{z}_s)}{q(\mathbf{z}_s\;;\;\lambda)}.\end{aligned}\]</span> This follows from Bayes’ rule <span class="math display">\[\begin{aligned}
  p(\mathbf{z} \mid \mathbf{x})
  &=
  p(\mathbf{x}, \mathbf{z}) / p(\mathbf{x})\\
  &=
  p(\mathbf{x}, \mathbf{z}) / \text{a constant function of }\mathbf{z}.\end{aligned}\]</span></p>
<p>Importance sampling thus gives the following biased yet consistent gradient estimate <span class="math display">\[\begin{aligned}
\nabla_\lambda\;
  \text{KL}(
  p(\mathbf{z} \mid \mathbf{x})
  \;\|\;
  q(\mathbf{z}\;;\;\lambda)
  )
  &=
  -
  \sum_{s=1}^S
  w_s
  \nabla_\lambda\; \log q(\mathbf{z}_s\;;\;\lambda).\end{aligned}\]</span> The objective <span class="math inline">\(\text{KL}(p\|q)\)</span> can be calculated in a similar fashion. The only new ingredient for its gradient is the score function <span class="math inline">\(\nabla_\lambda \log q(\mathbf{z}\;;\;\lambda)\)</span>. Edward uses automatic differentiation, specifically with TensorFlow’s computational graphs, making this gradient computation both simple and efficient to distribute.</p>
<p>Adaptive importance sampling follows this gradient to a local optimum using stochastic optimization. It is adaptive because the variational distribution <span class="math inline">\(q(\mathbf{z}\;;\;\lambda)\)</span> iteratively gets closer to the posterior <span class="math inline">\(p(\mathbf{z} \mid \mathbf{x})\)</span>.</p>

<h2 id="maximum-a-posteriori-estimation">Maximum a Posteriori Estimation</h2>
<p>Maximum a posteriori (MAP) estimation is a form of approximate posterior inference. It uses the mode as a point estimate of the posterior distribution, <span class="math display">\[\begin{aligned}
  \mathbf{z}_\text{MAP}
  &=
  \arg \max_\mathbf{z}
  p(\mathbf{z} \mid \mathbf{x})\\
  &=
  \arg \max_\mathbf{z}
  \log p(\mathbf{z} \mid \mathbf{x}).\end{aligned}\]</span> In practice, we work with logarithms of densities to avoid numerical underflow issues <span class="citation" data-cites="murphy2012machine">(Murphy, 2012)</span>.</p>
<p>The MAP estimate is the most likely configuration of the hidden patterns <span class="math inline">\(\mathbf{z}\)</span> under the model. However, we cannot directly solve this optimization problem because the posterior is typically intractable. To circumvent this, we use Bayes’ rule to optimize over the joint density, <span class="math display">\[\begin{aligned}
  \mathbf{z}_\text{MAP}
  &=
  \arg \max_\mathbf{z}
  \log p(\mathbf{z} \mid \mathbf{x})\\
  &=
  \arg \max_\mathbf{z}
  \log p(\mathbf{x}, \mathbf{z}).\end{aligned}\]</span> This is valid because <span class="math display">\[\begin{aligned}
  \log p(\mathbf{z} \mid \mathbf{x})
  &=
  \log p(\mathbf{x}, \mathbf{z}) - \log p(\mathbf{x})\\
  &=
  \log p(\mathbf{x}, \mathbf{z}) - \text{constant in terms of } \mathbf{z}.\end{aligned}\]</span> MAP estimation includes the common scenario of maximum likelihood estimation as a special case, <span class="math display">\[\begin{aligned}
  \mathbf{z}_\text{MAP}
  &=
  \arg \max_\mathbf{z}
  p(\mathbf{x}, \mathbf{z})\\
  &=
  \arg \max_\mathbf{z}
  p(\mathbf{x}\mid \mathbf{z}),\end{aligned}\]</span> where the prior <span class="math inline">\(p(\mathbf{z})\)</span> is flat, placing uniform probability over all values <span class="math inline">\(\mathbf{z}\)</span> supports. Placing a nonuniform prior can be thought of as regularizing the estimation, penalizing values away from maximizing the likelihood, which can lead to overfitting. For example, a normal prior or Laplace prior on <span class="math inline">\(\mathbf{z}\)</span> corresponds to <span class="math inline">\(\ell_2\)</span> penalization, also known as ridge regression, and <span class="math inline">\(\ell_1\)</span> penalization, also known as the LASSO.</p>
<p>Maximum likelihood is also known as cross entropy minimization. For a data set <span class="math inline">\(\mathbf{x}=\{x_n\}\)</span>, <span class="math display">\[\begin{aligned}
  \mathbf{z}_\text{MAP}
  &=
  \arg \max_\mathbf{z}
  \log p(\mathbf{x}\mid \mathbf{z})
  \\
  &=
  \arg \max_\mathbf{z}
  \sum_{n=1}^N \log p(x_n\mid \mathbf{z})
  \\
  &=
  \arg \min_\mathbf{z}
  -\frac{1}{N}\sum_{n=1}^N \log p(x_n\mid \mathbf{z}).\end{aligned}\]</span> The last expression can be thought of as an approximation to the cross entropy between the true data distribution and <span class="math inline">\(p(\mathbf{x}\mid \mathbf{z})\)</span>, using a set of <span class="math inline">\(N\)</span> data points.</p>
<h3 id="gradient-descent">Gradient descent</h3>
<p>To find the MAP estimate of the latent variables <span class="math inline">\(\mathbf{z}\)</span>, we use the gradient of the log joint density <span class="math display">\[\begin{aligned}
  \nabla_\mathbf{z}
  \log p(\mathbf{x}, \mathbf{z})\end{aligned}\]</span> and follow it to a (local) optima. Edward uses TensorFlow’s automatic differentiation, making this gradient computation both simple and efficient to distribute.</p>
<p>Edward currently does not support MAP for discrete latent variables. This amounts to discrete optimization, which is difficult.</p>

<h2 id="laplace-approximation">Laplace Approximation</h2>
<p>(This tutorial follows the <a href="/tutorials/map">Maximum a posteriori estimation</a> tutorial.)</p>
<p>Maximum a posteriori (MAP) estimation approximates the posterior <span class="math inline">\(p(\mathbf{z} \mid \mathbf{x})\)</span> with a point mass (delta function) by simply capturing its mode. MAP is attractive because it is fast and efficient. How can we use MAP to construct a better approximation to the posterior?</p>
<p>The Laplace approximation <span class="citation" data-cites="laplace1986memoir">(Laplace, 1986)</span> is one way of improving a MAP estimate. The idea is to approximate the posterior with a normal distribution centered at the MAP estimate, <span class="math display">\[\begin{aligned}
  p(\mathbf{z} \mid \mathbf{x})
  &\approx
  \text{Normal}(\mathbf{z}\;;\; \mathbf{z}_\text{MAP}, \Lambda^{-1}).\end{aligned}\]</span> This requires computing a precision matrix <span class="math inline">\(\Lambda\)</span>. Derived from a Taylor expansion, the Laplace approximation uses the Hessian of the negative log joint density at the MAP estimate. It is defined component-wise as <span class="math display">\[\begin{aligned}
  \Lambda_{ij}
  &=
  \frac{\partial^2}{\partial z_i \partial z_j} -\log p(\mathbf{x}, \mathbf{z}).\end{aligned}\]</span> For flat priors (which reduces MAP to maximum likelihood), the precision matrix is known as the observed Fisher information <span class="citation" data-cites="fisher1925theory">(Fisher, 1925)</span>. Edward uses TensorFlow’s automatic differentiation, making this second-order gradient computation both simple and efficient to distribute.</p>

<h2 id="model-criticism">Model Criticism</h2>
<p>We can never validate whether a model is true. In practice, “all models are wrong” <span class="citation" data-cites="box1976science">(Box, 1976)</span>. However, we can try to uncover where the model goes wrong. Model criticism helps justify the model as an approximation or point to good directions for revising the model.</p>
<p>Model criticism typically analyzes the posterior predictive distribution, <span class="math display">\[\begin{aligned}
  p(\mathbf{x}_\text{new} \mid \mathbf{x})
  &=
  \int
  p(\mathbf{x}_\text{new} \mid \mathbf{z})
  p(\mathbf{z} \mid \mathbf{x})
  \text{d} \mathbf{z}.\end{aligned}\]</span> The model’s posterior predictive can be used to generate new data given past observations and can also make predictions on new data given past observations. It is formed by calculating the likelihood of the new data, averaged over every set of latent variables according to the posterior distribution.</p>
<p>A helpful utility function to form the posterior predictive is <code>copy</code>. For example, assume the model defines a likelihood <code>x</code> connected to a prior <code>z</code>. The posterior predictive distribution is</p>
<pre class="python" data-language="Python"><code>x_post = ed.copy(x, {z: qz})</code></pre>
<p>Here, we copy the likelihood node <code>x</code> in the graph and replace dependence on the prior <code>z</code> with dependence on the inferred posterior <code>qz</code>. We describe several techniques for model criticism.</p>
<h3 id="point-evaluation">Point Evaluation</h3>
<p>A point evaluation is a scalar-valued metric for assessing trained models <span class="citation" data-cites="winkler1994evaluating gneiting2007strictly">(Gneiting &amp; Raftery, 2007; Winkler, 1994)</span>. For example, we can assess models for classification by predicting the label for each observation in the data and comparing it to their true labels. Edward implements a variety of metrics, such as classification error and mean absolute error.</p>
<p>The <code>ed.evaluate()</code> method takes as input a set of metrics to evaluate, and a data dictionary. As with inference, the data dictionary binds the observed random variables in the model to realizations: in this case, it is the posterior predictive random variable of outputs <code>y_post</code> to <code>y_train</code> and a placeholder for inputs <code>x</code> to <code>x_train</code>.</p>
<pre class="python" data-language="Python"><code>ed.evaluate(&#39;categorical_accuracy&#39;, data={y_post: y_train, x: x_train})
ed.evaluate(&#39;mean_absolute_error&#39;, data={y_post: y_train, x: x_train})</code></pre>
<p>Point evaluation also applies to unsupervised tasks. For example, we can evaluate the likelihood of observing the data.</p>
<pre class="python" data-language="Python"><code>ed.evaluate(&#39;log_likelihood&#39;, data={x_post: x_train})</code></pre>
<p>It is common practice to criticize models with data held-out from training. To do this, we must first perform inference over any local latent variables of the held-out data, fixing the global variables; we demonstrate this below. Then we make predictions on the held-out data.</p>
<pre class="python" data-language="Python"><code>from edward.models import Categorical

# create local posterior factors for test data, assuming test data
# has N_test many data points
qz_test = Categorical(logits=tf.Variable(tf.zeros[N_test, K]))

# run local inference conditional on global factors
inference_test = ed.Inference({z: qz_test}, data={x: x_test, beta: qbeta})
inference_test.run()

# build posterior predictive on test data
x_post = ed.copy(x, {z: qz_test, beta: qbeta}})
ed.evaluate(&#39;log_likelihood&#39;, data={x_post: x_test})</code></pre>
<p>Point evaluations are formally known as scoring rules in decision theory. Scoring rules are useful for model comparison, model selection, and model averaging.</p>
<p>See the <a href="/api/criticism">criticism API</a> for further details. An example of point evaluation is in the <a href="/tutorials/supervised-regression">supervised learning (regression)</a> tutorial.</p>
<h3 id="posterior-predictive-checks">Posterior predictive checks</h3>
<p>Posterior predictive checks (PPCs) analyze the degree to which data generated from the model deviate from data generated from the true distribution. They can be used either numerically to quantify this degree, or graphically to visualize this degree. PPCs can be thought of as a probabilistic generalization of point evaluation <span class="citation" data-cites="box1980sampling rubin1984bayesianly meng1994posterior gelman1996posterior">(Box, 1980; Gelman, Meng, &amp; Stern, 1996; Meng, 1994; Rubin, 1984)</span>.</p>
<p>The simplest PPC works by applying a test statistic on new data generated from the posterior predictive, such as <span class="math inline">\(T(\mathbf{x}_\text{new}) = \max(\mathbf{x}_\text{new})\)</span>. Applying <span class="math inline">\(T(\mathbf{x}_\text{new})\)</span> to new data over many data replications induces a distribution. We compare this distribution to the test statistic on the real data <span class="math inline">\(T(\mathbf{x})\)</span>.</p>

<p>In the figure, <span class="math inline">\(T(\mathbf{x})\)</span> falls in a low probability region of this reference distribution: if the model were true, the probability of observing the test statistic is very low. This indicates that the model fits the data poorly according to this check; this suggests an area of improvement for the model.</p>
<p>More generally, the test statistic can be a function of the model’s latent variables <span class="math inline">\(T(\mathbf{x}, \mathbf{z})\)</span>, known as a discrepancy function. Examples of discrepancy functions are the metrics used for point evaluation. We can now interpret the point evaluation as a special case of PPCs: it simply calculates <span class="math inline">\(T(\mathbf{x}, \mathbf{z})\)</span> over the real data and without a reference distribution in mind. A reference distribution allows us to make probabilistic statements about the point, in reference to an overall distribution.</p>
<p>The <code>ed.ppc()</code> method provides a scaffold for studying various discrepancy functions. It takes as input a user-defined discrepancy function, and a data dictionary.</p>
<pre class="python" data-language="Python"><code>ed.ppc(lambda xs, zs: tf.reduce_mean(xs[x_post]), data={x_post: x_train})</code></pre>
<p>The discrepancy can also take latent variables as input, which we pass into the PPC.</p>
<pre class="python" data-language="Python"><code>ed.ppc(lambda xs, zs: tf.maximum(zs[z]),
       data={y_post: y_train, x_ph: x_train},
       latent_vars={z: qz, beta: qbeta})</code></pre>
<p>See the <a href="/api/criticism">criticism API</a> for further details.</p>
<p>PPCs are an excellent tool for revising models—simplifying or expanding the current model as one examines its fit to data. They are inspired by classical hypothesis testing; these methods criticize models under the frequentist perspective of large sample assessment.</p>
<p>PPCs can also be applied to tasks such as hypothesis testing, model comparison, model selection, and model averaging. It’s important to note that while PPCs can be applied as a form of Bayesian hypothesis testing, hypothesis testing is generally not recommended: binary decision making from a single test is not as common a use case as one might believe. We recommend performing many PPCs to get a holistic understanding of the model fit.</p>