<p>While there are many potential themes of probabilistic models we might explore, we'll herein focus on two: <strong>generative vs. discriminative models</strong>, and <strong>"fully Bayesian" vs. "lowly point estimate" learning</strong>. We will stick to the supervised setting as well.</p>
<p>Finally, our pool ring is not a godhead — we are not nautical missionaries brandishing a divine statistical truth, demanding that each model we encounter implement this truth in a rigid, bottom-up fashion. Instead, we'll explore the unique goals, formulations and shortcomings of each, and fall back on Bayes' theorem to bridge the gaps between. Without it, we'd quickly start sinking.</p>
<h1>Discriminative vs. generative models</h1>
<p>The goal of a supervised model is to compute the distribution over outcomes <span class="math">\(y\)</span> given an input <span class="math">\(x\)</span>, written <span class="math">\(P(y\vert x)\)</span>. If <span class="math">\(y\)</span> is discrete, this distribution is a probability mass function, e.g. a multinomial or binomial distribution. If continuous, it is a probability density function, e.g. a Gaussian distribution.</p>
<h2>Discriminative models</h2>
<p>In discriminative models, we immediately direct our focus to this output distribution. Taking an example from the <a href="https://cavaunpeu.github.io/2017/05/18/minimizing_the_negative_log_likelihood_in_english/">previous post</a>, let's assume a softmax regression which receives some data <span class="math">\(x\)</span> and predicts a multi-class label <code>red or green or blue</code>. The model's output distribution is therefore multinomial; a multinomial distribution requires as a parameter a vector <span class="math">\(\pi\)</span> of respective outcome probabilities, e.g. <code>{red: .27, green: .11, blue: .62}</code>. We can compute these individual probabilities via the softmax function, where:</p>
<ul>
<li><span class="math">\(\pi_k = \frac{e^{\eta_k}}{\sum\limits_{k=1}^K e^{\eta_k}}\)</span></li>
<li><span class="math">\(\eta_k = \theta_k^Tx\)</span></li>
<li><span class="math">\(\theta\)</span> is a matrix of weights which we must infer, and <span class="math">\(x\)</span> is our input.</li>
</ul>
<h3>Inference</h3>
<p>Typically, we perform inference by taking the <em>maximum likelihood estimate</em>: "which parameters <span class="math">\(\theta\)</span> most likely gave rise to the observed data pairs <span class="math">\(D = ((x^{(i)}, y^{(i)}), ..., (x^{(m)}, y^{(m)}))\)</span> via the relationships described above?" We compute this estimate by maximizing the log-likelihood function with respect to <span class="math">\(\theta\)</span>, or equivalently minimizing the negative log-likelihood in identical fashion — the latter better known as a "loss function" in machine learning parlance.</p>
<p>Unfortunately, the maximum likelihood estimate includes no information about the plausibility of the chosen parameter value itself. As such, we often place a <em>prior</em> on our parameter and take the <a href="https://en.wikipedia.org/wiki/Arg_max">"argmax"</a> over their product. This gives the <em>maximum a posteriori</em> estimate, or MAP.</p>
<div class="math">$$
\begin{align*}
\theta_{MAP}
&= \underset{\theta}{\arg\max}\ \log \prod\limits_{i=1}^{m} P(y^{(i)}\vert x^{(i)}; \theta)P(\theta)\\
&= \underset{\theta}{\arg\max}\ \sum\limits_{i=1}^{m} \log{P(y^{(i)}\vert x^{(i)}; \theta)} + \log{P(\theta)}\\
\end{align*}
$$</div>
<p>The <span class="math">\(\log{P(\theta)}\)</span> term can be easily rearranged into what is better known as a <em>regularization term</em> in machine learning, where the type of prior distribution we place on <span class="math">\(\theta\)</span> gives the type of regularization term.</p>
<p>The argmax finds the point(s) <span class="math">\(\theta\)</span> at which the given function attains its maximum value. As such, the typical discriminative model — softmax regression, logistic regression, linear regression, etc. — returns a single, lowly point estimate for the parameter in question.</p>
<h3>How do we compute this value?</h3>
<p>In the trivial case where <span class="math">\(\theta\)</span> is 1-dimensional, we can take the derivative of the function in question with respect to <span class="math">\(\theta\)</span>, set it equal to 0, then solve for <span class="math">\(\theta\)</span>. (Additionally, in order to verify that we have indeed obtained a maximum, we should compute a second derivative and assert that its value is negative.)</p>
<p>In the more realistic case where <span class="math">\(\theta\)</span> is a high-dimensional vector or matrix, we can compute the argmax by way of an optimization routine like stochastic gradient ascent or, as is more common, the argmin by way of stochastic gradient descent.</p>
<h3>What if we're uncertain about our parameter estimates?</h3>
<p>Consider the following three scenarios — taken from Daphne Koller's <a href="https://www.coursera.org/learn/probabilistic-graphical-models-3-learning/home/welcome">Learning in Probabilistic Graphical Models</a>.</p>
<blockquote>
<p>Two teams play 10 times, and the first wins 7 of the 10 matches.</p>
</blockquote>
<p>&gt; <em>Infer that the probability of the first team winning is 0.7.</em></p>
<p>Seems reasonable, right?</p>
<blockquote>
<p>A coin is tossed 10 times, and comes out <code>heads</code> on 7 of the 10 tosses.</p>
</blockquote>
<p>&gt; <em>Infer that the probability of observing <code>heads</code> is 0.7.</em></p>
<p>Changing only the analogy, this now seems wholly unreasonable — right?</p>
<blockquote>
<p>A coin is tossed 10000 times, and comes out <code>heads</code> on 7000 of the 10000 tosses.</p>
</blockquote>
<p>&gt; <em>Infer that the probability of observing <code>heads</code> is 0.7.</em></p>
<p>Finally, increasing the observed counts, the previous scenario now seems plausible.</p>
<p>I find this a terrific succession of examples with which to convey the notion of <em>uncertainty</em> — that the more data we have, the less uncertain we are about what's really going on. This notion is at the heart of Bayesian statistics and is extremely intuitive to us as humans. Unfortunately, when we compute "lowly point estimates," i.e. the argmin of the loss function with respect to our parameters <span class="math">\(\theta\)</span>, we are discarding this uncertainty entirely. Should our model be fit with <span class="math">\(n\)</span> observations where <span class="math">\(n\)</span> is not a large number, our estimate would amount to that of Example #2: <em>a coin is tossed <span class="math">\(n\)</span> times, and comes out <code>heads</code> on <code>int(.7n)</code> of <code>n</code> tosses — infer that the probability of observing <code>heads</code> is squarely, unflinchingly, <code>0.7</code>.</em></p>
<h3>What does including uncertainty look like?</h3>
<p>It looks like a <em>distribution</em> — a range of possible values for <span class="math">\(\theta\)</span>. Further, these values are of varying plausibility as dictated by the data we've observed. In Example #2, while we'd still say that <span class="math">\(\Pr(\text{heads}) = .7\)</span> is the parameter value <em>most likely</em> to have generated our data, we'd additionally maintain that other values in <span class="math">\((0, 1)\)</span> are plausible, albeit less so, as well. Again, this logic should be simple to grasp: it comes easy to us as humans.</p>
<h3>Prediction</h3>
<p>With the parameter <span class="math">\(\theta\)</span> in hand prediction is simple: just plug back into our original function <span class="math">\(P(y\vert x)\)</span>. With a point estimate for <span class="math">\(\theta\)</span>, we compute but a single value for <span class="math">\(y\)</span>.</p>
<h2>Generative models</h2>
<p>In generative models, we instead compute <em>component parts</em> of the desired output distribution <span class="math">\(P(y\vert x)\)</span> instead of directly computing <span class="math">\(P(y\vert x)\)</span> itself. To examine these parts, we'll turn to Bayes' theorem:</p>
<div class="math">$$
P(y\vert x) = \frac{P(x\vert y)P(y)}{P(x)}
$$</div>
<p>The numerator posits a generative mechanism for the observed data pairs <span class="math">\(D = ((x^{(i)}, y^{(i)}), ..., (x^{(m)}, y^{(m)}))\)</span> in idiomatic terms; it states that each pair was generated by:</p>
<ol>
<li>Selecting a label <span class="math">\(y^{(i)}\)</span> from <span class="math">\(P(y)\)</span>. If our model is predicting <code>red or green or blue</code>, <span class="math">\(P(y)\)</span> is likely a multinomial distribution.<ul>
<li>If our observed label counts are <code>{'red': 20, 'green': 50, 'blue': 30}</code>, we would retrodictively believe this multinomial distribution to have had a parameter vector near <span class="math">\(\pi = [.2, .5, .3]\)</span>.</li>
</ul>
</li>
<li>Given a label <span class="math">\(y^{(i)}\)</span>, select a value <span class="math">\(x^{(i)}\)</span> from <span class="math">\(P(x\vert y)\)</span>. Trivially, this means that we are positing <em>three distinct distributions</em> of this form: <span class="math">\(P(x\vert y=\text{red}), P(x\vert y=\text{green}), P(x\vert y=\text{blue})\)</span>.<ul>
<li>For example, if <span class="math">\(y^{(i)} = \text{red}\)</span>, draw <span class="math">\(x^{(i)}\)</span> from <span class="math">\(P(x\vert y=\text{red})\)</span>, and so forth.</li>
</ul>
</li>
</ol>
<h3>Inference</h3>
<p>The inference task is to compute <span class="math">\(P(y)\)</span> and each distinct <span class="math">\(P(x\vert y_k)\)</span>. In a classification setting, the former is likely a multinomial distribution. The latter might be a multinomial distribution or a set of binomial distributions in the case of discrete-feature data, or a set of Gaussian distributions in the case of continuous-feature data. In fact, these distributions can be whatever you'd like, dictated by the idiosyncrasies of the problem at hand.</p>
<p>Finally, we can compute these distributions as per normal: via a maximum likelihood estimate, a MAP estimate, etc.</p>
<h3>Prediction</h3>
<p>To compute <span class="math">\(P(y\vert x)\)</span> we return to Bayes' theorem:</p>
<div class="math">$$
P(y\vert x) = \frac{P(x\vert y)P(y)}{P(x)}
$$</div>
<p>We have the numerator <span class="math">\(P(y)\)</span> and three distinct conditional distributions <span class="math">\(P(x\vert y=\text{red}), P(x\vert y=\text{green})\)</span> and <span class="math">\(P(x\vert y=\text{blue})\)</span> in hand. What about the denominator?</p>
<h3>Conditional probability and marginalization</h3>
<p>The axiom of conditional probability allows us to write <span class="math">\(P(B\vert A)P(A) = P(B, A)\)</span>, i.e. the <em>joint probability</em> of <span class="math">\(B\)</span> and <span class="math">\(A\)</span>. This is a simple algebraic manipulation. As such, we can rewrite Bayes' theorem in its more compact form.</p>
<div class="math">$$
P(y\vert x) = \frac{P(x, y)}{P(x)}
$$</div>
<p>Another manipulation of probability distributions is the <em>marginalization</em> operator, which allows us to write:</p>
<div class="math">$$
\int P(x, y)dy = P(x)
$$</div>
<p>As such, we can <em>marginalize <span class="math">\(y\)</span> out of the numerator</em> so as to obtain the denominator we require. This denominator is often called the "evidence."</p>
<h3>Marginalization example</h3>
<p>Marginalization took me a while to understand. Imagine we have the following joint probability distribution out of which we'd like to marginalize <span class="math">\(A\)</span>.</p>
<table class="table-striped table-hover table">
<thead>
<tr>
<th><span class="math">\(A\)</span></th>
<th><span class="math">\(B\)</span></th>
<th><span class="math">\(p\)</span></th>
</tr>
</thead>
<tbody>
<tr>
<td><span class="math">\(a^1\)</span></td>
<td><span class="math">\(b^7\)</span></td>
<td><span class="math">\(.03\)</span></td>
</tr>
<tr>
<td><span class="math">\(a^2\)</span></td>
<td><span class="math">\(b^8\)</span></td>
<td><span class="math">\(.14\)</span></td>
</tr>
<tr>
<td><span class="math">\(a^3\)</span></td>
<td><span class="math">\(b^7\)</span></td>
<td><span class="math">\(.09\)</span></td>
</tr>
<tr>
<td><span class="math">\(a^1\)</span></td>
<td><span class="math">\(b^8\)</span></td>
<td><span class="math">\(.34\)</span></td>
</tr>
<tr>
<td><span class="math">\(a^2\)</span></td>
<td><span class="math">\(b^8\)</span></td>
<td><span class="math">\(.23\)</span></td>
</tr>
<tr>
<td><span class="math">\(a^3\)</span></td>
<td><span class="math">\(b^8\)</span></td>
<td><span class="math">\(.17\)</span></td>
</tr>
</tbody>
</table>
<p>The result of this marginalization is <span class="math">\(P(B)\)</span>, i.e. "what is the probability of observing each of the distinct values of <span class="math">\(B\)</span>?" In this example there are two — <span class="math">\(b^7\)</span> and <span class="math">\(b^8\)</span>. To marginalize over <span class="math">\(A\)</span>, we simply:</p>
<ol>
<li>Delete the <span class="math">\(A\)</span> column.</li>
<li>"Collapse" the remaining columns — in this case, <span class="math">\(B\)</span>.</li>
</ol>
<p>Step 1 gives:</p>
<table class="table-striped table-hover table">
<thead>
<tr>
<th><span class="math">\(B\)</span></th>
<th><span class="math">\(p\)</span></th>
</tr>
</thead>
<tbody>
<tr>
<td><span class="math">\(b^7\)</span></td>
<td><span class="math">\(.03\)</span></td>
</tr>
<tr>
<td><span class="math">\(b^8\)</span></td>
<td><span class="math">\(.14\)</span></td>
</tr>
<tr>
<td><span class="math">\(b^7\)</span></td>
<td><span class="math">\(.09\)</span></td>
</tr>
<tr>
<td><span class="math">\(b^8\)</span></td>
<td><span class="math">\(.34\)</span></td>
</tr>
<tr>
<td><span class="math">\(b^8\)</span></td>
<td><span class="math">\(.23\)</span></td>
</tr>
<tr>
<td><span class="math">\(b^8\)</span></td>
<td><span class="math">\(.17\)</span></td>
</tr>
</tbody>
</table>
<p>Step 2 gives:</p>
<table class="table-striped table-hover table">
<thead>
<tr>
<th><span class="math">\(B\)</span></th>
<th><span class="math">\(p\)</span></th>
</tr>
</thead>
<tbody>
<tr>
<td><span class="math">\(b^7\)</span></td>
<td><span class="math">\(.03 + .09 = .12\)</span></td>
</tr>
<tr>
<td><span class="math">\(b^8\)</span></td>
<td><span class="math">\(.14 + .34 + .23 + .17 = .88\)</span></td>
</tr>
</tbody>
</table>
<h3>The denominator</h3>
<p>In the context of our generative model with a given input <span class="math">\(x\)</span>, the result of this marginalization is a <em>scalar</em> — not a distribution. To see why, let's construct the joint distribution — the numerator — then marginalize:</p>
<p><span class="math">\(P(x, y)\)</span>:</p>
<table class="table-striped table-hover table">
<thead>
<tr>
<th><span class="math">\(y\)</span></th>
<th><span class="math">\(X\)</span></th>
<th><span class="math">\(P(y, X)\)</span></th>
</tr>
</thead>
<tbody>
<tr>
<td><span class="math">\(\text{red}\)</span></td>
<td><span class="math">\(x\)</span></td>
<td><span class="math">\(P(y = \text{red}, x)\)</span></td>
</tr>
<tr>
<td><span class="math">\(\text{green}\)</span></td>
<td><span class="math">\(x\)</span></td>
<td><span class="math">\(P(y = \text{green}, x)\)</span></td>
</tr>
<tr>
<td><span class="math">\(\text{blue}\)</span></td>
<td><span class="math">\(x\)</span></td>
<td><span class="math">\(P(y = \text{blue}, x)\)</span></td>
</tr>
</tbody>
</table>
<p><span class="math">\(\int P(x, y)dy = P(x)\)</span>:</p>
<table class="table-striped table-hover table">
<thead>
<tr>
<th><span class="math">\(X\)</span></th>
<th><span class="math">\(P(y, X)\)</span></th>
</tr>
</thead>
<tbody>
<tr>
<td><span class="math">\(x\)</span></td>
<td><span class="math">\(P(y = \text{red}, x) + P(y = \text{green}, x) + P(y = \text{blue}, x)\)</span></td>
</tr>
</tbody>
</table>
<p>The resulting probability distribution is over a single value: it is a scalar. This scalar <em>normalizes</em> the respective numerator terms such that:</p>
<div class="math">$$
\frac{P(y = \text{red}, x)}{P(x)} +
\frac{P(y = \text{green}, x)}{P(x)} +
\frac{P(y = \text{blue}, x)}{P(x)}
= 1
$$</div>
<p>This gives <span class="math">\(P(y\vert x)\)</span>: a valid probability distribution over the class labels <span class="math">\(y\)</span>.</p>
<h3>Partition function</h3>
<p><span class="math">\(P(x)\)</span> often takes another name and even another variable: <span class="math">\(Z\)</span>, the <em>partition function</em>. The stated purpose of this function is to normalize the numerator such that the above summation-to-1 holds. This normalization is necessary because the numerators typically will not sum to 1 themselves, which follows logically from the fact that:</p>
<div class="math">$$
\begin{align*}
\sum\limits_{k = 1}^K P(y = k) = 1
\end{align*}
$$</div>
<div class="math">$$
\begin{align*}
P(x\vert y = k) \neq 1
\end{align*}
$$</div>
<p>Since <span class="math">\((1)\)</span> is always true, the "<span class="math">\(\neq\)</span>" in <span class="math">\((2)\)</span> would need to become an "<span class="math">\(=\)</span>" such that:</p>
<div class="math">$$
\sum\limits_{k = 1}^K P(y = k)P(x\vert y = k) = 1
$$</div>
<p>Unfortunately, <span class="math">\(P(x\vert y = k) = 1\)</span> is rarely if ever the case.</p>
<p>As you'll now note, the <span class="math">\(x\)</span>-specific partition function gives a result equivalent to that of the marginalized-over-<span class="math">\(y\)</span> joint distribution: a scalar value <span class="math">\(P(x)\)</span> with which to normalize the numerator. However, crucially, please keep in mind:</p>
<ul>
<li><em>The partition function is a specific component of a probabilistic model. It always yields a scalar</em>.</li>
<li><em>Marginalization is a <strong>much more general</strong> operation performed on a probability distribution, which yields a scalar only when the remaining variable(s) are homogeneous, i.e. each remaining column contains a single distinct value.</em></li>
<li>In the majority of cases, marginalization will simply yield a reduced probability distribution over many value configurations, similar to the <span class="math">\(P(B)\)</span> example above.</li>
</ul>
<h3>In practice, this is superfluous</h3>
<p>If we neglect to compute <span class="math">\(P(x)\)</span>, i.e. if we don't normalize our joint distributions <span class="math">\(P(x, y = k)\)</span>, we'll be left with an invalid probability distribution <span class="math">\(\tilde{P}(y\vert x)\)</span> whose values do not sum to 1. This distribution might look like <code>P(y|x) = {'red': .00047, 'green': .0011, 'blue': .0000853}</code>. <em>If our goal is to simply compute the most likely label, taking the argmax of this unnormalized distribution works just fine.</em> This follows trivially from our Bayesian pool ring:</p>
<div class="math">$$
\underset{y}{\arg\max}\ \frac{P(x, y)}{P(x)} = \underset{y}{\arg\max}\ P(x, y)
$$</div>
<h1>"Fully Bayesian learning"</h1>
<p>We previously lamented the shortcomings of "lowly point estimates" and sang the praises of inferring the full distribution instead. Unfortunately, this is often a computationally-hard thing to do.</p>
<p>To see why, let's revisit Bayes' theorem. Assume we are estimating the parameters <span class="math">\(\theta\)</span> of a softmax regression model and have placed a prior on <span class="math">\(\theta\)</span>. In concrete terms, this estimate can be written as <span class="math">\(P(\theta\vert D = ((x^{(i)}, y^{(i)}), ..., (x^{(m)}, y^{(m)})))\)</span>: the distribution over our belief in the true value of <span class="math">\(\theta\)</span> given the data we've observed. Bayes' theorem allows us to expand this quantity into:</p>
<div class="math">$$
P(\theta\vert D) = \frac{P(D\vert\theta)P(\theta)}{P(D)}
$$</div>
<p>Previously, we computed a "lowly point estimate" for this distribution — the MAP — as:</p>
<div class="math">$$
\begin{align*}
\theta_{MAP}
&= \underset{\theta}{\arg\max}\ \log \prod\limits_{i=1}^{m} P(y^{(i)}\vert x^{(i)}; \theta)P(\theta)\\
&= \underset{\theta}{\arg\max}\ \log \prod\limits_{i=1}^{m} P(\theta\vert (y^{(i)}, x^{(i)}))\\
\end{align*}
$$</div>
<p>While <span class="math">\(P(y^{(i)}\vert x^{(i)}; \theta)P(\theta) \neq P(\theta\vert (y^{(i)}, x^{(i)}))\)</span>, the argmaxes of the respective products <em>are</em> equal. For this reason, we were able to compute a point estimate for <span class="math">\(P(\theta\vert D)\)</span>, i.e. a "summarization" of <span class="math">\(P(\theta\vert D)\)</span> in a single value, without ever computing the denominator <span class="math">\(P(D)\)</span>.</p>
<p>(As a brief aside, please note that we could summarize <span class="math">\(P(\theta\vert D)\)</span> with <em>any</em> single value from this distribution. We often select the maximum likelihood estimate — the single value of <span class="math">\(\theta\)</span> that most likely gave rise to our data, or the MAP — the single value of <span class="math">\(\theta\)</span> that both most likely gave rise to our data and most plausibly occurred itself.)</p>
<p>To compute <span class="math">\(P(\theta\vert D)\)</span> — trivially, a full distribution as the term suggests — we will need to compute <span class="math">\(P(D)\)</span> after all. As before, this can be accomplished via marginalization:</p>
<div class="math">$$
\begin{align*}
P(\theta\vert D)
&= \frac{P(D\vert\theta)P(\theta)}{P(D)}\\
&= \frac{P(D, \theta)}{P(D)}\\
&= \frac{P(D, \theta)}{\int P(D, \theta)d\theta}\\
\end{align*}
$$</div>
<p>Since <span class="math">\(\theta\)</span> takes continuous values, we can no longer employ the "delete and collapse" method of marginalization in discrete distributions. Furthermore, in all but trivial cases, <span class="math">\(\theta\)</span> is a high-dimensional vector or matrix, leaving us to compute a "high-dimensional integral that lacks an analytic (closed-form) solution — the central computational challenge in inference."<sup id="fnref-1"><a class="footnote-ref" href="#fn-1">1</a></sup></p>
<p>As such, computing the full distribution <span class="math">\(P(\theta\vert D)\)</span> is <em>approximating</em> the full distribution <span class="math">\(P(\theta\vert D)\)</span>. To this end, we'll introduce two new families of algorithms.</p>
<h2>Markov chain monte carlo</h2>
<p>In small to medium-sized models, we often take an alternative ideological approach to approximating <span class="math">\(P(\theta\vert D)\)</span>: instead of computing a distribution, i.e. the canonical parameters of a gory algebraic expression which control its shape — we produce <em>samples</em> from this distribution. Roughly speaking, the aggregate of these samples then gives, retrodictively, the distribution itself. The general family of these methods is known as Markov chain monte carlo, or MCMC.</p>
<p>In simple terms, MCMC inference for a given parameter <span class="math">\(\phi\)</span> works as follows:</p>
<ol>
<li>Initialize <span class="math">\(\phi\)</span> to some value <span class="math">\(\phi_{\text{current}}\)</span>.</li>
<li>Compute the prior probability of <span class="math">\(\phi_{\text{current}}\)</span> and the probability of having observed our data under <span class="math">\(\phi_{\text{current}}\)</span> — <span class="math">\(P(\phi_{\text{current}})\)</span> and <span class="math">\(P(D\vert \phi_{\text{current}})\)</span>, respectively. Their product gives <span class="math">\(P(D, \phi_{\text{current}})\)</span> — the joint probability of having observed the proposed parameter value and our observed data given this value.</li>
<li>Add <span class="math">\(\phi_{\text{current}}\)</span> to a big green plastic bucket of "accepted values."</li>
<li>Propose moving to a new, nearby value <span class="math">\(\phi_{\text{proposal}}\)</span>. This value is drawn from an entirely separate <em>sampling distribution</em> which bears no influence on our prior <span class="math">\(P(\phi)\)</span> nor likelihood function <span class="math">\(P(D\vert \phi)\)</span>. Repeat Step 2 using <span class="math">\(\phi_{\text{proposal}}\)</span> instead of <span class="math">\(\phi_{\text{current}}\)</span>.</li>
<li>Walk the following tree:<ul>
<li>If <span class="math">\(P(D, \phi_{\text{proposal}}) \gt P(D, \phi_{\text{current}})\)</span>:<ul>
<li>Set <span class="math">\(\phi_{\text{current}} = \phi_{\text{proposal}}\)</span>.</li>
<li>Move to Step 3.</li>
</ul>
</li>
<li>Else:<ul>
<li>With some small probability:<ul>
<li>Set <span class="math">\(\phi_{\text{current}} = \phi_{\text{proposal}}\)</span>.</li>
<li>Move to Step 3.</li>
</ul>
</li>
<li>Else:<ul>
<li>Move to Step 4.</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ol>
<p>After collecting a few thousand samples — and discarding the first few hundred, in which we drunkenly amble towards the region of high joint probability (a quantity <em>proportional</em> to the posterior probability) — we now have a bucket of samples from our desired posterior distribution. Nota bene: we never had to touch the high-dimensional integral <span class="math">\(\int P(D, \theta)d\theta\)</span>.</p>
<h2>Variational inference</h2>
<p>In large-scale models, MCMC methods are often too slow. Conversely, variational inference provides a framework for casting the problem of posterior approximation as one of <em>optimization</em> — far faster than a sampling-based approach. This yields an <em>analytical</em> approximation to <span class="math">\(P(\theta\vert D)\)</span>. The following explanation of variational inference is taken largely from a previous post of mine: <a href="https://cavaunpeu.github.io/2017/05/08/transfer-learning-flight-delay-prediction/">Transfer Learning for Flight Delay Prediction</a>.</p>
<p>For our approximating distribution we'll choose one that is simple, parametric and familiar: the normal (Gaussian) distribution, parameterized by some set of parameters <span class="math">\(\lambda\)</span>.</p>
<div class="math">$$q_{\lambda}(\theta\vert D)$$</div>
<p>Our goal is to force this distribution to closely resemble the original; the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL divergence</a> quantifies their difference:</p>
<div class="math">$$KL(q_{\lambda}(\theta\vert D)\Vert P(\theta\vert D)) = \int{q_{\lambda}(\theta\vert D)\log\frac{q_{\lambda}(\theta\vert D)}{P(\theta\vert D)}d\theta}$$</div>
<p>To this end, we compute its argmin with respect to <span class="math">\(\lambda\)</span>:</p>
<div class="math">$$q_{\lambda}^{*}(\theta\vert D) = \underset{\lambda}{\arg\min}\ \text{KL}(q_{\lambda}(\theta\vert D)\Vert P(\theta\vert D))$$</div>
<p>Expanding the divergence, we obtain:</p>
<div class="math">$$
\begin{align*}
KL(q_{\lambda}(\theta\vert D)\Vert P(\theta\vert D))
&= \int{q_{\lambda}(\theta\vert D)\log\frac{q_{\lambda}(\theta\vert D)}{P(\theta\vert D)}d\theta}\\
&= \int{q_{\lambda}(\theta\vert D)\log\frac{q_{\lambda}(\theta\vert D)P(D)}{P(\theta, D)}d\theta}\\
&= \int{q_{\lambda}(\theta\vert D)\bigg(\log{q_{\lambda}(\theta\vert D) -\log{P(\theta, D)} + \log{P(D)}}\bigg)d\theta}\\
&= \int{q_{\lambda}(\theta\vert D)\bigg(\log{q_{\lambda}(\theta\vert D)} -\log{P(\theta, D)}}\bigg)d\theta + \log{P(D)}\int{q_{\lambda}(\theta\vert D)d\theta}\\
&= \int{q_{\lambda}(\theta\vert D)\bigg(\log{q_{\lambda}(\theta\vert D)} -\log{P(\theta, D)}}\bigg)d\theta + \log{P(D)} \cdot 1
\end{align*}
$$</div>
<p>Since only the integral depends on <span class="math">\(\lambda\)</span>, minimizing the entire expression with respect to <span class="math">\(\lambda\)</span> amounts to minimizing this term. Incidentally, the opposite (negative) of this term is called the <a href="https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf">ELBO</a>, or the "evidence lower bound."</p>
<div class="math">$$
ELBO(\lambda) = -\int{q_{\lambda}(\theta\vert D)\bigg(\log{q_{\lambda}(\theta\vert D)} -\log{P(\theta, D)}}\bigg)d\theta
$$</div>
<p>To see why, let's plug the ELBO into the equation above and solve for <span class="math">\(\log{P(D)}\)</span>:</p>
<div class="math">$$\log{P(D)} = ELBO(\lambda) + KL(q_{\lambda}(\theta\vert D)\Vert P(\theta\vert D))$$</div>
<p>In English: "the log of the evidence is at least the lower bound of the evidence plus the divergence from our (variational) approximation of the posterior <span class="math">\(q_{\lambda}(\theta\vert D)\)</span> to our true posterior <span class="math">\(P(\theta\vert D)\)</span>."</p>
<p>As such, minimizing this divergence is equivalent to <em>maximizing</em> the ELBO, as:</p>
<div class="math">$$
KL(q_{\lambda}(\theta\vert D)\Vert P(\theta\vert D)) = -ELBO(\lambda) + \log{P(D)}
$$</div>
<h3>Optimization</h3>
<p>Let's restate the equation for the ELBO and rearrange further:</p>
<div class="math">$$
\begin{align*}
ELBO(\lambda)
&= -\int{q_{\lambda}(\theta\vert D)\bigg(\log{q_{\lambda}(\theta\vert D)} -\log{P(\theta, D)}}\bigg)d\theta\\
&= -\int{q_{\lambda}(\theta\vert D)\bigg(\log{q_{\lambda}(\theta\vert D)} -\log{P(D\vert \theta)} - \log{P(\theta)}}\bigg)d\theta\\
&= -\int{q_{\lambda}(\theta\vert D)\bigg(\log{q_{\lambda}(\theta\vert D)} - \log{P(\theta)}}\bigg)d\theta + \log{P(D\vert \theta)}\int{q_{\lambda}(\theta\vert D)d\theta}\\
&= -\int{q_{\lambda}(\theta\vert D)\log{\frac{q_{\lambda}(\theta\vert D)}{P(\theta)}}d\theta} + \log{P(D\vert \theta)} \cdot 1\\
&= \log{P(D\vert \theta)} -KL(q_{\lambda}(\theta\vert D)\Vert P(\theta))\\
\end{align*}
$$</div>
<p>Again, our goal is to maximize this expression or minimize its opposite:</p>
<div class="math">$$
-\log{P(D\vert \theta)} + KL(q_{\lambda}(\theta\vert D)\Vert P(\theta))
$$</div>
<p>One step further, we obtain:</p>
<div class="math">$$
\begin{align*}
&= -\log{P(D\vert \theta)} + q_{\lambda}(\theta\vert D)\log{q_{\lambda}(\theta\vert D)} - q_{\lambda}(\theta\vert D)\log{P(\theta)}\\
&= \mathop{\mathbb{E}}_{q_{\lambda}(\theta\vert D)}[-\log{P(D\vert \theta)} +\log{q_{\lambda}(\theta\vert D)} - \log{P(\theta)}]\\
&= \mathop{\mathbb{E}}_{q_{\lambda}(\theta\vert D)}[-\big(\log{P(D,  \theta)} -\log{q_{\lambda}(\theta\vert D)}\big)]\\
&= -\mathop{\mathbb{E}}_{q_{\lambda}(\theta\vert D)}[\log{P(D,  \theta)}] + \mathop{\mathbb{E}}_{q_{\lambda}(\theta\vert D)}[\log{q_{\lambda}(\theta\vert D)}]\\
\end{align*}
$$</div>
<p>In machine learning parlance: "minimize the negative log joint probability of our data and parameter <span class="math">\(\theta\)</span> — a MAP estimate — plus the entropy of our variational approximation." As a <em>higher</em> entropy is desirable — an approximation which distributes its mass in a <em>conservative</em> fashion — this minimization is a balancing act between the two terms.</p>
<p>For a more in-depth discussion of both entropy and KL-divergence please see <a href="https://cavaunpeu.github.io/2017/05/18/minimizing_the_negative_log_likelihood_in_english/">Minimizing the Negative Log-Likelihood, in English</a>.</p>
<h1>Posterior predictive distribution</h1>
<p>With our estimate for <span class="math">\(\theta\)</span> as a full distribution, we can now make a new prediction as a full distribution as well.</p>
<div class="math">$$
\begin{align*}
P(y\vert x, D)
&= \int P(y\vert x, D, \theta)P(\theta\vert x, D)d\theta\\
&= \int P(y\vert x, \theta)P(\theta\vert D)d\theta\\
\end{align*}
$$</div>
<ul>
<li>The right term under the integral is the posterior distribution of our parameter <span class="math">\(\theta\)</span> given the "training" data, <span class="math">\(P(\theta\vert D)\)</span>. Since it does not depend on a new input <span class="math">\(x\)</span> we have removed <span class="math">\(x\)</span>.</li>
<li>The left term under the integral is our likelihood function: given an <span class="math">\(x\)</span> and a <span class="math">\(\theta\)</span>, it produces a <span class="math">\(y\)</span>. While this function does depend on <span class="math">\(\theta\)</span> — whose values are pulled from our posterior <span class="math">\(P(\theta\vert D)\)</span> — it does not depend on <span class="math">\(D\)</span> itself. As such, we have removed <span class="math">\(D\)</span>.</li>
</ul>
<p>Integrating over <span class="math">\(\theta\)</span> yields a distribution over <span class="math">\(y\)</span>: we've now captured not just the uncertainty in <em>inference</em>, but also the corresponding uncertain in our <em>predictions</em>.</p>
<h1>What do these distributions actually do for me?</h1>
<p>Said differently, "why is it important to quantify uncertainty?"</p>
<p>I think we, as humans, are exceptionally qualified to answer this question: we need to look no further than ourselves, our choices, our environment.</p>
<ul>
<li>The cross-walk says "go." Do I:<ul>
<li>Close my eyes, lie down for a 15-second nap in the middle of the road, then walk backwards the rest of the way?</li>
<li>Quickly look both ways then walk leisurely across the road, keeping an eye out for cyclists at the same time.</li>
</ul>
</li>
<li>A company emails to say "we'd like to discuss the possibility of a full-time role." Do I:<ul>
<li>Respond saying "Great! Let's chat further" while continuing to speak with other companies.</li>
<li>Respond saying "Great! Let's chat further" and promptly sever all contact with other companies.</li>
</ul>
</li>
<li>An extremely reliable lifelong friend calls to say they've found me a beautiful studio in Manhattan for $600/month, and would need to confirm in the next 24 hours if I'd like to take it. Do I:<ul>
<li>Take it.</li>
<li>Call three friends to ask if they think that this makes sense.</li>
</ul>
</li>
<li>An extremely sketchy real estate broker calls to say they've found me a beautiful studio in Manhattan for $600/month, and would need to confirm in the next 24 hours if I'd like to take it. Do I:<ul>
<li>Take it.</li>
<li>Call three friends to ask if they think that this makes sense.</li>
</ul>
</li>
</ul>
<p>The notion is the same in probabilistic modeling. Furthermore, we often build models with "not big data," and therefore have a substantially non-zero amount of uncertainty in our parameter estimates and subsequent predictions.</p>
<p>Finally, with distributional estimates in hand, we can begin to make more robust, measured and logical decisions. We can do this because, while point estimates give a quick summary of the dynamics of our system, distributions tell the full, thorough story: where the peaks are, their width and height, their distance from one another, etc. For an excellent exploration of what we can do with posterior distributions, check out Rasmus Bååth's <a href="http://www.sumsar.net/blog/2015/01/probable-points-and-credible-intervals-part-two/">Probable Points and Credible Intervals, Part 2: Decision Theory</a>.</p>
<p>Many thanks for reading, and to our pool ring Bayes'.</p>