<h2 id="example-banana-shaped-distribution">Example: Banana-shaped distribution</h2>

<p>Consider the <em>banana-shaped distribution</em>, a commonly-used testbed for adaptive
MCMC methods<sup class="footnote-ref" id="fnref:haario1999adaptive"><a href="#fn:haario1999adaptive">2</a></sup>.
Denote the density of this distribution as $p_{Y}(\mathbf{y})$.
To illustrate, 1k samples randomly drawn from this distribution are shown below:</p>

<p><img src="images/banana_samples.svg" alt="Banana distribution samples" /></p>

<p>The underlying process that generates samples
$\tilde{\mathbf{y}} \sim p_{Y}(\mathbf{y})$ is simple to describe,
and is of the general form,</p>

<p>$$
\tilde{\mathbf{y}} \sim p_{Y}(\mathbf{y}) \quad
\Leftrightarrow \quad
\tilde{\mathbf{y}} = G(\tilde{\mathbf{x}}),
\quad \tilde{\mathbf{x}} \sim p_{X}(\mathbf{x}).
$$</p>

<p>In other words, a sample $\tilde{\mathbf{y}}$ is the output of a transformation
$G$, given a sample $\tilde{\mathbf{x}}$ drawn from some underlying
base distribution $p_{X}(\mathbf{x})$.</p>

<p>However, it is not as straightforward to compute an analytical expression for
density $p_{Y}(\mathbf{y})$.
In fact, this is only possible if $G$ is a <em>differentiable</em> and <em>invertible</em>
transformation (a <em>diffeomorphism</em><sup class="footnote-ref" id="fnref:1"><a href="#fn:1">3</a></sup>), and if there is an analytical
expression for $p_{X}(\mathbf{x})$.</p>

<p>Transformations that fail to satisfy these conditions (which includes something
as simple as a multi-layer perceptron with non-linear activations) give rise to
<em>implicit distributions</em>, and will be the subject of many posts to come.
But for now, we will restrict our attention to diffeomorphisms.</p>

<h3 id="base-distribution">Base distribution</h3>

<p>Following on with our example, the base distribution $p_{X}(\mathbf{x})$ is
given by a two-dimensional Gaussian with unit variances and covariance
$\rho = 0.95$:</p>

<p>$$
p_{X}(\mathbf{x}) = \mathcal{N}(\mathbf{x} | \mathbf{0}, \mathbf{\Sigma}),
\qquad
\mathbf{\Sigma} =
\begin{bmatrix}
  1    & 0.95 \newline
  0.95 & 1
\end{bmatrix}
$$</p>

<p>This can be encapsulated by an instance of
<a href="https://www.tensorflow.org/api_docs/python/tf/contrib/distributions/MultivariateNormalTriL" target="_blank">MultivariateNormalTriL</a>,
which is parameterized by a lower-triangular matrix.
First let&rsquo;s import TensorFlow Distributions:</p>

In [13]:
import tensorflow as tf
import torch
import tensorflow.distributions as tfd
import torch.distributions as dist
import numpy as np


<p>Then we create the lower-triangular matrix and the instantiate the distribution:</p>

In [21]:
rho = 0.95
Sigma = torch.tensor([[1, rho],[rho, 1]])
Sigma

tensor([[1.0000, 0.9500],
        [0.9500, 1.0000]])

In [23]:
# p_x = tfd.MultivariateNormalTriL(scale_tril=tf.cholesky(Sigma))
p_x = 

tensor([[1.0000, 0.0000],
        [0.9500, 0.3122]])

<p>As with all subclasses of <code>tfd.Distribution</code>, we can evaluated the probability
density function of this distribution by calling the <code>p_x.prob</code> method.
Evaluating this on an uniformly-spaced grid yields the equiprobability contour
plot below:</p>

<p><img src="images/banana_base_density.svg" alt="Base density" /></p>

<h3 id="forward-transformation">Forward Transformation</h3>

<p>The required transformation $G$ is defined as:</p>

<p>$$
G(\mathbf{x}) =
\begin{bmatrix}
  x_1 \newline
  x_2 - x_1^2 - 1 \newline
\end{bmatrix}
$$</p>

<p>We implement this in the <code>_forward</code> function below<sup class="footnote-ref" id="fnref:2"><a href="#fn:2">4</a></sup>:</p>

In [25]:
def _forward(x):
    y_0 = x[..., 0:1]
    y_1 = x[..., 1:2] - y_0**2 - 1
    y_tail = x[..., 2:-1]
    return tf.concat([y_0, y_1, y_tail], axis=-1)

<p>We can now use this to generate samples from $p_{Y}(\mathbf{y})$.
To do this we first sample from the base distribution $p_{X}(\mathbf{x})$ by
calling <code>p_x.sample</code>. For this illustration, we generate 1k samples, which is
specified through the <code>sample_shape</code> argument. We then transform these samples
through $G$ by calling <code>_forward</code> on them.</p>

In [26]:
x_samples = p_x.sample(1000)
y_samples = _forward(x_samples)

AttributeError: 'Tensor' object has no attribute 'sample'

<p>The figure below contains scatterplots of the 1k samples <code>x_samples</code> (left)
and the transformed <code>y_samples</code> (right):</p>

<p><img src="images/banana_base_samples.svg" alt="Banana and base samples" /></p>

<h3 id="instantiating-a-transformeddistribution-with-a-bijector">Instantiating a <code>TransformedDistribution</code> with a <code>Bijector</code></h3>

<p>Having specified the forward transformation and the underlying distribution, we
have now fully described the sample generation process, which is the bare
minimum necessary to define a probability distribution.</p>

<p>The forward transformation is also the <em>first</em> of <strong>three</strong> operations needed to
fully specify a <code>Bijector</code>, which can be used to instantiate a
<code>TransformedDistribution</code> that encapsulates the banana-shaped distribution.</p>

<h4 id="creating-a-bijector">Creating a <code>Bijector</code></h4>

<p>First, let&rsquo;s subclass <code>Bijector</code> to define the <code>Banana</code> bijector and implement
the forward transformation as an instance method:</p>

<h3 id="probability-density-function">Probability Density Function</h3>

<p>Although we can now sample from this distribution, we have yet to define the
operations necessary to evaluate its probability density function&mdash;the
remaining <em>two</em> of <strong>three</strong> operations needed to fully specify a <code>Bijector</code></p>

<p>Indeed, calling <code>p_y.prob</code> at this stage would simply raise a
<code>NotImplementedError</code> exception. So what else do we need to define?</p>

<p>Recall the probability density of $p_{Y}(\mathbf{y})$ is given by:</p>

<p>$$
p_{Y}(\mathbf{y}) = p_{X}(G^{-1}(\mathbf{y})) \mathrm{det}
\left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right )
$$</p>

<p>Hence we need to specify the inverse transformation $G^{-1}(\mathbf{y})$ and its
Jacobian determinant
$\mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right )$.</p>

<p>For numerical stability, the <code>Bijector</code> API requires that this be defined in
log-space. Hence, it is useful to recall that the forward and inverse log
determinant Jacobians differ only in their signs<sup class="footnote-ref" id="fnref:3"><a href="#fn:3">5</a></sup>,</p>

<p>$$
\begin{align}
  \log \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right )
  &amp; = - \log \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{x}} G(\mathbf{x}) \right ),
\end{align}
$$</p>

<p>which gives us the option of implementing either (or both).
However, do note the following from the official
<a href="https://www.tensorflow.org/api_docs/python/tf/contrib/distributions/bijectors/Bijector" target="_blank">tf.contrib.distributions.bijectors.Bijector</a> API docs:</p>

<blockquote>
<p>Generally its preferable to directly implement the inverse Jacobian
determinant. This should have superior numerical stability and will often share
subgraphs with the <code>_inverse</code> implementation.</p>
</blockquote>

<h3 id="inverse-transformation">Inverse Transformation</h3>

<p>So let&rsquo;s implement the inverse transform $G^{-1}$, which is given by:</p>

<p>$$
G^{-1}(\mathbf{y}) =
\begin{bmatrix}
  y_1 \newline
  y_2 + y_1^2 + 1 \newline
\end{bmatrix}
$$</p>

<p>We define this in the <code>_inverse</code> function below:</p>

<pre><code class="language-python">def _inverse(y):

    x_0 = y[..., 0:1]
    x_1 = y[..., 1:2] + x_0**2 + 1
    x_tail = y[..., 2:-1]

    return tf.concat([x_0, x_1, x_tail], axis=-1)
</code></pre>

<h3 id="jacobian-determinant">Jacobian determinant</h3>

<p>Now we compute the log determinant of the Jacobian of the <em>inverse</em>
transformation.
In this simple example, the transformation is <em>volume-preserving</em>, meaning its
Jacobian determinant is equal to 1.</p>

<p>This is easy to verify:</p>

<p>$$
\begin{align}
  \mathrm{det} \left ( \frac{\partial}{\partial\mathbf{y}} G^{-1}(\mathbf{y}) \right )
  &amp; = \mathrm{det}
  \begin{pmatrix}
    \frac{\partial}{\partial y_1} y_1             &amp; \frac{\partial}{\partial y_2} y_1 \newline
    \frac{\partial}{\partial y_1} y_2 + y_1^2 + 1 &amp; \frac{\partial}{\partial y_2} y_2 + y_1^2 + 1 \newline
  \end{pmatrix} \newline
  &amp; = \mathrm{det}
  \begin{pmatrix}
    1     &amp; 0 \newline
    2 y_1 &amp; 1 \newline
  \end{pmatrix}
  = 1
\end{align}
$$</p>

<p>Hence, the log determinant Jacobian is given by zeros shaped like input <code>y</code>, up
to the last <code>inverse_min_event_ndims=1</code> dimensions:</p>

<pre><code class="language-python">def _inverse_log_det_jacobian(y):

    return tf.zeros(shape=y.shape[:-1])
</code></pre>

<p>Since the log determinant Jacobian is constant, i.e. independent of the input,
we can just specify it for one input by setting the flag <code>is_constant_jacobian=True</code><sup class="footnote-ref" id="fnref:4"><a href="#fn:4">6</a></sup>,
and the <code>Bijector</code> class will handle the necessary shape inference for us.</p>

<p>Putting it all together in the <code>Banana</code> bijector subclass, we have:</p>


<p>Finally, we can instantiate distribution <code>p_y</code> by calling
<code>tfd.TransformedDistribution</code> as we did before <em>et voilà</em>,
we can now simply call <code>p_y.prob</code> to evaluate the probability density function.</p>

<p>Evaluating this on the same uniformly-spaced grid as before yields the following
equiprobability contour plot:</p>

<p><img src="images/banana_density.svg" alt="Banana density" /></p>

<h1 id="summary">Summary</h1>

<p>In this post, we showed that using diffeomorphisms&mdash;mappings that are
differentiable and invertible, it is possible transform standard distributions
into interesting and complicated distributions, while still being able to
compute their densities analytically.</p>

<p>The <code>Bijector</code> API provides an interface that encapsulates the basic properties
of a diffeomorphism needed to transform a distribution. These are: the
forward transform itself, its inverse and the determinant of their Jacobians.</p>

<p>Using this, <code>TransformedDistribution</code> <em>automatically</em> implements perhaps the two
most important methods of a probability distribution: sampling (<code>sample</code>), and
density evaluation (<code>prob</code>).</p>

<p>Needless to say, this is a very powerful combination.
Through the <code>Bijector</code> API, the number of possible distributions that can be
implemented and used directly with other functionalities in the TensorFlow
Probability ecosystem effectively becomes <em>endless</em>.</p>