<h4>Background</h4>
<p>We consider a general feed-forward BNN with <span class="math">\(L\)</span> total layers (including the output layer and <span class="math">\(L-1\)</span> hidden layers), indexed <span class="math">\(1, 2, \ldots \ell, \ldots L\)</span>. It takes as input a vector <span class="math">\(x\)</span> and produces a real-valued scalar output <span class="math">\(f(x) \in \mathbb{R}\)</span>.</p>
<p>The 1st layer takes as input a <span class="math">\(D\)</span>-dimensional data vector: <span class="math">\(x = [x_1, x_2, \ldots x_D]\)</span>. Each of the <span class="math">\(J^{(1)}\)</span> hidden units, indexed by <span class="math">\(j\)</span>, produces a scalar value by multiplying the input vector <span class="math">\(x\)</span> by a weight vector, adding a bias, and feeding the resulting scalar through an activation function:</p>
<div class="math">$$
h^{(1)}_{j}(x, w, b) = \text{activation}(b^{(1)}_{j} + \sum_{d=1}^D w^{(1)}_{j,d} x_{d} )
$$</div>
<p>If there are 2 or more hidden layers, we'll write that layer <span class="math">\(\ell\)</span> has <span class="math">\(J^{\ell}\)</span> units. Each unit produces a scalar in the same fashion: taking as input the <span class="math">\(J^{(\ell-1)}\)</span>-length vector <span class="math">\(h^{\ell-1}\)</span> produced by the previous layer, multiplying by unit-specific weights, adding unit-specific bias, and applying an activation function:
</p>
<div class="math">$$
h^{(\ell)}_{j}(x, w, b) = \text{activation}(b^{(\ell)}_{j} + \sum_{k=1}^{J^{(\ell-1)}} w^{(\ell)}_{j,k} h^{(\ell-1)}_{k}(x,w,b) )
$$</div>
<p>At the final layer <span class="math">\(L\)</span>, we produce a scalar value <span class="math">\(f(x, w, b)\)</span> via:
</p>
<div class="math">$$
f(x, w, b) = b^{(L)}_{1} + \sum_{k=1}^{J^{(L-1)}} w^{(L)}_{k} h^{(L-1)}_{k}(x,w,b)
$$</div>
<h4>Possible Activation Functions</h4>
<ul>
<li>Tanh: Hyperbolic tangent function -- see <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.tanh.html">numpy's np.tanh</a></li>
<li>ReLU: <span class="math">\(\text{relu}(z) = \text{max}(0, z)\)</span></li>
<li>SquaredExponential (aka 'RBF'): <span class="math">\(\text{sqexp}(z) = \exp(-z^2)\)</span></li>
</ul>
<h4>Possible Architectures</h4>
<p>A general neural network for regression with L total layers will have L-1 hidden layers, each one with different numbers of hidden units. We can specify the size of the hidden network as a list of integers, like this:</p>
<ul>
<li><code>[]</code> means there are no hidden layers and no hidden units (equivalent to linear regression)</li>
<li><code>[5]</code> means there is one hidden layer with 5 units</li>
<li><code>[5, 3]</code> means there are two hidden layers, the first has 5 units and second has 3 units.</li>
</ul>
<p>In general, the number of hidden layers (L-1) is equal to the length of the list, and the size of the <span class="math">\(\ell\)</span>-th layer is given by the integer at position <span class="math">\(\ell\)</span> in the list.</p>
<h4>Likelihood model</h4>
<p>We can use the scalar value produced by the feed-foward neural network as the mean of a Gaussian <em>likelihood</em> distribution that explains observed input/output training data pairs: <span class="math">\(y_n, x_n\)</span>:</p>
<div class="math">\begin{align}
p(y|x, w, b) &amp;= \prod_{n=1}^N p(y_n | x_n, w, b)
\\
    &amp;= \prod_{n=1}^N \mathcal{N}(y_n \mid f(x_n, w, b), \sigma^2)
\end{align}</div>
<p>Where <span class="math">\(\sigma\)</span> is a given standard deviation hyperparameter.</p>
<h2><a name="problem-1">Problem 1: Sampling from a BNN Prior</a></h2>
<p>Write Python code to sample function values produced by the output of a general BNN regression architecture, where the weight parameters and biases at every layer each have an independent Gaussian prior -- Normal(mean=0, variance=1).</p>
<p>You should sample the function values that correspond to a set of at least 200 evenly-spaced test points <span class="math">\(\{x_i\}\)</span> between -20 and 20. One way to generate a 1D array of <span class="math">\(G\)</span> points would be: <code>x_grid_G = np.linspace(-20, 20, G)</code>. </p>
<p>To demonstrate your implementation, you'll make plots of sampled function values (just like in HW1). Each individual plot should show a line plot of the test grid points <span class="math">\(x_i\)</span> and the corresponding sampled function values <span class="math">\(f_i = f(x_i)\)</span>. Use a matplotlib line style '-' to emphasize the connecting lines between the specific <span class="math">\(\{x_i, f_i\}\)</span> pair values (showing the specific dots themselves can make it tough to observe qualitative patterns).</p>
<p>For Problem 1, your report PDF should include:</p>
<p>a. 4 row x 3 column grid of plots, where each panel shows 5 samples from the prior </p>
<ul>
<li>For the rows, try 4 different architectures: [2], [10], [2,2], [10, 10]</li>
<li>For the columns, try 3 different activation functions: ReLu, tanh, and squared exponential (aka 'RBF')</li>
</ul>
<p>b. Short text description of the qualitative trends you observe. How does a deeper network impact the function shape? How does the activation function impact function shape? A few short but complete sentences.</p>
<h2><a name="problem-2">Problem 2: Sample from Posterior using your own HMC implementation </a></h2>
<p>Consider the following training data with <span class="math">\(N=6\)</span> example pairs of <span class="math">\(x\)</span> and <span class="math">\(y\)</span> values (as in HW1):</p>
<div class="highlight"><pre><span></span>x_train_N = np.asarray([-2.,    -1.8,   -1.,  1.,  1.8,     2.])
y_train_N = np.asarray([-3.,  0.2224,    3.,  3.,  0.2224, -3.])
</pre></div>


<p>Your goal is to write an implementation of HMC that can sample from the posterior given this data. <strong>Hint:</strong> the pseudocode algorithm on Page 14 of Neal's Handbook of MCMC Chapter on HMC is a helpful resource: <a href="https://arxiv.org/pdf/1206.1901.pdf#page=14">https://arxiv.org/pdf/1206.1901.pdf#page=14</a></p>
<p>You may use automatic differentiation tools (like autograd, PyTorch, or Tensorflow) as you wish. We recommend autograd as a simple option that has been demonstrated in class. See 'Part 5' of the Jupyter Notebook we worked on in class for examples of useful NN data structures and gradient descent training of NNs with autograd: <a href="https://github.com/tufts-ml/comp150_bdl_2018f_public/blob/master/notebooks/intro_to_autograd_and_neural_net_training.ipynb">intro_to_autograd_and_neural_net_training.ipynb</a>.</p>
<p>You should think carefully about how you set the step_size and the number of leapfrog steps for an HMC proposal. You may need to try several values and find the one that performs best. </p>
<p>You may refer to existing HMC implementations you find online for high-level understanding, but you <em>must</em> write your own code and be able to defend any code you submit as your original work. Working with a small set of fellow students from this class (at most 2 partners) is encouraged, provided you abide by the collaboration policy.</p>
<p><em>Implementation Details</em>:</p>
<ul>
<li>Fix the BNN architecture to 1 layer of 10 hidden units, with <code>tanh</code> as the activation function.</li>
<li>Use a Normal(0, 1) prior for all weights and biases (as in Part 1).</li>
<li>Use <code>sigma=0.1</code> for the likelihood's Gaussian noise level.</li>
<li>Run at least 3 chains each for at least 2000 total iterations (remember to worry about burnin).</li>
</ul>
<p><em>Instructions:</em> For Problem 2, your report PDF should include:</p>
<p>a. Plot of the "potential energy" (aka negative log joint probability) vs. iteration, for each of your chains. This should be one plot with multiple lines.</p>
<p>b. Plot of sampled function values from the "posterior", for multiple chains. Show 10 samples per plot. Avoid showing samples in the transient burn-in phase of the sampler.</p>
<p>c. Plot of the empirical mean of many samples from the "posterior". Also show +/- 2 standard deviations (see matplotlib's <a href="https://matplotlib.org/gallery/recipes/fill_between_alpha.html">fill_between</a> function). Avoid showing samples in the transient burn-in phase of the sampler.</p>
<h3><a name="template-code">Template Python Code</a></h3>
<p>The following two function templates might help you solve Problem 2. You are not required to use either of these, but it just might help.</p>
<p>See also the pseudocode found on Page 14 of Neal's paper: <a href="https://arxiv.org/pdf/1206.1901.pdf#page=14">https://arxiv.org/pdf/1206.1901.pdf#page=14</a>.</p>
<p>Brief template code intro: we assume we have defined Python functions that can</p>
<ul>
<li>Calculate the kinetic energy, given momentum values</li>
<li>Calculate the potential energy of BNN regression, given some bnn parameter values (aka 'position')</li>
<li>Calculate the <em>gradient</em> of the potential energy (perhaps via autograd)</li>
</ul>
<p>(You'll need to write each of these functions).</p>
<p>Given these functions, we can build an HMC sampler using the <code>run_HMC_sampler</code> and <code>make_proposal_via_leapfrog_steps</code> functions defined below.</p>
<h4>Template: Run HMC Sampler for many iterations</h4>

In [1]:
def run_HMC_sampler(
        init_bnn_params=None,
        n_hmc_iters=100,
        n_leapfrog_steps=1,
        step_size=1.0,
        random_seed=42,
        calc_potential_energy=None,
        calc_kinetic_energy=None,
        calc_grad_potential_energy=None,
        ):
    """ Run HMC sampler for many iterations (many proposals)

    Returns
    -------
    bnn_samples : list
        List of samples of NN parameters produced by HMC
        Can be viewed as 'approximate' posterior samples if chain runs to convergence.
    info : dict
        Tracks energy values at each iteration and other diagnostics.

    References
    ----------
    See Neal's pseudocode algorithm for a single HMC proposal + acceptance:
    https://arxiv.org/pdf/1206.1901.pdf#page=14

    This function repeats many HMC proposal steps.
    """
    # Create random-number-generator with specific seed for reproducibility
    prng = np.random.RandomState(int(random_seed))

    # Set initial bnn params
    cur_bnn_params = init_bnn_params
    cur_potential_energy = calc_potential_energy(cur_bnn_params)

    bnn_samples = list()
    # TODO make lists to track energies over iterations

    n_accept = 0
    for t in range(n_hmc_iters):
        # Draw momentum for CURRENT configuration
        cur_momentum_vec = # TODO draw momentum using prng

        # Create PROPOSED configuration
        prop_bnn_params, prop_momentum_vec = make_proposal_via_leapfrog_steps(
            cur_bnn_params, cur_momentum_vec,
            n_leapfrog_steps=n_leapfrog_steps,
            step_size=step_size,
            calc_grad_potential_energy=calc_grad_potential_energy)

        # TODO Compute probability of accept/reject for proposal
        # TODO You'll use need to use kinetic and potential energy functions   
        accept_proba = 0.0 # (Placeholder)

        # Draw random value from (0,1) to determine if we accept or not
        if prng.rand() < accept_proba:
            # If here, we accepted the proposal
            n_accept += 1

            # TODO what current state needs to be updated?

        # Update list of samples from "posterior"
        bnn_samples.append(cur_bnn_params)
        # TODO update energy tracking lists

        # Print some diagnostics every 50 iters
        if t < 5 or ((t+1) % 50 == 0) or (t+1) == n_hmc_iters:
            accept_rate = float(n_accept) / float(t+1)
            print("iter %6d/%d after %7.1f sec | accept_rate %.3f" % (
                t+1, n_hmc_iters, time.time() - start_time_sec, accept_rate))

    return (
        bnn_samples,
        dict(
            n_accept=n_accept,
            n_hmc_iters=n_hmc_iters,
            accept_rate=accept_rate),
        )

SyntaxError: invalid syntax (<ipython-input-1-00eabd3f900f>, line 41)

## Template: Construct a Single HMC Proposal

In [2]:
def make_proposal_via_leapfrog_steps(
        cur_bnn_params, cur_momentum_vec,
        n_leapfrog_steps=1,
        step_size=1.0,
        calc_grad_potential_energy=None):
    """ Construct one HMC proposal via leapfrog integration

    Returns
    -------
    prop_bnn_params : same type/size as cur_bnn_params
    prop_momentum_vec : same type/size as cur_momentum_vec

    """
    # Initialize proposed variables as copies of current values
    prop_bnn_params = copy.deepcopy(cur_bnn_params)
    prop_momentum_vec = copy.deepcopy(cur_momentum_vec)

    # TODO: half step update of momentum
    # This will use the grad of potential energy (use provided function)

    for step_id in range(n_leapfrog_steps):
        # TODO: full step update of 'position' (aka bnn_params)
        # This will use the grad of kinetic energy (has simple closed form)

        if step_id < (n_leapfrog_steps - 1):
            # TODO: full step update of momentum

        else:
            # Special case for final step
            # TODO: half step update of momentum


    # TODO: don't forget to flip sign of momentum (ensure symmetry)

    return prop_bnn_params, prop_momentum_vec

IndentationError: expected an indented block (<ipython-input-2-3a283f268fda>, line 28)

<h2><a name="debugging-tips">Debugging Tips</a></h2>
<p>Here are some tricks to simplify your life if you are having trouble:</p>
<ul>
<li>
<p>Before you try HMC, see if you can get a simpler random walk Metropolis-Hastings sampler to work. This should be quite easy (just a few lines of code), but would let you verify that you understand how to evaluate acceptance probabilities for the BNN model.</p>
</li>
<li>
<p>Before you try HMC on a network with hidden units, try it with 0 hidden layers (equivalent to Bayesian linear regression). This should be a much simpler model with just two scalar random variables (weight and bias). </p>
</li>
</ul>