<h2>Background</h2>
<p>We have now learned about black-box variational inference (see the [<a href="https://arxiv.org/pdf/1401.0118.pdf#page=3">Alg. 1 pseudocode on Page 3 of BBVI arXiv paper</a>]. In this HW, we will implement our own version of BBVI and use it to fit a neural network univariate regression model.</p>
<p>We assume the NN has <span class="math">\(L\)</span> total layers (including the output layer and <span class="math">\(L-1\)</span> hidden layers). Each layer <span class="math">\(\ell\)</span> of the network has a set of <span class="math">\(J^{\ell}\)</span> weights <span class="math">\(w^{(\ell)}\)</span> and a set of <span class="math">\(K^{\ell}\)</span> scalar biases <span class="math">\(b^{\ell}\)</span>.</p>
<p>As in the previous homework, we assume the network takes an input <span class="math">\(x_n\)</span> (we'll assume it is univariate), and produces a predicted scalar output via a feed-forward function <span class="math">\(f(x_n, w, b) \in \mathbb{R}\)</span>.</p>
<h4>Prior</h4>
<p>Each weight and bias has an independent univariate Gaussian prior:</p>
<div class="math">\begin{align}
p(w) &amp;= 
    \prod_{\ell=1}^{L} \prod_{j=1}^{J^{(\ell)}} \mathcal{N}(w_j | 0, 1)
\\
p(b) &amp;= 
    \prod_{\ell=1}^{L} \prod_{k=1}^{K^{(\ell)}} \mathcal{N}(b_k | 0, 1)
\end{align}</div>
<h4>Likelihood</h4>
<p>We explain the <span class="math">\(N\)</span> total examples we observe -- each indexed by <span class="math">\(n\)</span> and consisting of a pair of output/input <span class="math">\(\{y_n, x_n\}\)</span> values -- via a Gaussian likelihood, with mean given by the scalar network output value <span class="math">\(f(x_n, w, b)\)</span> and standard deviation given by known hyperparameter <span class="math">\(\sigma &gt; 0\)</span>:</p>
<div class="math">$$
p(y| x, w, b) = \prod_{n=1}^N \mathcal{N}(y_n | f(x_n, w, b), \sigma^2)
$$</div>
<h4>Target Posterior</h4>
<p>Our goal is to estimate the posterior over the weights and biases given the training data:</p>
<div class="math">$$
p(w, b | y, x)
$$</div>
<p>Knowing this posterior would let us make useful estimates, such as the value of <span class="math">\(y_*\)</span> at a new test input <span class="math">\(x_*\)</span>.</p>
<p>Sadly, we don't have a way to compute this posterior in closed form. It is <em>intractable</em> when the model contains hidden layers with non-linear activations.</p>
<h4>Approximate Posterior</h4>
<p>To gain tractability, we'll try to solve a <em>simpler</em> problem. We will look at possible <strong>approximate posteriors</strong> that have more managable form.</p>
<p>Let us assume each scalar weight and bias parameter has an independent Normal approximate posteriors with a mean value <span class="math">\(m_j \in \mathbb{R}\)</span> and <strong>log-standard-deviation</strong> value <span class="math">\(s_j \in \mathbb{R}\)</span> that are free parameters.</p>
<div class="math">$$
q(w, b | m, s) = \prod_{\ell=1}^L
    \Big(
        \prod_{j=1}^{J^{\ell}} \mathcal{N}(
            w_j \mid \tilde{m}^{\ell}_j, (e^{\tilde{s}^{\ell}_j})^2 )
    \Big)
    \Big(
        \prod_{k=1}^{K^{\ell}} \mathcal{N}(
            b_k \mid \bar{m}^{\ell}_k, (e^{\bar{s}^{\ell}_k})^2 )
    \Big)
$$</div>
<p>We use bars over variables to indicate those that belong to biases, and tildes to denote the variables that belong to weights. For convenience, when it is clear from context we'll just use <span class="math">\(m = [\tilde{m}, \bar{m}]\)</span> to denote the means of all weights and biases, and <span class="math">\(s = [\tilde{s}, \bar{s}]\)</span> to denote the LOG standard deviations for all weights and biases.</p>
<p>At given specific <span class="math">\(m\)</span> and <span class="math">\(s\)</span> values, we have a valid distribution over weights and biases. Certain values of <span class="math">\(m,s\)</span> will give us distributions that are closer to the true posterior than others. Our goal is to find the values that bring us as close as possible.</p>
<h4>VI Objective Function to Maximize</h4>
<div class="math">\begin{align}
\mathcal{L}(m,s) &amp;= \mathbb{E}_{q(w|m,s)}
\Big[
    \log p(y | x, w) + \log p(w,b) - \log q(w,b|m,s)
\Big]
\end{align}</div>
<p>The function <span class="math">\(\mathcal{L}(m,s)\)</span> takes in possible mean and log-std-dev parameters <span class="math">\(m,s\)</span>, and produces a scalar real value. </p>
<p>The readings discuss how this function is a "lower bound" on the marginal likelihood <span class="math">\(\log p(y | x) = \log \int_w p(y,w|x) dw\)</span>. We wish to maximize this lower bound (make it as tight as possible), by finding the best possible values of the parameters <span class="math">\(m,s\)</span> for our approximate posterior.</p>
<h4>VI Loss Function (Objective to Minimize)</h4>
<p>Often, we are interested in framing inference as a minimization problem (not maximization). Esp. in deep learning, minimization is the common goal of optimization toolboxes. </p>
<p>We can thus define that our "VI loss function" (what we want to minimize) is just -1.0 times the lower bound objective above.</p>
<p>"VI loss function": <span class="math">\(-1.0 \cdot \mathcal{L}(m,s)\)</span></p>
<h4>Training Optimization Problem</h4>
<p>Our goal is to learn a set of <span class="math">\(m,s\)</span> values which make the approximate posterior as close as possible to our true intractable posterior. Our readings show this is equivalent to solving the following optimization problem:</p>
<div class="math">\begin{align}
\min_{m,s}
\quad &amp; - \mathcal{L}(m,s)
%{m} \in \mathbb{R}^{\sum_{\ell} J^{\ell} + K^{\ell}},
%{s} \in \mathbb{R}^{\sum_{\ell} J^{\ell} + K^{\ell}}}
%\\
%\mathcal{L}(m,s) &amp;= \mathbb{E}_{q(w,b|m,s)}
%\Big[
%    \log p(y | x, w) + \log p(w) - \log q(w)
%\Big]
\end{align}</div>
<p>We can then use:</p>
<ul>
<li>Monte Carlo methods to evaluate the expectation</li>
<li>Score Function trick to evaluate the <strong>gradient</strong> of the expectation</li>
</ul>
<p>These two ideas, together, allow us to do BBVI inference.</p>
<h2><a name="problem-1">Problem 1: Estimating expectations and gradients of expectations with Monte Carlo and the Score Function Trick </a></h2>
<p>For this problem only, consider a simple linear regression model, equivalent to a BNN with zero hidden layers and zero hidden units. The random variables are one weight parameter <span class="math">\(w\)</span> and one bias parameter <span class="math">\(b\)</span>, both scalars.</p>
<div class="math">$$
    f(x_n, w, b) \triangleq w x_n + b
$$</div>
<p>Thus, our approximate posterior will have four parameters, a pair <span class="math">\(\tilde{m},\tilde{s}\)</span> for <span class="math">\(q(w)\)</span> and a pair <span class="math">\(\bar{m}, \bar{s}\)</span> for <span class="math">\(q(b)\)</span>.</p>
<p>Consider the dataset of <span class="math">\(N=5\)</span> example points:</p>
<div class="highlight"><pre><span></span>x_train_N = np.asarray([-5.0,  -2.50, 0.00, 2.50, 5.0])
y_train_N = np.asarray([-4.91, -2.48, 0.05, 2.61, 5.09])
</pre></div>


<p>Observe that our proposed linear model works very well for this problem if the slope is near 1 and the bias is near 0. This leads us to the following "near-ideal" values for the approximate posteriors <span class="math">\(q(w)\)</span> and <span class="math">\(q(b)\)</span>.</p>
<h5>Near-Ideal parameter values for Problem 1</h5>
<p>You can assume that <span class="math">\(q(w)\)</span> and <span class="math">\(q(b)\)</span> are close to ideal if we set parameters such that:</p>
<ul>
<li><span class="math">\(\tilde{m}\)</span>, mean of <span class="math">\(w\)</span> = 1.0</li>
<li><span class="math">\(\tilde{s}\)</span>, log stddev of <span class="math">\(w = \log(0.1)\)</span></li>
<li><span class="math">\(\bar{m}\)</span>, mean of <span class="math">\(b\)</span> = 0.0</li>
<li><span class="math">\(\bar{s}\)</span>, log stddev of <span class="math">\(b = \log(0.1)\)</span></li>
</ul>
<h5>Functions needed for Problem 1</h5>
<p>Write a function to estimate the VI loss function defined above, <span class="math">\(-1.0 \cdot \mathcal{L}(\tilde{m},\tilde{s}, \bar{m}, \bar{s})\)</span>, using a given number of Monte Carlo (MC) samples.</p>
<p>Write another function to estimate the <em>gradient</em> of the VI loss function using the score function trick and a given number of MC samples.</p>
<p>We recommend these functions to be general-purpose (not restricted to linear models), so you can use them later in Problem 2. But if it helps, you can first write them for the linear case.</p>
<h5>Instructions for Problem 1</h5>
<p>Your PDF report should include the following labeled parts:</p>
<p>a. 1 row x 4 column plot of the estimated VI loss</p>
<ul>
<li>Each column should show results using 1, 10, 100, 1000 samples</li>
<li>Each panel should show the estimated value of <span class="math">\(-\mathcal{L}(\tilde{m},\tilde{s},\bar{m}, \bar{s})\)</span> across a list of possible input parameters:</li>
<li>
<ul>
<li>Fix <span class="math">\(\tilde{s}, \bar{m}, \bar{s}\)</span> to their Near-Ideal values in the chart above</li>
</ul>
</li>
<li>
<ul>
<li>Let <span class="math">\(\tilde{m}\)</span> (the mean of <span class="math">\(w\)</span> under <span class="math">\(q\)</span>) vary across a grid of 20 possible values <code>np.linspace(-3.0, 5.0, 20)</code></li>
</ul>
</li>
</ul>
<p>b. Short Answer: Describe any trends between number of samples and accuracy. How trustworthy is the 1-sample estimate?</p>
<p>c. 1 row x 4 column plot of the estimated <em>gradient</em> of the VI loss w.r.t <span class="math">\(\tilde{m}\)</span></p>
<ul>
<li>Each column should show results using 1, 10, 100, 1000 samples</li>
<li>Each panel should show the estimated value of the <em>gradient</em> of 
<span class="math">\(-\mathcal{L}\)</span> with respect to the scalar <span class="math">\(\tilde{m}\)</span>, across the same range of parameters in part a.</li>
</ul>
<p>d. Short Answer: Does the plot from Part 1c look plausible as the gradient of the plots in Part 1a? Describe any trends between number of samples and accuracy for the Part 1c plot. How trustworthy is the 1-sample estimate? About how many samples does it take for the gradient to appear accurate (and thus) trustworthy?</p>
<h2><a name="problem-2">Problem 2: Fit an Approximate Posterior using your own BBVI implementation </a></h2>
<p>Consider the following training data with <span class="math">\(N=6\)</span> example pairs of <span class="math">\(x\)</span> and <span class="math">\(y\)</span> values (as in HW2):</p>
<div class="highlight"><pre><span></span>x_train_N = np.asarray([-2.,    -1.8,   -1.,  1.,  1.8,     2.])
y_train_N = np.asarray([-3.,  0.2224,    3.,  3.,  0.2224, -3.])
</pre></div>


<p>Your goal is to write an implementation of Black-box Variational Inference (BBVI) that fits an approximate posterior for BNN neural network weights given this data.</p>
<p>You may use automatic differentiation tools (like autograd, PyTorch, or Tensorflow) as you wish. We recommend autograd as a simple option that has been demonstrated in class.</p>
<p>You should think carefully about how you set hyperparameters such as the step size (aka learning rate) and number of Monte Carlo samples.</p>
<p>You may refer to existing BBVI implementations you find online for high-level understanding, but you <em>must</em> write your own code and be able to defend any code you submit as your original work. Working with a small set of fellow students from this class (at most 2 partners) is encouraged, provided you abide by the collaboration policy.</p>
<p><em>Model Implementation Details</em>:</p>
<ul>
<li>Fix the BNN architecture to 1 layer of 10 hidden units, with <code>tanh</code> as the activation function.</li>
<li>Use a Normal(0, 1) prior for all weights and biases.</li>
<li>Use <code>sigma=0.1</code> for the final output layer's Gaussian noise level.</li>
</ul>
<p><em>Approximate Posterior Family Details</em>:</p>
<ul>
<li>As indicated in the 'Background' above, you should learn a mean and log-standard-deviation parameter for each weight and bias.</li>
<li>Initialize weight means to draws from a standard normal distribution</li>
<li>Initialize bias means to draws from a standard normal distribution</li>
<li>Initialize weight standard deviations to 1</li>
<li>Initialize bias standard deviations to 1</li>
</ul>
<p><em>Algorithm Details</em>:</p>
<ul>
<li>Run at least 3 separate runs from independent random initializations, each for at least 2000 total iterations (or until convergence).</li>
<li>You can use a constant step size (aka learning rate). This needs some tuning, but something around 1e-5 might be OK. Probably 1e-8 is too small, and 0.01 is too big. </li>
</ul>
<p><em>Instructions:</em> For Problem 2, your report PDF should include:</p>
<p>a. Plot of the Monte-Carlo-estimated optimization objective <span class="math">\(\mathcal{L}(w,b)\)</span> (aka evidence lower bound aka 'ELBO') vs. iteration, for each of your runs. This should be one plot with multiple lines. Write a clear caption indicating number of Monte Carlo samples and step-size used, as well as 1-2 sentences summarizing your strategy for selecting these values.</p>
<p>b. Make a 1x3 grid of plots, each one showing sampled function values <span class="math">\(f\)</span> from each of your independent runs. Show 10 samples per plot.</p>
<blockquote>
<p>You can draw samples from your posterior <span class="math">\(w^s,b^s \sim q(w,b)\)</span>, and use these to produce function values <span class="math">\(f(x_g, w^s, b^s)\)</span>. Use an evaluation grid of <span class="math">\(x_g\)</span> values such that <code>x_grid_G = np.linspace(-20, 20, 200)</code>, like in previous homeworks.</p>
</blockquote>
<p>c. Make a 1x3 grid of plots, each one showing a filled posterior plot of the posterior (mean +/- 2 std dev) from each of your independent runs. Show one plot per run/chain.</p>
<blockquote>
<p>Filled plots have a line for the learned mean, and show fill at +/- 2 standard deviations. See matplotlib's <a href="https://matplotlib.org/gallery/recipes/fill_between_alpha.html">fill_between</a> function).</p>
</blockquote>
<p>d. Short answer (1-3 sentences): How does this result compare to the HMC posterior from HW2. What is different? Which one is a better fit? Which method is more reliable?</p>
<h2><a name="debugging-tips">Debugging Tips</a></h2>
<p>Here are some tricks to simplify your life if you are having trouble:</p>
<ul>
<li>
<p>Before you try fitting BBVI on a network with hidden units, try it with 0 hidden layers (equivalent to Bayesian linear regression). </p>
</li>
<li>
<p>Before you optimize all the weight means and weight log-std-devs, consider trying to optimize <em>only</em> the weight means or <em>only</em> the log-std-devs, holding others fixed. You can easily modify your gradient descent to do this just by masking the right gradient values with zeros.</p>
</li>
</ul>
<h2>Bonus: Real-world Datasets</h2>
<p>Here are some possible datasets you could play with you are curious. Each one is a 'real' dataset with a univariate input <span class="math">\(x\)</span> and univariate output <span class="math">\(y\)</span>, so you could easily fit a BNN to the dataset using your existing code.</p>
<h5>Carbon dioxide over time</h5>
<p>We measure C02 over time at Mauna Loa observatory. The input <span class="math">\(x\)</span> is the time since 1958, the output <span class="math">\(y\)</span> is the C02 concentration in parts-per-million.</p>
<ul>
<li>[<a href="http://scrippsco2.ucsd.edu/data/atmospheric_co2/primary_mlo_co2_record">Dataset description</a>]</li>
<li>[<a href="http://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/weekly/weekly_in_situ_co2_mlo.csv">Dataset file (.csv)</a>]</li>
<li>Key question: What values should we forecast in year 2020?</li>
</ul>
<h5>NBA Basketball heights and weights</h5>
<p>Height (<span class="math">\(x\)</span>) vs. weight (<span class="math">\(y\)</span>) for Professional Basketball (NBA) Players</p>
<ul>
<li>[<a href="http://users.stat.ufl.edu/~winner/data/nba_ht_wt.dat">Dataset file (.csv)</a>]</li>
<li>Key question: Given a new player's height, how accurately can we predict weight?</li>
</ul>
<h5>Revenue of Harry Potter films over time.</h5>
<p>At each week (<span class="math">\(x\)</span>) since the film's release, we have the total revenue in dollars <span class="math">\(y\)</span>.</p>
<ul>
<li>[<a href="http://users.stat.ufl.edu/~winner/data/harrypotter.txt">Dataset description</a>]</li>
<li>[<a href="http://users.stat.ufl.edu/~winner/data/harrypotter.csv">Dataset file (.csv)</a>]</li>
<li>Key question: Given data from weeks 1-15, how accurate can we predict revenue in week 16 or 17? How much worse would we do if we only had weeks 1-10?</li>
</ul>
<h2>Bonus: Models and Algorithms</h2>
<p>Here are some ideas you could play with if you have leftover time:</p>
<ul>
<li>
<p>Is there a smart way to initialize that might lead to faster convergence / better fits?</p>
</li>
<li>
<p>Try using a Robbins-Munro sequence of step sizes. Does this help? Do you notice any differences?</p>
</li>
<li>
<p>Could you use some version of line search to pick the step size at each step in a smart way? (Google: Backtracking line search).</p>
</li>
</ul>