
<h2>Software Notes</h2>
<p>Use the starter code: <a href="https://github.com/tufts-ml/comp150_bdl_2018f_public/tree/master/homeworks/hw4_starter">https://github.com/tufts-ml/comp150_bdl_2018f_public/tree/master/homeworks/hw4_starter</a></p>
<p>Make sure you have latest version of the starter code.</p>
<p>Latest updates (with timestamp):</p>
<ul>
<li>20181004 20:00 (fix import statement in hw4starter_ae.py)</li>
</ul>
<p>You'll need to fill in all the TODO statements to make it work!</p>
<h4>Requires: PyTorch</h4>
<p>You'll need to install <em>PyTorch</em> to use the starter code. See the instructions here for a complete conda environment: <a href="../setup_python_env.html#pytorch">setup_python_env.html#pytorch</a>, or visit the PyTorch website: <a href="https://pytorch.org/">https://pytorch.org/</a> and find your own way.</p>
<h2>Background</h2>
<p>In class, we have learned about approximating posteriors with variational inference, using the reparameterization trick for VI (e.g. in the Bayes-by-Backprop algorithm), and deep generative models for images using variational autoencoders. It's now time to try to bring these ideas together.</p>
<p>In HW4, you will combine these ideas together to train models that can generate images of handwritten digits from the MNIST dataset. First, you'll directly train autoencoders for images via maximum likelihood methods. Next, you'll compare these results to a more Bayesian approach, the VAE.</p>
<h3>Data</h3>
<p>We consider the MNIST dataset of carefully cropped images of hand-written digits. We will use a version where each image is a 28 x 28 <strong>binary</strong> image. For more background, see: <a href="https://en.wikipedia.org/wiki/MNIST_database">https://en.wikipedia.org/wiki/MNIST_database</a>.</p>
<p>We observe <span class="math">\(N\)</span> total training examples of images of hand-written digits. We'll index each image with <span class="math">\(n\)</span>.</p>
<p>Each image can be defined by a 784-dimensional <em>binary</em> vector called <span class="math">\(x_n\)</span>.
To make things more general, we can say that <span class="math">\(x_n\)</span> has size <span class="math">\(P\)</span>, where P=784 in the case of MNIST and in general would could the number of pixels in the image. </p>
<p>FYI, we can always reshape the vector <span class="math">\(x_n\)</span> into a 28x28 binary image for display purposes. But it'll be easiest to work with a flat vector.</p>
<h3>Neural Networks : Ingredients for Probabilistic Modeling</h3>
<h4>Decoder Neural Network: From Code Vector to Image</h4>
<p>We will use a <em>decoder</em> feed-forward neural network to map each possible 2D code <span class="math">\(z_n\)</span> to a 784-dim vector of probabilities (each entry is a real value between 0 and 1). </p>
<div class="math">\begin{align}
\text{decode}(z_n, \theta) : \mathbb{R}^2 \rightarrow (0,1)^{784}
\end{align}</div>
<p>We will always denote the parameters (weights and biases) of the decoder as a vector <span class="math">\(\theta\)</span>. The size of this vector depends on the architecture.</p>
<h5>Possible Decoder Architectures</h5>
<p>Decoder Arch. D1-X:</p>
<ul>
<li>1 Hidden Layer with X units + ReLU activation</li>
<li>1 Output Layer with 784 units + sigmoid activation (so each entry is between 0.0 and 1.0)</li>
</ul>
<h4>Encoder Neural Network: From Image to Code Vector</h4>
<p>We will use an <em>encoder</em> feed-forward neural network to map each possible 784-dimensional binary image vector to a specific possible 2D code vector <span class="math">\(z_n\)</span>.</p>
<div class="math">\begin{align}
\text{encode}(x_n, \phi) : (0,1)^{784} \rightarrow \mathbb{R}^2
\end{align}</div>
<h5>Possible Encoder Architectures</h5>
<p>Encoder Arch. E1-X:</p>
<ul>
<li>1 Hidden Layer with X units + ReLU activation</li>
<li>1 Output Layer with 2 units + no activation (so output is real-valued)</li>
</ul>
<h3>Goals of Model Fitting</h3>
<p>Our goal in fitting an autoencoder to data is twofold. First, we want to train decoder parameters <span class="math">\(\theta\)</span> and encoder parameters <span class="math">\(\phi\)</span> to have accurate reconstructions. Second, we wish to build a <em>probabilistic model</em> on top of an autoencoder, so that we can reason about our <em>uncertainty</em> over the code space.</p>
<p>For goal 1, we will simply produce a point estimate of the encoder and decoder parameters <span class="math">\(\theta\)</span> (following the principle of minimizing reconstruction error).</p>
<p>For goal 2, we'd like to be "more Bayesian", so we'll assume a full generative model for our data. This model has a prior on code values and a likelihood of producing a data vector given a code vector. Given observed data, in the usual way we can write a <em>posterior</em> over the code values, and then approximate this posterior using variational inference. In this way, both our encoder and decoder are viewed as "stochastic" operations.</p>
<h3>Goal 1 (Problem 1): Fitting an AE to Minimize Reconstruction Error</h3>
<p>For this first goal, we will ignore the prior. We wish to find the decoder and encoder parameters that, for the training data at hand, minimize reconstruction error.</p>
<h5>Reconstructed probability vector <span class="math">\(\tilde{x}\)</span></h5>
<p>Given an input data vector <span class="math">\(x_n\)</span>, we can "reconstruct" it by encoding it to a code vector, then decoding that code vector. In math:</p>
<div class="math">\begin{align}
\tilde{x}_n = \text{decode}(\text{encode}(x_n, \phi), \theta)
\end{align}</div>
<p>The result, <span class="math">\(\tilde{x}_n\)</span>, is still a vector the same size as <span class="math">\(x_n\)</span>. Under our chosen architectures, it is not binary but instead contains real values between 0.0 and 1.0.</p>
<h5>Minimize Binary Cross Entropy (our chosen reconstruction error)</h5>
<p>We wish to minimize the cross entropy, which is a cost function well-justified by information theory that you can read about on <a href="https://en.wikipedia.org/wiki/Cross_entropy">Wikipedia</a>. The formula is:</p>
<div class="math">\begin{align}
\min_{\theta, \phi} 
    \underbrace{- \Big(
    \sum_{n=1}^N \sum_{p=1}^P
        x_{np} \log \tilde{x}_{np}(x_n, \theta, \phi)
        + (1-x_{np}) \log (1 - \tilde{x}_{np}(x_n, \theta, \phi))
    \Big)}_{\text{binary_cross_entropy}(x, \tilde{x})}
\end{align}</div>
<p>This is a measure of <em>error</em>: a good autoencoder will have low "bce".</p>
<h5>Maximize Bernoulli log likelihood</h5>
<p>In fact, the above is equivalent to maximizing the logpdf of a Bernoulli likelihood that produces a positive binary value <span class="math">\(x_{np}\)</span> with probability <span class="math">\(\tilde{x}_{np}\)</span>:</p>
<div class="math">\begin{align}
\min_{\theta, \phi} \sum_{n=1}^N \sum_{p=1}^P \log \mbox{Bern}
    \Big(
        x_{np} \mid x'_{np}(x_n, \theta, \phi) \Big)
\end{align}</div>
<p>Note that this "likelihood" is not a valid probabilistic model (we cannot have <span class="math">\(\tilde{x}_n\)</span> depend on <span class="math">\(x_n\)</span> while <span class="math">\(x_n\)</span> depends on <span class="math">\(\tilde{x}_n\)</span>). But, it's still a way to set up an optimization problem.</p>
<p>We can solve the reconstruction optimization problem using modern gradient descent tools. In problem 1, you'll try out fitting AE models to minimize reconstruction error and see what tradeoffs might exist in terms of architectures and learning rates.</p>
<h3>Goal 2 (Problem 2): Full Generative Model with an VAE Approximate Posterior</h3>
<h4>Prior on Codes <span class="math">\(z\)</span> (Generative Model Step 1)</h4>
<p>We assume that each image has a latent code vector <span class="math">\(z_n\)</span> of size 2, so <span class="math">\(z_n \in \mathbb{R}^2\)</span>. The prior distribution over this vector's values is just a standard Normal:</p>
<div class="math">\begin{align}
p(z_n) = \text{Normal}(z_n | 0, I)
\end{align}</div>
<h4>Likelihood of Data given Code (Generative Model Step 2)</h4>
<p>Given a decoder, we can write a probabilistic model for generating a data vector <span class="math">\(x_n\)</span> given its code vector <span class="math">\(z_n\)</span>:</p>
<div class="math">\begin{align}
p(x_n | z_n, \theta) = \text{Bern}(x_n | \text{decode}(z_n, \theta) )
\end{align}</div>
<p>The ideal result of goal 2 would be a way to estimate the code-given-data posterior: <span class="math">\(p(z_n | x_n)\)</span>. </p>
<p>However, this posterior is difficult, even if we have good estimates of <span class="math">\(\theta\)</span>! Why? The likelihood is <strong>non-linear</strong>, so the posterior does not belong to any known density family with easy-to-derive formulas for posterior parameter values. </p>
<p>To gain tractability, we'll try to solve a <em>simpler</em> problem. We will look at possible <strong>approximate posteriors</strong> that have more managable form.</p>
<p>We will assume that there is an independent Normal <span class="math">\(q\)</span> distribution for each example <span class="math">\(n\)</span>. We will assume this Normal has a known fixed variance <span class="math">\(\sigma\)</span> for all examples, and a mean parameter that is determined by an <strong>encoder</strong> neural network that is shared across all examples. </p>
<div class="math">\begin{align}
q(z_n | x_n) = \text{Normal}( \text{encode}(x_n, \phi), \sigma^2 )
\end{align}</div>
<p>FYI this is a <em>simpler</em> approximation than in Kingma &amp; Welling's paper (where the per-example variance depends on an additional 'encoder'). </p>
<p>Here, the parameters <span class="math">\(\phi\)</span> of the encoding network will be trained to provide the best possible encoding (bring the <span class="math">\(q(z_n | x_n)\)</span> distribution as close as possible to the true posterior <span class="math">\(p(z_n | x_n)\)</span>). </p>
<h4>Variational Objective Function to Maximize</h4>
<div class="math">\begin{align}
\mathcal{L}(\theta, \phi) &amp;= 
\sum_{n=1}^N
\mathbb{E}_{q(z_n | x_n)}
\Big[
    \log p(x_n | z_n, \theta) + \log p(z_n) - \log q(z_n | x_n, \phi)
\Big]
\end{align}</div>
<p>This objective function <span class="math">\(\mathcal{L}(\theta, \phi)\)</span> takes in the encoder parameters <span class="math">\(\phi\)</span> and decoder parameters <span class="math">\(\theta\)</span>, and produces a scalar real value.</p>
<p>The readings discuss how this function is a "lower bound" on the marginal likelihood (the "evidence") <span class="math">\(\log p(x | \theta) = \log \int_z p(x, z| \theta) dz\)</span>. We wish to find parameters that maximize this evidence lower bound (ELBO).
The parameters include the generative model's parameters <span class="math">\(\theta\)</span> and the parameters <span class="math">\(\phi\)</span> for our approximate posterior.</p>
<h4>VI Loss Function (Objective to Minimize)</h4>
<p>Often, we are interested in framing inference as a minimization problem (not maximization). Esp. in deep learning, minimization is the common goal of optimization toolboxes. For example, PyTorch expects a loss function to minimize.</p>
<p>We can thus define that our "VI loss function" (what we want to minimize) is just -1.0 times the evidence lower bound objective above.</p>
<p>"VI loss function": <span class="math">\(-1.0 \cdot \mathcal{L}(\theta, \phi)\)</span></p>
<h4>Training Optimization Problem for Goal 2</h4>
<p>Our goal is to learn values for <span class="math">\(\theta, \phi\)</span> that make the objective to minimize as small as possible. Here's the optimization problem:</p>
<div class="math">\begin{align}
\min_{\theta, \phi}
\quad &amp; - \mathcal{L}(\theta, \phi)
\end{align}</div>
<p>We can then use:</p>
<ul>
<li>Monte Carlo methods to evaluate the objective function <span class="math">\(\mathcal{L}\)</span></li>
<li>Reparameterization trick methods to estimate the gradient of <span class="math">\(\mathcal{L}\)</span> with respect to <span class="math">\(\phi\)</span> and <span class="math">\(\theta\)</span></li>
</ul>
<p>These two ideas, together, allow us to fit this VAE model to data.</p>
<h2><a name="problem-1">Problem 1: Fitting Autoencoders to MNIST to Maximize Likelihood </a></h2>
<p>You can use the <a href="https://github.com/tufts-ml/comp150_bdl_2018f_public/blob/master/homeworks/hw4_starter/hw4starter_ae.py"><code>hw4starter_ae.py</code> script</a> to perform <em>maximum likelihood</em> estimation of the parameters of the encoder and decoder networks. This code uses <strong>PyTorch</strong>, unlike past homeworks, so you'll need to install a new conda environment: <code>bdl_pytorch_env</code>.</p>
<p>You'll need to train 3 separate models with 32, 128, and 512 hidden units (these size specifications are used by both encoder and decoder in the released code). </p>
<p>You'll need to adjust the learning rate <code>--lr</code> and potentially other keyword arguments as well.</p>
<p>Train for at least 200 epochs (or more if you don't see convergence).</p>
<h5><strong>Instructions</strong> for Problem 1</h5>
<p>Your PDF report should include the following labeled parts:</p>
<p>a. 1 row x 3 col plot (with caption): Plot binary cross entropy (y-axis) on both <em>train</em> and <em>test</em> sets versus  the number of training iterations (x-axis). Show training with a solid blue line, and test with a dashed red line (always include a legend).</p>
<p>Your binary cross entropy calculation should be a "per-pixel error", so you should normalized by the number of examples <span class="math">\(N\)</span> and pixels <span class="math">\(P\)</span> in each dataset:
</p>
<div class="math">$$
\text{per_pixel_cross_entropy}(x, x') = \frac{1}{NP}\sum_{n=1}^N \sum_{p=1}^P x_{np} \log x'_{np} + (1-x_{np}) \log (1-x'_{np})
$$</div>
<p>The 3 columns should show performance with 3 different encoder/decoder architectures: 32, 128, or 512 hidden units.</p>
<p>b. 1 row x 3 col plot (with caption): Plot L1 reconstruction error (y-axis) on both <em>train</em> and <em>test</em> sets versus  the number of training iterations (x-axis). Show training with a solid blue line, and test with a dashed red line.</p>
<p>The 3 columns should show performance with 3 different encoder/decoder architectures: 32, 128, or 512 hidden units.</p>
<p>Your L1 error calculation should be "per-pixel" error, so you should normalized by the number of examples <span class="math">\(N\)</span> and pixels <span class="math">\(P\)</span> in each dataset:
</p>
<div class="math">$$
\text{per_pixel_L1_error}(x, x') = \frac{1}{NP}\sum_{n=1}^N \sum_{p=1}^P | x_{np} - x'_{np}|
$$</div>
<p>c. Short answer: Comment on the relative values of the metrics in 1a and 1b between train and test sets. Is there overfitting? Underfitting?</p>
<p>d. 1 row x 3 col plot (with caption): Show 3 panels, each one with 25 sampled images drawn from the generative model using your final trained decoder.</p>
<p>e. Short answer: Comment on the relationship between visual quality and error metrics (binary cross entropy or L1 error). Is the magnitude of visual improvement from 32 to 128 to 512 reflected in these metrics? What might be missing or what might we do instead?</p>
<p>f. (Bonus +5) 1 row x 3 col plot (with caption): Show 3 panels, each one with a 2D visualization of the "encoding" of test images. Color each point by its class label (digit 0 gets one color, digit 1 gets another color, etc). Show at least 100 examples per class label.</p>
<h2><a name="problem-2">Problem 2: Fitting VAEs to MNIST to minimize the VI loss</h2>
<p>You can use the <a href="https://github.com/tufts-ml/comp150_bdl_2018f_public/blob/master/homeworks/hw4_starter/hw4starter_vae.py"><code>hw4starter_vae.py</code> script</a> to perform <em>variational inference</em> to estimate the encoder, while also performing maximum likelihood estimation of the decoder parameters.</p>
<p>This code uses <strong>PyTorch</strong>, unlike past homeworks, so you'll need to install a new conda environment: <code>bdl_pytorch_env</code>.</p>
<p>You'll need to train 3 separate models with 32, 128, and 512 hidden units (these size specifications are used by both encoder and decoder in the released code). </p>
<p>You'll need to adjust the learning rate <code>--lr</code> and potentially other keyword arguments as well.</p>
<p>Train for at least 200 epochs (or more if you don't see convergence).</p>
<p><em>Instructions:</em> For Problem 2, your report PDF should include:</p>
<p>Your PDF report should include the following labeled parts:</p>
<p>a. 1 row x 3 col plot (with caption): Plot binary cross entropy (y-axis) on both <em>train</em> and <em>test</em> sets versus the number of training iterations (x-axis). Use the same line styles and metric definitions as in Problem 1a.</p>
<p>The 3 columns should show performance with 3 different <strong>VAE</strong> encoder/decoder architectures: 32, 128, or 512 hidden units.</p>
<p>b. 1 row x 3 col plot (with caption): Plot L1 reconstruction error (y-axis) on both <em>train</em> and <em>test</em> sets versus  the number of training iterations (x-axis).  Use the same line styles and metric definitions as in Problem 1b.</p>
<p>The 3 columns should show performance with 3 different <strong>VAE</strong> encoder/decoder architectures: 32, 128, or 512 hidden units.</p>
<p>c. Short answer: Comment on the relative values of the metrics in 2a and 2b between train and test sets. Is there overfitting? Underfitting? What are the major differences (if any) from Problem 1a and 1b? Why might the behavior be different?</p>
<p>d. 1 row x 3 col plot (with caption): Show 3 panels (one per arch.), each one with 25 sampled images drawn from the generative model using your final trained VAE decoder.</p>
<p>e. Short answer: Compare 1d to 2d. Does the VAE differ noticeably in the quality of its sampled images?</p>
<p>f. (Bonus +5) 1 row x 3 col plot (with caption): Show 3 panels (one per arch.), each one with a 2D visualization of the VAE's "encoding" of test images. Color each point by its class label (digit 0 gets one color, digit 1 gets another color, etc). Show at least 100 examples per class label.</p>