### Summary
* Great models can only be achieved by iterative development.

* Iterate quickly by building a pipeline that is robust to code changes.

* Start with a simple model and mean-field inference.

* Avoid NANs by intelligently initializing and .clamp()ing.

* Reparametrize the model to improve geometry.

* Create a custom variational family by combining AutoGuides or EasyGuides.

### Table of contents
* Overview

* Running example: SARS-CoV-2 strain prediction

    * Clean the data

    * Create a generative model

    * Sanity check using mean-field inference

    * Create an initialization heuristic

    * Reparametrize the model

    * Customize the variational family: autoguides, easyguides, custom guides

### Overview 

Consider the problem of sampling from the posterior distribution of a probabilistic model with 
 or more continuous latent variables, but whose data fits entirely in memory. (For larger datasets, consider amortized variational inference.) Inference in such high-dimensional models can be challenging even when posteriors are known to be unimodal or even log-concave, due to correlations among latent variables.

To perform inference in such high-dimensional models in Pyro, we have evolved a workflow to incrementally build data analysis pipelines combining variational inference, reparametrization effects, and ad-hoc initialization strategies. Our workflow is summarized as a sequence of steps, where validation after any step might suggest backtracking to change design decisions at a previous step.

The crux of efficient workflow is to ensure changes don’t break your pipeline. That is, after you build a number of pipeline stages, validate results, and decide to change one component in the pipeline, you’d like to minimize code changes needed in other components. The remainder of this tutorial describes these steps individually, then describes nuances of interactions among stages, then provides an example.

### Running example: SARS-CoV-2 strain prediction

The running example in this tutorial will be a model (Obermeyer et al. 2022) of the relative growth rates of different strains of the SARS-CoV-2 virus, based on open data counting different PANGO lineages of viral genomic samples collected at different times around the world. There are about 2 million sequences in total.

The model is a high-dimensional regression model with around 1000 coefficients, a multivariate logistic growth function (using a simple torch.softmax()) and a Multinomial likelihood. While the number of coefficients is relatively small, there are about 500,000 local latent variables to estimate, and plate structure in the model should lead to an approximately block diagonal posterior covariance matrix. For an introduction to simple logistic growth models using this same dataset, see the logistic growth tutorial.## 
