# SVI Part IV: Tips and Tricks

The three SVI tutorials leading up to this one ([Part I](http://pyro.ai/examples/svi_part_i.html), [Part II](http://pyro.ai/examples/svi_part_ii.html), & [Part III](http://pyro.ai/examples/svi_part_iii.html)) go through
the various steps involved in using Pyro to do variational
inference.
Along the way we defined models and guides (i.e. variational distributions),
setup variational objectives (in particular [ELBOs](https://docs.pyro.ai/en/dev/inference_algos.html?highlight=elbo#module-pyro.infer.elbo)), 
and constructed optimizers ([pyro.optim](http://docs.pyro.ai/en/dev/optimization.html)). 
The effect of all this machinery is to cast Bayesian inference as a *stochastic optimization problem*. 

This is all very useful, but in order to arrive at our ultimate goal—learning model parameters, inferring approximate posteriors, making predictions with the posterior predictive distribution, etc.—we need to successfully solve this optimization problem. 
Depending on the details of the particular problem—for example the dimensionality of the latent spaces, whether we have discrete latent variables, and so on—this can be easy or hard. 
In this tutorial we cover a few tips and tricks we expect to be generally useful for users doing variational inference in Pyro. *ELBO not converging!? Running into NaNs!?* Look below for possible solutions!  

### 1. Start with a small learning rate

While large learning rates might be appropriate for some problems, it's usually good practice to start with small learning rates like $10^{-3}$
or $10^{-4}$:
```python
optimizer = pyro.optim.Adam({"lr": 0.001})
```
This is because ELBO gradients are *stochastic*, and potentially high variance, so large learning rates can quickly lead to regions of model/guide parameter space that are numerically unstable or otherwise undesirable.

You can try a larger learning rate once you have achieved stable
ELBO optimization using a smaller learning rate. 
This is often a good idea because excessively small learning rates can lead to poor optimization. 
In particular small learning rates can lead to getting stuck in poor local optima of the ELBO.

### 2. Make sure your model and guide distributions have the same support

Suppose you have a distribution in your `model` with constrained support, e.g. a LogNormal distribution, which has support on the positive real axis:
```python
def model():
    pyro.sample("x", dist.LogNormal(0.0, 1.0))
``` 
Then you need to ensure that the accompanying `sample` site in the `guide` has the same support:
```python
def good_guide():
    loc = pyro.param("loc", torch.tensor(0.0))
    pyro.sample("x", dist.LogNormal(loc, 1.0))
``` 
If you fail to do this and use for example the following inadmissable guide:
```python
def bad_guide():
    loc = pyro.param("loc", torch.tensor(0.0))
    pyro.sample("x", dist.Normal(loc, 1.0))
```
you will likely run into NaNs very quickly. This is because the `log_prob` of a LogNormal distribution evaluated at a sample `x` that satisfies `x<0` is undefined, and the `bad_guide` is likely to produce such samples.


### 3. Constrain parameters that need to be constrained
In a similar vein, you need to make sure that the parameters used to instantiate distributions are valid; otherwise you will quickly run into NaNs. For example the `scale` parameter of a Normal distribution needs to be positive. Thus the following `bad_guide` is problematic:
```python
def bad_guide():
    scale = pyro.sample("scale", torch.tensor(1.0))
    pyro.sample("x", dist.Normal(0.0, scale))
``` 
while the following `good_guide` correctly uses a constraint to ensure positivity:
```python
from torch.distributions import constraints

def good_guide():
    scale = pyro.sample("scale", torch.tensor(0.05),               
                        constraint=constraints.positive)
    pyro.sample("x", dist.Normal(0.0, scale))
``` 

### 4. If you are having trouble constructing a custom guide, use an AutoGuide
In order for a model/guide pair to lead to stable optimization a number of conditions need to be satisfied, some of which we have covered above. 
Sometimes it can be difficult to diagnose the reason for numerical instability or poor convergence. Among other reasons this is becaue the fundamental issue could arise in a number of different places: in the model, in the guide, or in the choice of optimization algorithm/hyperparameters. 

Sometimes the problem is actually in your model even though you think it's in the guide. 
Conversely, sometimes the problem is in your guide even though you think it's in the model or somewhere else. 
For these reasons it can be helpful to reduce the number of moving parts while you try to identify the underyling issue.
One convenient way to do this is to replace your custom guide with a [pyro.infer.AutoGuide](http://docs.pyro.ai/en/stable/infer.autoguide.html#module-pyro.infer.autoguide). 
For example, if all the latent variables in your model are continuous, you can try a [pyro.infer.AutoNormal](http://docs.pyro.ai/en/stable/infer.autoguide.html#autonormal) guide.


### 4. If you are having trouble constructing a custom guide, try MAP inference

If all the latent variables in your model are continuous, you can use MAP inference instead of full-blown variational inference. See the [MLE/MAP](http://pyro.ai/examples/mle_map.html) tutorial for further details. Once you have MAP inference working, there's good reason to believe that your model is setup correctly (at least as far as basic numerical stability is concerned). If you're interested in obtaining approximate posterior distributions, you can now follow-up with full-blown SVI.

### 5. Parameter initialization matters: initialize guide distributions to have low variance

Initialization in optimization problems can make all the difference between finding a good solution and catastrophic failure. 
It is difficult to come up with a comprehensive set of good practices for initialization, as good initialization schemes are often very problem dependent. In the context of Stochastic Variational Inference it is generally a good idea to initialize your guide distributions so that they have **low variance**. This is because the ELBO gradients you use to optimize the ELBO are stochastic. If the ELBO gradients you get at the beginning of ELBO optimization exhibit high variance, you may be led into numerically unstable or otherwise undesirable regions of parameter space. One way to guard against this potential hazard is to pay close attention to parameters in your guide that control variance. 
For example we would generally expect this to be a reasonably initialized guide:
```python
from torch.distributions import constraints

def good_guide():
    scale = pyro.sample("scale", torch.tensor(0.05),               
                        constraint=constraints.positive)
    pyro.sample("x", dist.Normal(0.0, scale))
``` 
while the following high-variance guide is very likely to lead to problems:
```python
def bad_guide():
    scale = pyro.sample("scale", torch.tensor(12345.6),               
                        constraint=constraints.positive)
    pyro.sample("x", dist.Normal(0.0, scale))
``` 

### 6. Consider normalizing your ELBO

By default Pyro computes a un-normalized ELBO, i.e. it computes the quantity that is a lower bound to the log evidence computed on the full set of data that is being conditioned on. For large datasets this can be a number of large magnitude. Since computers use finite precision (e.g. 32-bit floats) to do arithmetic, large numbers can be problematic for numerical stability, since they can lead to loss of precision, under/overflow, etc.
For this reason it can be helpful in many cases to normalize your ELBO so that it is roughly order one. This can also be helpful for getting a rough feeling for how good your ELBO numbers are. For example if we have $N$ datapoints of dimension $D$ (e.g. $N$ real-valued vectors of dimension $D$) then we generally expect a reasonably well optimized ELBO to be order $N \times D$. Thus if we renormalize our ELBO by a factor of $N \times D$ we expect an ELBO of order one. While this is just a rough rule-of-thumb, if we use this kind of normalization and obtain ELBO values like $-123.4$ or $1234.5$ then something is probably wrong: perhaps our model is terribly mis-specified; perhaps our initilization is catastrophically bad, etc. For details on how you can scale your ELBO by a normalization constant see [this tutorial](http://pyro.ai/examples/custom_objectives.html#Example:-Scaling-the-Loss).

### 7. Pay attention to scales

Scales matter. 
They matter for at least two important reasons: 
i) scales can make or break a particular initialization scheme; 
ii) as discussed in the previous section, scales can have an impact on numerical precision and stability.

To make this concrete suppose you are doing linear regression, i.e.
you're learning a linear map of the form $Y = W @ X$. Often the data comes with particular units. 
For example some of the components of the covariate $X$ may be in units of dollars (e.g. house prices), while others may be in units of density (e.g. residents per square mile). 
Perhaps the the first covariate has typical values like $10^5$, while the second covariate has typical values like $10^2$. 
You should always pay attention when you encounter numbers that range across many orders of magnitude. 
In many cases it makes sense to normalize things so that they are order unity. 
For example you might measure house prices in units of $100,000.

These sorts of data transformations can have a number of benefits for downstream modeling and inference. 
For example if you've normalized all of your covariates appropriately, it may be reasonable to set a simple 
isotropic prior on your weights

```python
pyro.sample("W", dist.Normal(torch.zeros(2), torch.ones(2)))
```
instead of having to specify different prior covariances for different covariates
```python
prior_cov = torch.tensor([1.e5, 1.e2])
pyro.sample("W", dist.Normal(torch.zeros(2), prior_cov))
```
There are other benefits too. 
It now becomes easier to initialize appropriate parameters for your guide. 
It is also now much more likely that the default initializations used by a [pyro.infer.AutoGuide](http://docs.pyro.ai/en/stable/infer.autoguide.html#module-pyro.infer.autoguide) will work for your problem.