## TL;DR
Bayesian analysis gives us back an entire distribution summarizing all there is to know about the possible value of each parameter. If we're only interested in a particular distilation of that distribution (e.g. it's mode or mean) we can get a compuatational savings.

Additionally, providing a prior enables Bayes to give credible intervals, which are different from confidence intervals and generally nicer.

## Sumarizing the Posterior
Bayesian analyses give us a (joint) posterior over all parameters in the model. [Which can be used to form posterior predictive distributions as well, but we'll ignore those here].

Even if we marginalize the joint posterior down to a 1-dimensional distribution, we still have a bunch of values for that parameter. If we need to report something simple to our bosses, we're left with the familar task of summarizing a distribution down to something simpler.

We'll discuss three common methods of distilling the posterior

1. Taking the mode. (Maximum a Posteriori or MAP)
2. Taking the mean. (Posteriror mean)
3. Building an interval (Credible Intervals)

### MAP (Maximum a posteriori)
MAP produces a point estimate of the parameter value(s). It just takes the most likely combination of parameter values from the posterior.

The costs of distilling to the MAP are 1) throwing away a lot of information and 2) potentially overfitting, but the benefit is that MAP can be found more or less locally (calculate the posterior at particular points and simply hill climb or do simulated annealing). For this reason MAP estimates tend to have lower computaitonal cost than other summaries.

#### Math
The goal is to find the mode of the posterior distribution.

This corresponds to:

$$
\begin{eqnarray}
 \theta_{{\rm MAP}} &=& \arg \max_{\theta} \, p(\theta \vert D)  \nonumber \\ 
                               & =& \arg \max_{\theta}  \frac{\mathcal{L}(\theta|D) \, p(\theta)}{p(D)}  \nonumber \\ 
                               & =& \arg \max_{\theta}  \, \mathcal{L}(\theta|D) \, p(\theta) \nonumber \\ 
\end{eqnarray}
$$

Thus MAP is strongly similar to the maximum likelihood estimation procedure. The difference is that the prior we set over the parameters influences the parameter estimation. [Maximum likelihood is MAP with a uniform prior, i.e. without regularization]

### Posterior Mean

This method is still a point estimate, but a little richer than MAP. (Just like giving a distribution's mean is often richer than giving its mode).

We calculate the mean of the posterior distribution:

$$Summary=E_{\theta \sim posterior}[\theta] = \int \theta\, p(\theta|D)\, d\theta$$

The posterior mean is more informative than the MAP, but more costly. Often it is found via simulating from the posterior. In this light, the posterior mean requires fewer samples to approximate than would be required to explore the full posterior. [The mean of bunch of values has lower variance than the individual values do]

### Credible Interval
A credible interval summarizes the posterior down to an interval (or set of intervals), rather than a single point. We'll stick with tradition and only discuss the 1-D case, since the issues only get thornier in more dimensions.

The idea here is to provide an interval where e.g. 90% of the posterior mass falls. The issue is that there are LOTS of such intervals. You could move the left endpoint to lose 1% of the mass and move the right endpoint to gain 1% of the mass. We need to add more criteria.

Options:
1. Specifcy that the mode should be at the center of the interval, and the interval expands by gobbling up the endpoint with the highest probabilty. This procedure give the contiguous interval that is as small as possible
2. Drop a horizontal line down from infinity. When 90% of the mass is above the line and 10% is below, stop, and report the endpoints of the region above the line. This might produce multiple intervals instead of just one, but guarantees the total length of the intervals is as small as possible
3. Specify that the interval should be centered on the mean or mode and extend an equal amount to either side until 90% of the mass is inside the interval. This could be grossly misleading for asymmetric distributions.

No matter how found, the intervals describe are what we'd hoped confidence intervals would be. If you believe the prior and the model, you should believe that any interval constructed as above contains the correct value of theta 90% of the time. 


### Contrast with confidence intervals
Confidence intervals and credible intervals coincide in particluar cases, but usually give different values.

Condifidence intervals take in a dataset and a model to produce endpoints. Credible intervals additionally take in a prior distribution. For either one, the interval is entirely determined by the dataset (and prior). If you see the same dataset a second time, each interval returns the same thing it said previously.

Confidence intervals are built so that _over multiple datasets built with any given value of $\theta^*$_ 90% of the intervals constructed will contain the $\theta^*$. As you might imagine, building a confidence interval recipe is a serious mathematical feat. In some sense, confidence intervals take a point prediction and hedge against "the data might have been different".

Credible intervals are built so that if nature re-rolled the true value of $\theta^*$ according to the prior and fed us a new dataset built under that $\theta^*$, then 90% of the credible intervals we build would contain the value of $\theta*$ used. In some sense, credible intervals hedge against uncertainty in our knowledge of $\theta^*$, and against the data being different. Because they have MORE knowoledge of the uncertainty of $theta^*$ (we wrote it down in a prior instead of shurgging and saying "$\theta^*$ has a value, but we don't know what) credible intervals are able to make a better hedge.

#### Comparison
To put a fine point on things: frequentist intervals promise that for any fixed parameter value, 90% of the intervals produced will contain the true value. However, if you grab any particular dataset you have no guarantee about that interval. It's possible the interval doesn't include the parameter, or is way, way off. It's even possible that the interval you're told to draw for that dataset has width zero or is otherwise degenerate. If you happen to be in the world where that's the dataset you got, tough luck. The confidence interval procedure is allowed fail as badly as it wants on 10% of datasets as long as it works for the others. It's only across all datasets that the 90% coverage emerges.

The bayesian framework makes an extra assumption (that nature follows the prior distribution) and gets a stronger result. Bayesian credible intervals   work across all datasets and even on the level of individual datasets. No matter whether we get to a particular dataset via a rare fluke or if the dataset is entirely typical for that parameter setting, the interval we're told to draw for that dataset and prior contains the true parameter 90% of the time, _assuming that the true parameter is drawn from the prior distribution we wrote down_.

That said, if you grab a particular parameter value, the bayesian method has no guarantees. It's possible that datasets generated under that parameter value produce intervals that are badly wrong for that value and only very rarely contain the given parameter. If you happen to be in the world where that's the true parameter value, tough luck. The credible interval procedure is allowed to decide that some parameter values are rare and it's okay to consistently exclude them in order to contain more common parameter values the right amount of the time. [Though with larger datasets, the percecentage of intervals (across datasets) that work for any fixed parameter value will approach 90%, just like the frequentist method does]