# Overview

We start with a review of Bayes Rule and then a motivating story for the Bayesian approach to inference.

## Review of Bayes Rule

Let us return to our example application of Bayes Rule from a previous chapter in which we tried to see if it was going to rain on Sam's wedding. For review,

> Sam is getting married tomorrow in an outdoor ceremony in the desert. In recent years, it has only rained 5 days per year. Unfortunately, the meteorologist has predicted rain for tomorrow. Should Sam rent a tent for the ceremony?

> We can solve this problem using Bayes Rule which remember is:

> $$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

> But instead what we want is:

> $$P(W|F) = \frac{P(F|W)P(W)}{P(F)}$$

> where $W$ is weather (rain or shine) and $F$ is forecast (rain or shine). 

In Bayesian terms, the different values for weather, rain or shine, are *models*. We have one model that says "it rains" and we have another model that says "it shines".

> Remember that $P(W)$ in the numerator is our *prior* probability. What *is* our prior probability? Well, it only rains 5 days a year on average:

>| rain | shine |
|:----:|:-----:|
| 5/365 = 0.0137 | 360/365 = 0.9863 |

> I think this is what Sam had in mind when he planned his wedding.

The prior is how much we believe or how much we find credible each model. We find the "rain" model 1.37% credible and the "shine" model, 98.63% credible.

> But now he needs to take new evidence into account: a forecast of rain. The likelihood $P(F|W)$ is essentially the probability of the meteorologist being correct: given that it rained, what is the probability that it was forecast? Sam looks this up on the Internet.

>|  F  | rain | shine |
|:---:|:----:|:-----:|
|rain |  0.8 | 0.2   |
|shine | 0.2 | 0.8   |

> What does this mean? *Given* that it rained, there is an 80% chance there was a forecast of rain:

> $F(F=rain|W=rain) = 0.8$

This is the consistency of the data with each possible model. In Bayesian statistics, if you don't know something, you condition on it. This has the strange effect of telling you what you want to know. Here we know that there's a forecast for rain but we don't know if it will really rain so we condition on it and ask, what's the probability that we have a forecast for rain tomorrow given that it really rains tomorrow.

It turns out that this is an easier question to answer than the posterior itself.

> Because it *will* be confusing, we do not take shortcuts here. We will use the longhand notation, F=rain and W=rain, to distinguish the two events. Up above, we had Bayes Rule defined over entire random variables. 

> What does it look like for the specific outcome we're interested in?

> $$P(W=rain|F=rain) = \frac{P(F=rain|W=rain)P(W=rain)}{P(F=rain)}$$

> We have everything we need except the denominator. We can use total probability for it, though:

> $P(F=rain) = P(F=rain|W=rain)P(W=rain) + P(F=rain|W=shine)P(W=shine)$

> $0.8 \times 0.0137 + 0.2 \times 0.9863 = 0.208$

> and now we have:

> $P(W=rain|F=rain) = \frac{0.8 \times 0.0137}{0.208} = 0.053$

The resulting posterior is the probability of the model (rain) given a forecast for rain tomorrow. Or, the probability of rain, given the data (F=rain) and the prior.

Here we only have two models, though. What if we have more?

Consider the case where the meteorologist doesn't predict rain or shine but says only that there's a 5%, 10%, 15%, etc., to 95% chance of rain tomorrow. The first thing to note is that this means we have 19 models. The other interesting thing is that the meteorologist's forecast is never right or wrong. If there's a 5% chance of rain and it rains...the forecast is right. If there is a 5% chance of rain and it doesn't...the forecast is still right. What we want now is something like the accuracy of the meteorologist. In those cases where the meteorologist forecasts a 5% chance of rain, ideally, we want it to only rain 5% of the time.

Even so, there's nothing that stops us from having prior beliefs about the credibility of P(5%), P(10%), P(15%), etc. And there's nothing that stops us from conditioning our data on those models so that we have P(rain|5%), P(rain|10%), P(rain|15%), etc.

What we end up with is something like P(rain|5%) or the consistency of the forecasts. That's ok, too.

What we really want to drive home is that we can have more than two models. In fact, we can have a *continuous* model of P(1%), P(2%), P(3%), if we want but the math gets very complicated. This actually stymied Bayesian techniques for several centuries...until the invention of the computer.

In order to make clear that Bayesian Inference is just Bayes Rule, we return to the well-trod example of a coin toss and place it in the context of the previous chapters.

## Inference, Deduction, and Bayesian Inference

Inference begins to bring the previous chapters together and demonstrate that it is the fundamental problem of science and, hence, data science. In keeping with tradition, we're going to start with the coin example.

1. There is a process that gives rise to an observable event (data) which we call "flipping a coin". We can describe the elements of that process even if we can't create a simulation of the event or measure all the variables involved. One way to handle this is to make a qualitative model (Systems Theory, Causal Loop Diagram).
2. Because all we can observe about the process is the result, we model this process with a parameter $p$, the rate of success, and associate $p$ with the probability of heads. If we know $p$, the probability distribution over the events $\{heads, tails\}$, we can make all kinds of *deductions* about outcomes. One such deduction might be, what is the probability of 10 heads and 12 tails if $p = 0.42$? Note that deductions are always true given their assumptions. (Probability).
3. It turns out that there are many processes like our coin with only two outcomes and that we often ask similar questions of such processes (Mathematical Distributions). These questions might include "what is the probability of 10 successes and 5 failures?" (Bernoulli Distribution), "what is the probability of 10 successes in 15 attempts?" (Binomial Distribution), "how many trials do we need in order to see at least 5 successes?" (Geometric Distribution).
4. Finally, we have observed 6 heads and 4 tails, what is $p$?

This last question is a question of inference. It is intimately connected with the process of science in general as we are moving from data (specifics) to theories and models ($p$). Admittedly, $p$ is not the Theory of Gravity or Evolution, but it is still a model of the real world--our process of interest. And that's what data science is for the most part, building models about the *everyday*. After all, if you build models about biology, you're a biologist.

There are a lot of ways to think about what $p$ "really" is. If you have some familiarity with statistics, you'll have heard $p$ referred to as a *population* parameter or the true value and the question is couched as, how do we discover the population parameter from a *sample*, a finite set of observations that includes 6 heads and 4 tails?

Instead, I would like you to think of $p$ as just our model of the real world process and we just want to know how credible the possible values of $p$ are given the data we've seen. We want good models. So how do we come up with a good model for $p$, based on our data?

So we know that if we have a fair coin ($p=0.5$) we can use *deduction* to answer the question, what is the probability of seeing 6 heads and 4 tails in 10 tosses. This is just a joint probability distribution of 10 independent outcomes. The Binomial distribution will tell us that the probability of this event is 20.5%.

If, on the other hand, we'd like to know what the value of $p$ is when we observe 6 heads and 4 tails in 10 tosses, our task is not as easy. Why? Because many values of $p$ are consistent with 6 heads and 4 tails in 10 tosses. The following table shows the probability of that event for different values of $p$ in 0.1 increments:

| $p$ | probability |
|:--------:|------------:|
| 0.0      | 0.00%       |
| 0.1      | 0.01%       |
| 0.2      | 0.55%       |
| 0.3      | 3.67%       |
| 0.4      | 11.14%      |
| 0.5      | 20.50%       |
| 0.6      | 25.08%      |
| 0.7      | 20.01%      |
| 0.8      | 8.80%       |
| 0.9      | 1.11%       |
| 1.0      | 0.00%       |

so which $p$ is the "real" one? We don't know for certain. Actually, we **cannot** know for certain merely from data.

If we use each possible value of $p$ as a model of the actual process and we assume that all values were initially equally likely to be true (or we had no information that would lead us to favor one of the values--models--over another), then the table above shows us, based on the data, how much we should believe in each model, the one where $p=0.2$, the one where $p=0.3$, the one where $p=0.4$, etc. This probability distribution then becomes a model of our uncertainty over the possible models, $p$. We use probability to model probability.

Suppose now we have a question about the probability of throwing 3 heads in a row. What model do we pick? There are a variety of answers. We could pick the most probable model where $p=0.6$. Or we could use all of the models simultaneously, weighting each prediction by the probability of the $p$ used.

This is the essence of the Bayesian approach to *statistical* inference. It basically says, in the forward direction, we have a model (probability) and we deduce data (events) based on those probabilities. In the reverse or *inverse* direction, we have data and we infer probabilities over the various possible models. 

In the forward direction, I have a **model** of $p$ and I deduce events given that model's probability distribution. In the reverse direction, I have events and infer a probability distribution over **models** of $p$.

It's probability all the way down. The main challenge is to see everything we calculate as a model.

## Another View

So how do we do this *inverse probability* thing?

We start with Bayes Rule:

$P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}$

where $\theta$ represents one or more (and sometimes many) "models" of our process and then $P(\theta|D)$ is our *posterior* distribution over $\theta$, our Bayesian model of our models. There's nothing actually mysterious about the approach. Sometimes we use the term *hypothesis* instead of model.

In the Elvis problem, we have a model of twins and two hypotheses: Elvis's twin was a fraternal twin and Elvis's twin was an identical twin. We started with prior probabilities for each of these hypotheses and then using data we adjusted our probabilities according to Bayes Rule.

In the M&M's problem, we had two models: the first bag was from 1994 and the second from 1996 or the first bag was from 1996 and the second was from 1994. We started out with a 50/50 chance for each model and when we were done, our probabilities for each model (or hypothesis) was updated.

And, as we just saw, in the Weather problem, we had two models: rain or shine. We had prior probabilities for these models, and then data (a forecast of rain). We then reasoned out the posterior probabilities of the models based on the data and our priors.

The main difference between what we did then and what we'll do now is that now our models are almost always over continuous numerical values. The coin's "success rate", $p$, can take on infinitely many values. Even if we limit the significance to two decimals, there are 101 possible values. More generally, we will be interested in making inferences not over just discrete, categorical events like which M&M bag is which but over a infinite number of possible means and standard deviations.

We're still going to use Bayes Rule, it's just a bit harder. There are several different ways to do this,

1. Grid Method
2. Exact Method
3. Monte Carlo Method
4. Bootstrap Methods

The Exact Method involves parameterizing mathematical functions that represent the likelihood and prior and multiplying them together exactly as described in Bayes Rule. The math often becomes intractable except for the simplest of problem because with continuous variables especially "counting" means "integral calculus". The integrals involved in $P(D|\theta)$, $P(D)$ and $P(\theta)$ do not often compose well. This was one of the great obstacles to the practical application of Bayesian statistical inference.

The Grid Method involves picking a variety of individual values of $\theta$ and doing the calculations of Bayes Rule "by hand" as it were. This approach can actually be pretty effective but the quality of your results depend on how many values of $\theta$ you're willing to do calculations for...the size of your grid. We will start here because it's easiest to explain and develops the main intuition about the process.

With the advent of the computer, both the Exact and Grid approaches are no longer the only way to perform these calculations. Why? With Markov Chain Monte Carlo methods we can use numerical approaches to estimate the integrals that give the Exact method such problems.

The Monte Carlo Method involves parameterizing mathematical functions as with the Exact method but instead of trying to find an analytical solution for the posterior distribution, we *simulate* events from these distributions and construct a posterior distribution for our parameter which we can analyze.

With a renewed interest in Bayesian estimation, research has deepened our understanding of statistical inference which lead to the development of Bootstrap Methods (both parameteric and non-parameteric) which use simulation and data themselves to estimate the posterior distribution directly. This result follows mostly from the Law of Large Numbers. This is the approach we will focus on in the remainder of the text because its easy. Nevertheless, it almost seems like magic.

We will discuss each of these methods in turn but in this course, we will rely mostly on the Bootstrap in later chapters. You will certainly want to know about the Monte Carlo method if you venture in Bayesian modeling or use actual "Big Data".