<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Bayesian Statistics

_Authors: Matt Brems (DC), Kiefer Katovich (SF)_

---

### Learning Objectives
- Review the axioms and properties of probability
- Cover the formula for Bayes rule
- Learn the diachronic interpretation of Bayes rule
- Gain an intuition for the different components of the formula
- Tackle the Monty Hall problem with Bayesian statistics
- Complete some additional Bayesian statistics problems

### Lesson Guide
- [Review of probability](#review)
    - [Axioms of probability](#axioms)
    - [Properties of probability](#properties)
- [Bayes rule](#bayes-rule)
    - [The "diachronic" interpretation](#diachronic)
- [Frequentist vs. Bayesian probability](#freq-vs-bayes)
- [Bayes rule in parts](#parts)
- [The Monty Hall problem](#monty-hall)
- [Additional Bayesian statistics problems](#additional)

<a id='review'></a>
## Review of probability

---

### The sample space, event space, and probability function

With $S$ denoting the "sample space" and $F$ denoting the "event space" or space of all possible events, we have a probability function $P$ defined as:

### $$ P(S, F) \rightarrow [0, 1]$$

The probability maps all events in the sample space to the interval from 0 to 1. 


---
<a id='axioms'></a>
### Axioms of probability

**Nonnegativity**

For any event $A$, the probability of the event must be greater than or equal to zero.

### $$ 0 \le P(A) $$

**Unit measure**

The probability of the entire sample space is 1.

### $$ P(S) = 1 $$

**Additivity**

For mutually exclusive, or in other words "disjoint" events $E$, the probability of any of the events occuring is equivalent to the sum of their probabilties.

### $$ P\left(\cup_{i=1}^{\infty}\; E_i \right) = \sum_{i=1}^{\infty} P(E_i) $$

---
<a id='properties'></a>
### Properties of probability

**The probability of no event**

The probability of the empty set, denoted $\emptyset$, is zero.

### $$ P\left(\emptyset \right) = 0 $$

**The probability of A or B occuring (union)**

The probability of event $A$ or event $B$ occuring is equivalent to the sum of their individual probabilities minus the intersection of their probabilities (the probability they both occur).

### $$ P(A \cup B) = P(A) + P(B) - P(A \cap B)$$

**Conditional probability**

The probability of an event conditional on another event is written using a vertical bar between the two events. The probability of event $A$ occuring _given_ event $B$ occurs is calculated:

### $$ P(A | B) = \frac{P(A \cap B)}{P(B)} $$

Meaning the probability of both $A$ and $B$ occuring divided by the probability that $B$ occurs at all.

**Joint probability**

The joint probability of two events $A$ and $B$ is a reformulation of the above equation.

### $$ P(A \cap B) = P(A|B) \; P(B) $$

Verbally, if we want to know the probability that both $A$ and $B$ happen, we can multiply the probability that $B$ happens by the probability that $A$ happens given $B$ happens.

**The law of total probability**

Lets say we want to know the probability of the event $B$ occuring across _all_ different events $A$. For example, lets say that we are a judge presiding over a murder trial. $B$ is the event that the suspect's wallet was found at the scene of the murder. We have many hypotheses or possible scenarios in which the wallet is found at the murder scene, one of which that the suspect was actually at the scene of the crime at the time of the murder.

These different events $A$, our scenarios, are disjoint. The _total probability_ of $B$ is the probability across all of these scenarios that the wallet is found at the murder scene. So in other words - regardless of which possible scenario $A$ - what is the probability overall that the wallet is at the murder scene?

### $$ P(B) = \sum_{i=1}^n P(B \cap A_i) $$

![total probability](./assets/images/output_27_0.png)

<a id='bayes-rule'></a>
## Bayes rule

---

Bayes Rule relates the probability of $A$ given $B$ to the probability of $B$ given $A$. This rule is critical for performing statistical inference, as we shall see shortly. It is formulated as:

### $$ P(A|B) = \frac{P(B|A)\;P(A)}{P(B)} $$

Let's return to the courtroom example.

Say $A$ is the event that the suspect is guilty.

$B$ is the event that the suspect's wallet was found at the scene of the crime.

Using Bayes rule, we phrase this as: the probability that the suspect is guilty given the suspect's wallet was found at the scene of the crime is equivalent to the probability that the suspect's wallet was found there given the suspect is guilty, times the probability that the suspect is guilty (without evidence) and divided by the total probability that the wallet is found at the scene of the crime.

<a id='diachronic'></a>
### The "diachronic" interpretation of Bayes Rule

We can re-write the formula for Bayes Rule in the context of hypotheses and data, as we have already been doing with the courtroom example. The diachronic interpretation is for the probability of events _over time_. As in, the probability of an event changes over time as we collect new data.

In this case we have a model or a statistic, and we are asking the probability of our model given the data that we have observed.

### $$P\left(model\;|\;data\right) = \frac{P\left(data\;|\;model\right)}{P(data)}\; P\left(model\right)$$


<a id='freq-vs-bayes'></a>
## Frequentist vs. Bayesian probability

---

### Frequentism

Frequentists believe the "true" value of a statistic about a population (for example, the mean) is fixed (and not known). We can infer more more about this "true" distribution by engaging in sampling, testing for effects, and studying relevant parameters of the population.

Say we are flipping a coin and want to know the probability of heads. Frequentists formulate the probability of heads as a limit, defining the true probability of heads derived from an infinite number of coin flips with that coin.

### $$P(\text{heads}) = \lim_{\text{# of coin flips} \to \infty} \frac{\text{# of heads}}{\text{# of flips}}$$

Alternatively, we can write this more generally as the number of times any event $A$ occurs given an infinite number of observations/experiments (random samples from the event space).

### $$P(A) = \lim_{\text{# of experiments} \to \infty} \frac{\text{# of occurances of A}}{\text{# of experiments}} $$

### Bayesianism

Bayesians believe that data informs us about the distribution of a statistic or event, and as we receive more data our view of the distribution can be updated, further confirming or denying our previous beliefs (but never in total certainty).

For the coin flip example above, we would write out the probability of heads as our belief in the probability of getting heads given the evidence we have from observing coin flips.

### $$ P(\text{heads}) = \frac{P(\text{# of heads observed} \;|\; \text{heads})}{P(\text{# of heads observed})} P(\text{heads}) $$

Here we are representing the probability of flipping with:

Our **prior** belief, before observing flips, of the probability of flipping heads: $P(\text{heads})$

The **likelihood** of the data we observe given the chance to flip heads: $P(\text{# of heads observed} \;|\; \text{heads})$

The **total probability** of observing that many heads in coin flips regardless of weighting (or rather, across all coin weightings): $P(\text{# of heads observed})$

<a id='parts'></a>
## Bayes rule in parts
---

Using the diachronic interpretation of Bayes Rule, we can describe each part with its label like in our coin flip example above.

### $$P\left(model\;|\;data\right) = \frac{P\left(data\;|\;model\right)}{P(data)}\; P\left(model\right)$$

**The prior**

### $$ \text{prior} = P\left(model\right) $$

The prior is our belief in the model given no additional information. This "model" could be as simple as a statistic like the mean we are measuring, or a complex regression. 

**The likelihood**

### $$ \text{likelihood} = P\left(data\;|\;model\right) $$

The likelihood is the probability of the data we observed occuring given the model. So, for example, assuming that a coin is biased towards heads with a mean rate of heads of 0.9, what is the likelihood we observed 10 tails and 2 heads in 12 coin flips.

The likelihood is in fact what frequentist statistical methods are measuring. 

**The marginal probability or total probability of the data**

### $$ \text{marginal probability of data} = P(data) $$

The marginal probability of the data is the probability that our data is observed regardless of what model we choose or believe in. You divide the likelihood by this value to ensure that we are only talking about our model within the context of the data occuring. More technically, we divide by this value to ensure that what we get out on the other side is a true probability distribution - more on this later.

**The posterior**

### $$ \text{posterior} = P\left(model\;|\;data\right) $$

The posterior is our _updated_ belief in the model given the new data we have observed. Bayesian statistics is all about updating a prior belief we have about the world with the data we observe, and so we are transforming our _prior_ belief about the world into this new _posterior_ belief about the world.

<a id='monty-hall'></a>

## The Monty Hall problem
---

The Monty Hall problem is a famous probability problem with an unintuitive solution. Framing it in a Bayesian context makes it clear!

[Open up the Monty Hall notebook and tackle the problem.](./monty-hall.ipynb)

<a id='additional'></a>
## Additional Bayesian statistics problems
---

As independent practice, you can tackle some more Bayesian statistics problems:
- Pregnancy screening problem
- Cookie Jar problem
- The German Tank problem
- Dungeons & Dragons dice problems
- M&M's problem

[The questions can be found in this notebook.](bayes-problems.ipynb)