### Credits

Content taken from Skylar Versage and rearranged.

## Objectives

- Describe the difference between frequentist and bayesian approaches of experiments
- Define bayes rule, use in two examples
- Define posterior, prior, likelihood, evidence
- Solve a discrete bayes problem by hand

## Introduction

**Example 1**: A highly trained conductor says she can tell the difference between Mozart and Beethoven. We randomly sample excerpts and play them. She guesses 10/10 correctly. 

**Example 2**: A drunk observer then claims he can guess the outcome of a coin flip mid air, we proceed to test him and he guesses 10/10 correctly.

A **Frequentist** might say “They are both so skilled, I have as much confidence in the conductor’s ability as the drunk man’s ability to predict the coin toss”

A **bayesian** would say “I’m not so convinced by the drunk guy”.

The bayes approach incorporates **prior knowledge** into the experiment’s results.

## 1. Conditional Probabilities [Review]

The probability of an event B given 

> The conditional probability of an event B is the probability that the event will occur given the knowledge that an event A has already occurred. This probability is written

>$$ P(B \mid A) $$

> notation for the probability of B given A. [\[link\]](http://www.stat.yale.edu/Courses/1997-98/101/condprob.htm) [also see All of Statistics p 28].

$$ P(B \mid A) = \frac {P(A \cap B)} {P(A)} $$

That you can relate, by analogy, to a set of possible outcomes:

<img src="images/set_intersection.svg" alt="set intersection" style="width: 300px;"/>

**Exercise**: what's the probability of rolling a dice with a value less than 4 knowing that the value is odd.
- identify that is A and B in the problem, to inject into the formula above,
- identify the possible outcomes for each component of that formula,
- compute the answer to the problem.

<br/>
<details>
<summary>Click here to see the solution below</summary>
- B is "rolling a dice with a value less than 4", as a set of possible outcomes it is $ \{ 1, 2, 3 \} $<br/>
- A is "rolling a dice with a value that is odd", as a set of possible outcomes it is $ \{ 1, 3, 5 \} $<br/>
- $ A \cap B $, as the intersection of those sets is $ \{ 1, 3 \} $ <br/>
- $ P(B \mid A) = \frac {P(A \cap B)} {P(A)} = \frac {|A \cap B|} {|A|} = \frac {2} {3} $
</details>

## 2. Bayes Rule

### 2.1. Law of total probability

Given a partition $ A_1, ..., A_n $ of $ \Omega $, the probability of an event B

$$ P(B) $$

Can be written as:

$$ P(B) = \Sigma i P( B \mid A_i ) P (A_i) $$

Special case for this, if two situations are exclusive (A and non A):

$$ P(B) = P( B \mid A ) \: P (A) + P( B \mid non A ) \: P (non A)$$

### 2.2. Bayes Rule


$$ P(A \mid B) = \frac { P(B \mid A) \: P(A) } {P(B)} $$

- $ P(A \mid B) $ is called the posterior probability
- $ P(B \mid A) $ is called the likelihood
- $ P(A) $ is called the prior probability
- $ P(B) $ is a probability, often refered to a normalizing constant (we'll talk about that later)

### 2.3. Let's apply it right away...

**Exercise**: planning a picnic

You are planning a picnic today in summer in Seattle, but the morning is cloudy. What is the chance that it will rain during the day, knowing that:

- 50% of all rainy days start off cloudy
- cloudy mornings are common in Seattle (40% of days start cloudy)
- this month of summer is usually dry (only 3 days out of 30 tend to be rainy).

What is the chance that it will rain during the day?
- identify the terms in the Bayes Rule: what is A, what is B?
- identify which probability you know, write them as A and B, and inject them into the formula
- compute the result

<br/>
<details>
<summary>Click here to see the solution below</summary>
What we want to know is
$$P(A \mid B)$$
$$P(rainy \: day \mid cloudy \: morning)$$<br/>

- B is "today is a rainy day"<br/>
- A is "today is a cloudy morning"<br/>
<br/>
What we know is:<br/>
<br/>
_"50% of all rainy days start off cloudy"_ translates as $ P(cloudy \: morning \mid rainy \: day) = 0.5 $<br/>
<br/>
_"40% of days start cloudy"_ translates as $ P(cloudy \: morning) = 0.4 $<br/>
<br/>
_"only 3 days out of 30 tend to be rainy"_ translates as $ P(rainy \: day)= 3/30 $<br/>
<br/>

$$ P(rainy \: day \mid cloudy \: morning) = \frac { P(cloudy \: morning \mid rainy \: day) \: P(rainy \: day) } {P(cloudy \: morning)} $$

<br/>

$$ P(rainy \: day \mid cloudy \: morning) = \frac { 0.5 \times 3/30 } { 0.4 } = 0.125 $$

</details>

**Exercise**: cancer screening

- $ P(cancer) = .01 $
- $ P(tested \: positive \mid cancer) = .9 $
- $ P(tested \: negative \mid no \: cancer) = .9 $

What’s the probability we have cancer given we’ve been tested positive?
- identify the terms in the Bayes Rule: what is A, what is B?
- identify which probability you know, write them as A and B, and inject them into the formula
- compute the result

<br/>
<details>
<summary>Click here</summary>
What we want to know is
$$P(A \mid B)$$
$$P(cancer \mid tested \: positive)$$<br/>

- B is "tested positive"<br/>
- A is "cancer"<br/>
<br/>
What we know is:<br/>
<br/>
- $ P(cancer) = .01 $<br/>
- $ P(tested \: positive \mid cancer) = .9 $<br/>
- $ P(tested \: negative \mid no \: cancer) = .9 $<br/>
<br/>

$$ P(cancer \mid tested \: positive) = \frac { P(tested \: positive \mid cancer) \: P(cancer) } {P(tested \: positive)} $$

<br/>

$$ P(cancer \mid tested \: positive) = \frac { 0.9 \times 0.01 } {P(tested \: positive)} $$

</details>

How do you obtain:

$$ P(tested \: positive) $$

<details>
<summary>Click here</summary>
<br/>
$ P(tested \: positive) = $<br\>
$ P(tested \: positive \mid cancer) \: P (cancer) + P(tested \: positive \mid no \: cancer) \: P (no \: cancer) $

</details>


### 2.4. How to obtain it (or recover it from your long term memory)

<img src="images/set_intersection.svg" alt="set intersection" style="width: 300px;"/>

Remember conditional probability?

$$ P(B \mid A) = \frac {P(A \cap B)} {P(A)} $$

Can also be rewritten as:

$$ P(B \mid A) \: P(A) = P(A \cap B) $$

Now, $ A \cap B = B \cap A $, right? so...

$$ P(B \mid A) \: P(A) = P(A \cap B) = P(B \cap A) = P(A \mid B) \: P(B) $$

Let's put them together:

$$ P(B \mid A) \: P(A) = P(A \mid B) \: P(B) $$


### 2.2. Components revisited

**Prior Probability**:
- A PMF / PDF representing your initial beliefs about the parameter(s)
- The initial belief is less represented in the posterior as more data is incorporated
- How do we obtain it? Previous Studies... Researcher’s Intuition... Expert Opinion... If all else fails use an uninformed prior

**Likelihood**:
- The probability of observing the data given the parameter(s)
- i.e. What is the likelihood of 3 Heads in a row given the probability of heads is 0.7?

**Posterior Probability**:
- The product of prior and likelihood (Bayesian-update)
- The posterior probability becomes the prior of the next Bayesian-update

**Normalizing Constant**:
- The probability of observing the data. 
- In Bayesian analysis, this term ensures the sum of all probabilities is 1


### 2.4. The "diachronic" interpretation

In [Think Bayes, p. 5](http://www.greenteapress.com/thinkbayes/thinkbayes.pdf) we find:

> There is another way to think of Bayes’s theorem: it gives us a way to update the probability of a hypothesis, H, in light of some body of data, D.
This way of thinking about Bayes’s theorem is called the diachronic interpretation. “Diachronic” means that something is happening over time; in this case the probability of the hypotheses changes, over time, as we see new data.

>Rewriting Bayes’s theorem with H and D yields:

> $$ P(H \mid D) = \frac {P(H) \: P(D \mid H)} {P(D)}$$

> In this interpretation, each term has a name:

>- $P(H)$ is the probability of the hypothesis before we see the data, called the prior probability, or just prior.
>- $P(H \mid D)$ is what we want to compute, the probability of the hypothesis after we see the data, called the posterior.
>- $P(D \mid H)$ is the probability of the data under the hypothesis, called the likelihood.
>- $P(D)$ is the probability of the data under any hypothesis, called the normalizing constant.



If we overlook that normalizing constant, what comes up is a relation between prior, likelihood and posterior.

$$ posterior \sim likelihood \times prior $$


## 3. Prior in Action [example]

Let's take an example.

You have a drawer of 100 coins, 10 of which are biased.

- $ P ( head \mid fair ) = 0.5 $
- $ P ( head \mid biased ) = 0.25 $

You randomly choose a coin and flip it once. It comes up heads.
1. What is $ P(fair \mid head) $ ?
2. What if you flip it a second time and it comes up heads again?


## 3.1. First loop: no prior

After the first flip, $D = [head]$

H = "fair coin"

$ P (fair \mid head) = ... $

<br/>
<details>
<summary>Click here to see the solution below</summary>
<br/>
$ P (fair \mid head) = \frac { P (head \mid fair) \: P (fair) } { P (head \mid fair) \: P (fair) + P (head \mid biased) \: P (biased) } $<br/>
<br/>
$ P (fair \mid head) = \frac { 0.5 \: 0.9 } { 0.5 \times 0.9 + 0.25 \times 0.1 } $<br/>
</details>

In [1]:
P_fair_head = (0.5 * 0.9) / (0.5 * 0.9 + 0.25 * 0.1)
print(P_fair_head)

0.947368421053


H = "biased coin"

$ P (biased \mid head) = ... $

<br/>
<details>
<summary>Click here to see the solution below</summary>
<br/>
$ P (biased \mid head) = \frac { P (head \mid biased) \: P (biased) } { P (head \mid fair) \: P (fair) + P (head \mid biased) \: P (biased) } $<br/>
<br/>
$ P (biased \mid head) = \frac { 0.25 \: 0.1 } { 0.5 \times 0.9 + 0.25 \times 0.1 } $<br/>
</details>

In [2]:
P_biased_head = (0.25 * 0.1) / (0.5 * 0.9 + 0.25 * 0.1)
print(P_biased_head)

0.0526315789474


## 3.2. New evidence! (Second loop)

After a second flip, $D = [head, head]$

We'll use our previous results as priors:

$$ P (fair) = 0.947 $$
$$ P (biased) = 0.052 $$

And consider we've just observed one new evidence $ D=[head] $

H = "fair coin"

$ P (fair \mid head) = ... $

<br/>
<details>
<summary>Click here to see the solution below</summary>
<br/>
$ P (fair \mid head) = \frac { P (head \mid fair) \: P (fair) } { P (head \mid fair) \: P (fair) + P (head \mid biased) \: P (biased) } $<br/>
<br/>
$ P (fair \mid head) = \frac { 0.5 \: 0.947 } { 0.5 \times 0.947 + 0.25 \times 0.053 } $<br/>
</details>

In [3]:
P_fair_head = (0.5 * 0.947) / (0.5 * 0.947 + 0.25 * 0.053)
print(P_fair_head)

0.972778633796


H = "biased coin"

$ P (biased \mid head) = ... $

<br/>
<details>
<summary>Click here to see the solution below</summary>
<br/>
$ P (biased \mid head) = \frac { P (head \mid biased) \: P (biased) } { P (head \mid fair) \: P (fair) + P (head \mid biased) \: P (biased) } $<br/>
<br/>
$ P (biased \mid head) = \frac { 0.25 \: 0.053 } { 0.5 \times 0.947 + 0.25 \times 0.053 } $<br/>
</details>

In [4]:
P_biased_head = (0.25 * 0.053) / (0.5 * 0.947 + 0.25 * 0.053)
print(P_biased_head)

0.0272213662044


## 3.3. Loop N times for head

If I go on drawing heads for a long time...

In [5]:
p_head_fair = 0.5
p_head_biased = 0.25

p_fair_prior = 0.9
p_biased_prior = 0.1

The function we're building here, is using the likelihood of observations to update the posterior probability.

In [6]:
# we draw head
p_fair_posterior = ((p_head_fair * p_fair_prior)
                    / (p_head_fair * p_fair_prior
                        + p_head_biased * p_biased_prior) )

print("p(fair|D) = {}".format(p_fair_posterior))

p_biased_posterior = ((p_head_biased * p_biased_prior)
                    / (p_head_fair * p_fair_prior
                        + p_head_biased * p_biased_prior) )

print("p(biased|D) = {}".format(p_biased_posterior))


p(fair|D) = 0.947368421053
p(biased|D) = 0.0526315789474


In [7]:
p_fair_prior = p_fair_posterior
p_biased_prior = p_biased_posterior

## 3.4. Can we design a generic bayesian update loop?

In [8]:
p_head_fair = 0.5
p_head_biased = 0.25

p_tail_fair = 1 - p_head_fair
p_tail_biased = 1 - p_head_biased

p_fair_prior = 0.9
p_biased_prior = 0.1

# evidence of previous example
E = ['h','h']

# Fake evidence for fair
#E = ['h','t','h','t','h','t','h','t','h','t','h','t','h','t']

# Fake evidence for biased
#E = ['t','t','t','h','t','t','t','h','t','t','t','h']

In [9]:
for el in E:
    print("Drew {}".format(el))
    
    if el == 'h':
        # we draw head
        p_fair_posterior = ((p_head_fair * p_fair_prior)
                            / (p_head_fair * p_fair_prior
                                + p_head_biased * p_biased_prior) )
        p_biased_posterior = ((p_head_biased * p_biased_prior)
                            / (p_head_fair * p_fair_prior
                                + p_head_biased * p_biased_prior) )
    elif el == 't':
        # we draw tail
        p_fair_posterior = ((p_tail_fair * p_fair_prior)
                            / (p_tail_fair * p_fair_prior
                                + p_tail_biased * p_biased_prior) )
        p_biased_posterior = ((p_tail_biased * p_biased_prior)
                            / (p_tail_fair * p_fair_prior
                                + p_head_biased * p_biased_prior) )

    print("   p(fair|D) = {}".format(p_fair_posterior))
    print("   p(biased|D) = {}".format(p_biased_posterior))

    p_fair_prior = p_fair_posterior
    p_biased_prior = p_biased_posterior

Drew h
   p(fair|D) = 0.947368421053
   p(biased|D) = 0.0526315789474
Drew h
   p(fair|D) = 0.972972972973
   p(biased|D) = 0.027027027027


# What's next?

Imagine we have hypothesis about all the possible probabilities of that coin we're flipping. And we want to estimate how unfair it is, what is its probability for heads/tails.

Let's imagine all possible hypothesis and assign priors on them.

Let's "feed" evidence into a loop using:

$$ posterior \sim likelihood \times prior $$

![](images/bayes-update.png)