# Probability and Bayes

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from src.piApprox import pi_approx
import src.monty_hall as mh

%matplotlib inline

## Calculating Probabilities

Note: some of this text comes from [_OpenIntro Statistics_](https://www.openintro.org/book/os/), Chapter 3.

In general, calculating probabilities is a matter of dividing the outcome you're exploring by all possible outcomes:

$$\large P(Event) = \frac{|Event|}{|Sample\ Space|} $$

We are often interested in calculating the probability that *either* event $A$ or event $B$ occur, or the probability that *both* events $A$ and $B$ occur.

In describing these sorts of probabilities, we generally borrow some notation from set theory and use the set "union" symbol ($\cup$) for "or" and the set "intersection" symbol ($\cap$) for "and".

(If I collect the $A$ and $B$ possibilities into a set, then to ask about the probability of $A$ **or** $B$ is to ask about the probability of an event occurring from the **union** of $A$ and $B$, $A\cup B$. Similarly, to ask about the probability of $A$ **and** $B$ is to ask about the probability of an event occuring from the **intersection** of $A$ and $B$, $A\cap B$.)

### General Addition Rule

The probability that either $A$ or $B$ will occur can be calculated by adding each individual probability, and then subtracting the probability that both occur together:

$$\large P(A \cup B) = P(A) + P(B) − P(A \cap B)$$

Remember, $P(A \cap B)$ expresses the overlap between the two events - if you don't subtract that overlap, then you double count the instances when **both** $A$ and $B$ occur!

### Multiplication Rule for Independent Processes

A special condition is when the outcome of $A$ has no bearing on the outcome of $B$. We say these two events are **independent** (e.g. rolling a die and tossing a coin).

If $A$ and $B$ represent events from two different and **independent** processes, then the probability
that both $A$ and $B$ occur can be calculated as the product of their separate probabilities:

$$\large P(A \cap B) = P(A) * P(B)$$

### 🧠 Knowledge Check

### 1) AND Question:

What is the probability of rolling a 5 on a fair die _and_ getting a tails on a fair coin toss?

**Your answer here**:

- 



<details>
    <summary>Answer</summary>

We're checking for the intersection of these sets. Of the six possible outcomes on a die roll only one (the 5) will do. So the chance of getting a 5 on a die is 1/6. Of the two possible outcomes on a coin toss again only one (tails) will do. So the chance of getting tail on a coin toss is 1/2.

So the calculation is: $$\large P(5 \cap tails) = \left(\frac{1}{6}\right)*\left(\frac{1}{2}\right) = \frac{1}{12}$$
</details>

### 2) OR Question: 

What is the probability of rolling a 5 on a die _or_ getting a tails on a coin toss?

**Your answer here**:

- 



<details>
    <summary>Answer</summary>
    
   We're now checking for the union of these sets. Here we want to count all the die-coin combinations where we have a 5 on the die AND all the die-coin combinations where we have a tails on the coin. 

$$\large P(5 \cup tails) $$

**BUT:**

If the die is 5, that includes two possibilities: 5-heads and **5-tails**.

Our coin is tails, that includes six possibilities: 1-tails, 2-tails, 3-tails, 4-tails, **5-tails**, and 6-tails.

But then we've counted the combination where **both** the 5 and the tails occur **twice**.

So the correct calculation is the sum of the individual probabilities **less the probability of their intersection**:

$$\large P(5 \cup tails) = \left(\frac{1}{6}\right) + \left(\frac{1}{2}\right) - \left(\left(\frac{1}{6}\right)*\left(\frac{1}{2}\right)\right) = \frac{7}{12} $$
    </details>

## Enough Talk - Let's Explore in Python!

### Mushroom dataset

Let's look at a modified version of the Mushroom dataset from UCI [here](https://archive.ics.uci.edu/ml/datasets/Mushroom). Each row in this dataset corresponds to one observation (one mushroom). 

In [None]:
df = pd.read_csv('data/Mushrooms_cleaned.csv')
df.head()

In [None]:
df.describe()

#### 1) If you picked a row from this dataset at random, what is the probability it corresponds to a bruised mushroom? 

In other words, find $P(bruised)$

In [None]:
df['bruised'].value_counts(normalize=True)

In [None]:
print(len(df.index))
print(len(df.loc[df['bruised'] == True].index))

len(df.loc[df['bruised'] == True]) / len(df)

In [None]:
# Another way
p_bruised = df[df['bruised'] == True].shape[0]/df.shape[0]
p_bruised

In [None]:
# Let's see...
df.sample(1)

#### 2) What is the probability you pick a row corresponding to a mushroom that is bruised _AND_ edible?

$P(edible \cap bruised)$

BUT! Are they independent events ...?

In [None]:
p_bruised_and_edible = df[(df['bruised'] == True) & (df['edible-poisonous'] == 'edible')].shape[0]/df.shape[0]

p_bruised_and_edible

Are being bruised and being edible independent of each other?

> Formally, $A$ and $B$ are *independent* if and only if the probability that *both* $A$ *and* $B$ happen is:
> 
> $$\large P(A \cap B) = P(A) * P(B)$$

In [None]:
p_bruised

In [None]:
p_edible = len(df[df['edible-poisonous'] == 'edible'])/len(df)
p_edible

In [None]:
p_bruised * p_edible

In [None]:
p_bruised_and_edible == p_bruised * p_edible

## Enter: Conditional Probability

### When do we compute conditional probabilities? 

We need to compute conditional probabilities when the outcome of an event depends on the outcome of previous events (**dependent** events). A conditional probability of an event is the probability of the event *given* another event has occurred.


When events _are_ independent, the rule for probabilistic AND (I'll use '$\cap$' below) is simple:

$$\large P(A\cap B) = P(A) * P(B)$$

But the more general rule, which includes non-independent events, is:

$$\large P(A\cap B) = P(A | B) * P(B)$$

> (this is pretty much the law of total probability, we'll revisit this later!)

In fact, this is the definition of conditional probability. Rearranging:

$$\large P(A | B) = \frac{P(A\cap B)}{P(B)}$$

The `|` here should be read as "given". We are given some information, $B$, and thus it reduces our sample space!

In [None]:
edible = df[df['edible-poisonous'] == 'edible']
edible['bruised'].value_counts(normalize=True)

## Bayesianism

The Bayesian treats probabilities as **subjective**, and in particular as rational **degrees of belief** in states of affairs. And the classic use case for Bayesian reasoning is in **updating** one's subjective probability about something, **conditional on** some new evidence that comes in.

### A Famous Example

Suppose some rare disease affects 1 in 100,000 people. There is a test for it, though it is imperfect: 5% of the people who have the disease will test negative and 4% of the people who don't have the disease will test positive for it. You take the test and test positive. Before the test the probability that you had the disease was only 1 in 100,000. But now, with this new information of the positive test, how should you judge the probability that you have the disease?

We can use **Bayes's Theorem**:

$\large P(h | e) = \frac{P(e | h)P(h)}{P(e)}$.

Terminology:
<table>
    <tr>
        <th>term</th>
        <th>name</th>
    </tr>
    <tr>
        <td>P(h)</td>
        <td>prior</td>
    <tr>
        <td>P(h|e)</td>
        <td>posterior</td>
    <tr>
        <td>P(e|h)</td>
        <td>likelihood</td>
    </tr>
    <tr>
        <td>P(e)</td>
        <td>scaler</td>
    </tr>
</table>

That is, for our problem: The (new) probability that I have the disease (i.e. the posterior) is equal to the unconditional probability that I have the disease (i.e. the prior) multiplied by the probability that someone who has the disease tests positive (i.e. the likelihood) divided by the probability of testing positive (i.e. the scaler).

To calculate the denominator, we'll need to make use of the **Law of Total Probability**.

Remember that the Law of Total Probability tells us how to calculate an unconditional probability given a partitioning collection of conditional probabilities. In a Bayesian context, we can think of this law as saying: Suppose there are $n$-many hypotheses $h_1, ... , h_n$, that could explain some bit of evidence $e$. If I know all of the **conditional** probabilities $P(e | h_1), ... , P(e | h_n)$, then I can use the Law of Total Probability to calculate the **unconditional probability** $P(e)$.

In our case, there are only two possible hypotheses: Either I have the disease or I do not.

So if we put the Law of Total Probability together with Bayes's Theorem here we get:

$\huge P(h_1|e) = \frac{P(h_1)*P(e|h_1)}{P(h_1)*P(e|h_1)+P(h_2)*P(e|h_2)}$, where:

$h_1$: I have the disease

$h_2$: I do not have the disease

$e$: I test positive

In [None]:
(0.00001 * 0.95) / (0.00001 * 0.95 + 0.99999 * 0.04)

# Notice that the likelihoods (P(e | h_1) and P(e | h_2)) do NOT sum to 1!

The probability that I have the disease is still less than 1 in 4000!

## Level Up: Analytically Intractable Problems and Indeterministic Approximation Methods

In general, there will be infinitely many possible probability distributions responsible for the observations we've made.

A Bayesian might make use of *Monte Carlo sampling* here:

(i) choose some distribution-defining parameters randomly; <br/>
(ii) calculate the likelihood of the data having come from that distribution; <br/>
(iii) "step" randomly to a new set of parameter values and do (i) and (ii) again <br/>
(iv) if the newly calculated likelihood is higher than the old, then take another "step" *from* those new parameter values <br/>
(v) repeat until your likelihood isn't getting any higher.

Monte Carlo sampling is often used to approximate values of definite integrals that have no analytic solution. The classic example here is [the estimation of pi](https://en.wikipedia.org/wiki/Monte_Carlo_integration#Example).

### Let's use `piApprox.py`!

In [None]:
pi_approx(n=10)

In [None]:
guesses = []
for _ in range(100):
    guesses.append(pi_approx(100))
np.mean(guesses)

In [None]:
guesses = []
for i in range(1000):
    guesses.append(pi_approx(1000))
np.mean(guesses)

## Level Up: Objective and Subjective Probability

Philosophers wonder about the best way to understand what probabilities are. The main division is between those who want to understand probabilities _objectively_ and those who want to understand probabilities _subjectively_.

### Historical Relevance
In the early twentieth century, the quantum theory being developed by physicists was saying that the location (etc.) of a particle could be represented by a probabilistic wave function that gave probabilities for the particle to be in one place rather than another. And this question of how to understand these probabilities reared its head. Albert Einstein argued that they could only be interpreted subjectively, but the dominant interpretation today is that there is a kind of indeterminacy in the universe itself.

### Objective Probability
The paradigmatic theory of _objective_ probability is frequentism, which says that probabilities are a measure of the long-run behvior of physical systems. To say that a die has a 1/6 chance of coming up "6" when tossed, for example, is to say that, in the long run as the number of tosses increases without bound, the number of "6"s rolled will constitute one sixth of all tosses.

On this point of view, **we cannot speak meaningfully of the probability of a single event**. Once a die has been rolled, there is no non-trivial probability of its having come up "6" or not. Either it did (in which case the probability is 1) or it did not (in which case the probability is 0).

Similarly, **we cannot speak meaningfully of the probability of a parameter having a certain value, or of a hypothesis being true**. The frequentist will reject the idea of a (meaningful) probability of a die being unfairly weighted. Either it is or it is not.


### Subjective Probability
The paradigmatic theory of _subjective_ probability is Bayesianism, which says that probabilities are better understood as rational _degrees of belief_. The standard of rationality is necessary here to assure that these degrees of belief will conform to the probability calculus.

If probabilities are degrees of belief, then it _does_ make sense to apply them to parameters or to hypotheses. The probability of a die being unfairly weighted would simply represent what it would be rational to believe about the die with respect to its being weighted or not.

Now: Crucially, what it is rational to believe about the die with respect to its being weighted or not _is a function of what we know about the die!_

In particular, if we gain the evidence (or knowledge) that the die has been rolled 100 times and come up "5" 90 times, then this would have (or, rather, *ought, rationally, to have*) a significant impact on our degree of belief with respect to the weightedness of the die. This is the sort of idea that Thomas Bayes had.