## Probability

We will cover basic probability theorems

- Basics
- Disjoint probability
- Dependent probability

## Basics

The Sample Space or S, is the set of all possible events.  For example, in rolling a die, S would be {1, 2, 3, 4, 5, 6}.
 
We can specify a subset of events, often with A.  For example, the events of rolling a die with an even value is
A = {2, 4, 6}.  It is important to note when we say "A happens", that it is _any_ element within A.

> The Sample Space of rolling two d6 is {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, and the event A such that the roll of 
> 2d6 is greater than 8 is A = {9, 10, 11, 12}

### Probability Function

A probability function (or just "probability") is a function of $ P: F \Rightarrow \mathbb{R} $

- The probability that A happens is given as $ P(A) >= 0, \forall{A} \in F $.
- $ P(S) = 1 $ 

## Probability of Disjoint events

We must carefully distinguish when events in A or B are disjoint (ie, the events are mutually exclusive). The 
probability of P(A) or P(B) happening is

$ \large{P(A \cup B) = P(A) + P(B)} $

The probability of P(A) **and** P(B) happening is 

$ \large{P(A \cap B) = P(A) * P(B)} $

## Probability of Non-disjoint events

Things get more complicated when events in A and B are not disjoint.  The probability of P(A) or P(B) happening then 
becomes:

$ \large{P(A \cup B) = P(A) + P(B) - P(A \cap B)} $

This is almost the same as a union of disjoint events, except that we have to subtract events which are both in A and
B, otherwise they would be counted twice.

> Note that this rule applies to disjoint events too.  If the events are disjoint then:  
> $ P(A \cap B) = 0 $
>
> Since there are no events which are in both A and B

### Intersection?

But what about the intersection of A and B?  What is the probability of the intersection of two non-disjoint events?  In
order to answer this, we need to know about conditional probabilities first.

## Conditional Probability

A conditional probability is given as $ P(A|B) $ which is read as "The probability of event A, given event B happens".

$$
\Large P(A | B) = \frac{P(A \cap B)}{P(B)} 
$$

> What is the probability of getting an 8 on 2 dice, when the first die is greater than 4?  Can be written as:  
> $ P(A) = sum\space is \space 8 $  
> $ P(B) = first \space die \space > 4 $  
> $ P(A | B) $

If we did not have the given condition, then

$$
\Large P(A) = \frac{5}{36} 
$$

However, since it is given that the first die is > 4. we reduce our sample space.  Instead of 36 combinations, we have
only look at the first die.  There are only two cases where the value is > 4, and the sum is 8: 5,3 and 6, 2. Therefore 

$$
\Large P(A|B) = \frac{2}{36} 
$$

## Conditional Probability: Intersection

Now that we know conditional probability, we can write the intersection of two events as

$ \large{P(A \cap B) = P(A) \cdot P(B|A)} $

> When events are disjoint, then  
> $ P(B|A) = P(B) $  
> because in the conditional probabiliy of B given A, we can not reduce the population size since they are disjoint

## Random Variables

A Random Variable is simply a function that maps an event in the sample space to a Real number.  The example given is  
usually a die, but this idea is limiting since the die face is already a number. But another example could be 

```python
events = {"salt": 1, "pepper": 2, "thyme": 3, "basil": 4}
def rand_var(spice: str) -> int:
    return events[spice] if spice in events else 0
```

Note that RVs can be discrete as in the spices, or continuous.  It only has to return a real number

## Distributions

A distribution shows the possible values a variable can take, and how often they occur.  In a fair die, there would be  
6 possible values, all equally likely.  Other distributions may have other "shapes".  A gaussian or normal distribution  
has a bell like shape.  Other distributions might look like a U.

Distributions can be characterized by several _moments_ such as mean (the average value) and variance (a measure of  
the spread)

Distributions are therefore a way to look at certain characteristics of a Random Variable.  When we use Bayesian  
statistics, the posterior, likelihood and prior are actually distributions.

The `prior` is really just a distribution that is selected before we have any actual evidence (data), based on knowledge  
or estimation we have for the domain.

The `posterior` is likewise a distribution and calculated by multiplying the `likelihood` with the `prior` 

## Parameters

When we use bayesian models, we usually have some unknown that we want to quantify.  For example, if we ask "is this  
service actually passing?" that is a parameter that we are trying to provide a probability distribution for.  In this  
case, the possible values are "pass" or "fail".  We update our estimate through new data (ie test results) and maybe  
updating our prior.

## How to interpret probabilities

Have you ever stopped to think about what a probability represents?  There are two primary ways to think about it.

### Frequentist

Imagine the odds of it raining in January in Oklahoma is .125, Then if we looked at 8 days, we would expect to see 1  
day with rain.  This is the Frequentist view. It is perhaps the most common and intuitive way to think about  
probabilities.  Basically we count all the times we see a certain event divided by the total times we looked.

### Bayesian

Another way is to interpret them in the Bayesian model.  We can look at probabilities as uncertainty of events, and 
reflect on our knowledge of the world/system.  It does not need to be based on repeated trials.  Imagine if you were
asked "What are odds it rained in Tulsa On Jan 10th?".  You can't run repeated trials because it happened (or didn't)  
in the past.

## Probability Functions

There are two major kinds of probability functions that we are concerned about in probability theory:

- PMF: Probability Mass Functions map specific discrete value(s) to a probability
    - "What is the probability that event X will occur?"
- PDF: Probability Density Functions find the probabilty between a range of values.
    - "What is the probability that the value will be between 5 and 10?" 
    - The sum of the area under the curve within this interval is the probability (integral)
- CDF: Cumulative Distribution Function

> You can think of a PMF as the instantaneous rate of change (derivative) at a specific value of x, and  PDF as the  
> sum of the probabilities between two points (the definite inttegral).  Density is mass / volume.  Instead of a volume,  
> it's an area

Related to these functions is the notion of a `distribution`, which is a way to describe all the possible events and  
the probability of each event happening.

### The Binomial Distribution PMF

A very important PMF is the binomial distribution.  It is used to calculate the probability of a certain number of
outcomes.  The outcomes can only have two possible states: it happened, or it didn't happen (hence the binomial).  For  
example, "Getting two heads in a toss of 4 coins" or "getting at least 2 rolls of 8+ on 5d10".  Either the event  
happens, or it didn't happen.

A binomial distribution has 3 parameters:

- k: The number of outcomes we want
- n: The number of trials
- p: The probability of the successful outcome for a single event

So for example, in "Getting two heads in a toss of 4 coins" the values would be:

- k = 2 because we are looking for 2 heads
- n = 4 because we have 4 attempts
- p = 1/2 since the odds of getting a single heads 

In the example of "getting at least 3 rolls of 8+ on 5d10"

- k = 3 because we want 3 out of 5
- n = 5 because we are rolling 5 dice
- p = 3/10 since P(8+ on 1d10) = P(8 on 1d10) + P(9 on 1d10) + P(10 on 1d10) = 1/10 + 1/10 + 1/10

The Binomial Distribution function can be written as $ B(k; n, p) $ And it is equal to 

$$
\Large B(k; n, p) = N_{outcomes} * P(Desired Outcome) 
$$ 

But how do we calculate $ N_{outcomes} $ and $ P(Desired Outcome) $?

#### Binomial Coefficient for outcomes

The $ N_{outcomes} $ can be calculated using the well known $ n \choose k $ which is calcuated as

$ \huge{\binom{n}{k} = \frac{n!}{k!(n-k)}!} $

So, lets solve the first problem.  The possible number of outcomes is

$ \huge{\binom{4}{2} = \frac{4!}{2!(2!)} = \frac{24}{4} = 6} $

#### Calculating P(Desired Outcome)

Now all we need to do is calculate the probability of the desired outcome.

$ \huge{p^k * (1 - p)^{n-k}} $

How did we get this?  Imagine you want the case where you can get 3 8's or better on a roll of 5d10.  Let's look at the  
most simple example of 8,8,8,1,1.  If we define $ P(A) $ as _the probability of getting an 8 or better on a d10 roll_

The odds of this are 

$ \huge{P(A) * P(A) * P(A) * \neg{P(A)} * \neg{P(A)}} $


In [None]:
import math

def binomial_pmf(
    k: int,
    n: int,
    p: float
):
    num_outcomes = math.comb(n, k)
    desired_outcome = math.pow(p, k) * math.pow(1-p, n-k)
    print(f"{k=}, {n=}, {p=}, {num_outcomes=}, {desired_outcome=}")
    return num_outcomes * desired_outcome

def sum_binom(
    k: int,
    n: int,
    p: float
):
    return sum(binomial_pmf(i, n, p) for i in range(k, n + 1))

In [None]:
sum_binom(3, 5, .4)

In [None]:
from random import randint
from dataclasses import dataclass

def die(size: int):
    return randint(1, size)

@dataclass
class Pool:
    amount: int
    size: int
    
    def roll(self):
        return sorted(die(self.size) for _ in range(self.amount))
    
p = Pool(5, 10)
gte_8 = 0
for i in range(1000):
    if len([b for b in p.roll() if b >= 7]) >= 3:
        gte_8 += 1

print(gte_8)


## Beta Distribution

Where the binomial distribution is a PMF, the beta distribution is a PDF.  A PMF, like the binomial distribution, 
is for a discrete value.  A PDF, like the beta distribution, is for continuous values between a range.

In the binomial distribution, we wanted `k` outcomes in `n` trials.  In a beta distribution, we have $ \alpha $ which 
is how many times we observe what we are interested in, and $ \beta $ number of times we didn't observe what we wanted.
It is a subtle but important difference between `k` and $ \alpha $ and `n` and $ \beta $.  The total number of trials
is $ \alpha + \beta $

The beta distribution function is defined as follows

$$ 
\huge B(p; \alpha, \beta) = \frac{p^{\alpha-1} \cdot (1-p)^{\beta-1}}{beta(\alpha,\beta)} 
$$

Where:
- p: the probability of an event
- $ \alpha $: how many times the event of interest occurs
- $ \beta $: how many times the event of interest did not occur


The $ beta(\alpha,\beta) $ symbol in the denominator is called the `beta Function` needs a little explanation.  It is 
defined as:

$$ 
\huge \int_0^1 p^{\alpha-1} \cdot (1-p)^{\beta-1}
$$



In [None]:


def betad(
    p: float,
    alpha: int,
    beta: int
):
    ...

## Bayes Theorem

If we know the probability that we will see our data given our hypothesis, the odds of our hypothesis without any  
conditions, and the odds of seeing our data without any other information, then we can calculate 

$$
\Large P(H|E) = \frac{P(E|H) \cdot P(H)}{P(E)}
$$


### Posterior, Likelihood, and Prior

Bayes formula can be broken into 3 parts:

- H is our hypothesis, eg, _"odds the service is good"_
- E is the evidence, eg, _"the test result"_
- P(H|E) _posterior distribution_ (or posterior)
- P(E|H) _likelihood_ 
- P(E) _prior_ or our prior belief 

## Parameter estimation

Beta distribution examples

In [None]:
import preliz as pz

pz.BetaBinomial(alpha=10, beta=10, n=6).rvs()

In [None]:
import matplotlib.pyplot as plt

plt.hist(pz.BetaBinomial(alpha=3, beta=4, n=6).rvs(100))
#pz.BetaBinomial(3,4,6).plot_pdf()

## The Birthday Problem

This is a counter-intuitive problem seen when you try to find two people with the same birthday.  Intuitively, you may
think it is 1/365, since there are 365 days in the year.  However, that would be a match for a specific date.  We want
to see if any two dates pulled at random match

In [None]:
def bday():
    i = 1
    prob = 1
    while i < 365:
        prob = prob * (1 - i/365)
        yield prob
        i += 1

In [None]:
i = 0
g = bday()
while True:
    if next(g) < .50:
        break
    i += 1
print(i)