## Basic form

`p(A|B) = p(A, B) / p(B)`

### Example: buying from your website

A = {Buys, Does not Buy}, B = {USA, Canada, Mexico}

![buys](img/country_buy.png)

Support we want to find p(Buy?| Country)

**Marginal Probability**

`p(country = Mexico)= 210/(210+550+320) = 0.19
p(country = US) = 550/(210+550 + 320) = 0.51
p(country = Canada) = 320 / (210 + 550 + 320) = 0.30`

**Joint probability**
- How many probabilities are we looking for here?
    - Buy = 2 possiblities
    - Country = 3 possiblities
    - Total = 2*3 possiblities
- In general, the total = |RV1| * |RV2| * |RV3| * ... * |RVn|
    - and this number grows exponentially as we add more variables => **curse of dimensionality**
    
    
`p(buy = 1, country = CA) = 20/1080 = 0.019
p(buy = 0, country = CA) = 300/1080 = 0.28
p(buy = 1, country = US) = 50/1080 = 0.046
p(buy = 0, country = US) = 500/1080 = 0.46
p(buy = 1, country = MX) = 10/1080 = 0.0093
p(buy = 0, country = MX) = 200/1080 = 0.19`

_They all have to sum to 1!!_


**!NOTE**

As the number of parameters grows, the joint probabilities become tiny and the computer will eventually round down to 0
- this is called the **"underflow" problem**, which is common in probability
- we use log probability instead as it grows slower as its argument increases

**Conditional Probabilities**

`
p(buy = 1 | country = CA) = 0.019 / 0.30 = 0.06
p(buy = 0 | country = CA) = 0.28 / 0.30 = 0.93
p(buy = 1 | country = US) = 0.046 / 0.51 = 0.09
p(buy = 0 | country = US) = 0.46 / 0.51 = 0.91
p(buy = 1 | country = MX) = 0.009 / 0.19 = 0.05
p(buy = 0 | country = MX) = 0.185 / 0.19 = 0.97`


_This no longer sums to 1!_

If the conditional probability is independent of the random variable that it's conditioned on, then it becomes the **marginal probability**: 

if A & B are independent, i.e.

`p(A, B) = p(A)*p(B)`

so, if Buy and Country are independent (let's say if the buying percentage is the same across countries), then:

`p(Buy | Country) = p(Buy, Country) / p(Country) 
                  = p(Buy)*p(Country) / p(Country)
                  = p(Buy) `

### Manipulating Bayes Rule

` p(A|B) = p(B | A)* p(A) / p(B)`

## Monty Hall problem

[Explanation](https://betterexplained.com/articles/understanding-the-monty-hall-problem/)

### Setup
- You can choose one of three doors (one has a car, two have a goat)
- The host opens another door, which always has a goat
- You have the option to switch or to stay with your door. 

### Explanation
The best strategy is to change doors, because once the host has revealed another door that has a goat, the chances that the car is behing the other one are aroun 67%. Why?

- Assume we choose always door #1 
- C: here the car really is
- p(C = 1) = p(C = 2) = p(C = 3) = 1/3
- H: door that Monty Hall opens
- Assume he opens door # 2 without loss of generality, the problem is symmetric: 

`p(H = 2 | C = 1) = 0.5` # he has to select one of the other two doors

`p(H = 2 | C = 2) = 0` If the car is behind door #2, there is 0 probability Monty Hall would open that door (which is the one he did open)

`p(H = 2 | C = 3) = 1` If the car is actually behind door #3, there is 100% chance Monty Hall would open door 2. 

#### What probability do we want?

`p(C = 1|H = 2), p(C = 3| H = 2)`


Using Bayes Rule, we calculate that 

`p(C = 3 | H = 2) = p(H = 2 | C = 3) * p(C = 3) / [p(H = 2 | C = 1) * p (C = 1) + p(H = 2| C = 2) * p (C = 2) + p(H = 2 | C = 3)*p(C = 3)]`

==>

`p(C = 3 | H = 2) = 1* 1/3 / (1/2 * 1/3 + 0*1/3 + 1*1/3) = 2/3`

==>

`p(C = 1|H = 2)` = 1/3

`p(C = 3|H = 2)` = 2/3

Which means that we should always swicth 

## Gaussian example

- Suppose we have collected one data point from a source of Gaussian-distributed data, and let's call it "x"
- What is the probability density of that one data point?

![gaussian_single_point](img/single_datapoint_probability.png)

In real life we will collect multiple samples
- Typically the samples are IID - independent and identically distributed

#### Joint probability density
![joint_probability](img/joint_probability.png)
This means I can multiply the probaility of each individual sample to get the joint probability of all the samples

#### Mean of the Gaussian: Data likelihood

We write the "probability of the data given the parameters", which is actually what the Bayesians call **likelihood**
- The parameters depend on what the model is (e.g. Gaussian, Beta, Gamma, etc)

![likelihood](img/likelihood_given_params.png)


### Maximum Likelihood

"What is the best setting of mu, such that the likelihood is maximized?"
- To maximize the likelihood function in respect to the variable, we use calculus (set dP/d _mu_ = 0, solve for _mu_)
- Taking the log is useful to avoid the **underflow problem** to get rid of the exponent

## Maximum Likelihood: Click Through Rate

This is different to the Gaussian, it's the Bernoulli distribution


### Problem set up 
- H = Click, T = No click, H + T = total # impressions
- Also IID
- Let's call p(H) = p, so p(T) = 1 - p
- Bernoulli only has 1 parameter (Gaussian has 2)
- Suppose we flip 2H, 3T - what is the total likelihood?

Likelihood function: `L(N_h, N_t) = p^N_h * (1-p) * N_t`

So what is the maximum likelihood estimate of p?

log-likelihood gives us the solution: 

`p = N_h / (N_h + N_t)`, which is basically how often heard appears over the total number of tosses

## Confidence Intervals

- Confidence level is the inverse of the significance level
- Confidence interval is the range that most likely contains the real population _mu_ ==> we can (almost) say "_mu_ is probably here" 
- What is **don't mean** is "mu is in the interval with 95% probability"