# 2. Probability

## 2.1 Introduction

두 가지 관점:

1. frequenist interpretation: probabilities represent long run frequencies of events
2. bayesian interpretation: probability is used to quantify our uncertainty about something

베이지안 확률 정의의 장점:  can be used to model our uncertainty about events that do not have long term frequencies.

## 2.2 A brief review of probability theory

### 2.2.1 Discrete random variables

$p(A)$: the probability that the event $A$ is true
$p(\bar{A}) = 1-p(A)$: the probability that the event not $A$

**require:**

* $0 \leq p(A) \leq 1$: 
    * 0: the event is definitely not happen
    * 1: the event is definitely happen

the notion of binary events by defining a **discrete random variable** $X$, which can take on any value from a finite or countably infinite set $\varkappa$. Denote the probability of the event that $X=x$ by $p(X=x)$, for short is $p(x)$. $p()$ is called **probability mass function(pmf)**. It satisfies $0 \leq p(x) \leq 1$ and $\sum_{x \in \varkappa} p(x) = 1$

* $\varkappa$: state space

Uniform distribution:

$\Bbb{I}()$ is the binary **indicator function**. 


### 2.2.2 Fundamental Rules

#### Probability of a union of two events
$$\begin{aligned}p(A\cup B) &= p(A) + p(B) +p(A\cap B)\\ 
&= p(A) + p(B) \quad if\ A\ and\ B\ are\ mutually\ exclusive \end{aligned}$$

####  Joint probabilities
$$p(A,B) = p(A\cap B) = p(A \vert B)p(B)$$

the marginal distribution: 

$$p(A) = \sum_b p(A, B) = \sum_b p(A \vert B=b)p(B=b)$$

the sum rule / the rule of total probability:

$$p(X_{1:D}) = p(X_1)p(X_2 \vert X_1)p(X_3 \vert X_2, X_1) \cdots p(X_D \vert X_{1:D-1})$$

#### Conditional probability
$$p(A \vert B) = \dfrac{p(A, B)}{p(B)},\ if\ p(B) > 0$$

### 2.2.3  Bayes rule(Bayes Theorem)
$$p(X=x \vert Y=y) = \dfrac{p(X=x, Y=y)}{p(Y=y)} = \dfrac{p(Y=y \vert X=x)p(X=x)}{\sum_{x'}p(Y=y \vert X=x')p(X=x')}$$

Example: medical diagnosis

> mammogram: a medical test for breast cancer for a woman in 40s 
>
> the test has a sensitivity of 80%, which means, if you have cancer, the test will be positive with probability 0.8
>
> $$p(x=1 \vert y=1)=0.8$$
>
> * $x = 1$ is the event the mammogram is positive, $y = 1$ is the event you have breast cancer
>
> the perior probability(base rate fallacy) of having breast cancer is $p(y=1) = 0.004$
>
> a false positive / false alarm
>
> $$p(x=1 \vert y=0)=0.1$$
>
> We want to calculate wheter the probability having breast cancer when we take the test, Using bayes rule
>
> $$\begin{aligned} p(y=1 \vert x=1) &= \dfrac{p(x=1 \vert y=1)p(y=1)}{p(x=1 \vert y=1)p(y=1) + p(x=1 \vert y=0)p(y=0)} \\
&= \dfrac{0.8 \times 0.004}{0.8\times 0.004 + 0.1 \times 0.996} \\ &= 0.031\end{aligned}$$


In [2]:
p_y1 = 0.004
p_x1y1 = 0.8
p_x1y0 = 0.1
p_y1x1 = (p_x1y1 * p_y1) / (p_x1y1*p_y1 + p_x1y0*(1-p_y1))
print(p_y1x1)

0.0311284046692607


Generative classifiers:

$$p(y=c \vert x, \theta) = \dfrac{p(x \vert y=c, \theta)p(y=c \vert \theta)}{\sum_{c'}p(x \vert y=c', \theta)p(y=c' \vert \theta)}$$

### 2.2.4 Independence and conditional independence

$X\perp Y \Leftrightarrow p(X, Y)=p(X)p(Y)$: unconditionally independent / marginally independent

![](./figs/02_independent.png)

$X, Y$ 가 각각 descrete random variable 이라고 하자, 예를 들어 $X$ 는 6면체 주사위, $Y$ 는 5면체의 주사위라고 생각하면, X, Y 들 던졌을 때의 경우의 수를 생각하면 총 필요한 파라미터가 29 (왜 -1?), 독립일 때는 (6-1) + (5-1) = 9 개의 파라미터가 필요함

1/30 ~ 1, 1/6~1 + 1/5~1 ??

$X$ and $Y$ are conditionally independent (CI) given $Z$ iff the conditional joint can be written as a product of conditional marginals:

$$X\perp Y \vert Z \Leftrightarrow p(X, Y \vert Z)= p(X \vert Z)p(Y \vert Z)$$

#### Theorm 2.2.1
$X\perp Y \vert Z$ iff there exist function $g$ and $h$ such that, $p(x, y \vert z)= g(x, z)h(y, z)$ for all $x,y,z$ such that $p(z)>0$

### 2.2.5 Continuous random variables

Suppose $X$ is some uncertain continuous quantity. The probability that $X$ lies in any interval $a \leq X \leq b$ can be computed as follows. Define the events $A=(X\leq a), B=(X \leq b), W=(a < X \leq b)$. 

We have that $B = A \cup W$ and since $A$ and $W$ are mutually exclusive, the sum rules gives $p(B) = p(A) + p(W)$, hence $p(W) = p(B) - p(A)$

Define the function $F(q) \triangleq p(X \leq q)$ called as **cumulative distribution function(cdf)** of $X$. This is obviously a monotonically increasing function. Using this notation we have

$$p(a < X \leq b) = F(b) - F(a)$$

Now define, $f(x) = \frac{d}{dx}F(x)$ (we assume this derivative exists); this is called the **probability
density function(pdf)**. Given a pdf, we can compute the probability of a continuous variable being in a finite interval as follows:

$$P(a < X \leq b)=\int_a^b f(x) dx$$

As the size of the interval gets smaller, we can write

$$P(x \leq X \leq x + dx) \approx p(x) dx$$

* require $p(x) \geq 0$
* but it is possible for $p(x) > 1$ for any given $x$
    * consider the **uniform distribution** Unif(a,b): $Unif(x \vert a, b) = \dfrac{1}{b-a} \Bbb{I}(a \leq x \leq b)$
    * if we set $a=0, b=\frac{1}{2}$ we have $p(x)=2$ for any $x \in [0, \frac{1}{2}]$

### 2.2.6 Quantiles

Since the cdf $F$ is a monotonically increasing function, it has an inverse; let us denote this by $F^{-1}$, If $F$ is the cdf of $X$, then $F^{-1}(\alpha)$ is the value of $x_{\alpha}$ such that $P(X \leq x_{\alpha}) = \alpha$, this is called the $\alpha$ **quantile** of $F$.

We can also use the inverse cdf to compute **tail area probabilities**

### 2.2.7 Mean and variance

**mean / expected value** ($\mu$) 

$$\begin{cases} \Bbb{E}[X] \triangleq \sum_{x\in \varkappa} x p(x) \quad for\ descrete\ rv's \\
\Bbb{E}[X] \triangleq \int_{\varkappa} x p(x) dx \quad for\ continuous\ rv's 
\end{cases}$$

If this integral is not finite, the mean is not defined 

The **variance**(\sigma^2) is a measure of the “spread” of a distribution

$$\begin{aligned} var[X] &\triangleq \Bbb{E}[(X - \mu)^2] = \int (x-\mu)^2 p(x) dx \\
&= \int x^2 p(x) dx + \mu^2 \int p(x) dx - 2\mu \int x p(x) dx = \Bbb{E}[X^2] - \mu^2
\end{aligned}$$

* so, $\Bbb{E}[X^2] = \mu^2 + \sigma^2$

## 2.3 Some Common discrete distributions