# 2. Probability

## 2.1 Introduction

두 가지 관점:

1. frequenist interpretation: probabilities represent long run frequencies of events
2. bayesian interpretation: probability is used to quantify our uncertainty about something

베이지안 확률 정의의 장점:  can be used to model our uncertainty about events that do not have long term frequencies.

## 2.2 A brief review of probability theory

### 2.2.1 Discrete random variables

$p(A)$: the probability that the event $A$ is true
$p(\bar{A}) = 1-p(A)$: the probability that the event not $A$

**require:**

* $0 \leq p(A) \leq 1$: 
    * 0: the event is definitely not happen
    * 1: the event is definitely happen

the notion of binary events by defining a **discrete random variable** $X$, which can take on any value from a finite or countably infinite set $\varkappa$. Denote the probability of the event that $X=x$ by $p(X=x)$, for short is $p(x)$. $p()$ is called **probability mass function(pmf)**. It satisfies $0 \leq p(x) \leq 1$ and $\sum_{x \in \varkappa} p(x) = 1$

* $\varkappa$: state space

Uniform distribution:

$\Bbb{I}()$ is the binary **indicator function**. 


### 2.2.2 Fundamental Rules

#### Probability of a union of two events
$$\begin{aligned}p(A\cup B) &= p(A) + p(B) +p(A\cap B)\\ 
&= p(A) + p(B) \quad if\ A\ and\ B\ are\ mutually\ exclusive \end{aligned}$$

####  Joint probabilities
$$p(A,B) = p(A\cap B) = p(A \vert B)p(B)$$

the marginal distribution: 

$$p(A) = \sum_b p(A, B) = \sum_b p(A \vert B=b)p(B=b)$$

the sum rule / the rule of total probability:

$$p(X_{1:D}) = p(X_1)p(X_2 \vert X_1)p(X_3 \vert X_2, X_1) \cdots p(X_D \vert X_{1:D-1})$$

#### Conditional probability
$$p(A \vert B) = \dfrac{p(A, B)}{p(B)},\ if\ p(B) > 0$$

### 2.2.3  Bayes rule(Bayes Theorem)
$$p(X=x \vert Y=y) = \dfrac{p(X=x, Y=y)}{p(Y=y)} = \dfrac{p(Y=y \vert X=x)p(X=x)}{\sum_{x'}p(Y=y \vert X=x')p(X=x')}$$

Example: medical diagnosis

> mammogram: a medical test for breast cancer for a woman in 40s 
>
> the test has a sensitivity of 80%, which means, if you have cancer, the test will be positive with probability 0.8
>
> $$p(x=1 \vert y=1)=0.8$$
>
> * $x = 1$ is the event the mammogram is positive, $y = 1$ is the event you have breast cancer
>
> the perior probability(base rate fallacy) of having breast cancer is $p(y=1) = 0.004$
>
> a false positive / false alarm
>
> $$p(x=1 \vert y=0)=0.1$$
>
> We want to calculate wheter the probability having breast cancer when we take the test, Using bayes rule
>
> $$\begin{aligned} p(y=1 \vert x=1) &= \dfrac{p(x=1 \vert y=1)p(y=1)}{p(x=1 \vert y=1)p(y=1) + p(x=1 \vert y=0)p(y=0)} \\
&= \dfrac{0.8 \times 0.004}{0.8\times 0.004 + 0.1 \times 0.996} \\ &= 0.031\end{aligned}$$


In [2]:
p_y1 = 0.004
p_x1y1 = 0.8
p_x1y0 = 0.1
p_y1x1 = (p_x1y1 * p_y1) / (p_x1y1*p_y1 + p_x1y0*(1-p_y1))
print(p_y1x1)

0.0311284046692607


Generative classifiers:

$$p(y=c \vert x, \theta) = \dfrac{p(x \vert y=c, \theta)p(y=c \vert \theta)}{\sum_{c'}p(x \vert y=c', \theta)p(y=c' \vert \theta)}$$

### 2.2.4 Independence and conditional independence

$X\perp Y \Leftrightarrow p(X, Y)=p(X)p(Y)$: unconditionally independent / marginally independent

![](./figs/02_independent.png)

$X, Y$ 가 각각 descrete random variable 이라고 하자, 예를 들어 $X$ 는 6면체 주사위, $Y$ 는 5면체의 주사위라고 생각하면, X, Y 들 던졌을 때의 경우의 수를 생각하면 총 필요한 파라미터가 29 (왜 -1?), 독립일 때는 (6-1) + (5-1) = 9 개의 파라미터가 필요함

1/30 ~ 1, 1/6~1 + 1/5~1 ??

$X$ and $Y$ are conditionally independent (CI) given $Z$ iff the conditional joint can be written as a product of conditional marginals:

$$X\perp Y \vert Z \Leftrightarrow p(X, Y \vert Z)= p(X \vert Z)p(Y \vert Z)$$

#### Theorm 2.2.1
$X\perp Y \vert Z$ iff there exist function $g$ and $h$ such that, $p(x, y \vert z)= g(x, z)h(y, z)$ for all $x,y,z$ such that $p(z)>0$

### 2.2.5 Continuous random variables

Suppose $X$ is some uncertain continuous quantity. The probability that $X$ lies in any interval $a \leq X \leq b$ can be computed as follows. Define the events $A=(X\leq a), B=(X \leq b), W=(a < X \leq b)$. 

We have that $B = A \cup W$ and since $A$ and $W$ are mutually exclusive, the sum rules gives $p(B) = p(A) + p(W)$, hence $p(W) = p(B) - p(A)$

Define the function $F(q) \triangleq p(X \leq q)$ called as **cumulative distribution function(cdf)** of $X$. This is obviously a monotonically increasing function. Using this notation we have

$$p(a < X \leq b) = F(b) - F(a)$$

Now define, $f(x) = \frac{d}{dx}F(x)$ (we assume this derivative exists); this is called the **probability
density function(pdf)**. Given a pdf, we can compute the probability of a continuous variable being in a finite interval as follows:

$$P(a < X \leq b)=\int_a^b f(x) dx$$

As the size of the interval gets smaller, we can write

$$P(x \leq X \leq x + dx) \approx p(x) dx$$

* require $p(x) \geq 0$
* but it is possible for $p(x) > 1$ for any given $x$
    * consider the **uniform distribution** Unif(a,b): $Unif(x \vert a, b) = \dfrac{1}{b-a} \Bbb{I}(a \leq x \leq b)$
    * if we set $a=0, b=\frac{1}{2}$ we have $p(x)=2$ for any $x \in [0, \frac{1}{2}]$

### 2.2.6 Quantiles

Since the cdf $F$ is a monotonically increasing function, it has an inverse; let us denote this by $F^{-1}$, If $F$ is the cdf of $X$, then $F^{-1}(\alpha)$ is the value of $x_{\alpha}$ such that $P(X \leq x_{\alpha}) = \alpha$, this is called the $\alpha$ **quantile** of $F$.

We can also use the inverse cdf to compute **tail area probabilities**

### 2.2.7 Mean and variance

**mean / expected value** ($\mu$) 

$$\begin{cases} \Bbb{E}[X] \triangleq \sum_{x\in \varkappa} x p(x) \quad for\ descrete\ rv's \\
\Bbb{E}[X] \triangleq \int_{\varkappa} x p(x) dx \quad for\ continuous\ rv's 
\end{cases}$$

If this integral is not finite, the mean is not defined 

The **variance**(\sigma^2) is a measure of the “spread” of a distribution

$$\begin{aligned} var[X] &\triangleq \Bbb{E}[(X - \mu)^2] = \int (x-\mu)^2 p(x) dx \\
&= \int x^2 p(x) dx + \mu^2 \int p(x) dx - 2\mu \int x p(x) dx = \Bbb{E}[X^2] - \mu^2
\end{aligned}$$

* so, $\Bbb{E}[X^2] = \mu^2 + \sigma^2$

## 2.3 Some Common discrete distributions

### 2.3.1 The binomial and Bernoulli distributions

$$X \sim Bin(n, \theta)$$

pmf:

$$Bin(k \vert n, \theta) \triangleq \dbinom{k}{n} \theta^{k} (1-\theta)^{n-k}$$

* $\dbinom{k}{n} = \dfrac{n!}{(n-k)!k!}$: binomial coefficient

When $n=1$, it becomes Bernoulli distribution.

$$Bern(x \vert \theta) = \theta^{\Bbb{I}(x=1)}(1-\theta)^{\Bbb{I}(x=0)}$$

### 2.3.2 The multinomial and multinoulli distributions

pmf:

$$Mu(x \vert n, \theta) \triangleq \begin{pmatrix} n \\ x_1 \cdots x_K \end{pmatrix} \prod_{j=1}^K \theta_j^{x_j}$$

* $x = (x_1, \cdots, x_K)$: random vector, where $x_j$ is number of j-th class data appeared
* $\begin{pmatrix} n \\ x_1 \cdots x_K \end{pmatrix} = \dfrac{n!}{x_1!\cdots x_K!}$

When $n = 1$, we can use **dummy encoding(one hot encoding)**, $x$ becomes $x = [\Bbb{I}(x = 1), \cdots ,\Bbb{I}(x = K)]$, and pmf becomes:

$$Mu(x \vert 1, \theta)=\prod_{j=1}^K \theta_j^{\Bbb{I}(x_j=1)}$$

![](./figs/02_summary_binomial.png)

### 2.3.3 The Poisson distribution

$$X \sim Poi(\lambda), \quad \lambda >0, X \in \{0, 1, 2 \cdots\}$$

pmf:

$$Poi(x \vert \lambda) = e^{-\lambda} \dfrac{\lambda^x}{x!}$$

* $e^{-\lambda}$ : the normalization constant, required to ensure the distribution sums to 1

The Poisson distribution is often used as a model for counts of rare events like radioactive decay and traffic accidents.

### 2.3.4  The empirical distribution

Given set of data, $D=\{x_1, \cdots x_N \}$, define **empirical distribution(measure)** as follow:

$$p_{emp}(A) \triangleq \dfrac{1}{N} \sum_{i=1}^N \delta_{x_i}(A)$$

* $\delta_{x_i}(A) = \begin{cases} 0 \quad if x \notin A\\ 1 \quad if x \in A\end{cases}$: Dirac measure

In general, we can associate "weights" with each sample:

$$p(x) = \sum_{i=1}^N w_i \delta_{x_i}(x)$$

where require $0 \leq w_i \leq 1$ and $\sum_{i=1}^N w_i = 1$



## 2.4 Some common continuous distributions

### 2.4.1 Gaussian (normal) distribution

$$X \sim N(\mu, \sigma^2)$$

pdf:

$$N(x \vert \mu, \sigma^2) \triangleq \dfrac{1}{\sqrt{2\pi \sigma^2} } e^{-\frac{1}{2\sigma^2}(x-\mu)^2 }$$

* $\mu = \Bbb{E}[X]$
* $\sigma^2 = var[X]$
* $\sqrt{2\pi \sigma^2}$: normalization constant needed to ensure the density integrates to 1

When $\mu=0, \sigma^2 = 1$, then $X$ follows a **standard normal distribution**

**precision** of a Gaussian: $\lambda = \dfrac{1}{\sigma^2}$

cdf:

$$\Phi(x; \mu, \sigma^2) \triangleq \int_{-\infty}^x N(z \vert \mu, \sigma^2) dz$$

The Gaussian distribution is the most widely used distribution in statistics. There are several reasons for this. 

1. First, it has two parameters which are easy to interpret, and which capture some of the most basic properties of a distribution, namely its mean and variance. 
2. Second, the central limit theorem tells us that sums of independent random variables have an approximately Gaussian distribution, making it a good choice for modeling residual errors or “noise”. 
3. Third, the Gaussian distribution makes the least number of assumptions (has maximum entropy), subject to the constraint of having a specified mean and variance, as we show in Section 9.2.6; this makes it a good default choice in many cases. 
4. Finally, it has a simple mathematical form, which results in easy to implement, but often highly effective, methods, as we will see. 

### 2.4.2 Degenerate pdf

limit $\sigma^2 \rightarrow 0$, gaussian dist becomes 

$$\underset{\sigma^2 \rightarrow 0}{\lim} N(x \vert \mu, \sigma^2) = \delta(x-\mu)$$

*  Dirac delta function: $\delta(x) = \begin{cases} \infty \quad if\ x = 0 \\ 0 \quad if\ x \not = 0\end{cases}$, such that $\int_{-\infty}^{\infty} \delta(x)dx = 1$

A useful property of delta functions is the sifting property, which selects out a single term from a sum or integral:

$\int_{-\infty}^{\infty} f(x)\delta(x-\mu) dx = f(\mu)$

since the integrand is only non-zero if $x-\mu = 0$

One problem with the Gaussian distribution is that it is sensitive to outliers, since the logprobability only decays quadratically with distance from the center. A more robust distribution is the **Student t distribution**

$$T(x \vert \mu, \sigma^2, v) \propto \lbrack 1 + \dfrac{1}{v}(\dfrac{x-\mu}{\sigma})^2 \rbrack^{-\frac{v+1}{2}}$$

* $v>0$: degrees of freedom
* mean = mode = $\mu$, variance = $\dfrac{v\sigma^2}{(v-2)}$
* The variance is only defined if $v > 2$. The mean is only defined if $v > 1$
* robustness at lower $v$, because it has fat tail than gaussian.

If $v = 1$, this distribution is known as the **Cauchy or Lorentz** distribution. This is notable for having such heavy tails that the integral that defines the mean does not converge.

For $v \gg 5$, the Student distribution rapidly approaches a Gaussian distribution and loses its robustness properties.

### 2.4.3 The Laplace distribution

Another distribution with heavy tails is the **Laplace distribution(double sided exponential distribution)**

pdf:

$$Lap(x \vert \mu, b) \triangleq \dfrac{1}{2b} \exp(-\dfrac{\vert x-\mu \vert}{b})$$

* $\mu$: location parameter
* $b > 0$: scale parameter.
* mean = mode = $\mu$, variance = $2b^2$
* put mores probability density at 0 than the Gaussian

![](./figs/02_robustness.png)

### 2.4.4 The gamma distribution

a flexible distribution for positive real valued rv’s, $x > 0$.

$$Ga(T \vert shape=a, rate=b) \triangleq \dfrac{b^a}{\Gamma(a)}T^{a-1}e^{-Tb}$$

* where $\Gamma(a) \triangleq \int_0^{\infty} u^{x-1}e^{-u}du$
* shape: $a > 0$ 
* rate: $b > 0$
* mean = $\dfrac{a}{b}$, mode = $\dfrac{a-1}{b}$, var = $\dfrac{a}{b^2}$

Speacial cases:

* **Exponential distribution**: $Expon(x \vert \lambda) \triangleq Ga(x \vert 1, \lambda)$ This distribution describes the times between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate $\lambda$
* **Erlang distribution**: common $a=2 Ga(x \vert shape=2, rate=\lambda)$
* **Chi-squared distribution**: $\chi^2(x \vert v) \triangleq Ga(x \vert \frac{v}{2}, \frac{1}{2})$ This is the distribution of the sum of squared Gaussian random variables. $Z_i \sim N(0,1)$ and $S=\sum_{i=1}^v Z_i^2$ then $S \sim \chi_v^2$

**inverse gamma** distribution:

$$\dfrac{1}{X} \sim IG(a, b)$$

$$IG(x \vert shape=a, rate=b) \triangleq \dfrac{b^a}{\Gamma(a)}x^{-(a+1)}e^{-b/x}$$

* mean = $\dfrac{b}{a-1}$, mode = $\dfrac{b}{a+1}$, var = $\dfrac{b^2}{(a-1)^2(a-2)}$
* The mean only exists if $a > 1$. The variance only exists if $a > 2$.

### 2.4.5 The beta distribution

**beta distribution** has support over the interval $[0, 1]$

$$Beta(x \vert a, b) = \dfrac{1}{B(a,b)} x^{a-1}(1-x)^{b-1}$$

* $B(a,b) \triangleq \dfrac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}$
* $a, b > 0$ to ensure the distribution is integrable 
* mean = $\dfrac{a}{a+b}$, mode = $\dfrac{a-1}{a+b-2}$, var = $\dfrac{ab}{(a+b)^2(a+b+1)}$

![](./figs/02_beta.png)

### 2.4.6 Pareto distribution

it used to model the distribution of quantities that exhibit **long tails(heavy tails)**.

example: Zipf’s law

pdf:

$$Pareto(x \vert k, m)=km^kx^{-(k+1)}\Bbb{I}(x \geq m)$$

* This density asserts that $x$ must be greater than some constant $m$, but not too much greater, where $k$ controls what is “too much”.
* As $k \rightarrow \infty$, the distribution approaches $\delta(x − m)$

log scale:(power law)

$$\log p(x) = a \log x + c$$

* mean = $\dfrac{km}{k-1}, \quad k>1$, mode = $m$, var = $\dfrac{m^2k}{(k-1)^2(k-2)}, \quad if\ k>2$

## 2.5 Joint probability distributions

### 2.5.1 Covariance and correlation

The **covariance** between two rv’s $X$ and $Y$ measures the degree to which $X$ and $Y$ are (linearly) related.

$$cov[X, Y] \triangleq \Bbb{E}[(X - \Bbb{E}[X])(Y - \Bbb{E}[Y])] = \Bbb{E}[XY] - \Bbb{E}[X]\Bbb{E}[Y]$$

If $x$ is a $d$-dimensional random vector, its **covariance matrix** is defined to be the following symmetric, positive definite matrix:

$$\begin{aligned} cov[x] 
&\triangleq \Bbb{E}[(x - \Bbb{E}[x])(x - \Bbb{E}[x])^T] \\
&= \begin{pmatrix} 
var[x_1] & cov[x_1, x_2] & \cdots & cov[x_1, x_d] \\
cov[x_2, x_1] & var[x_2] & \cdots & cov[x_2, x_d] \\
\vdots & \vdots & \ddots & \vdots \\
cov[x_d, x_1] & cov[x_d, x_2] & \cdots & var[x_d]
\end{pmatrix}
\end{aligned}$$

**(Pearson) correlation coefficient**:

$$corr[X, Y] = \dfrac{cov[X, Y]}{\sqrt{var[X]var[Y]} }$$

correlation matrix:

$$\begin{pmatrix} 
corr[x_1, x_1] & corr[x_1, x_2] & \cdots & corr[x_1, x_d] \\
corr[x_2, x_1] & corr[x_2, x_2] & \cdots & corr[x_2, x_d] \\
\vdots & \vdots & \ddots & \vdots \\
corr[x_d, x_1] & corr[x_d, x_2] & \cdots & corr[x_d, x_d]
\end{pmatrix}$$

* $-1 \leq corr[X, Y] \leq 1$

### 2.5.2 The multivariate Gaussian

$$N(x \vert \mu, \sum) \triangleq \dfrac{1}{(2\pi)^{D/2}\vert D \vert^{1/2} } \exp [-\dfrac{1}{2}(x-\mu)^T \sum^{-1} (x-\mu)]$$

* mean vector: $\mu = \Bbb{E}[x] \in \Bbb{R}^D$ 
* covariance matrix: $\sum = cov[x]$
* covariance matrix has $D(D + 1)/2$ parameters, divide by 2 since $\sum$ symmetric

**precision matrix (concentration matrix)**:

$\Lambda = \sum^{-1}$

* $(2\pi)^{-D/2}\vert \Lambda \vert^{1/2}$ ensures that the pdf integrates to 1.

### 2.5.3 Multivariate Student t distribution

$$\begin{aligned} T(x \vert \mu, \sigma, v) &= \dfrac{\Gamma(v/2 + D/2)}{\Gamma(v/2)} \dfrac{\vert \sum \vert^{-1/2} }{v^{D/2} \pi^{D/2} } \times [1 + \dfrac{1}{v} (x-\mu)^T \sum^{-1} (x-\mu)]^{-(\frac{v+D}{2}) } \\
&= \dfrac{\Gamma(v/2 + D/2)}{\Gamma(v/2)} \vert \pi V \vert^{-1/2} \times [1 + (x-\mu)^T V^{-1} (x-\mu)]^{-(\frac{v+D}{2}) }
\end{aligned}$$

* $\sum$: scale matrix
* $V = v\sum$
* This has fatter tails than a Gaussian. The smaller $v$ is, the fatter the tails.
* mean = mode = $\mu$, Cov = $\dfrac{v}{v-2}\sum$ 

### 2.5.4 Dirichlet distribution

A multivariate generalization of the beta distribution is the **Dirichlet distribution**, has support over the **probability simplex**

$$S_k = \{x: 0 \leq x_k \leq 1, \sum_{k=1}^K x_k=1 \}$$

pdf:

$$Dir(x \vert \alpha) \triangleq \dfrac{1}{B(\alpha)} \prod_{k=1}^K x_k^{\alpha_k -1} \Bbb{I}(x \in S_K)$$

* $B(\alpha_1, \cdots , \alpha_K) \triangleq \dfrac{\prod_{k=1}^K \Gamma(\alpha_k)}{\Gamma(\alpha_0)}$ where $\alpha_0 \triangleq \sum_{k=1}^K \alpha_k$: the natural generalization of the beta function to $K$ variables
* $\Bbb{E}[x_k] = \dfrac{\alpha_k}{\alpha_0}$, $mode[x_k] = \dfrac{\alpha_k -1}{\alpha_0 - K}$, $var[x_k] = \dfrac{\alpha_k(\alpha_0-\alpha_k)}{\alpha_0^2(\alpha_0+1)}$


## 2.6 Transformations of random variables
If $x \sim p()$ is some random variable, and $y = f(x)$, what is the distribution of $y$? This is the question we address in this section.

### 2.6.1 Linear transformations

$y = f(x) = Ax + b$

* $\Bbb{E}[y] = A\mu + b$ 
* $cov[y] = A \sum A^T$

### 2.6.2 General transformations

If $X$ is a discrete rv, we can derive the pmf for $y$ by simply summing up the probability mass for all the $x$’s such that $f(x) = y$:

$$p_y(y) = \underset{x:f(x)=y}{\sum} p_x(x)$$

If X is continuous rv, cdf: 

$$P_y(y) \triangleq P(Y \leq y) = P(f(X) \leq y) = P(X \in \{x \vert f(x) \leq y\})$$

derive pdf of $y$ by differentiating the cdf.

In the case of monotonic and hence invertible functions, we can write

$$P_y(y) \triangleq P(Y \leq y) = P(f(X) \leq f^{-1}(y)) = P_x(f^{-1}(y))$$

$p_y(y) \triangleq \dfrac{d}{dy} P_y(y) = P_x(f^{-1}(y)) = \dfrac{dx}{dy} \dfrac{d}{dx} P_x(x) = \dfrac{dx}{dy} p_x(x)$ where $x=f^{-1}(y)$

Since the sign of this change is not important, we take the absolute value to get the general expression: **change of variables** formula

$$p_y(y) = p_x(x)\vert \dfrac{dx}{dy} \vert$$


### 2.6.3 Central limit theorem

Now consider $N$ random variables with pdf’s (not necessarily Gaussian) $p(x_i)$, each with mean $\mu$ and variance $\sigma^2$. We assume each variable is **independent and identically distributed(iid)**. Let $S_N = \sum_{i=1}^N X_i$ be the sum of the rv’s. This is a simple but widely used transformation of rv’s. One can show that, as $N$ increases, the distribution of this sum approaches

$$p(S_N=s) = \dfrac{1}{\sqrt{2\pi N \sigma^2} } \exp(-\dfrac{(s-N\mu)^2}{2N \sigma^2})$$

Hence the distribution of the quantity $Z_N \triangleq \dfrac{S_N-N_{\mu} }{\sigma \sqrt{N} } = \dfrac{\bar{X}-\mu }{\sigma / \sqrt{N} }$ converges to the standard normal, where $\bar{X} = \dfrac{1}{N} \sum_{i=1}^N x_i$ is the sample mean.

## 2.7 Monte Carlo approximation
One way to compute the distribution of a function of an rv using the change of variables formula: **Monte Carlo approximation**

1. generate $S$ samples from the distribution: $x_1, \cdots x_S$
2. Given the samples, we can approximate the distribution of $f(X)$ by using the empirical distribution of $\{ f(x_s) \}_{s=1}^S$

Monte Carlo integration:

$\Bbb{E}[f(X)] = \int f(x)p(x)dx \approx \dfrac{1}{S} \sum_{s=1}^S f(x_s)$ where $x_s \sim p(X)$

넘나 어렵

## 2.8 Information theory

### 2.8.1 Entropy

$\Bbb{H}(X)$($\Bbb{H}(p)$) : The entropy of a random variable $X$ with distribution $p$, a measure of its uncertainty

In particular, for a discrete variable with $K$ states: 

$$\Bbb{H}(X) \triangleq - \sum_{k=1}^K p(K=k) log_2p(X=k)$$

* base = 2: **bits**, base = $e$: **nats**

binary entropy function: $X \in \{0, 1\}$, $p(X=1)=\theta, p(x=0)=1-\theta$

$$\begin{aligned}\Bbb{H} &= -[p(X=1)\log_2p(X=1)+p(X=0)\log_2p(X=0)] \\
&= -[\theta \log_2\theta+(1-\theta)\log_2(1-\theta)]\end{aligned}$$

### 2.8.2 KL divergence

Kullback-Leibler divergence (KL divergence, relative entropy): to measure the dissimilarity of two probability distributions, $p$ and $q$

$$$$

### 2.8.3 Mutual information