$\newcommand{\pr}{\textrm{Pr}}$
$\newcommand{\l}{\left}$
$\newcommand{\r}{\right}$
$\newcommand\given[1][]{\:#1\vert\:}$
$\newcommand{\var}{\textrm{Var}}$
$\newcommand{\mc}{\mathcal}$
$\newcommand{\lp}{\left(}$
$\newcommand{\rp}{\right)}$
$\newcommand{\lb}{\left\{}$
$\newcommand{\rb}{\right\}}$
$\newcommand{\iid}{\textrm{i.i.d. }}$

# 2.1 Belief functions and probabilities

Probabilities are a way to numerically express rational beliefs.
This first section shows that probabilities satisfy some general
features that we would expect a measure of "belief" to have, so
it seems reasonable to use probabilities to represent our belief
in something.


# 2.2 Events, partitions, and Bayes' rule

**Bayes' rule**:
\begin{align}
\pr\left(H_j \given E\right) &= \frac{\pr\l(E \given H_j\r)\pr\l(H_j\r)}{\pr\l(E\r)} \\
&= \frac{\pr\l(E \given H_k\r)\pr\l(H_j\r)}{\sum_{k=1}^K \pr\l(E \given H_k\r)\pr\l(H_k\r)}
\end{align}

**Bayes factors**: $\l\{H_1, \cdots, H_k\r\}$ often refer to disjoint hypotheses and $E$ refers
to data. To compare hypotheses post-experimentally, we often calculate the ratio:
\begin{align}
\frac{\pr\l(H_i \given E\r)}{\pr\l(H_j \given E\r)} &= 
\frac{\pr\l(E \given H_i\r)}{\pr\l(E \given H_j\r)} \times \frac{\pr\l(H_i\r)}{\pr\l(H_j\r)} \\
&= \textrm{"Bayes factor"}\times \textrm{"prior beliefs"}
\end{align}

This quantity reminds us that Bayes' rule does not determine what our beliefs should be after we 
see data, it only tells us how they should change. 


# 2.4 Random variables

A random variable is an unknown numerical quantity about which we make probability statements.


## 2.4.3 Descriptions of distributions


### Mean, mode, median

The mean is a good quantity to look at because

1. The mean is a scaled value of the total of $\l\{Y_1, \cdots, Y_n\r\}$, and the total
is often a quantity of interest.
2. If you were forced to guess the value of $Y$, guessing the mean would minimize your error
if it was measured as $\l(Y - y_{guess}\r)^2$.
3. In some simple models, the mean contains all of the information about the population that 
can be obtained from the data.

### Variance
 * The variance is the average squared distance that a sample value $Y$ will be from 
the population mean $E\l[Y\r]$. 
 * The standard deviation is the square root of the variance
and is on the same scale as the mean.

## 2.5 Joint distributions

### Discrete distributions

The **joint pdf** or **joint density** of discrete random variables $Y_1$ and $Y_2$ is defined as 
$$p\l(y_1, y_2\r) = \pr\l(y_1 \cap y_2 \r)$$
for $y_1 \in \mc{Y}_1$, $y_2 \in \mc{Y}_2$.

The **marginal density** of $Y_1$ can be computed from the joint density of $Y_1$ and $Y_2$:
\begin{align}
p\l(y_1\r) &= \sum_{y_2 \in \mc{Y}_2}p\l(y_1, y_2\r)
\end{align}

The **conditional density** of $Y_2$ given $\l\{Y_1 = y_1\r\}$ can be computed from the 
joint density and the marginal density:
\begin{align}
p\l(y_2 \given y_1\r) &= \frac{p\l(y_1, y_2\r)}{p\l(y_1\r)}
\end{align}

### Continuous joint distributions

If $Y_1$ and $Y_2$ are continuous, the marginal density of $Y_1$ is given by
\begin{align}
p\l(y_1\r) = \int_{-\infty}^\infty p\l(y_1,y_2\r)dy_2
\end{align}
and the conditional density is given by
\begin{align}
p\l(y_2 \given y_1\r) = p\l(y_1, y_2\r) / p\l(y_1\r)
\end{align}

### Mixed continuous and discrete variables

Let $Y_1$ be discrete and $Y_2$ be continuous. Suppose we know $p\left(y_1\right)$ and
$p\left(y_2 \given y_1\right)$

The joint density of $Y_1$ and $Y_2$ is then 
$$p\l(y_1, y_2\r) = p\l(y_1\r) \times p\l(y_2 \given y_1\r)$$

and has the property that
$$\pr\l(Y_1 \in A, Y_2 \in B\r) = \int_{y_2 \in B} \left\{\sum_{y_1 \in A} p_{Y_1Y_2}\l(y_1, y_2\r)\r\}dy_2$$

In other words, we can summation and integration to calculate the joint density.

### Bayes rule and parameter estimation

Let $\theta$ be a continuous parameter we want to estimate and let $Y$ be a discrete 
data measurement. Having observed $\l\{Y = y\r\}$, we need to compute our updated beliefs about $\theta$:
$$p\l(\theta | y\r) = \frac{p\l(\theta, y\r)}{p\lp y\rp} = \frac{p\lp \theta \rp p\lp y \given \theta \rp}{p\lp y \rp}$$

**This conditional density is called the posterior density of $\theta$.** If $\theta_a$ and $\theta_b$
are two estimates of $\theta$, the posterior probability (density) of $\theta_a$ relative to $\theta_b$,
conditional on $Y=y$, is 
\begin{align}
\frac{p\lp \theta_a \given y \rp}{p \lp \theta_b \given y \rp} &= 
\frac{p\lp \theta_a \rp p\lp y | \theta_a \rp}{p\lp \theta_b \rp p\lp y | \theta_b \rp}
\end{align}

To evaluate the **relative** posterior odds of $\theta_a$ and $\theta_b$, we do not need to evaluate
the marginal density $p\lp y \rp$. We see that
$$p\lp \theta | y \rp \propto p\lp \theta \rp p\lp y \given \theta \rp$$

The constant of proportionality is $1 / p\lp y \rp$ which we *could* calculate using
$$p\lp y \rp = \int_\Theta p\lp y, \theta \rp d\theta = \int_\Theta p\lp y \given \theta \rp p\lp \theta \rp d\theta$$

Later we will see that the numerator is more important than this denominator.


# 2.7 Exchangeability

The observations in a data set $Y_1, \cdots, Y_N$ are exchangeable if the subscript labels convey no information about the outcomes. In other words, you could repeatedly permute the data and calculate the joint density and it wouldn't change.

Therom: If $Y_1, \cdots, Y_N$ are conditionally i.i.d. given $\theta$ and $\theta \sim p\lp \theta \rp$,
then $Y_1, \cdots, Y_N$ are exchangeable.

# 2.8 de Finetti's thereom

We know that if $Y_1, \cdots, Y_n \given \theta$ are i.i.d. and $\theta \sim p\lp \theta \rp$, then
$Y_1, \cdots, Y_N$ are exchangeable. de Finetti's theorem says that this statement is if and only if:

$$Y_1, \cdots, Y_n \given \theta \textrm{ are i.i.d. and }\theta \sim p\lp \theta \rp \iff
Y_1, \cdots, Y_N \textrm{ are exchangeable for all } n$$

The question is when are $Y_1, \cdots, Y_n$ exchangeable for all $n$? For this to be true,
we need both exchangeability and repeatability. Exchangeability is true when the labels
have no meaning. Repeatability is true when

* $Y_1, \cdots, Y_n$ are outcomes of a repeatable experiment
* $Y_1, \cdots, Y_n$ are sampled from a finite population *with* replacement
* $Y_1, \cdots, Y_n$ are sampled from an infinite population *without* replacement

If $Y_1, \cdots, Y_n$ are exchangeable and sampled from a finite population *without* replacement
of size $N >> n$, then they can be modeled as approximately conditionally i.i.d.