$\newcommand{\pr}{\textrm{Pr}}$
$\newcommand{\l}{\left}$
$\newcommand{\r}{\right}$
$\newcommand\given[1][]{\:#1\vert\:}$
$\newcommand{\var}{\textrm{Var}}$
$\newcommand{\mc}{\mathcal}$
$\newcommand{\lp}{\left(}$
$\newcommand{\rp}{\right)}$
$\newcommand{\lb}{\left\{}$
$\newcommand{\rb}{\right\}}$
$\newcommand{\iid}{\textrm{i.i.d. }}$

# 2.1 Belief functions and probabilities

Probabilities are a way to numerically express rational beliefs.
This first section shows that probabilities satisfy some general
features that we would expect a measure of "belief" to have, so
it seems reasonable to use probabilities to represent our belief
in something.

Probability axioms:

**P1**: $0 = \pr\l(\textrm{not } H \given H\r) \leq \pr\l(F \given H\r) \leq \pr\l(H \given H\r) = 1$

**P2**: $\pr\l(F \cup G \given H\r) = \pr\l(F \given H\r) + \pr\l(G \given H\r)$ if $F \cap G = \varnothing$

**P3**: $\pr\l(F \cap G \given H\r) = \pr\l(G \given H\r)\pr\l(F \given G \cap H\r)$


# 2.2 Events, partitions, and Bayes' rule

**Definition 1 (Partition)**: A collection of sets $\left\{H_1, \cdots, H_k\right\}$ is a partition
of another set $\mathcal{H}$ if 

1. the events are disjoint: $H_i \cap H_j = \varnothing$ for $i \not= j$
2. the union of the sets is $\mathcal{H}$: $\bigcup_{k=1}^K H_k = \mathcal{H}$

If $\mathcal{H}$ is the set of all possible truths and $\left\{H_1, \cdots, H_k\right\}$ is a partition of 
$\mathcal{H}$, then exactly one of $\left\{H_1, \cdots, H_k\right\}$ contains the truth.

Suppose $\left\{H_1, \cdots, H_k\right\}$ is a partition of $\mathcal{H}$, $\textrm{Pr}\left(\mathcal{H}\right) = 1$,
and $E$ is some specific event. The axioms of probability imply:

**Rule of total probability**: $$\sum_{k=1}^{K} \pr\left(H_k\right) = 1$$

**Rule of marginal probability**: 
\begin{align}
\pr\left(E\right) &= \sum_{k=1}^K \pr\left(E \cap H_k\right) \\
&= \sum_{k=1}^K \pr\left(E \given H_k\right)\pr\left(H_k\right)
\end{align}

**Bayes' rule**:
\begin{align}
\pr\left(H_j \given E\right) &= \frac{\pr\l(E \given H_j\r)\pr\l(H_j\r)}{\pr\l(E\r)} \\
&= \frac{\pr\l(E \given H_k\r)\pr\l(H_j\r)}{\sum_{k=1}^K \pr\l(E \given H_k\r)\pr\l(H_k\r)}
\end{align}

We would say that $H_k$ has been marginalized out in the denominator.

I think it's worth looking at the formula for Bayes' rule here. The left hand side, 
$\pr\left(H_j \given E\right)$, represents that probability that $H_j$ is true given that
event $E$ happened. For us, $E$ is generally data and $H_k$ will be a parameter we are 
interested in estimating. On the right side, we have $\pr\l(E \given H_k\r)$
which is the probability of observing the event/data under the parameterization $H_k$.
This is multiplied by $\pr\l(H_j\r)$ which is our prior distribution on $H_j$.
The denominator is a little more tricky. $\pr\l(E\r)$ doesn't mean much on its own. However,
we can rewrite the denominator using the rule of marginal probability as

$$\pr\l(E\r) = \sum_{k=1}^K\pr\l(E \cap H_k\r) = \sum_{k=1}^K \pr\l(E \given H_k\r)\pr\l(H_k\r)$$

Since we think $E$ depends on the $H_i$, we can treat $\pr\l(E\r)$ as a 
**marginal density** and rewrite it as a joint distribution (see 2.5 below for more information).
We know $\pr\l(E \given H_k\r)$ and $\pr\l(H_k\r)$ so we can evaluate this expression.

I kind of think of
the numerator as the strength of my belief and the denominator acts to normalize that value
based on how strongly I believe all of the $H_k$.

**Bayes factors**: $\l\{H_1, \cdots, H_k\r\}$ often refer to disjoint hypotheses and $E$ refers
to data. To compare hypotheses post-experimentally, we often calculate the ratio:
\begin{align}
\frac{\pr\l(H_i \given E\r)}{\pr\l(H_j \given E\r)} &= 
\frac{\pr\l(E \given H_i\r)}{\pr\l(E \given H_j\r)} \times \frac{\pr\l(H_i\r)}{\pr\l(H_j\r)} \\
&= \textrm{"Bayes factor"}\times \textrm{"prior beliefs"}
\end{align}

This quantity reminds us that Bayes' rule does not determine what our beliefs should be after we 
see data, it only tells us how they should change. 

# 2.3 Independence

**Definition 2 (Independence)**: Two events $F$ and $G$ are conditionally independent given
$H$ if $\pr\l(F \cap G \given H\r) = \pr\l(F \given H\r)\pr\l(G \given H\r)$.

If $F$ and $G$ are conditionally independent, then knowing $G$ will tell us nothing about $F$.

# 2.4 Random variables

A random variable is an unknown numerical quantity about which we make probability statements.

## 2.4.1 Discrete random variables

Let $Y$ be a random variable and let $\mathcal{Y}$ be the set of all possible values of $Y$.
We say $Y$ is discrete if $\mathcal{Y}$ is countable: $\mathcal{Y} = \l\{y_1, y_2, \cdots\r\}$.

For short, we will write $\pr\l(Y=y\r) = p\l(y\r)$ where $p$ is the probability density function (pdf).
The pdf has the following properties:

1. $0 \leq p\l(y\r) \leq 1$ for all $y \in \mathcal{Y}$
2. $\sum_{y \in \mathcal{Y}} p\left(y\right) = 1$

## 2.4.2 Continuous random variables

If the sample space $\mathcal{Y}$ is roughly equal to $\mathbb{R}$, then we often define
probability distributions for random variables in terms of a cumulative distribution function
(cdf): $F\l(y\r) = \pr\l(Y \leq y\r)$. Note that

* $F\l(\infty\r) = 1$
* $F\l(-\infty\r) = 0$
* $F\l(b\r) \leq F\l(a\r)$ if $b<a$
* $\pr\l(Y > a\r) = 1-F\l(a\r)$
* $\pr\l(a < Y < b\r) = F\l(b\r) - F\l(a\r)$

If $Y$ is a continuous random variable, then there exists a pdf $p$ such that
$$F\l(a\r) = \int_{-\infty}^a p\l(y\r)dy$$

The continuous pdf has analogous characteristics to the discrete pdf:

1. $0 \leq p\l(y\r) \leq 1$ for all $y \in \mathcal{Y}$
2. $\int_{y \in \mathbb{R}} p\left(y\right)dy = 1$

## 2.4.3 Descriptions of distributions


### Mean, mode, median

The **mean** or **expectation** of an unknown quantity $Y$ is given by

* $E\l[Y\r] = \sum_{y\in Y}yp\l(y\r)$ if $Y$ is discrete
* $E\l[Y\r] = \int_{y\in \mathbb{R}}yp\l(y\r)dy$ if $Y$ is continuous

The mean is the center of mass of the distribution. It is generally not equal to

* the **mode**: the probable value of $Y$
* the **median**: the value of $Y$ in the middle of the distribution

The mean is a good quantity to look at because

1. The mean is a scaled value of the total of $\l\{Y_1, \cdots, Y_n\r\}$, and the total
is often a quantity of interest.
2. If you were forced to guess the value of $Y$, guessing the mean would minimize your error
if it was measured as $\l(Y - y_{guess}\r)^2$.
3. In some simple models, the mean contains all of the information about the population that 
can be obtained from the data.

### Variance

The variance is a measure of the spread:

\begin{align}
\var\l[Y\r] &= E\l[\l(Y - E\l[Y\r]\r)^2\r] \\
&= E\l[Y^2\r] - E\l[Y\r]^2
\end{align}

The variance is the average squared distance that a sample value $Y$ will be from 
the population mean $E\l[Y\r]$. The standard deviation is the square root of the variance
and is on the same scale as the mean.

For a continuous, increasing cdf $F$, the $\alpha$ **quantile** is the $y_\alpha$ such that
$F\left(y_\alpha\right) \equiv \pr\l(Y \leq y_\alpha\r) = \alpha$. The **interquartile** range
is the interval $\l(y_{0.25}, y_{0.75}\r)$ which contains 50% of the mass of the distribution.

## 2.5 Joint distributions

### Discrete distributions

Let 

* $\mathcal{Y}_1$, $\mathcal{Y}_2$ be two countable samples spaces
* $Y_1$, $Y_2$, be two random variables, taking values in $\mathcal{Y}_1$, $\mathcal{Y}_2$ respectively

The **joint pdf** or **joint density** of $Y_1$ and $Y_2$ is defined as 
$$p_{Y_1 Y_2}\l(y_1, y_2\r) = \pr\l(\l\{Y_1 = y_1\r\} \cap \l\{Y_2 = y_2\r\} \r)$$
for $y_1 \in \mc{Y}_1$, $y_2 \in \mc{Y}_2$.

The **marginal density** can be computed from the joint density:
\begin{align}
p_{Y_1}\l(y_1\r) &\equiv \pr\l(Y_1 = y_1\r) \\
&= \sum_{y_2 \in \mc{Y}_2} \pr\l(\l\{Y_1 = y_1\r\} \cap \l\{Y_2 = y_2\r\} \r) \\
&\equiv \sum_{y_2 \in \mc{Y}_2}p_{Y_1Y_2} \l(y_1, y_2\r)
\end{align}

The **conditional density** of $Y_2$ given $\l\{Y_1 = y_1\r\}$ can be computed from the 
joint density and the marginal density of $Y_1$:
\begin{align}
p_{Y_2 \given Y_1}\l(y_2 \given y_1\r) &= \frac{\pr\l(\l\{Y_1 = y_1\r\} \cap \l\{Y_2 = y_2\r\}\r)}{\pr\l(Y_1 = y_1\r)} \\
&= \frac{p_{Y_1Y_2}\l(y_1, y_2\r)}{p_{Y_1}\l(y_1\r)} \\
&= \frac{p_{Y_1Y_2}\l(y_1, y_2\r)}{\sum_{y_2 \in \mc{Y}_2}p_{Y_1Y_2} \l(y_1, y_2\r)}
\end{align}

We often drop the subscripts on the pdf's such that $p_{Y_1}\l(y_1\r)$ becomes $p\l(y_1\r)$ etc.

### Continuous joint distributions

If $Y_1$ and $Y_2$ are continuous, we have a joint cdf $F_{Y_1Y_2}\l(a, b\r) \equiv 
\pr\l(\l\{Y_1 \leq a\r\} \cap \l\{Y_2 \leq b\r\}\r)$, there is a function $p_{Y_1Y_2}$ such that
$$F_{Y_1Y_2}\l(a,b\r) = \int_{-\infty}^a \int_{-\infty}^b p_{Y_1Y_2}\l(y_1, y_2\r)dy_2dy_1$$

The function $p_{Y_1Y_2}$ is the joint density of $Y_1$ and $Y_2$. As in the discrete case, 
we have
* $p_{Y_1}\l(y_1\r) = \int_{-\infty}^\infty p_{Y_1Y_2}\l(y_1,y_2\r)dy_2$
* $p_{Y_2 \given Y_1}\l(y_2 \given y_1\r) = p_{Y_1Y_2}\l(y_1, y_2\r) / p_{Y_1}\l(y_1\r)$

### Mixed continuous and discrete variables

Let $Y_1$ be discrete and $Y_2$ be continuous. Suppose we have

* a marginal density $p_{Y_1}$ from our beliefs $\pr\l(Y_1=y_1\r)$
* a conditional density $p_{Y_2 \given Y_1}\left(y_2\given y_1\r)$ from
$\pr\l(Y_2 \leq y_2 \given Y_1 = y_1\r) \equiv F_{Y_2 \given Y_1}\l(y_2 \given y_1\r)$

The joint density of $Y_1$ and $Y_2$ is then 
$$p_{Y_1Y_2}\l(y_1, y_2\r) = p_{Y_1}\l(y_1\r) \times p_{Y_2 \given Y_1}\l(y_2 \given y_1\r)$$

and has the property that
$$\pr\l(Y_1 \in A, Y_2 \in B\r) = \int_{y_2 \in B} \left\{\sum_{y_1 \in A} p_{Y_1Y_2}\l(y_1, y_2\r)\r\}dy_2$$

In other words, we can use summation and integration to calculate the joint density.

### Bayes rule and parameter estimation

Let $\theta$ be a continuous parameter we want to estimate and let $Y$ be a discrete 
data measurement. Bayesian estimation of $\theta$ derives from the calculation of 
$p\l(\theta \given y\r)$, where $y$ is the observed value of $Y$. This calculation 
first requires the joint density of $\theta$ and $Y$. We can construct the joint density from

* $p\l(\theta\r)$, beliefs about $\theta$
* $p\l(y \given \theta\r)$, beliefs about $Y$ for each value of $\theta$

Having observed $\l\{Y = y\r\}$, we need to compute our updated beliefs about $\theta$:
$$p\l(\theta | y\r) = p\l(\theta, y\r) / p\lp y\rp = p\lp \theta \rp p\lp y \given \theta \rp / p\lp y \rp$$

**This conditional density is called the posterior density of $\theta$.** If $\theta_a$ and $\theta_b$
are two estimates of $\theta$, the posterior probability (density) of $\theta_a$ relative to $\theta_b$,
conditional on $Y=y$, is 
\begin{align}
\frac{p\lp \theta_a \given y \rp}{p \lp \theta_b \given y \rp} &= 
\frac{p\lp \theta_a \rp p\lp y | \theta_a \rp / p\lp y \rp}{p\lp \theta_b \rp p\lp y | \theta_b \rp / p\lp y \rp} \\
&= \frac{p\lp \theta_a \rp p\lp y | \theta_a \rp}{p\lp \theta_b \rp p\lp y | \theta_b \rp}
\end{align}

To evaluate the **relative** posterior odds of $\theta_a$ and $\theta_b$, we do not need to evaluate
the marginal density $p\lp y \rp$. We see that
$$p\lp \theta | y \rp \propto p\lp \theta \rp p\lp y \given \theta \rp$$

The constant of proportionality is $1 / p\lp y \rp$ which we *could* calculate using
$$p\lp y \rp = \int_\Theta p\lp y, \theta \rp d\theta = \int_\Theta p\lp y \given \theta \rp p\lp \theta \rp d\theta$$

Later we will see that the numerator is more important than this denominator.

# 2.6 Independent random variables

Suppose $Y_1, \cdots, Y_n$ are random variables and that $\theta$ is a parameter describing
the conditions under which the random variables are generated. We say that $Y_1, \cdots, Y_n$
are **conditionally independent** given $\theta$ if for every collection of $n$ sets $\lb A_1, \cdots, A_n \rb$ 
we have 
$$\pr\lp Y_1 \in A_1, \cdots, Y_n \in A_n \given \theta \rp = 
\pr\lp Y_1 \in A_1 \given \theta\rp \times \cdots \times \pr \lp Y_n \in A_n \given \theta \rp$$

This tells us that knowing $Y_j$ gives us no further information about $Y_i$ beyond what $\theta$
gives us. Under independence, the joint density of the $Y_i$ condtioned on $\theta$ is 
$$p\lp y_1, y_2, \cdots, y_n \given \theta \rp = \prod_{i=1}^n p_{Y_i} \lp y_i \given \theta \rp$$

In this case, we say that $Y_1, \cdots, Y_n$ are **conditionally independent and identically
distributed (i.i.d.)**: $Y_1, \cdots, Y_n \sim \iid p\lp y \given \theta \rp$.

# 2.7 Exchangeability

**Definition 3 (Exchangeable)**: Let $p\lp y_1, \cdots, y_n\rp$ be the joint density of $Y_1, \cdots, Y_n$.
If $p\lp y_1, \cdots, y_n\rp = p\lp y_{\pi_1}, \cdots, y_{\pi_n}\rp$ for all permutations $\pi$ of 
$\lb 1, \cdots, n \rb$, then $Y_1, \cdots, Y_N$ are exchangeable. This basically means that 
$Y_1, \cdots, Y_N$ are exchangeable if the subscript labels convey no information about the outcomes.

Therom: If $\theta \sim p\lp \theta \rp$ and $Y_1, \cdots, Y_N$ are conditionally i.i.d. given $\theta$,
then marginally (unconditionally on $\theta$), $Y_1, \cdots, Y_N$ are exchangeable.

# 2.8 de Finetti's thereom

We know that if $Y_1, \cdots, Y_n \given \theta$ are i.i.d. and $\theta \sim p\lp \theta \rp$, then
$Y_1, \cdots, Y_N$ are exchangeable. Can we say something about the other direction?

**Theorem 1 (de Finetti)**: Let $\lb Y_1, \cdots, Y_n \rb$ be a potentially
infinite sequence of random variables all having a common sample space $\mc{Y}$. 
Let $Y_i \in \mc{Y}$ for all $i \in \lb Y_1, Y_2, \cdots \rb$. Suppose that for 
any $n$ our belief model for $Y_1, \cdots, Y_n$ is exchangeable
$$p \lp y_1, \cdots, y_n \rp = p\lp y_{\pi_1}, \cdots, y_{\pi_n} \rp$$
for all permutations of $\lb 1, \cdots, n \rb$. The our model can be written as 
$$p\lp y_1, \cdots, y_n \rp = \int \lb \prod_1^np \lp y_i \given \theta \rp \rb p \lp \theta \rp d \theta$$
for some parameter $\theta$, some prior distribution on $\theta$ ($p \lp \theta \rp$), and some sampling
model $p\lp y \given \theta \rp$. The prior and sampling model depend on the form of the belief
model $p\lp y_1, \cdots, y_n \rp$.

The probability distribution $p\lp \theta \rp$ represents our beliefs about the outcomes
of $\lb Y_1, \cdots, Y_n \rb$, induced by our belief model $p\lp y_1, y_2, \cdots \rp$.
More precisely,

* $p\lp \theta \rp$ represents our beliefs about $\lim_{n\rightarrow \infty} Y_i / n$ in the binary case
* $p\lp \theta \rp$ represents our beliefs about $\lim_{n\rightarrow \infty} \lp Y_i \leq c \rp / n$ 
for each $c$ in the general case

We can summarize as
$$Y_1, \cdots, Y_n \given \theta \textrm{ are i.i.d. and }\theta \sim p\lp \theta \rp \iff
Y_1, \cdots, Y_N \textrm{ are exchangeable for all } n$$

The question is when are $Y_1, \cdots, Y_n$ exchangeable for all $n$? For this to be true,
we need both exchangeability and repeatability. Exchangeability is true when the labels
have no meaning. Repeatability is true when

* $Y_1, \cdots, Y_n$ are outcomes of a repeatable experiment
* $Y_1, \cdots, Y_n$ are sampled from a finite population *with* replacement
* $Y_1, \cdots, Y_n$ are sampled from an infinite population *without* replacement

If $Y_1, \cdots, Y_n$ are exchangeable and sampled from a finite population *without* replacement
of size $N >> n$, then they can be modeled as approximately conditionally i.i.d.