## Probability Theory

Probability theory provides a consistent framework for the quantification and manipulation of uncertainty.

Consider two random variables $X$ and $Y$. Suppose that $X$ can take any of the values $x_i$ where $i=1,...,M$ and $Y$ can take the values $y_j$ where $j=1,...,L$. Consider a total of $N$ trials in which we sample both $X$ and $Y$, and let the number of such trials in which $X=x_i$ and $Y=y_j$ be $n_{ij}$. Also let the number of trials in which $X$ takes the value $x_i$ (irrespective of the value $Y$ takes) be denoted by $c_i$, and similarly let the number of trials in which $Y$ takes the value $y_j$ be denoted by $r_j$.

The probability that $X$ will take the value $x_i$ and $Y$ will take the value $y_j$ is written $p(X=x_i, Y=y_j)$ and is called the *joint* probability of $X=x_i$ and $Y=y_j$, and can be expressed as
$$
p(X=x_i, Y=y_j)=\frac{n_{ij}}{N}
$$
where we implicitly consider the limit $N\rightarrow \infty$

The probability that $X$ takes the value $x_i$ irrespective of the value of $Y$ is written as $p(X=x_i)$ and is given by
$$
p(X=x_i)=\frac{c_i}{N}
$$
Since this is independent of the value of $Y$, we have $c_i=\sum_j n_{ij}$ and
$$
p(X=x_i)=\sum^L_{j=1}p(X=x_i, Y=y_j)
$$
which is the *sum rule* of probability. $p(X=x_i)$ is sometimes called the *marginal* probability, because it is obtained by marginalising, or summing out, the other variables.

If we consider only the instances for which $X=x_i$, then the fraction of such instances for which $Y=y_j$ is written $p(Y=y_j\vert X=x_i)$ and is called the *conditional* probability of $Y=j_j$ given $X=x_i$, and is given by
$$
p(Y=y_j\vert X=x_i)=\frac{n_{ij}}{c_i}
$$
We can derive the following relationship
$$
p(X=x_i,Y=y_j) = \frac{n_{ij}}{N} = \frac{n_{ij}}{c_i}\cdot\frac{c_i}{N} \\
=p(Y=y_j\vert X=x_i)p(X=x_i)
$$
which is the *product rule* of probability.

### The Rules of Probability

**sum rule**:
$$
p(X) = \sum^{}_Y p(X,Y)
$$
**product rule**
$$
p(X,Y) = p(Y\vert X)p(X)
$$

From the product rule, and the symmetry property $p(X,Y)=p(Y,X)$, we obtain the following relationship between conditional probabilities
$$
p(Y\vert X)=\frac{p(X\vert Y)p(Y)}{p(X)}
$$
which is called *Bayes' theorem*. Using the sum rule, the denominator in Bayes' theorem can be expressed in terms of quantities appearing in the numerator
$$
p(X)=\sum^{}_Y p(X\vert Y)p(Y)
$$
We can view the denominator in Bayes' theorem as being the normalization constant required to ensure that the sum of the conditional probability on the left-hand side in the definition of Bayes' theorem over all values of $Y$ equals one.