## Probability Theory

Probability theory provides a consistent framework for the quantification and manipulation of uncertainty.

Consider two random variables $X$ and $Y$. Suppose that $X$ can take any of the values $x_i$ where $i=1,...,M$ and $Y$ can take the values $y_j$ where $j=1,...,L$. Consider a total of $N$ trials in which we sample both $X$ and $Y$, and let the number of such trials in which $X=x_i$ and $Y=y_j$ be $n_{ij}$. Also let the number of trials in which $X$ takes the value $x_i$ (irrespective of the value $Y$ takes) be denoted by $c_i$, and similarly let the number of trials in which $Y$ takes the value $y_j$ be denoted by $r_j$.

The probability that $X$ will take the value $x_i$ and $Y$ will take the value $y_j$ is written $p(X=x_i, Y=y_j)$ and is called the *joint* probability of $X=x_i$ and $Y=y_j$, and can be expressed as

$$
p(X=x_i, Y=y_j)=\frac{n_{ij}}{N}
$$
where we implicitly consider the limit $N\rightarrow \infty$

The probability that $X$ takes the value $x_i$ irrespective of the value of $Y$ is written as $p(X=x_i)$ and is given by

$$
p(X=x_i)=\frac{c_i}{N}
$$
Since this is independent of the value of $Y$, we have $c_i=\sum_j n_{ij}$ and

$$
p(X=x_i)=\sum^L_{j=1}p(X=x_i, Y=y_j)
$$
which is the *sum rule* of probability. $p(X=x_i)$ is sometimes called the *marginal* probability, because it is obtained by marginalising, or summing out, the other variables.

If we consider only the instances for which $X=x_i$, then the fraction of such instances for which $Y=y_j$ is written $p(Y=y_j\vert X=x_i)$ and is called the *conditional* probability of $Y=j_j$ given $X=x_i$, and is given by
$$
p(Y=y_j\vert X=x_i)=\frac{n_{ij}}{c_i}
$$
We can derive the following relationship
$$
p(X=x_i,Y=y_j) = \frac{n_{ij}}{N} = \frac{n_{ij}}{c_i}\cdot\frac{c_i}{N} \\
=p(Y=y_j\vert X=x_i)p(X=x_i)
$$
which is the *product rule* of probability.

### The Rules of Probability

**sum rule**:
$$
p(X) = \sum^{}_Y p(X,Y)
$$
**product rule**
$$
p(X,Y) = p(Y\vert X)p(X)
$$

From the product rule, and the symmetry property $p(X,Y)=p(Y,X)$, we obtain the following relationship between conditional probabilities

$$
p(Y\vert X)=\frac{p(X\vert Y)p(Y)}{p(X)}
$$
which is called *Bayes' theorem*. Using the sum rule, the denominator in Bayes' theorem can be expressed in terms of quantities appearing in the numerator

$$
p(X)=\sum^{}_Y p(X\vert Y)p(Y)
$$
We can view the denominator in Bayes' theorem as being the normalization constant required to ensure that the sum of the conditional probability on the left-hand side in the definition of Bayes' theorem over all values of $Y$ equals one.

### Probability densities

We also want to consider probabilities with respect to continuous variables. If the probability of a real-values variable $x$ falling in the interval $(x, x+\delta x)$ is given by $p(x)\delta x$ for $\delta x \rightarrow 0$, then $p(x)$ is called the *probability density* over $x$. The probability that $x$ will lie in an interval $(a,b)$ is then given by

$$
p(x\in (a,b)) = \int_a^b p(x)~\mathrm{d}x
$$
Because probabilities are non negative and because the value of $x$ must lie somewhere on the real axis, the probability density $p(x)$ must satisfy two conditions

$$
~~~~~~~~~~~~~p(x) \geqslant 0
$$

$$
\int_\infty^\infty p(x)~\mathrm{d}x = 1
$$

Under a nonlinear change of variable, a probability density transforms differently from a simple function due to the Jacobian factor. Consider a change of variables $x=g(y)$, then $f(x)$ becomes $\tilde{f}(y) = f(g(y))$. Consider a probability density $p_x(x)$ that corresponds to a density $p_y(y)$ with respect to the new variable $y$. Observations falling in the range $(x, x+\delta x)$ will, for small values of $\delta x$ be transformed into the range $(y,y+\delta y)$ where $p_x(x)\delta x \simeq p_y(y)\delta y$, hence

$$
\begin{aligned}
p_y(y) & = p_x(x)\bigg\vert\frac{\mathrm{d}x}{\mathrm{d}y}\bigg\vert\\
& = p_x(g(y))\vert g'(y)\vert
\end{aligned}
$$
One consequence of this property is the concept of the maximum of a probability density is dependent on the choice of variable

The probability that $x$ lies in the interval $(-\infty, z)$ is given by the *cumulative distribution function* defined by

$$
P(z) = \int_{-\infty}^z p(x)~\mathrm{d}x
$$
which satisfies $P'(x)=p(x)$.

If we have several continuous variables $x_1,...,x_D$, denoted collectively by the vector $\mathbf{x}$, then we can define a joint probability density $p(\mathbf{x})=p(x_1,...,x_D)$ such that the probability of $\mathbf{x}$ falling in an infinitesimal volume $\delta\mathbf{x}$ containing the point $\mathbf{x}$ is given by $p(\mathbf{x})\delta\mathbf{x}$. This multivariate probability density must satisfy

$$
\begin{aligned}
p(\mathbf{x} & \geqslant 0 \\
\int_{}^{}p(\mathbf{x})~\mathrm{d}x & = 1
\end{aligned}
$$
in which the integral is taken over the whole of $\mathbf{x}$ space. We can also consider ajoint probability distributions over a combination of discrete and continuous variables. If $x$ is a discrete variable, then $p(x)$ is called a *probability mass function*.

The sum and product rules of probability, as well as Bayes' theorem, apply equally to the case of probability densities, or to combinations of discrete and continous variables. If $x$ and $y$ are two real variables, then the sum and product rules take the form

$$
\begin{aligned}
p(x) & = \int_{}^{}p(x,y)~\mathrm{d}y \\
p(x,y) & = p(y\vert x)p(x)
\end{aligned}
$$
and Bayes' theorem is given by

$$
p(y\vert x) = \frac{p(x\vert y)(p(y)}{\int_{}^{}p(x,y)~\mathrm{d}y}
$$