## Random Variables

To link the notion of events such as $E$ and collections of events, or *probability spaces* $\Omega$ to data, we must introduce the concept of random variables. The following definition is taken from Larry Wasserman's All of Stats.

**Definition**. A random variable is a mapping

$$ X: \Omega \rightarrow \mathbb{R}$$

that assigns a real number $X(\omega)$ to each outcome $\omega$. $\Omega$ is the sample space. Points
$\omega$ in $\Omega$ are called sample outcomes, realizations, or elements. Subsets of
$\Omega$ are called Events. Say $\omega = HHTTTTHTT$ then $X(\omega) = 3$ if defined as number of heads in the sequence $\omega$.

We will assign a real number P(A) to every event A, called the probability of
A. We also call P a probability distribution or a probability measure.
To qualify as a probability, P must satisfy the three axioms (non-negative, $P(\Omega)=1$, disjoint probs add).

## Marginals and conditionals, and Bayes Theorem

The diagram  below taken from Bishop may be used to illustrate the concepts of conditionals and marginals. Consider two random variables, $X$, which takes the values ${x_i}$ where
$i = 1,...,M$, and $Y$, which takes the values ${y_j}$ where $j = 1,...,L$. The number of instances for which $X = x_i$ and $Y = y_j$ is $n_{ij}$. The number of points in column i where $X=x_i$ is $c_i$, and for the row where $Y = y_j$ is $r_j$.


![m:bishopprob](./images/bishop-prob.png)

Then the **joint probability** of having  $p(X = x_i, Y= y_j)$ is in the asymptotic limit of large numbers in the frequency sense of probability $n_{ij}/N$ where is the total number of instances. The $X$ **marginal**,  $p(X=x_i)$ can be obtained by summing instances in all the cells in the  i'th column:

$$p(X=x_i) = \sum_j p(X=x_i, Y=y_j)$$

Lets consider next only those instances for which  $X=x_i$. This means that we are limiting our analysis to the ith row. Then, we write the **conditional probability** of $Y = y_j$ given $X = x_i$ as $p(Y = y_j \mid X = x_i)$. This is the asymptotic fraction of these instances where $Y = y_j$ and is obtained by dividing the instances in the cell by those in the comumn as 

$$p(Y = y_j \mid X = x_i) = \frac{n_{ij}}{c_i}.$$

A little algebraic rearrangement gives:

$$p(Y = y_j \mid X = x_i) = \frac{n_{ij}}{c_i} = \frac{n_{ij}}{N} / \frac{c_i}{N},$$

or:

$$p(Y = y_j \mid X = x_i) \times p(X=x_i) =  p(X=x_i, Y=y_j).$$

This is the product rule of probability with conditionals involved.

Let us simplify the notation by dropping the $X=$ and $Y=$.

Then we can write the marginal probability of x as a sum over the joint distribution of x and y where we sum over all possibilities of y,

$$p(x) = \sum_y p(x,y) $$.

We can rewrite a joint distribution as a product of a conditional and marginal probability,

$$ p(x,y) = p(y\mid x) p(x) $$

The product rule is applied repeatedly to give expressions for the joint
probability involving more than two variables. For example, the joint distribution over three
variables can be factorized into a product of conditional probabilities:

$$ p(x,y,z) = p(x|y,z) \, p(y,z) = p(x |y,z) \, p(y|z) p(z) $$

### Bayes rule

Observe that 

$$ p(x,y) = p(y\mid x) p(x) = P(x\mid y)p(y).$$

Given the product rule one can derive the Bayes rule, which plays a central role in a lot of the things we will be talking:

$$ p(y\mid x) = \frac{p(x\mid y) \, p(y) }{p(x)} = \frac{p(x\mid y) \, p(y) }{\sum_{y'} p(x,y')} = \frac{p(x\mid y) \, p(y) }{\sum_{y'} p(x\mid y')p(y')}$$

### Independence

Two variables are said to be independent if their joint distribution factorizes into a product of two marginal probabilities:

$$ p(x,y) = p(x) \, p(y) $$ 

 Another consequence of independence is that if $x$ and $y$ are independent, the conditional probability of $x$ given $y$ is just the probability of $x$:

$$ p(x|y) = p(x) $$

In other words, by conditioning on a particular $y$, we have learned nothing about $x$ because of independence. Two variables $x$ and $y$ and said to be conditionally independent of $z$ if the following holds:

$$ p(x,y|z) = p(x|z) p(y|z) $$

Therefore, if we learn about z, x and y become independent. Another way to write that $x$ and $y$ are conditionally independent of $z$ is 

$$ p(x| z, y) = p(x|z) $$

In other words, if we condition on $z$, and now also learn about $y$, this is not going to change the probability of $x$. It is important to realize that conditional independence between $x$ and $y$ does not imply independence between $x$ and $y$. 