# I—Probability spaces, Events & Random Variables (Week 1)

## I.1—PROBABILITY SPACES

A *finite probability space* is the most basic data structure used throughout this course for modeling uncertainty (often just called  a probability space or a probability model).

A *finite probability space* consists of two ingredients:

- a **sample space** $\Omega$ consisting of a finite (i.e., not infinite) number of collectively exhaustive and mutually exclusive possible outcomes.

How we specify a sample space is usually not unique, for instance we can add extraneous outcomes that all have probability 0 or extraneous information that doesn't matter. Generally speaking it's best to choose a sample space that is as simple as possible for modeling what we care about solving. (For example, if we were rolling a six-sided die, and we actually only care about whether the face shows up at least 4 or not, then it's sufficient to just keep track of two outcomes, "at least 4" and "less than 4".)

- an **assignment of probabilities** $\mathbb {P}$: for each possible outcome $\omega \in \Omega$, we assign a probability $\mathbb {P}(\text {outcome }\omega )\in[0, 1]$ at least 0 and at most 1, where we require that the probabilities across all the possible outcomes in the sample space add up to 1: $\sum _{\omega \in \Omega }\mathbb {P}(\text {outcome }\omega )=1.$

**Notation**: As shorthand we occasionally use the tuple “$(\Omega ,\mathbb {P})$" to refer to a finite probability space to remind ourselves of the two ingredients needed, sample space $\Omega$ and an assignment of probabilities $\mathbb {P}$.

In Python code, a probability space can be written as a dictionnary encoding in one structure the sample space and the probability table by:

In [4]:
prob_space = {'sunny': 1/2, 'rainy': 1/6, 'snowy': 1/3}

## I.2—EVENTS

An event is a subset of the sample space $\Omega$. In our table representation for a probability space, an event could thus be thought of as a subset of the rows, and the probability of the event is just the sum of the probability values in those rows!

The probability of an event $\mathcal{A}\subseteq \Omega$ is the sum of the probabilities of the possible outcomes in $\mathcal{A}$:

$$\mathbb {P}(\mathcal{A})\triangleq \sum _{\omega \in \mathcal{A}}\mathbb {P}(\text {outcome }\omega )$$,
 
where “$\triangleq$" means “defined as".

We can translate the above equation into Python code. In particular, we can compute the probability of an event encoded as a Python set event, where the probability space is encoded as a Python dictionary prob_space:

In [5]:
def prob_of_event(event, prob_space):
    total = 0
    for outcome in event:
        total += prob_space[outcome]
    return total
# Here's an example of how to use the above function:

prob_space = {'sunny': 1/2, 'rainy': 1/6, 'snowy': 1/3}
rainy_or_snowy_event = {'rainy', 'snowy'}
print(prob_of_event(rainy_or_snowy_event, prob_space))

0.5


## I.3 — RANDOM VARIABLES

**Definition of a “finite random variable" (in this course, we will just call this a “random variable")**: Given a finite probability space $(\Omega ,\mathbb {P})$, a *finite random variable* $X$ is a mapping from the sample space $\Omega$ to a set of values $\mathcal{X}$ that random variable $X$ can take on. (We will often call $\mathcal{X}$ the “alphabet" of random variable $X$.):
$$ X:\Omega\to\mathcal{X}$$

For example, random variable $W$ takes on values in the alphabet $\{ \text {sunny},\text {rainy},\text {snowy}\}$, and random variable $I$ takes on values in the alphabet $\{ 0,1\}$.

**Quick summary**: There's an underlying experiment corresponding to probability space $(\Omega ,\mathbb {P})$. Once the experiment is run, let $\omega \in \Omega$ denote the outcome of the experiment. Then the random variable takes on the specific value of $X(\omega )\in \mathcal{X}$.

**Technical note**: Even though the formal definition of a finite random variable doesn't actually make use of the probability assignment $\mathbb {P}$, the probability assignment will become essential as soon as we talk about how probability works with random variables. 

**Explanation using a python example**: 

In [None]:
prob_space = {'sunny': 1/2, 'rainy': 1/6, 'snowy': 1/3}

random_outcome = comp_prob_inference.sample_from_finite_probability_space(prob_space)

W = random_outcome

if random_outcome == 'sunny':
    I = 1
else:
    I = 0

What happen in this code.

1. First, there is an underlying probability space $(\Omega , \mathbb {P})$, where $\Omega = \{ \text {sunny}, \text {rainy}, \text {snowy}\}$, and 
$$ \begin{eqnarray}
\mathbb{P}(\text{sunny}) &=& 1/2, \\
\mathbb{P}(\text{rainy}) &=& 1/6, \\
\mathbb{P}(\text{snowy}) &=& 1/3.
\end{eqnarray}$$

2. A random outcome $\omega \in \Omega$
is sampled using the probabilities given by the probability space $(\Omega ,\mathbb {P})$. This step corresponds to an underlying experiment happening.

3. Two random variables are generated:
    - $W$ is set to be equal to $\omega$. As an equation: $$\begin{eqnarray}W(\omega) &=&\omega\quad\text{for }\omega\in\{\text{sunny},\text{rainy},\text{snowy}\}.\end{eqnarray}$$ This step perhaps seems entirely unnecessary, as you might wonder “Why not just call the random outcome $W$ instead of $\omega$?" Indeed, this step isn't actually necessary for this particular example, but the formalism for random variables has this step to deal with what happens when we encounter a random variable like $I$.
    - $I$ is set to 1 if $\omega =\text {sunny}$, and 0 otherwise. As an equation: $$\begin{eqnarray} I(\omega) &=&
\begin{cases}
  1 & \text{if }\omega=\text{sunny}, \\
  0 & \text{if }\omega\in\{\text{rainy},\text{snowy}\}.
\end{cases}
\end{eqnarray}$$ Importantly, multiple possible outcomes (rainy or snowy) get mapped to the same value 0 that $I$ can take on. 

We see that random variable $W$ maps the sample space $\Omega =\{ \text {sunny},\text {rainy},\text {snowy}\}$ to the same set $\{ \text {sunny},\text {rainy},\text {snowy}\}$. Meanwhile, random variable $I$ maps the sample space $\Omega =\{ \text {sunny},\text {rainy},\text {snowy}\}$ to the set $\{ 0,1\}$.

We can pictorially see what's going on by looking at the probability tables for: the original probability space, the random variable $W$, and the random variable $I$:
![alt text](https://d37djvu3ytnwxt.cloudfront.net/assets/courseware/v1/cdb0d997cac4daf2d612e86780ce72bf/asset-v1:MITx+6.008.1x+3T2016+type@asset+block/images_sec-random-variables-main.png =250x)
These tables make it clear that a “random variable" really is just reassigning/relabeling what the values are for the possible outcomes in the underlying probability space (given by the top left table):

- In the top right table, random variable $W$ does not do any sort of relabeling so its probability table looks the same as that of the underlying probability space.

- In the bottom left table, the random variable I
relabels/reassigns “sunny" to 1, and both “rainy" and “snowy" to 0. Intuitively, since two of the rows now have the same label 0, it makes sense to just combine these two rows, adding their probabilities ($\frac{1}{6}+\frac{1}{3}=\frac{1}{2}$). This results in the bottom right table.

Specify a Random Variable in Python:

In [None]:
# underlying probability space
prob_space = {'sunny': 1/2, 'rainy': 1/6, 'snowy': 1/3}

# map from the sample space to the alphabet
W_mapping = {'sunny': 'sunny', 'rainy': 'rainy', 'snowy': 'snowy'}
I_mapping = {'sunny': 1, 'rainy': 0, 'snowy': 0}

# generate a random sample/draw for random variables
random_outcome = comp_prob_inference.sample_from_finite_probability_space(prob_space)
W = W_mapping[random_outcome]
I = I_mapping[random_outcome]

### Random Variables Notation and Terminology

In this course, we denote random variables with capital/uppercase letters, such as $X, W, I$, etc. We use the phrases “probability table", “probability mass function" (abbreviated as PMF), and “probability distribution" (often simply called a distribution) to mean the same thing, and in particular we denote the probability table for $X$ to be $p_ X$ or $p_ X(\cdot )$.

We write $p_ X(x)$ to denote the entry of the probability table that has label $x \in \mathcal{X}$ where $\mathcal{X}$ is the set of values that random variable $X$ takes on. Note that we use lowercase letters like x to denote variables storing nonrandom values. We can also look up values in a probability table using specific outcomes, e.g., from earlier, we have $p_ W(\text {rainy}) = 1/6$ and $p_ I(1)=1/2$.

Note that we use the same notation as in math where a function $f$
might also be written as $f(\cdot )$ to explicitly indicate that it is the function of one variable. Both $f$ and $f(\cdot )$ refer to a function whereas $f(x)$ refers to the value of the function $f$ evaluated at the point $x$.

As an example of how to use all this notation, recall that a probability table consists of nonnegative entries that add up to 1. In fact, each of the entries is at most 1 (otherwise the numbers would add to more than 1). For a random variable $X$ taking on values in $\mathcal{X}$, we can write out these constraints as:
$$0 \le p_ X(x) \le 1\quad \text {for all }x\in \mathcal{X}, \qquad \sum _{x \in \mathcal{X}} p_ X(x) = 1.$$

Often in the course, if we are making statements about all possible outcomes of $X$, we will omit writing out the alphabet $\mathcal{X}$ explicitly. For example, instead of the above, we might write the following equivalent statement:
$$0 \le p_ X(x) \le 1\quad \text {for all }x, \qquad \sum _ x p_ X(x) = 1.$$

# II — Jointly Distributed Random Variables (week 2)

## II.1—Relating Two Random Variables

At the most basic level, inference refers to using an observation to reason about some unknown quantity. In this course, the observation and the unknown quantity are represented by random variables. The main modeling question is: How do these random variables relate?

Let's build on our earlier weather example, where now another outcome of interest appears, the temperature, which we quantize into to possible values “hot" and “cold". Let's suppose that we have the following probability space:
![proba space](./images/images_sec-joint-rv-prob-space.png)
You can check that the nonnegative entries do add to 1. If we let random variable $W$ be the weather (sunny, rainy, snowy) and random variable $T$ be the temperature (hot, cold), then notice that we could rearrange the table in the following fashion:
![rearranged table](./images/images_sec-joint-rv-rearrange-table.png)
When we talk about two separate random variables, we could view them either as a single “super" random variable that happens to consist of a pair of values (the first table; notice the label for each outcome corresponds to a pair of values), or we can view the two separate variables along their own different axes (the second table).

The first table tells us what the underlying probability space is, which includes what the sample space is (just read off the outcome names) and what the probability is for each of the possible outcomes for the underlying experiment at hand.

The second table is called a *joint probability table* $p_{W,T}$ for random variables $W$ and $T$, and we say that random variables $W$ and $T$ are jointly distributed with the above distribution. Since this table is a rearrangement of the earlier table, it also consists of nonnegative entries that add to 1.

The joint probability table gives probabilities in which $W$ and $T$ co-occur with specific values. For example, in the above, the event that “$W=\text {sunny}$" and the event that “$T=\text {hot}$" co-occur with probability 3/10. Notationally, we write
$$p_{W,T}(\text {sunny},\text {hot})=\mathbb {P}(W=\text {sunny},T=\text {hot})=\frac{3}{10}.$$
**Conceptual note**: Given the joint probability table, we can easily go backwards and write out the first table above, which is the underlying probability space.

**Preview of inference**: Inference is all about answering questions like “if we observe that the weather is rainy, what is the probability that the temperature is cold?" Let's take a look at how one might answer this question.

First, if we observe that it is rainy, then we know that “sunny" and “snowy" didn't happen so those rows aren't relevant anymore. So the space of possible realizations of the world has shrunk to two options now: ($W=\text {rainy},T=\text {hot}$) or ($W=\text {rainy},T=\text {cold}$). But what about the probabilities of these two realizations? It's not just 1/30 and 2/15 since these don't sum to 1 — by observing things, adjustments can be made to the probabilities of different realizations but they should still form a valid probability space.

Why not just scale both 1/30 and 2/15 by the same constant so that they sum to 1? This can be done by dividing 1/30 and 2/15 by their sum:
$$\text {hot:}\quad \frac{\frac{1}{30}}{\frac{1}{30}+\frac{2}{15}}=\frac{1}{5},\qquad \text {cold}:\quad \frac{\frac{2}{15}}{\frac{1}{30}+\frac{2}{15}}=\frac{4}{5}.$$
	 

Now they sum to 1. It turns out that, given that we'ved observed the weather to be rainy, these are the correct probabilities for the two options “hot" and “cold". Let's formalize the steps. We work backwards, first explaining what the the denominator “$\frac{1}{30}+\frac{2}{15}=\frac{1}{6}$" above comes from. 

In [1]:
# Representing a Joint Probability Table in Code
# Approach 1: Use dictionaries within a dictionary
prob_W_T_dict = {}
for w in {'sunny', 'rainy', 'snowy'}:
    prob_W_T_dict[w] = {}

prob_W_T_dict['sunny']['hot'] = 3/10
prob_W_T_dict['sunny']['cold'] = 1/5
prob_W_T_dict['rainy']['hot'] = 1/30
prob_W_T_dict['rainy']['cold'] = 2/15
prob_W_T_dict['snowy']['hot'] = 0
prob_W_T_dict['snowy']['cold'] = 1/3

prob_W_T_dict['rainy']['cold'] # Test

# Approach 2: Use a numpy 2D array
import numpy as np
prob_W_T_rows = ['sunny', 'rainy', 'snowy']
prob_W_T_cols = ['hot', 'cold']
prob_W_T_array = np.array([[3/10, 1/5], [1/30, 2/15], [0, 1/3]])
prob_W_T_row_mapping = {label: index for index, label in enumerate(prob_W_T_rows)}
prob_W_T_col_mapping = {label: index for index, label in enumerate(prob_W_T_cols)}

prob_W_T_array[prob_W_T_row_mapping['rainy'], prob_W_T_col_mapping['cold']] # Test

0.13333333333333333

**Remarks on these two approaches:**
1. Dictionaries within a dictionary representation:
    - easily retrieving rows but not columns => summing a column's probabilities is more cumbersome than summing a row's probabilities
    - dictionaries within a dictionary representation is able to only store the nonzero table entries => More efficient for sparse matrices
2. Numpy 2D array representation:
    - easy to work with when it comes to basic operations like summing rows, and retrieving a specific row or column.
    - need to store the whole array, and if the alphabet sizes of the random variables are very large, then storing the array will take a lot of space!


## II.2—Marginalization: Summarizing Randomness
Given a joint probability table, often we'll want to know what the probability table is for just one of the random variables. We can do this by just summing or “marginalizing" out the other random variables. For example, to get the probability table for random variable $W$, we do the following:
![marginalization](./images/images_sec-joint-rv-marg-rows.png)
We take the joint probability table (left-hand side) and compute out the row sums (which we've written in the margin).

The right-hand side table is the probability table $p_{W}$ for random variable $W$; we call this resulting probability distribution the marginal distribution of $W$ (put another way, it is the distribution obtained by marginalizing out the random variables that aren't $W$).

In terms of notation, the above marginalization procedure whereby we used the joint distribution of $W$ and $T$ to produce the marginal distribution of $W$ is written:
$$p_{W}(w)=\sum _{t\in \mathcal{T}}p_{W,T}(w,t),$$
where $\mathcal{T}$ is the set of values that random variable $T$ can take on. In fact, throughout this course, we will often omit explicitly writing out the alphabet of values that a random variable takes on, e.g., writing instead
$$p_{W}(w)=\sum _{t}p_{W,T}(w,t).$$

It's clear from context that we're summing over all possible values for $t$, which is going to be the values that random variable $T$ can possibly take on.

As a specific example,
$$p_{W}(\text {rainy})=\sum _{t}p_{W,T}(\text {rainy},t)=\underbrace{p_{W,T}(\text {rainy},\text {hot})}_{1/30}+\underbrace{p_{W,T}(\text {rainy},\text {cold})}_{2/15}=\frac{1}{6}.$$

We could similarly marginalize out random variable $W$ to get the marginal distribution $p_{T}$ for random variable $T$:
![margi T](./images/images_sec-joint-rv-marg-cols.png)
(Note that whether we write a probability table for a single variable horizontally or vertically doesn't actually matter.)

As a formula, we would write:
$$p_{T}(t)=\sum _{w}p_{W,T}(w,t).$$
	 

For example,
$$p_{T}(\text {hot})=\sum _{w}p_{W,T}(w,\text {hot})=\underbrace{p_{W,T}(\text {sunny},\text {hot})}_{3/10}+\underbrace{p_{W,T}(\text {rainy},\text {hot})}_{1/30}+\underbrace{p_{W,T}(\text {snowy},\text {hot})}_{0}=\frac{1}{3}.$$
	 
In general:

**Marginalization**: Consider two random variables $X$ and $Y$ (that take on values in the sets $\mathcal{X}$ and $\mathcal{Y}$) with joint probability table $p_{X,Y}$. For any $x\in \mathcal{X}$, the *marginal probability* that $X=x$ is given by
$$p_{X}(x)=\sum _{y}p_{X,Y}(x,y).$$

### Marginalization for Many Random Variables

In general, for three random variables $X$, $Y$, and $Z$ with joint probability table $p_{X,Y,Z}$, we have
$$\begin{eqnarray}
p_{X,Y}(x,y)
&=&
\sum_{z} p_{X,Y,Z}(x,y,z), \\
p_{X,Z}(x,z)
&=&
\sum_{y} p_{X,Y,Z}(x,y,z), \\
p_{Y,Z}(y,z)
&=&
\sum_{x} p_{X,Y,Z}(x,y,z).
\end{eqnarray}$$

Note that we can marginalize out different random variables in succession. For example, given joint probability table $p_{X,Y,Z}$, if we wanted the probability table $p_ X$, we can get it by marginalizing out the two random variables $Y$ and $Z$:
$$p_ X(x) = \sum _{y} p_{X,Y}(x,y) = \sum _{y} \Big( \sum _{z} p_{X,Y,Z}(x,y,z) \Big).$$
	 

Even with more than three random variables, the idea is the same. For example, with four random variables $W$, $X$, $Y$, and $Z$ with joint probability table $p_{W,X,Y,Z}$, if we want the joint probability table for $X$ and $Y$, we would do the following:
$$p_{X,Y}(x, y) = \sum _ w \Big( \sum _ z p_{W,X,Y,Z}(w,x,y,z) \Big).$$

## II.3—Conditioning: Randomness of a Variable Given that Another Variable Takes on a Specific Value

When we observe that a random variable takes on a specific value (such as $W=\text {rainy}$ from earlier for which we say that we condition on random variable $W$ taking on the value “rainy"), this observation can affect what we think are likely or unlikely values for another random variable.

**Conditioning**: Consider two random variables $X$ and $Y$ (that take on values in the sets $\mathcal{X}$ and $\mathcal{Y}$) with joint probability table $p_{X,Y}$ (from which by marginalization we can readily compute the marginal probability table $p_{Y}$). For any $x\in \mathcal{X}$ and $y\in \mathcal{Y}$ such that $p_{Y}(y)>0$, the conditional probability of event $X=x$ given event $Y=y$ has happened is
$$p_{X\mid Y}(x\mid y)\triangleq \frac{p_{X,Y}(x,y)}{p_{Y}(y)}.$$
	 
**Computational interpretation**: To compute $p_{X\mid Y}(x\mid y)$
, take the entry $p_{X,Y}(x,y)$ in the joint probability table corresponding to $X=x$ and $Y=y$, and then divide the entry by $p_{Y}(y)$, which is an entry in the marginal probability table $p_{Y}$ for random variable $Y$. 

### Moving Toward a More General Story for Conditioning: conditionning on events

Jointly distributed random variables play a central role in this course. Remember that we will model observations as random variables and the quantities we want to infer also as random variables. When these random variables are jointly distributed so that we have a probabilistic way to describe how they relate (through their joint probability table), then we can systematically and quantitatively produce inferences.

We just saw how to condition on a random variable taking on a specific value. What about if we wanted to condition on a random variable taking on any one of of many values rather just one specific value? To answer this question, we look at a more general story of conditioning which is in terms of events.

Suppose we have a probability model all built up: ($\Omega, \mathcal{P}$) and there are two events in general $\mathcal{A}$ and $\mathcal{B}$ that could have some intersection (depending  where the outcome falls,
we say event $\mathcal{A}$ occurs, event $\mathcal{B}$ occurs, or both $\mathcal{A}$ and $\mathcal{B}$ occur).

Now, suppose that we have an observation that $\mathcal{A}$ occurred ($\omega \in \mathcal{A}$).
What does that mean? Set $\mathcal{A}$ now becomes our new sample space. But we still have a set of possible outcomes in $\mathcal{A}$ and we don't know exactly where the outcome would be. Therefore, there should be a new model with
- a new sample space: $\mathcal{A}$ and
- a new probability assignment: we'll call it $\mathbb{P}(\cdot \mid \mathcal{A})$

How do we compute the new probability assignment $\mathbb{P}(\cdot \mid \mathcal{A})$. There are two cases depending on $\omega$:
- if $\omega\notin \mathcal{A} : \mathbb{P}(\omega \mid \mathcal{A}) = 0$
- if $\omega\in \mathcal{A} : \mathbb{P}(\omega \mid \mathcal{A}) = \frac{\mathbb{P}(\omega)}{ \mathbb{P}(\mathcal{A})}$

Why do we scale with $1 \over \mathbb{P}(\mathcal{A})$ ? To garantee that $\sum_{\omega\in \mathcal{A}}{\mathbb{P}(\omega \mid \mathcal{A})}=1$

**General definition **:
We can dedue a general definition of the conditionnal probability of $\mathcal{B}$ given $\mathcal{A}$:
$$ \mathbb{P}(\mathcal{B} \mid \mathcal{A}) \triangleq \frac{\mathbb{P}(\mathcal{A} \cap \mathcal{B})}{ \mathbb{P}(\mathcal{A})}$$

We can deduce from that the commonly used *product rule*:
$$ \mathbb{P}(\mathcal{A} \cap \mathcal{B}) = \mathbb{P}(\mathcal{A}).\mathbb{P}(\mathcal{B} \mid \mathcal{A})$$

### Bayes' Theorem for Events

Given two events $\mathcal{A}$ and $\mathcal{B}$ (both of which have positive probability), Bayes' theorem, also called Bayes' rule or Bayes' law, gives a way to compute $\mathbb {P}(\mathcal{A} | \mathcal{B})$ in terms of $\mathbb {P}(\mathcal{B} | \mathcal{A})$. This result turns out to be extremely useful for inference because often times we want to compute one of these, and the other is known or otherwise straightforward to compute.

Bayes' theorem is given by
$$\mathbb {P}(\mathcal{A} | \mathcal{B}) = \frac{\mathbb {P}(\mathcal{B} | \mathcal{A}) \mathbb {P}(\mathcal{A})}{\mathbb {P}(\mathcal{B})}.$$
	 

The proof of why this is the case is a one liner:
$$\mathbb {P}(\mathcal{A} | \mathcal{B}) \overset {(a)}{=} \frac{\mathbb {P}(\mathcal{A} \cap \mathcal{B})}{\mathbb {P}(\mathcal{B})} \overset {(b)}{=} \frac{\mathbb {P}(\mathcal{B} | \mathcal{A}) \mathbb {P}(\mathcal{A})}{\mathbb {P}(\mathcal{B})},$$
	 

where step (a)
is by the definition of conditional probability for events, and step (b) is due to the product rule for events (which follows from rearranging the definition of conditional probability for $\mathbb {P}(\mathcal{B} | \mathcal{A}))$. 

### Law of total probability

If we can break up the sample space in a partition $(\mathcal{B_1}, ..., \mathcal{B_n})$ of $n$ subsets, we can sometime simplify our computation.

To be a partition of $\Omega$, a set of subsets $(\mathcal{B_1}, ..., \mathcal{B_n})$ has to respect two rules:
- The union of the subsets covers the entire sample space: $\bigcap_{i=1}^n{\mathcal{B_i}}=\Omega$, and
- The  subsets are disjoints: $\forall i \neq j, \mathcal{B_i}\cap\mathcal{B_j}=\emptyset$.

Then, the *law of total probability* can be stated as:
$$ \mathbb{P}(\mathcal{A}) = \sum_{i=1}^n{\mathbb {P}(\mathcal{A} \cap \mathcal{B_i})}$$

One usefull application of the law of total probabilty is to define a partition of $\Omega$ in an *interesting event* $\mathcal{A}$ and its complementary $\mathcal{A^c}$ in order to write the probabilty of an event $\mathcal{B}$ as:
$$\mathbb{P}(B) = \mathbb{P}(B|A)\mathbb{P}(A) + \mathbb{P}(B|A^c)(1-\mathbb{P}(A))$$

### Relating Conditioning on Events Back to Random Variables
We can come back to random variables from our discussion on events just by remebering that the fact that a random variable $X$ take a particuliar value $x$ is a event. It corresponds to all of the outcomes $\omega\in\Omega$ for which the random variable $X$ is assigned this value $x$.

Lets think of this as event $\mathcal{A}(x) = \{\omega\in\Omega \mid X(\omega) = x\}$.

The values $y$ taken by another random variable $Y$ can be associated to another even $\mathcal{B}(y) = \{\omega\in\Omega \mid Y(\omega) = y\}$

Starting from the defintion of conditional probability for events, $$ \mathbb{P}(\mathcal{B} \mid \mathcal{A}) \triangleq \frac{\mathbb{P}(\mathcal{A} \cap \mathcal{B})}{ \mathbb{P}(\mathcal{A})},$$ when can thus deduce the definition of conditional probability for random variables in terms of events:
$$ \mathbb{P}(Y=y \mid X = x) = \frac{\mathbb{P}(X=x,Y=y)}{ \mathbb{P}(X=x)}$$

If we fix the value of $x$ and just change the value of $y$, this define a new distribution $P_{Y\mid X}(\cdot\mid x)$ (viewed as a function of $y$ only) that can be written in terms of probability distribution of random variables as:
$$P_{Y\mid X}(y \mid x) = \frac{P_{X,Y}(x,y)}{ P_X(x)}$$
where we can see that $\sum_y{P_{Y\mid X}(y \mid x)}=1$

# III—Inference with Bayes' Theorem for Random Variables (week 3)

We now return to random variables and build up to Bayes' theorem for random variables. This machinery will be extremely important as it will be how we automate inference for much larger problems in the later sections of the course, where we can have a large number of random variables at play, and a large amount of observations that we need to incorporate into our inference.

## III.1—The Product Rule for Random Variables

In many real world problems, we aren't given what the joint distribution of two random variables is although we might be given other information from which we can compute the joint distribution. Often times, we can compute out the joint distribution using what's called the *product rule* (often also called the chain rule). This is precisely the random variable version of the product rule for events.

As we saw from before, we were able to derive Bayes' theorem for events using the product rule for events: $\mathbb {P}(\mathcal{A} \cap \mathcal{B}) = \mathbb {P}(\mathcal{A}) \mathbb {P}(\mathcal{B} \mid \mathcal{A})$. The random variable version of the product rule is derived just like the event version of the product rule, by rearranging the equation for the definition of conditional probability.

For two random variables $X$ and $Y$ (that take on values in sets $\mathcal{X}$ and $\mathcal{Y}$ respectively), the *product rule* for random variables says that
$$p_{X,Y}(x,y)=p_{Y}(y)p_{X\mid Y}(x\mid y)\qquad \text {for all }x\in \mathcal{X},y\in \mathcal{Y}\text { such that }p_{Y}(y)>0.$$
	 

**Interpretation**: If we have the probability table for $Y$, and separately the probability table for $X$ conditioned on $Y$, then we can come up with the joint probability table (i.e., the joint distribution) of $X$ and $Y$.

**What happens when $p_{Y}(y)=0$ ?** Even though $p_{X\mid Y}(x\mid y)$ isn't defined in this case, one can readily show that $p_{X,Y}(x,y)=0$ when $p_{Y}(y)=0$: Suppose that random variables $X$ and $Y$ have joint probability table $p_{X,Y}$ and take on values in sets $\mathcal{X}$ and $\mathcal{Y}$ respectively. Suppose that for a specific choice of $y\in \mathcal{Y}$, we have $p_{Y}(y)=0$. Then
$$p_{X,Y}(x,y)=0\qquad \text {for all }x\in \mathcal{X}.$$

*Proof*: Let $y\in \mathcal{Y}$ satisfy $p_{Y}(y)=0$. Recall that we relate marginal distribution $p_{Y}$ to joint distribution $p_{X,Y}$ via marginalization:
$$0=p_{Y}(y)=\sum _{x\in \mathcal{X}}p_{X,Y}(x,y).$$
If a sum of nonnegative numbers (such as probabilities) equals 0, then each of the numbers being summed up must also be 0 (otherwise, the sum would be positive!). Hence, it must be that each number being added up in the right-hand side sum is 0, i.e.,
$$p_{X,Y}(x,y)=0\qquad \text {for all }x\in \mathcal{X}.$$

Thus, in general:
$$\begin{eqnarray}
p_{X,Y}(x,y)
&=&
\begin{cases}
p_{Y}(y)p_{X\mid Y}(x\mid y) & \text{if }p_{Y}(y)>0,\\
0 & \text{if }p_{Y}(y)=0.
\end{cases}
\end{eqnarray}$$

**Important convention for this course**: For notational convenience, throughout this course, we will often just write $p_{X,Y}(x,y)=p_{Y}(y)p_{X\mid Y}(x\mid y)$ with the understanding that if $p_{Y}(y)=0$, even though $p_{X\mid Y}(x\mid y)$ is not actually defined, $p_{X,Y}(x,y)$ just evaluates to 0 anyways.

**The product rule is symmetric**: We can use the definition of conditional probability with $X$ and $Y$ swapped, and rearranging factors, we get:
$$p_{X,Y}(x,y)=p_{X}(x)p_{Y\mid X}(y\mid x)\qquad \text {for all }x\in \mathcal{X},y\in \mathcal{Y}\text { such that }p_{X}(x)>0,$$

and so similarly we could show that
$$\begin{eqnarray}
p_{X,Y}(x,y)
&=&
\begin{cases}
p_{X}(x)p_{Y\mid X}(y\mid x) & \text{if }p_{X}(x)>0,\\
0 & \text{if }p_{X}(x)=0.
\end{cases}
\end{eqnarray}$$

**Many random variables**: If we have many random variables, say, $X_1$, $X_2$, up to $X_N$ where $N$ is not a random variable but is a fixed constant, then we have
$$\begin{eqnarray}
&&p_{X_1, X_2, \dots ,X_N}(x_1, x_2, \dots, x_N) \\
&&=
  p_{X_1}(x_1)
  p_{X_2 \mid X_1}(x_2 \mid x_1)
  p_{X_3 \mid X_1, X_2}(x_3 \mid x_1, x_2) \\
  % p_{X_4 \mid X_1, X_2, X_3}(x_4 \mid x_1, x_2, x_3)
&&\quad
  \cdots
  p_{X_N \mid X_1, X_2, \dots, X_{N-1}}(x_N \mid x_1, x_2, \dots, x_{N-1}).
\end{eqnarray}$$

Again, we write this to mean that this holds for every possible choice of $x_1, x_2, \dots , x_ N$ for which we never condition on a zero probability event. Note that the above factorization always holds without additional assumptions on the distribution of $X_1, X_2, \dots , X_N$.

Note that the product rule could be applied in arbitrary orderings. In the above factorization, you could think of it as introducing random variable $X_1$ first, and then $X_2$, and then $X_3$, etc. Each time we introduce another random variable, we have to condition on all the random variables that have already been introduced.

Since there are $N$ random variables, there are $N!$ different orderings in which we can write out the product rule. For example, we can think of introducing the last random variable $X_N$ first and then going backwards until we introduce $X_1$ at the end. This yields the, also correct, factorization
$$\begin{eqnarray}
&& p_{X_1, X_2, \dots ,X_N}(x_1, x_2, \dots, x_N) \\
&&=
  p_{X_N}(x_N)
  p_{X_{N-1} \mid X_N}(x_{N-1} \mid x_N)
  p_{X_{N-2} \mid X_{N-1}, X_N}(x_{N-2} \mid x_{N-1}, x_N) \\
  % p_{X_4 \mid X_1, X_2, X_3}(x_4 \mid x_1, x_2, x_3)
&&\quad
  \cdots
  p_{X_1 \mid X_2, X_3, \dots, X_N}(x_1 \mid x_2, \dots, x_N).
\end{eqnarray}$$

## III.2—Bayes' Theorem for Random Variables

In inference, what we want to reason about is some unknown random variable $X$, where we get to observe some other random variable $Y$, and we have some model for how $X$ and $Y$ relate. Specifically, suppose that we have :
- **a “prior" distribution** $p_{X}$ for $X$: this prior distribution encodes what we believe to be likely or unlikely values that $X$ takes on, before we actually have any observations.
- **a “likelihood" distribution** $p_{Y\mid X}$ which encodes what are the likely values of $Y$ given a value of $X$,

After observing that $Y$ takes on a specific value $y$, our “belief" of what $X$ given $Y=y$ is now given by what's called the **“posterior" distribution** $p_{X\mid Y}(\cdot \mid y)$.

Put another way, we keep track of a probability distribution that tells us how plausible we think different values $X$ can take on are. When we observe data $Y$ that can help us reason about $X$, we proceed to either upweight or downweight how plausible we think different values $X$ can take on are, making sure that we end up with a probability distribution giving us our updated belief of what $X$ can be.

Thus, once we have observed $Y=y$, our belief of what $X$ is changes from the prior $p_{X}$ to the posterior $p_{X\mid Y}(\cdot \mid y)$.

Bayes' theorem (also called Bayes' rule or Bayes' law) for random variables explicitly tells us how to compute the posterior distribution $p_{X\mid Y}(\cdot \mid y)$, i.e., how to weight each possible value that random variable $X$ can take on, once we've observed $Y=y$. Bayes' theorem is the main workhorse of numerous inference algorithms and will show up many times throughout the course.

**Bayes' theorem**: Suppose that $y$ is a value that random variable $Y$ can take on, and $p_{Y}(y)>0$. Then
$$p_{X\mid Y}(x\mid y)=\frac{p_{X}(x)p_{Y\mid X}(y\mid x)}{\sum _{ x'}p_{X}( x')p_{Y\mid X}(y\mid x')}$$
	 
for all values $x$ that random variable $X$ can take on.

**Important**: Remember that $p_{Y\mid X}(\cdot \mid x)$ could be undefined but this isn't an issue since this happens precisely when $p_{X}(x)=0$, and we know that $p_{X,Y}(x,y)=0$ (for every $y$) whenever $p_{X}(x)=0$.

**Proof**: We have
$$p_{X\mid Y}(x\mid y)\overset {(a)}{=}\frac{p_{X,Y}(x,y)}{p_{Y}(y)}\overset {(b)}{=}\frac{p_{X}(x)p_{Y\mid X}(y\mid x)}{p_{Y}(y)}\overset {(c)}{=}\frac{p_{X}(x)p_{Y\mid X}(y\mid x)}{\sum _{ x'}p_{X,Y}( x',y)}\overset {(d)}{=}\frac{p_{X}(x)p_{Y\mid X}(y\mid x)}{\sum _{ x'}p_{X}( x')p_{Y\mid X}(y\mid x')},$$
	 
where step (a) uses the definition of conditional probability (this step requires $p_{Y}(y)>0$), step (b) uses the product rule (recall that for notational convenience we're not separately writing out the case when $p_{X}(x)=0$), step (c) uses the formula for marginalization, and step (d) uses the product rule (again, for notational convenience, we're not separately writing out the case when $p_{X}( x')=0$).

## III.3— Bayes' Theorem for Random Variables: A Computational View

Computationally, Bayes' theorem can be thought of as a two-step procedure to update $p_ X(x)$ once we have observed $Y=y$ :

#### Step 1:
For each value $x$ that random variable $X$ can take on, initially we believed that $X=x$ with a score of $p_ X(x)$, which could be thought of as how plausible we thought ahead of time that $X=x$. 

However now that we have observed $Y=y$, we weight the score $p_{X}(x)$ by a factor $p_{Y\mid X}(y\mid x)$, so our new belief for how plausible $X=x$ is, is given by: $\quad \alpha (x\mid y)\triangleq p_{X}(x)p_{Y\mid X}(y\mid x)$.

Here we have defined a new table $\alpha (\cdot \mid y)$ which is not a probability table, since when we put in the weights, the new beliefs are no longer guaranteed to sum to 1 (i.e., $\sum _{x}\alpha (x\mid y)$ might not equal 1)! $\alpha (\cdot \mid y)$ is an unnormalized posterior distribution!

Also, if $p_{X}(x)$ is already 0, then as we already mentioned a few times, $p_{Y\mid X}(y\mid x)$ is undefined, but this case isn't a problem: no weighting is needed since an impossible outcome stays impossible.

#### Step 2 :
We fix the fact that the unnormalized posterior table $\alpha (\cdot \mid y)$ isn't guaranteed to sum to 1 by renormalizing:
$$p_{X\mid Y}(x\mid y)=\frac{\alpha (x\mid y)}{\sum _{ x'}\alpha ( x'\mid y)}=\frac{p_{X}(x)p_{Y\mid X}(y\mid x)}{\sum _{ x'}p_{X}( x')p_{Y\mid X}(y\mid x')}.$$

**An important note**: Some times we won't actually care about doing this second renormalization step because we will only be interested in what value that $X$ takes on is more plausible relative to others.

If we just want to see which value of $x$ yields the highest entry in the unnormalized table $\alpha (\cdot \mid y)$, we could find this value of $x$ without renormalizing! 

## III.4—Maximum A Posteriori (MAP) Estimation

For a hidden random variable $X$ that we are inferring, and given observation $Y = y$, we have been talking about computing the posterior distribution $p_{X \mid Y}(\cdot | y)$ using Bayes' rule.  The posterior is a distribution for what we are inferring. 

Often times, we want to report which particular value of $X$ actually achieves the highest posterior probability, i.e., the most probable value $x$ that $X$ can take on given that we have observed $Y=y$.

The value that $X$ can take on that maximizes the posterior distribution is called the *maximum a posteriori (MAP)* estimate of $X$ given $Y = y$. We denote the MAP estimate by $\widehat{x}_{\text {MAP}}(y)$, where we make it clear that it depends on what the observed $y$ is. Mathematically, we write
$$\widehat{x}_{\text {MAP}}(y) = \arg \max _ x p_{X \mid Y}(x | y).$$
	 

Note that if we didn't include the “arg" before the “max", then we would just be finding the highest posterior probability rather than which value–or “argument"–$x$ actually achieves the highest posterior probability.

In general, there could be ties, i.e., multiple values that $X$ can take on are able to achieve the best possible posterior probability. 

## III.5—Important remark: Complexity of Computing Bayes' Theorem for Random Variables
It can be very expensive to compute a posterior distribution when we have many quantities we want to infer.

Consider when we have $N$ random variables $X_1, \dots , X_N$ with joint probability distribution $p_{X_1, \dots , X_ N}$, and where we have an observation $Y$ related to $X_1, \dots , X_N$ through the known conditional probability table $p_{Y\mid X_1, \dots , X_ N}$. Treating $X = (X_1, \dots , X_ N)$ as one big random variable, we can apply Bayes' theorem to get
$$\begin{eqnarray}
&& p_{X_1, X_2, \dots, X_N \mid Y}(x_1, x_2, \dots, x_N \mid y) \\
&&
= \frac{p_{X_1, X_2, \dots, X_N}(x_1, x_2, \dots, x_N)
        p_{Y\mid X_1, X_2, \dots, X_N}(y\mid x_1, x_2, \dots, x_N)}
       {\sum_{x_1'}
        \sum_{x_2'}
        \cdots
        \sum_{x_N'}
          p_{X}(x_1',
                x_2',
                \dots,
                x_N')
          p_{Y\mid X_1, X_2, \dots, X_N}(y\mid x_1',
                x_2',
                \dots,
                x_N')}.
\end{eqnarray}$$

If we suppose that each $X_ i$ takes on one of $k$ values.
We can see that wa are we summing together $k^N$ terms !

The number of terms being summed grows exponential in the number of variables we are inferring $N$. Without any sort of additional structure in the distribution, it turns out that we cannot hope to escape this exponential cost in computing the posterior distribution.

This is a disaster! In many problems we care about, $N$ will be very, very large! For example, if $X_1, \dots , X_ N$ represents values that different pixels in an image take, then nowadays images taken for example on a mobile phone often have easily well over 10 million pixels. So $N$ could be 10 million, and even if each $X_ i$ took on $k=2$ values, the number of terms we would have to sum over in the denominator is already greater than the number of atoms in the known, observable universe (which is estimated to be somewhere between $10^{78}$ and $10^{82}$).

Structure in distributions will help us escape from this exponential cost in $N$

# IV— Introduction to Independence

With a fair coin, let's say that we just tossed it five times and tails turned up all five times. Is it more likely now that we'll see heads?

The answer is no because the outcome of the tosses don't tell us anything about the outcome of a new toss. This concept is referred to as “independence".

We have actually already encountered independent events already, for example when we talked about two coin flips or two dice rolls. Now we develop theory for independent events and then, very soon, independent random variables. We build up to a concept called conditional independence, where two random variables become independent only after we condition on the value of a third random variable.

Not only is independence an important phenomenon to understand and to help us reason about many scenarios, it will also play a pivotal role in how we can represent very large probabilistic models with very little space on a computer.

## IV.1— Independent Events and Random Variables
The outcome of a coin flip or a die roll isn't going to tell you anything about the outcome of a new coin toss or die roll unless you
have some very peculiar coins or dice.

We can formalize this by saying that two events $\mathcal{A}$ and $\mathcal{B}$ are independent, which we'll denote as $\mathcal{A} \perp \!\!\! \perp \mathcal{B}$, if the probability of A and B occurring is the same as the probability of A times the probability of B:
$$\mathbb{P}(\mathcal{A}\cap\mathcal{B})=\mathbb{P}(\mathcal{A})\mathbb{P}(\mathcal{B}).$$

If $\mathbb{P}(\mathcal{A})>0$, we can use the product rule to write:
$$ \mathbb{P}(\mathcal{B}\mid \mathcal{A})\mathbb{P}(\mathcal{A})=\mathbb{P}(\mathcal{A})\mathbb{P}(\mathcal{B})$$
and thus 
$$ \mathbb{P}(\mathcal{B}\mid \mathcal{A})=\mathbb{P}(\mathcal{B})$$

In other words, knowing $\mathcal{A}$ doesn't tell us anything new about event $\mathcal{B}$. The same can be written by exchanging $\mathcal{A}$ and $\mathcal{B}$ of course.

In the case of random variables, we can say that $X$ and $Y$ are independent ($X \perp \!\!\! \perp Y$) if and only if:
$$ P_{X, Y}(x, y)=P_X(x)P_Y(y)$$
and, if $P_Y(y)>0$, and get the following result by applying the product rule:
$$ P_{X\mid Y}(x\mid y)=P_X(x)$$

## IV.2— Mutual vs Pairwise Independence
How do we extend the independence story to more than two variables?

#### Strongest way: "mutual independence"
$X$, $Y$, and $Z$ are *mutually independent* if we can write the joint distributionfor all three of them, as simply the product of the three individual distributions:
$$ P_{X, Y, Z}(x, y, z)=P_X(x)P_Y(y)P_Z(z)$$
This is the strongest independent statement we can make. It says they're completely independent, they're all separate.

#### Less strong way: "pairwise independence"
*Pairwise independence* means that for any two variables, for instance $X$ and $Y$, you can write $P_{X,Y}(x, y)= P_X(x)p_Y(y)$.
If I know any one, it doesn't tell me think anything about any of the others.
This is not the same as mutual independence. It's not as strong.

**example of why**: $X, Y$ are independent fair coin flips with values 0, 1. If we define $Z=X\oplus Y$, we can show that $X \perp \!\!\! \perp Y$, $Z \perp \!\!\! \perp X$, $Z \perp \!\!\! \perp Y$.
These three random variables are pairwise independant but they are not mutually independant ! if I know any two of them, for instance $X$ and $Y$ then I know exactly what the third one is going to be.

**Mutual vs Pairwise Independence Terminology Remark**: Throughout this course, if we say that many random variables are independent (without saying which specific kind of independence), then we mean mutual independence, which we often also call **marginal independence**.

## IV.3— Conditional Independence
Two random variables $X$ and $Y$ are conditionally independent given a third random variable $Z$ if we can write the conditional distribution for both of them as the product of the individual conditionals:
$$ p_{X, Y\mid Z}(x, y\mid z) = p_{X\mid Z}(x\mid z)p_{Y\mid Z}(y\mid z)$$

Intuitively, this means that once we know Z, knowing something about Y doesn't tell you anything about X. And vice versa.

** Marginal independence and conditional independence are not the same thing**: if you have marginal independence, then that does not necessarily imply conditional independence. And the reverse is also not true. *Sometimes, we can have marginal independence without conditional independence and sometimes we can have conditional independence without marginal independence.*

**Counter-example**: Suppose we have three random variables $R$, $S$, and $T$.
1. Suppose that $p_{R, S, T}(r, s, t) = p_R(r)p_{S\mid R}(s\mid r)p_{T\mid R}(t\mid r)$ i.e. $S$ and $T$ both only depends on $R$
    - $S \perp \!\!\! \perp T$ ? **No** !
    $$p_{S, T}(s, t)=\sum_r p_R(r)p_{S\mid R}(s\mid r)p_{T\mid R}(t\mid r) \neq p_S(s)p_T(t)$$
    - $S \perp \!\!\! \perp T\mid R$ ? **Yes** !
    $$p_{S, T\mid R}(s, t\mid r)=\frac{p_{R, S, T}(r, s, t)}{p_{R}(r)}=p_{S\mid R}(s\mid r)p_{T\mid R}(t\mid r)$$
2. Now suppose that $p_{R, S, T}(r, s, t) = p_S(s)p_{T}(t)p_{R\mid S, T}(r\mid s, t)$ i.e. $R$ depends on both $S$ and $T$  
    - $S \perp \!\!\! \perp T$ ? **Yes** !
    $$p_{S, T}(s, t)=\sum_r p_S(s)p_{T}(t)p_{R\mid S, T}(r\mid s, t) = p_S(s)p_T(t)$$
    - $S \perp \!\!\! \perp T\mid R$ ? **No** !
    $$p_{S, T\mid R}(s, t\mid r)=\frac{p_{R, S, T}(r, s, t)}{p_{R}(r)}=p_S(s)p_{T}(t)\frac{p_{R\mid S, T}(r\mid s, t)}{p_{R}(r)} \neq p_{S\mid R}(s\mid r)p_{T\mid R}(t\mid r)$$

## IV.4— Markov Chain: Conditional Independence for many variables

Suppose $X_0, \dots , X_{100}$ are random variables whose joint distribution has the following factorization:
$$p_{X_0, \dots , X_{100}}(x_0, \dots , x_{100}) = p_{X_0}(x_0) \cdot \prod _{i=1}^{100} p_{X_ i | X_{i-1}}(x_ i | x_{i-1})$$

This factorization is what's called a Markov chain. We'll be seeing Markov chains a lot more later on in the course.

As an exercise, we want to show that $X_{50} \perp X_{52} | X_{51}$.

We notice that we can marginalize out $x_{100}$ as such:
$$p_{X_0, \dots , X_{99}}(x_0, \dots , x_{99}) = p_{X_0}(x_0) \cdot \prod _{i=1}^{99} p_{X_ i | X_{i-1}}(x_ i | x_{i-1}) \cdot \underbrace{\sum _{x_{100}} p_{X_{100}|X_{99}}(x_{100}|x_{99})}_{= 1}$$
	 

Now we can repeat the same marginalization procedure to get (2.5):
$$p_{X_0, \dots , X_{50}}(x_0, \dots , x_{50}) = p_{X_0}(x_0) \cdot \prod _{i=1}^{50} p_{X_ i | X_{i-1}}(x_ i | x_{i-1}) $$

In essense, we have shown that the given joint distribution factorization applies not just to the last random variable ($X_{100}$), but also up to any point in the chain.

For brevity, we will now use $p(x_{i}^{j})$ as a shorthand for $p_{X_ i, \dots , X_{j}}(x_ i, \dots , x_{j})$. We want to exploit what we have shown to rewrite $p(x_{50}^{52})$
$$
\begin{eqnarray}
		p(x_{50}^{52})
        &=& \sum_{x_{0} \dots x_{49}} \sum_{x_{53} \dots x_{100}} p(x_{0}^{100}) \\
		&=& \sum_{x_{0} \dots x_{49}} \sum_{x_{53} \dots x_{100}} \left[p(x_{0}) \prod_{i=0}^{50} p(x_{i}|x_{i-1})\right] \cdot p(x_{51}|x_{50}) \cdot p(x_{52}|x_{51}) \cdot \prod_{i=53}^{100} p(x_{i}|x_{i-1}) \\
		&=& \sum_{x_{0} \dots x_{49}} \sum_{x_{53} \dots x_{100}} p(x_{0}^{50}) \cdot p(x_{51}|x_{50}) \cdot p(x_{52}|x_{51}) \cdot \prod_{i=53}^{100} p(x_{i}|x_{i-1}) \\
		&=& p(x_{51}|x_{50}) \cdot p(x_{52}|x_{51}) \cdot \sum_{x_{0} \dots x_{49}} p(x_{0}^{50}) \underbrace{\sum_{x_{53} \dots x_{100}} \prod_{i=53}^{100} p(x_{i}|x_{i-1})}_{=1} \\
		&=& p(x_{51}|x_{50}) \cdot p(x_{52}|x_{51}) \cdot \sum_{x_{0} \dots x_{49}} p(x_{0}^{50}) \\
		&=& p(x_{50}) \cdot p(x_{51}|x_{50}) \cdot p(x_{52}|x_{51})
    \end{eqnarray}$$

where we used (2.5) for the 3rd equality and the same marginalization trick for the 5th equality. We have just shown the Markov chain property, so the conditional independence property must be satisfied. 

## IV.5— Bernoulli and Binomial Random Variables

We introduces two of the most common random variables that people use in probabilistic models:
- the Bernoulli random variable (a biased coin flip), and 
- the Binomial random variable (counting the number of heads for $n$ biased coin flips).

These two distributions appear all the time in many, many application domains that use inference! We introduce them now to equip you with some vocabulary and also to let you see our first example of a random variable whose probability table can be described by only a few numbers even if the number of entries in the table can be much larger!

### IV.5.1— Bernoulli random variable
A Bernoulli random variable is like a biased coin flip where probability of heads is $p$. A Bernoulli random variables is 1 with probability $p$, and 0 with probability $1-p$.

If a random variable $X$ has this particular distribution, then we write $X \sim \text {Bernoulli}(p)$, where “$\sim$" can be read as “is distributed as" or “has distribution".

Some people like to abbreviate $\text {Bernoulli}(p)$ by writing $\text {Bern}(p)$, $\text {Ber}(p)$, or even just $B(p)$.

### IV.5.2— Binomial random variable
A Binomial random variable can be thought of as counting the number of head in $n$ independent coin flips, each with probability $p$ of heads.

For a random variable $S$ that has this Binomial distribution with parameters $n$ and $p$, we denote it as $S \sim \text {Binomial}(n, p)$, read as “S is distributed as Binomial with parameters $n$ and $p$".

Some people might also abbreviate and instead of writing $\text {Binomial}(n, p)$, they write $\text {Binom}(n, p)$ or $\text {Bin}(n, p)$.

We can see that if $Y \sim \text {Binomial}(1, p)$, then $Y$ is actually a Bernoulli random variable. When there's only a single flip, counting the number if heads yields precisely the Bernoulli distribution. In particular, we have $Y \sim \text {Bernoulli}(p)$.

### IV.5.3— Probabiltity table of a Binomial random variable

For $n > 1$, the  number of ways to see $s$ heads in $n$ tosses is ${n \choose s}$.In general, for random variable $S \sim \text {Binomial}(n, p)$, the probability that $S=s$ for $s\in \{ 0, 1, \dots , n\}$ is given by
$$
p_ S(s) = {n \choose s} p^ s (1-p)^{n-s}.
$$

Let's take an example by considering the random variable $S \sim \text {Binomial}(10, 0.6)$.

A value of $S = 4$ refers to the event that we see exactly 4 heads. (Note that HTHTTTTTHH and HHHHTTTTTT are different outcomes of the underlying experiment of coin flipping). The number of ways to see 4 heads in 10 tosses is precisely the number of ways to choose 4 items out of 10, given by the choose operator: ${10 \choose 4} = \frac{10!}{4! 6!} = 210$. Each way is equally likely with probability given by $0.6^4 \times 0.4^6$, so summing up the probabilities across all these ways, we have $210 \times 0.6^4 \times 0.4^6$.

It might not be a priori obvious why the probability table of the Binomial variable should sum to 1.

To show this result, we can use the Binomial Theorem, which says that for any two numbers $x$ and $y$, and any nonnegative integer $n$,
$$
(x + y)^ n = \sum _{k=0}^ n {n \choose k} x^ k y^{n-k}.
$$

Plugging in $x = p$ and $y = 1 - p$, we see that the right-hand side directly corresponds to summing across all the entries of probability table $p_ S$, and the left-hand side is $(p + (1-p))^ n = 1^ n = 1$. 

# V—Notation Summary (Up Through Week 3)

Typically we use a capital letter like $X$ to denote a random variable, a script (or calligraphic) letter $\mathcal{X}$ to denote a set (or an event), and a lowercase letter like $x$ to refer to a nonrandom variable. Occasionally we will also use capital letters to refer to a constant that is not varying throughout the problem (in contrast to using a lowercase letter like x that can be a “dummy" variable such as in a summation $\sum _{x}p_{X}(x)$, for which lowercase $x$ refers to a specific constant value but we are varying what $x$ is and it is effectively a temporary variable that we do not need after computing the summation).

| Symbol        | Interpretation|
| -------------:|:--------------|
| $p_{X}$ or $p_{X}(\cdot )$ | probability table/probability mass function (PMF)/probability distribution/marginal distribution of random variable $X$|
| $p_{X}(x)$ or $\mathbb {P}(X=x)$ | probability that random variable $X$ takes on value $x$|
| $p_{X,Y}$ or $p_{X,Y}(\cdot ,\cdot )$ | joint probability table/joint PMF/joint probability distribution of random variables $X$ and $Y$|
| $p_{X,Y}(x,y)$ or $\mathbb {P}(X=x,Y=y)$ | probability that $X$ takes on value $x$ and $Y$ takes on value $y$|
| $p_{X\mid Y}(\cdot \mid y)$ | conditional probability table/conditional PMF/conditional probability distribution of $X$ given $Y$ takes on value $y$|
| $p_{X\mid Y}(x\mid y) or \mathbb {P}(X=x\mid Y=y)$ | probability that $X$ takes on value $x$ given that $Y$ takes on value $y$|
| $X\sim p$ or $X\sim p(\cdot )$ | $X$ is distributed according to distribution $p$|
| $X\perp Y$ | $X$ and $Y$ are independent|
| $X\perp Y\mid Z$ | $X$ and $Y$ are independent given $Z$|

We will also of course be dealing with many events or many random variables. For example, $\mathbb {P}(\mathcal{A},\mathcal{B},\mathcal{C}\mid \mathcal{D},\mathcal{E})$
would be the probability that events $\mathcal{A}$, $\mathcal{B}$, and $\mathcal{C}$ all occur, given that both events $\mathcal{D}$ and $\mathcal{E}$
occur, which by the definition of conditional probability would be
$$\mathbb {P}(\mathcal{A},\mathcal{B},\mathcal{C}\mid \mathcal{D},\mathcal{E})=\frac{\mathbb {P}(\mathcal{A},\mathcal{B},\mathcal{C},\mathcal{D},\mathcal{E})}{\mathbb {P}(\mathcal{D},\mathcal{E})}.$$
	 

Similarly, $p_{X,Y,Z\mid V,W}$
would refer to a joint conditional distribution of random variables $X$, $Y$, and $Z$ given both $V$ and $W$

taking on specific values together:
$$p_{X,Y,Z\mid V,W}(x,y,z\mid v,w)=\frac{p_{X,Y,Z,V,W}(x,y,z,v,w)}{p_{V,W}(v,w)}.$$
	 

When we have a collection of random variables, e.g., $W,X,Y,Z$, if we say that they are independent (without specifying what type of independence), then what we mean is mutual independence, which means that the joint distribution factorizes into the marginal distributions:
$$p_{W,X,Y,Z}(w,x,y,z)=p_{W}(w)p_{X}(x)p_{Y}(y)p_{Z}(z)\qquad \text {for all }w,x,y,z.$$
	 

# VI— Introduction to Decision Making and Expectations (week 4)
We now know the basics of working with probabilities. But how do we incorporate probabilities into making decisions?

The main tool we now introduce is what's called the *expected value* of a random variable. In making decisions that account for randomness, it often makes sense to account for an “average" scenario that we should expect. Expectation is about taking an average, accounting for how likely different outcomes are.

## VI.1—Expected Value of a Random Variable

Consider, for example, the mean of three values: 3, 5, and 10. It can be computed as follows:
$$\frac{3+5+10}{3}=3\cdot \frac{1}{3}+5\cdot \frac{1}{3}+10\cdot \frac{1}{3}=6.$$
Notice, on the right-hand side, that we are adding 3, 5, and 10 each weighted by $\frac{1}{3}$. Concretely, consider a random variable $X$ given by a probability table $p_X(x) = 1/3 \forall x\in{3, 5, 10}$
Then the “expected value" of $X$ is given by
$$3\cdot p_{X}(3)+5\cdot p_{X}(5)+10\cdot p_{X}(10)=3\cdot \frac{1}{3}+5\cdot \frac{1}{3}+10\cdot \frac{1}{3}=\frac{18}{3}=6.\dots$$
But what if, for instance, we think that 3 is actually much more plausible than 5 or 10? Then what we could do is have the weight on 3 be higher than $\frac{1}{3}$, for instance $p_X(3)=2/3$ and then $p_X(x) = 1/3 \forall x\in{5, 10}$.
Then the expected value of $X$ is given by
$$3\cdot p_{X}(3)+5\cdot p_{X}(5)+10\cdot p_{X}(10)=3\cdot \frac{2}{3}+5\cdot \frac{1}{6}+10\cdot \frac{1}{6}=\frac{9}{2}.$$

Using probability, we now formalize the concept of expected value of a random variable. As you can see, all we are doing is taking the sum of the labels in the probability table, where we weight each label by the probability of the label. *Importantly, the **labels are numbers** so that it's clear what adding them means!*

Now, for the formal definition:

### VI.1.1 Definition of expected value

Consider a real-valued random variable $X$ that takes on values in a set $\mathcal{X}$. Then the expected value of $X$, denoted as $\mathbb {E}[X]$, is
$$\mathbb {E}[X]\triangleq \sum _{x\in \mathcal{X}}x\cdot p_{X}(x).$$

Having the random variable be real-valued makes it so that we can add up the labels with weights!

Also, note that whereas $X$ can be represented as a probability table, its expectation $\mathbb {E}[X]$ is **just a single number**. The expected value is the sum of the values in the set $\mathcal{{X}}$, weighted by the probabilities of each of the values. The mean is simply the expected value when all of the values in the set $\mathcal{X}$ when there is a uniform probability of each of the values.

Notice that how we came up with the expectation of a random variable $X$ just relied on the probability table for $X$.

In fact, if we took a different probability table, if the labels are numbers, then we can still compute the expectation! Two important examples are below.

###  VI.1.2 Conditional Expectation
As a first example, suppose we have two random variables $X$ and $Y$where we know (or we have already computed) $p_{X\mid Y}(\cdot \mid y)$ for some fixed value $y$, and $X$ is real-valued. Then we can readily compute the expectation for this probability table by multiplying each value $x$ in the alphabet of random variable $X$ by $p_{X\mid Y}(x\mid y)$ and summing these up to get a weighted average. This yields what is called the conditional expectation of $X$ given $Y=y$, denoted as
$$\mathbb {E}[X\mid Y=y]=\sum _{x\in \mathcal{X}}x\cdot p_{X\mid Y}(x\mid y).$$
	 
###  VI.1.3 Expectation of the Function of a Random Variable
As another example, suppose we have a (possibly not real-valued) random variable $X$ with probability table $p_{X}$, and we have a function $f$ such that $f(x)$ is real-valued for all $x$ in the alphabet $\mathcal{X}$ of $X$. Then $f(X)$ has a probability table where the labels are all numbers, and so we can compute $\mathbb {E}[f(X)]$.

Let's work out the math here. First, let's determine the probability table for $f(X)$. To make the notation here easier to parse, let random variable $Z=f(X)$. Note that $Z$ has alphabet $\mathcal{Z}=\{ f(x)\; :\; x\in \mathcal{X}\}$. Then the probability table for $f(X)$ can be written as $p_{Z}$. In terms of the probability table, to compute $p_{Z}(z)$, we first look at every label in table $p_{X}$ that gets mapped to $z$, i.e., the set $\{ x\in \mathcal{X}\: :\; f(x)=z\}$. Then we sum up the probabilities of these labels to get the probability that $Z=z$, i.e., 
$$p_{Z}(z)=\sum _{x\in \mathcal{X}\text { such that }f(x)=z}p_{X}(x)$$.

That quite a cumbersome notation so lets introduce a new piece of notation here called an indicator function $\mathbf{1}\{ \cdot \}$ that takes as input a statement $\mathcal{S}$ and outputs:
$$
\begin{eqnarray}
\mathbf{1}\{\mathcal{S}\}=\begin{cases}
1 & \text{if }\mathcal{S}\text{ happens},\\
0 & \text{otherwise}.
\end{cases}
\end{eqnarray}
$$

Then the probability that $Z=z$ can be written
$$ 
\begin{aligned}
p_{Z}(z) & = \sum _{x\in \mathcal{X}\text { such that }f(x)=z}p_{X}(x)\\
& = \sum _{x\in \mathcal{X}}\mathbf{1}\{ f(x)=z\} p_{X}(x).
\end{aligned}
$$
	 	 

Next, we compute the expectation of $Z=f(X)$:
$$ 
\begin{aligned}
\mathbb {E}[Z]& = \sum _{z\in \mathcal{Z}}zp_{Z}(z)\\
& = \sum _{z\in \mathcal{Z}}z\bigg[\sum _{x\in \mathcal{X}}\mathbf{1}\{ f(x)=z\} p_{X}(x)\bigg]\\
& = \sum _{x\in \mathcal{X}}\underbrace{\sum _{z\in \mathcal{Z}}z\mathbf{1}\{ f(x)=z\} p_{X}(x)}_{\text {there is only 1 nonzero term here: when }z=f(x)}\\
& = \sum _{x\in \mathcal{X}}f(x)p_{X}(x).
\end{aligned}
$$

Hence, since $Z=f(X)$, we can write
$$\mathbb {E}[f(X)]=\sum _{x\in \mathcal{X}}f(x)p_{X}(x).$$

###  VI.1.4 Expectations of Multiple Random Variables
#### Linearity of expectations

Let's look at when there are two random variables $X$ and $Y$ with alphabets $\mathcal{X}$ and $\mathcal{Y}$ respectively. Then how expectation is defined for multiple random variables is as follows: For any function $f:\mathcal{X}\times \mathcal{Y}\rightarrow \mathbb {R}$,
$$\mathbb {E}[f(X, Y)] = \sum _{x \in \mathcal{X}, y\in \mathcal{Y}} f(x, y) p_{X,Y}(x, y).$$
(Note that “$\sum _{x \in \mathcal{X}, y\in \mathcal{Y}}$" can also be written as “$\sum _{x \in \mathcal{X}}\sum _{y \in \mathcal{Y}}$".)

For example:
$$\mathbb {E}[X + Y] = \sum _{x \in \mathcal{X}, y\in \mathcal{Y}} (x + y) p_{X,Y}(x, y).$$
	 

Let's look at some properties of the expected value of the sum of multiple random variables.

First, we can easily show that:
$$\mathbb {E}[X + Y] = \mathbb {E}[X] + \mathbb {E}[Y].$$

This equality is called *linearity of expectation* and it holds **regardless of whether X and Y are independent.**

#### Expectation and variance of independent variables
If $X$ and $Y$ are independent then
$$\mathbb {E}[XY] = \mathbb {E}[X]\mathbb {E}[Y]$$
Easily showed by using $p_{X, Y}(x, y)=p_X(x)p_Y(y)$

And we can also show that the variances are then additives (be using linearity of expectation and the above result on the product of expectations):
$$\text {var}(X + Y) = \text {var}(X) + \text {var}(Y)$$

## VI.2—Variance and Standard Deviation

Variance is an important concept which measures how much a random variable deviates from its expectation. This can be thought of as a measure of uncertainty. Higher variance means more uncertainty.

The variance of a real-valued random variable $X$ is defined as
$$\text {var}(X) \triangleq \mathbb {E}[ (X - \mathbb {E}[X])^2 ].$$

Note that as we saw previously, $\mathbb {E}[X]$ is just a single number. To keep the variance of $X$, what you could do is first compute the expectation of $X$.

What units is variance in? Variance is looking at the expectation of $X$ squared. Thus, units are each in squared units of $X$.
Some times, people prefer keeping the units the same as the original units (i.e., without squaring), which you can get by computing what's called the standard deviation of a real-valued random variable $X$:
$$\text {std}(X) \triangleq \sqrt {\text {var}(X)}.$$

## VI.3— Some properties of Binomial and Bernoulli random variables
We previously saw that a binomial random variable with parameters $n$ and $p$ can be thought of as how many heads there are in $n$ tosses where the probability of heads is $p$.

A different way to view a binomial random variable is that it is the sum of $n$ i.i.d. Bernoulli random variables each of parameter $p$. As a reminder, a Bernoulli random variable is 1 with probability $p$ and 0 otherwise. Suppose that $X_1, X_2, \dots , X_ n$ are i.i.d. $\text {Bernoulli}(p)$ random variables, and $S = \sum _{i=1}^ n X_ i$.

#### Expectations 
We have $\mathbb {E}[X_ i] = 1 \cdot p + 0 \cdot (1 - p) = \boxed {p}.$

We also have $\mathbb {E}[S] = \mathbb {E}\Big[\sum _{i=1}^ n X_ i\Big] = \sum _{i=1}^ n \mathbb {E}[X_ i] = \sum _{i=1}^ n p = \boxed {np}$ by linearity of expectation.

#### Variance
What is the variance of a $\text {Ber}(p)$ random variable ?
For $X_ i \sim \text {Ber}(p)$, we can show that
$$\mathbb {E}[(X_ i-\underbrace{\mathbb {E}[X_ i]}_{p})^{2}] =  \boxed {p(1-p)}.$$

Recall that a binomial random variable $S$ that has a $\text {Binomial}(n, p)$ distribution can be written as $S = \sum _{i=1}^ n X_ i$ where the $X_i$'s are i.i.d. $\text {Ber}(p)$ random variables. Thus
$$\text {var}(S) =\text {var}\Big(\sum _{i=1}^ n X_ i\Big) =\sum _{i=1}^ n \text {var}(X_ i) =\boxed {n p (1-p)}.$$


## VI.4— The Law of Total Expectation

Remember the law of total probability? For a set of events $\mathcal{B}_{1},\dots ,\mathcal{B}_{n}$ that partition the sample space $\Omega$ (so the $\mathcal{B}_{i}$'s don't overlap and together they fully cover the full space of possible outcomes),
$$\mathbb {P}(\mathcal{A})=\sum _{i=1}^{n}\mathbb {P}(\mathcal{A}\cap \mathcal{B}_{i})=\sum _{i=1}^{n}\mathbb {P}(\mathcal{A}\mid \mathcal{B}_{i})\mathbb {P}(\mathcal{B}_{i}),$$
where the second equality uses the product rule.

A similar statement is true for the expected value of a random variable, called the *law of total expectation*: for a random variable $X$ (with alphabet $\mathcal{X}$) and a partition $\mathcal{B}_1,\dots ,\mathcal{B}_ n$ of the sample space,
$$\mathbb {E}[X]=\sum _{i=1}^{n}\mathbb {E}[X\mid \mathcal{B}_{i}]\mathbb {P}(\mathcal{B}_{i}),$$
where
$$\mathbb {E}[X\mid \mathcal{B}_{i}] = \sum _{x\in \mathcal{X}}xp_{X\mid \mathcal{B}_{i}}(x) = \sum _{x\in \mathcal{X}}x\frac{\mathbb {P}(X=x,\mathcal{B}_{i})}{\mathbb {P}(\mathcal{B}_{i})}.$$

We will be using this result in the section “Towards Infinity in Modeling Uncertainty".

# VII—Introduction to Information-Theoretic Measures of Randomness

We just saw some basics for decision making under uncertainty and expected values of random variables. One way we saw for measuring uncertainty was variance. Now we look at a different way of measuring uncertainty or randomness using some ideas from information theory.

In this section, we answer the following questions in terms of bits (as in bits on a computer; everything stored on a computer is actually 0's and 1's each of which is 1 bit):
- How do we measure how random an event is?
- How do we measure how random a random variable or a distribution is?
- How do we measure how different two distributions are?
- How much information do two random variables share? 

For now, this material may seem like a bizarre exercise relating to expectation of random variables, but as we will see in the third part of the course on learning probabilistic models, information theory provides perhaps the cleanest derivations for some of the learning algorithms we will derive!

More broadly but beyond the scope of 6.008.1x, information theory is often used to show what the best possible performance we should even hope an inference algorithm can achieve such as fundamental limits to how accurate we can make a prediction. And if you can show that your inference algorithm's performance meets the fundamental limit, then that certifies that your inference algorithm is optimal! Inference and information theory are heavily intertwined!

## VII.1—Shannon Information Content: Measuring Randomness in an Event

First, let's consider storing an integer that isn't random. Let's say we have an integer that is from $0,1,\dots ,63$. Then the number of bits needed to store this integer is $\log _{2}(64)=6$ bits: you tell me 6 bits and I can tell you exactly what the integer is. You can think of this as how many yes/no questions do
I have to ask to figure out where I am in this range of values?
The first question we ask is-- is it at least 32? So is it in the top half here? If so, we would recurse and do the binary search
in this top half and say is it at least 48? etc. And so after asking 6 such questions, then we will figure out exactly which number is being stored.

A different way to think about this result is that we don't a priori know which of the 64 outcomes is going to be stored, and so each outcome is equally likely with probability $\frac{1}{64}$. Then the number of bits needed to store an event $\mathcal{A}$ is given by what's called the “Shannon information content" (also called **self-information**):
$$\log _{2}\frac{1}{\mathbb {P}(\mathcal{A})}.$$
In particular, for an integer $x\in \{ 0,1,\dots ,63\}$, the Shannon information content of observing $x$ is
$$\log _{2}\frac{1}{\mathbb {P}(\text {integer is }x)}=\log _{2}\frac{1}{1/64}=\log _{2}64=6\text { bits}.$$

If instead, the integer was deterministically 0 and never equal to any of the other values $1,2,\dots ,63$, then the Shannon information content of observing integer 0 is
$$\log _{2}\frac{1}{\mathbb {P}(\text {integer is }0)}=\log _{2}\frac{1}{1}=0\text { bits}.$$

This is not surprising in that a outcome that we deterministically always observe tells us no new information. Meanwhile, for each integer $x\in \{ 1,2,\dots ,63\}$,
$$\log _{2}\frac{1}{\mathbb {P}(\text {integer is }x)}=\log _{2}\frac{1}{0}=\infty \text { bits}.$$

How could observing one of the integers $\{ 1,2,\dots ,63\}$
tell us infinite bits of information?! Well, this isn't an issue since the event that we observe any of these integers has probability 0 and is thus impossible. An interpration of Shannon information content is how surprised we would be to observe an event. In this sense, observing an impossible event would be infinitely surprising.

It is possible to have the Shannon information content of an event be some fractional number of bits (e.g., 0.7 bits). The interpretation is that from many repeats of the underlying experiment, the average number of bits needed to store the event is given by the Shannon information content, which can be fractional. 

## VII.2—Shannon Entropy: Measuring Randomness in a Distribution/Random Variable

To go from the number of bits contained in an event to the number of bits contained in a random variable, we simply take the expectation of the Shannon information content across the possible outcomes. The resulting quantity is called the entropy of a random variable:
$$H(X)=\sum _{x}p_{X}(x)\underbrace{\log _{2}\frac{1}{p_{X}(x)}}_{\text {Shannon information content of event }X=x}.$$
	 
The interpretation is that on average, the number of bits needed to encode each i.i.d. sample of a random variable $X$
is $H(X)$. In fact, if we sample n times i.i.d. from $p_{X}$, then two fundamental results in information theory that are beyond the scope of this course state that: (a) there's an algorithm that is able to store these n samples in $nH(X)$ bits, and (b) we can't possibly store the sequence in fewer than $nH(X)$ bits!

**Example**: If $X$ is a fair coin toss “heads" or “tails" each with probability 1/2, then
$$
\begin{eqnarray}
H(X)
&=& p_X(\text{heads}) \log_2 \frac1{p_X(\text{heads})}
+ p_X(\text{tails}) \log_2 \frac1{p_X(\text{tails})} \\
&=& \frac12 \cdot \underbrace{\log_2 \frac1{\frac{1}{2}}}_1
+ \frac12 \cdot \underbrace{\log_2 \frac1{\frac{1}{2}}}_1 \\
&=& 1 \text{ bit}.
\end{eqnarray}
$$

**Example**: If $X$ is a biased coin toss where heads occurs with probability 1 then
$$
\begin{eqnarray}
H(X)
&=& p_X(\text{heads}) \log_2 \frac1{p_X(\text{heads})}
+ p_X(\text{tails}) \log_2 \frac1{p_X(\text{tails})} \\
&=& 1 \cdot \underbrace{\log_2 \frac11}_0
+ 0 \cdot \cdot \underbrace{\log_2 \frac10}_1 \\
&=& 0 \text{ bits},
\end{eqnarray}
$$
where $0 \log _2 \frac10 = 0 \log _2 1 - 0 \log _2 0 = 0$
using the convention that $0 \log _2 0 \triangleq 0$. (Note: You can use l'Hopital's rule from calculus to show that $\lim _{x\rightarrow 0} x \log x = 0 and \lim _{x\rightarrow 0} x \log \frac1x = 0$.)

**Notation**: Note that entropy $H(X) = \sum _ x p_ X(x) \log _2 \frac{1}{p_ X(x)}$ is in the form of an expectation! So in fact, we can write an expectation:
$H(X) = \mathbb {E}\Big[\log _2 \frac{1}{p_ X(X)}\Big].$

## VII.3—Information Divergence: Measuring How Different Two Distributions Are

Information divergence (also called “Kullback-Leibler divergence" or “KL divergence" for short, or also “relative entropy") is a measure of how different two distributions $p$ and $q$ (over the same alphabet) are.

To come up with information divergence, first, note that entropy of a random variable with distribution $p$ could be thought of as the expected number of bits needed to encode a sample from $p$ using the information content according to distribution $p$:
$$
H(X)=\underbrace{\sum _{x}p(x)}_{\text {we take a sample from p }p}\underbrace{\log _{2}\frac{1}{p(x)}}_{\text {information content according to }p}\triangleq \mathbb {E}_{X \sim p}\Big[\log _{2}\frac{1}{p(X)}\Big].
$$

Here, we have introduced a new notation: $\mathbb {E}_{X \sim p}$
means that we are taking the expectation with respect to random variable $X$ drawn from the distribution $p$. If it's clear which random variable we are taking the expectation with respect to, we will often just abbreviate the notation and write $\mathbb {E}_ p instead of \mathbb {E}_{X \sim p}$.

If instead we look at the information content according to a different distribution $q$, we get
$$
\underbrace{\sum _{x}p(x)}_{\text {we take a sample from p }p}\underbrace{\log _{2}\frac{1}{q(x)}}_{\text {information content according to }q}\triangleq \mathbb {E}_{X \sim p}\Big[\log _{2}\frac{1}{q(X)}\Big].
$$

It turns out that if we are actually sampling from $p$ but encoding samples as if they were from a different distribution $q$, then we always need to use more bits! This isn't terribly surprising in light of the fundamental result we alluded to that entropy of a random variable with distribution $p$ is the minimum number of bits needed to encode samples from $p$.

Information divergence is the price you pay in bits for trying to encode a sample from $p$ using information content according to $q$ instead of according to $p$:
$$
D(p\parallel q)=\mathbb {E}_{X \sim p}\Big[\log _{2}\frac{1}{q(X)}\Big]-\mathbb {E}_{X \sim p}\Big[\log _{2}\frac{1}{p(X)}\Big].
$$

Information divergence is *always at least 0*, and *when it is equal to 0, then this means that $p$ and $q$ are the same distribution* (i.e., $p(x) = q(x)$ for all $x$). This property is called **Gibbs' inequality**.

Gibbs' inequality makes information divergence seem a bit like a distance. However, information divergence is not like a distance in that *it is not symmetric*: in general, $D(p \parallel q) \ne D(q \parallel p)$.

Often times, the equation for information divergence is written more concisely as
$$D(p\parallel q) = \sum _ x p(x) \log \frac{p(x)}{q(x)},$$

Example: Suppose $p$ is the distribution for a fair coin flip:
$$
\begin{eqnarray}
p(x)
&=&
\begin{cases}
\frac12 & \text{if }x=\text{heads}, \\
\frac12 & \text{if }x=\text{tails}. \\
\end{cases}
\end{eqnarray}
$$
Meanwhile, suppose $q$ is a distribution for a biased coin that always comes up heads (perhaps it's double-headed):
$$\begin{eqnarray}
q(x)
&=&
\begin{cases}
1 & \text{if }x=\text{heads}, \\
0 & \text{if }x=\text{tails}. \\
\end{cases}
\end{eqnarray}$$

Then
$$
\begin{eqnarray}
D(p \parallel q)
&=&
  p(\text{heads}) \log_2 \frac{p(\text{heads})}{q(\text{heads})}
+ p(\text{tails}) \log_2 \frac{p(\text{tails})}{q(\text{tails})} \\
&=&
  \frac12\log_2 \frac{\frac12}1
+ \underbrace{\frac12\log_2 \frac{\frac12}0}_{\infty} \\
&=&
  \infty\text{ bits}.
\end{eqnarray}
$$

This is not surprising: If we are sampling from $p$ (for which we could get tails) but trying to encode the sample using $q$ (which cannot possibly encode tails), then if we get tails, we are stuck: we can't store it! This incurs a penalty of infinity bits.

Meanwhile,
$$
\begin{eqnarray}
D(q \parallel p)
&=&
  q(\text{heads}) \log_2 \frac{q(\text{heads})}{p(\text{heads})}
+ q(\text{tails}) \log_2 \frac{q(\text{tails})}{p(\text{tails})} \\
&=&
  1 \log_2 \frac1{\frac12}
+ \underbrace{0 \log_2 \frac0{\frac12}}_0 \\
&=&
  1\text{ bit}.
\end{eqnarray}
$$

When we sample from $q$, we always get heads. In fact, as we saw previously, the entropy of the distribution for an always-heads coin flip is 0 bits since there's no randomness. But here we are sampling from q and storing the sample using distribution $p$. For a fair coin flip, encoding using distribution $p$ would store each sample using on average 1 bit. Thus, even though a sample from $q$ is deterministically heads, we store it using 1 bit. This is the penalty we pay for storing a sample from $q$ using distribution $p$.

Notice that in this example, $D(p \parallel q) \ne D(q \parallel p)$. *They aren't even close — one is infinity and the other is finite! *

**Example**: Using KL-divergence to prove that the uniform distribution is the distribution with maximal entropy.

Let random variable $U$ have what's called a uniform distribution over alphabet $\mathcal{X}$, meaning that
$$p_ U(x) = \frac{1}{|\mathcal{X}|} \qquad \text {for all }x\in \mathcal{X}.$$
	 
Notationally, we can write $U \sim \text {Uniform}(\mathcal{X})$.

We can see that:
$$
\begin{eqnarray}
D(p_X \parallel p_U)
&=& \sum_{x\in\mathcal{X}} p_X(x) \log_2 \frac{p_X(x)}{p_U(x)} \\
&=& \sum_{x\in\mathcal{X}} p_X(x) \log_2 \frac{1}{p_U(x)}
    + \sum_{x\in\mathcal{X}} p_X(x) \log_2 p_X(x) \\
&=& \sum_{x\in\mathcal{X}} p_X(x) \log_2 |\mathcal{X}|
    + \sum_{x\in\mathcal{X}} p_X(x) \log_2 p_X(x) \\
&=& (\log_2 |\mathcal{X}|) \underbrace{\sum_{x\in\mathcal{X}} p_X(x)}_1
    + \sum_{x\in\mathcal{X}} p_X(x) \log_2 p_X(x) \\
&=& \underbrace{\log_2 |\mathcal{X}|}_{H(U)}
    - \sum_{x\in\mathcal{X}} p_X(x) \log_2 \frac{1}{p_X(x)} \\
&=& H(U) - H(X).
\end{eqnarray}
$$

By Gibbs' inequality, information divergence is always nonnegative, which means that we must have $H(U) - H(X) \ge 0$, which means that $H(X) \le H(U)$ for any distribution $p_X$. 

### Proof of Gibbs' inequality

There are various ways to prove Gibbs' inequality. We'll be using a way that relies on the fact that $\ln x\le x-1$ for all $x>0$, with equality if and only if $x=1$ (easily demonstrated by studying the global minimum of $f(x)=x-1-\ln x$).

**Gibbs' inequality**: For any two distributions $p$ and $q$ defined over the same alphabet, we have $D(p\parallel q)\ge 0$, where equality holds if and only if $p$ and $q$ are the same distribution, i.e., $p(x)=q(x)$ for all $x$.

**Proof**: Recall that changing the base of a log just changes the log by a constant factor: $\log _{2}x=\frac{\ln x}{\ln 2}.$	 

Let $\mathcal{X}$ be the alphabet of distribution $p$ restricted to where the probability is positive, i.e., $\mathcal{X}=\{ a\text { such that }p(a)>0\}$. (There is no need to look at values $a$ for which $p(a)=0$.) If $q(a)=0$ for any $a\in \mathcal{X}$, then $D(p\parallel q)=\infty$, so trivially $D(p\parallel q)>0$.

What's left to consider is when $q(a)>0$ for every $a\in \mathcal{X}$. Then
$$
\begin{eqnarray}
D(p\parallel q)
& =& \sum _{a\in \mathcal{X}}p(a)\log _{2}\frac{p(a)}{q(a)}\\
& =& \frac{1}{\ln 2}\sum _{a\in \mathcal{X}}p(a)\ln \frac{p(a)}{q(a)}\\
& =& -\frac{1}{\ln 2}\sum _{a\in \mathcal{X}}p(a)\ln \frac{q(a)}{p(a)}.
\end{eqnarray}
$$ 	 

Next, using the fact that $\ln x\le x-1$ for all $x>0$, and accounting for the minus sign outside the summation,
$$
\begin{eqnarray}
D(p\parallel q)
& =& -\frac{1}{\ln 2}\sum _{a\in \mathcal{X}}p(a)\ln \frac{q(a)}{p(a)}\\	 	 
& \ge & -\frac{1}{\ln 2}\sum _{a\in \mathcal{X}}p(a)\Big(\frac{q(a)}{p(a)}-1\Big)\\
& =& -\frac{1}{\ln 2}\sum _{a\in \mathcal{X}}\big (q(a)-p(a)\big )\\
& =& -\frac{1}{\ln 2}\big (\underbrace{\sum _{a\in \mathcal{X}}q(a)}_{1}-\underbrace{\sum _{a\in \mathcal{X}}p(a)}_{1}\big )\\
& =& 0.
\end{eqnarray}
$$

Recall that inequality $\ln x\le x-1$ becomes an equality if and only if $x=1$. Thus, the inequality above becomes an equality if and only if, for all $a\in \mathcal{X}$, we have $\ln \frac{q(a)}{p(a)}=\frac{q(a)}{p(a)}-1$, which holds if and only if $\frac{q(a)}{p(a)}=1$. Thus $D(p\parallel q)=0$ if and only if $p(a)=q(a)$ for all $a\in \mathcal{X}$. This finishes the proof. \square

## VII.4— Mutual information:

For two discrete random variables $X$ and $Y$, the mutual information between X and Y, denoted as $I(X;Y)$, measures how many much information they share. Specifically,
$$I(X;Y)\triangleq D(p_{X,Y}\parallel p_{X}p_{Y}),$$

where $p_{X}p_{Y}$ is the distribution we get if $X$ and $Y$ were actually independent (i.e., if $X$ and $Y$ were actually independent, then we know that the joint probability table would satisfy $\mathbb {P}(X=x,Y=y)=p_{X}(x)p_{Y}(y)$).

The mutual information could be thought of as how far $X$ and $Y$ are from being independent, since if indeed they were independent, then $I(X;Y)=0$.

On the opposite extreme, consider when $X=Y$. Then we would expect $X$ and $Y$ to share the most possible amount of information. In this scenario, we can write $p_{X,Y}(x,y)=p_{X}(x)\mathbf{1}\{ x=y\}$, and so
$$
\begin{eqnarray}
I(X;Y)
&=&D(p_{X,Y}\parallel p_{X}p_{Y})\\
&=&\sum_{x}\sum_{y}p_{X,Y}(x,y)\log_{2}\frac{p_{X,Y}(x,y)}{p_{X}(x)p_{Y}(y)}\\
&=&\sum_{x}\sum_{y}p_{X}(x)\mathbf{1}\{x=y\}\log_{2}\frac{p_{X}(x)\mathbf{1}\{x=y\}}{p_{X}(x)p_{Y}(y)}\\
&=&\sum_{x}p_{X}(x)\log_{2}\frac{1}{p_{X}(x)}\\
&=&H(X).
\end{eqnarray}
$$

This is not surprising: if $X$ and $Y$ are the same, then the number of bits they share is exactly the average number of bits needed to store $X$ (or $Y$), namely $H(X)$ bits. 

# VIII— Infinite Outcomes

What if we want an infinite number of outcomes? For example, consider an underlying experiment where we keep flipping a coin until we see the first heads. We might have to flip an arbitrarily large number of tosses! Here, the sample space would consist of getting heads for the first time after 1 toss, after 2 tosses, and so forth, ad infinitum.

On a computer, we can't actually store an arbitrary probability table with an infinite number of entries.

A few workarounds:

- Approximate the probability distribution with a finite probability space. For example truncate and lump together all the possible outcomes over a given rank in the alphabet.

- For very specific probability distributions, there can be a way for us to represent the distribution “in closed form" meaning that if we just keep track of a few numbers, these few numbers are enough to tell us what the probability is for any possible outcome in an infinitely large sample space. The binomial distribution is an example of this: with just two numbers, we can query any entry in a probability table with far more than just two entries!

## VIII.1— The Geometric Distribution

The geometric distribution is an example of a distribution with an infinite alphabet size that has only 1 parameter. Thus, storing 1 number tells you what all the probability table entries are, even though there are an infinite number of entries!

The Geometric distribution, $X \sim \text {Geo}(p)$, can be written as
$$p_ X(x) = (1-p)^{x-1} p\qquad \text {for }x=1, 2, \dots$$

It has a single parameter $p$.

**Example**: Let's say that I have a biased coin, where if I toss it, the probability of heads is $p$. Then $p_X$, where random variable X is the number of tosses until we see the first heads, follow a geometric distribution of parameter $p$.

### VIII.1.1— Properties of the Geometric Distribution

Each of the table entries $p_ X(x)$ is nonnegative for $x = 1, 2, \dots.$

Proof: Note that $(1-p)\ge 0, x-1\ge 0$, and $p\ge 0$, and so $(1-p)^{x-1}p\ge 0$ for all $p\in (0,1)$ and $x=1,2,3,\dots$.

The sum of all the table entries is 1, i.e., $\sum _{x=1}^\infty p_ X(x) = 1$.

Proof: We know from calculus that: $For r\in (-1,1),$
$$
\sum _{i=0}^{\infty }r^{i}=\frac{1}{1-r}.
$$

Then, for $p\in (0,1)$,
$$
\sum _{x=1}^{\infty }p_{X}(x)=\sum _{x=1}^{\infty }(1-p)^{x-1}p\overset {(a)}{=}p\sum _{i=0}^{\infty }(1-p)^{i}\overset {(b)}{=}p\cdot \frac{1}{1-(1-p)}=p\cdot \frac{1}{p}=1,
$$
where step (a) substitutes $i=x-1$, and step (b) uses the result above from calculus. 

### VIII.1.2— Expectation of the Geometric Distribution
The expectation of a geometric distribution is $\mathbb {E}[X] = \boxed {\frac1p}.$

We can demonstrate this by using the law of total expectation:
$$\mathbb {E}[X]=\sum _{i=1}^{n}\mathbb {E}[X\mid \mathcal{B}_{i}]\mathbb {P}(\mathcal{B}_{i}),$$
where
$$\mathbb {E}[X\mid \mathcal{B}_{i}] = \sum _{x\in \mathcal{X}}xp_{X\mid \mathcal{B}_{i}}(x) = \sum _{x\in \mathcal{X}}x\frac{\mathbb {P}(X=x,\mathcal{B}_{i})}{\mathbb {P}(\mathcal{B}_{i})}.$$

To write $\mathbb {E}[X] = \mathbb {E}[X \mid \mathcal{B}]\mathbb {P}(\mathcal{B}) + \mathbb {E}[X \mid \mathcal{B}^ c](1 - \mathbb {P}(\mathcal{B})) $ where $\mathcal{B}$ is the event that we get heads in 1 try.

## VIII.2— Discrete Probability Spaces and Random Variables

In the case of the geometric distribution, the sample space $\Omega$ is what's called “countably infinite". This means that it has an infinite (rather than finite) number of entries, and that there's actually a way for us to arrange the elements so that there's a 1st element, 2nd, 3rd, and so forth off into infinity. Note that the set of real numbers is not countable.

Before this section, every time we used the phrases “probability space" and “random variable", we actually meant “finite probability space" and “finite random variable".

More general than the finite probability space and finite random variable are the discrete probability space and discrete random variable:

**Definition of a “discrete probability space"**: A discrete probability space ($\Omega ,\mathbb {P}$)is the same thing as a finite probability space except that the sample space $\Omega$ is allowed to be either finite or countably infinite. In particular, a discrete probability space consists of two ingredients:

- a finite or countably infinite sample space $\Omega$ that is the collectively exhaustive, mutually exclusive set of all possible outcomes

- an assignment of probability $\mathbb {P}$, where for any outcome $\omega \in \Omega$, we have $\mathbb {P}(\text {outcome }\omega )$ be a number at least 0 and at most 1, and
$$\sum _{\omega \in \Omega }\mathbb {P}(\text {outcome }\omega )=1.$$

**Discrete random variables**: Given a discrete probability space ($\Omega ,\mathbb {P}$), a random variable $X$ that maps $\Omega$ to a set of values $\mathcal{X}$ that the random variable can take on is an example of what is called a discrete random variable. In this example, as in the finite random variable case, we can think of random variable $X$ as generated from a two step procedure: some possible outcome $\omega$ is sampled from the discrete probability space ($\Omega ,\mathbb {P}$), and then $X$ takes on the value given by $X(\omega )$. In general, a discrete random variable takes on values in an alphabet that is either finite in size or countably infinite, but the probability space that the random variable is associated with can actually be more general than a discrete probability space!

Formally defining probability spaces that are more general than discrete random variables requires more sophisticated mathematical machinery that is beyond the scope of 6.008.1x. 

## VIII.3— Consecutive Sixes example
On average, how many times do you have to roll a fair six-sided die before getting two consecutive sixes ?

We can draw a tree of consecutive states as follow:
![14763403811819446](./images/14763403811819446.jpg)

Let's say $\mathbb E(x) = m$. Now let's see possible events.
- If I got sequentially two sixes (two upper legs on my graph) I with probability $1/6*1/6$ got to two sixes in 2 tosses.
- If I got six and second toss was not six (1...5) then with probability $1/6*5/6$ I spend two tosses and should start from the very beginning. Geometrical distribution memoryless, thus from this point my expectation will be m. I call it fresh start. Thus my total expectation here is m+2 accounting for 2 tosses I spent to get here. (That is middle way in diagram).
- If my first toss was not six (1...5) then with probability 5/6 and spending one toss I'll start here fresh too. My expectation on this way is $m+1$ (bottom path).

Now, by the law of total expectation $m = \frac{1}{36}\cdot 2 + \frac{5}{36} \cdot (m+2) + \frac{5}{6} \cdot (m+1)$. Resolving for $m$ we'll get 42.

**Another way to see it**:
Let $X$ be the random variable of the number of throws to get two consecutive sixes. Let $A$ be the event such that the first throw is a six and Let $B$ be the event such that the second throw is a six
$$
\begin{eqnarray}
E(X) &= & E(X|A)P(A)+E(X|A^c)P(A^c) \\
&=& (E(X|A,B) P(B|A) + E(X|A,B^c)P(B^c|A))P(A)+E(X|A^c)P(A^c) \\
&=& (2*1/6+(2+E(X))*5/6))*1/6+(1+E(X))*5/6
\end{eqnarray}
$$

Rearranging the terms you will get the same answer $E(X)=42$.