# ECE493: Probabilistic Reasoning and Reinforcement Learning

# Introduction

**Probablistic graphical modeling** is a vranch of machine learning that studies how to use probability distributions to describe the world and make useful predictions about it.

The simplest model would be a linear equation of the form:

$y = \beta^Tx$

* y is the outcome variable that we want to predict
* x is the series of factors that affect the outcome
* $\beta$ is the parameters

However, the real world has _uncertainty_ so this is usually dealt with a probability distribution.

$p(x, y)$

## Difficulties of Probabilistic Modeling

Suppose, we have a binary classifier that determines if an email is spam. This is done through a large list of words and determining if these words appear in an email will ultimately determine if the email is spam. However, if this list is very large, then we would have to write down all of these values.

This process can be simplified through _conditional independence_ among the variables. This process is called the **Naive Bayes** assumption. Given this assumption, we can model the probabilities as a product of factors:

$P(y, x_1, x_2, ..., x_n) = p(y)\sum_{i=1}^np(x_i|y)$

In this case, each factor $p(x_i|y)$ can be completely described by a small number of parameters (4 parameters with 2 degrees of freedom). This entire distribution is parametrized by $O(n)$ parameters which we can tractably estimate from data and make predictions.

## Graphical Representation
<img src="images/naive_bayes.png" width="50%">

## Overview of the Course
### Representation
This is how to specify a model.

This is a difficult process with lots of input parameters. Will use a lot of graph theory.

### Inference
Given a model, how do we extract useful information. There are two kinds of inference:
* _Marginal Inference_: what is the probability of a given variable in our model after we sum everything else out? An example query would be to determine the probability that a random house has more than three bedrooms.

$p(x_1) = \sum_{x_2}\sum_{x_3}...\sum_{x_n}p(x_1, x_2, ..., x_n)$

* Maximum a posteriori (MAP) inference: asks for the most likely assignment of variables. For example, we may try to determine the most likely spam message, solving the problem.

$max_{x_1, ..., x_n} p(x_1, ..., x_n, y = 1)$

### Learning

Refers to fitting a model to a dataset, which could be for example a large number of labeled examples of spam. By looking at the data, we can infer useful patterns (eg. which wordsare found more frequently in spam emails), which we can then use to make predictions about the future. However, we will see that learning and inference are also inherently linked in a more subtle way since inference will turn out to be a key subroutine that we will repeatedly call within learning algorithms. Also, the tpoic of learning will feature important connection to the field of computational learning theory - which deals with questions such as generalization from limited data and overfitting, as well as to Bayesian statistics.

# Probability Review

## 1. Elements of Probability

**Sample Space $\Omega$:** the set of all outcomes of a random experiment. Here, each outcome $\omega\in\Omega$ can be thought of as a complete description of the state of the real world at the end of the experiment.

**Set of Events (or event space) F:** A set whose elements $A\in F$ (called events) are subsets of $\Omega$ (ie. $A\subset\Omega$ is a collection of possible outcomes of an experiment).

**Probability Measure:** A function $P:F\implies\mathbb{R}$ that satisfies the following properties
* $P(A)\geq0$ for all $A\subset F$
* If $A_1, A_2, ...$ are disjoint events (ie. $A_i\cap A_j = \emptyset$ whenever $i\neq j$), then $P(\cup_iA_i) = \sum_iP(A_i)$
* $P(\Omega) = 1$

These three properties are called the *Axioms of Probability*.

**Example:** Consider the event of tossing a six-sided die. The sample space is $\Omega = \{1, 2, 3, 4, 5, 6\}$. We can define different event spaces on this sample space. For example, the simplest event space is the trivial event space $F = {\emptyset, \Omega}$. Another event space is the set of all subsets of $\Omega$.

For the first event space, the unique probability measure satisfying the requirements above is given by P($\emptyset$) = 0, P($\Omega$) = 1. For the second event space, one valid measure is to assign the probability of each set in the event space to be $\frac {i}{6}$. For example, $P(\{1, 2, 3, 4\}) = \frac{4}{6}$.

### _Properties_
* $A \in B \implies P(A) \leq P(B)$
* $P(A\cap B) \leq min(P(A), P(B))$
* Union Bound: $P(A\cup B) \leq P(A) + P(B)$
* $P(\Omega - A): 1-P(A)$
* Law of Total Probability: If $A_1, ..., A_k$ are a set of disjoint events such that $\cup_{i=1}^k A_i = \Omega$, then $\sum_{i=1}^k P(A_i) = 1$

### 1.1 Conditional Probability
Let B be an event with non-zero probability. The donctional probablity of any event A given B is:

$P(A|B) = \frac{P(A \cap B)}{P(B)}$

### 1.2 Chain Rule
Let $S_1, ..., S_k$ be events, $P(S_i) > 0$. Then the chain rule states that:

$P(S_1\cap S_2\cap...\cap S_k) = P(S_1)P(S_2|S_1)P(S_3|S_2\cap S_1)...P(S_k|S_1\cap S_2\cap...\cap S_{k-1})$ 

Note that for $k=2$ events, this is just the definition of conditional probability:

$P(S_1\cap S_2) = P(S_1)P(S_2|S_1)$

**Example:**

$P(S_1\cap S_2 \cap S_3 \cap S_4)$

$= P(S_1 \cap S_2 \cap S_3)P(S_4 | S_1 \cap S_2 \cap S_3)$

$= P(S_1 \cap S_2)P(S_3 | S_1 \cap S_2)P(S_4 | S_1 \cap S_2 \cap S_3)$

$= P(S_1)P(S_2|S_1)P(S_3 | S_1 \cap S_2)P(S_4 | S_1 \cap S_2 \cap S_3)$

### 1.3 Indepedence

Two events are called **independent** if $P(A\cap B) = P(A)P(B)$ or $P(A|B) = P(A)$. Intuitively, A and B are independent means that observing B does not have any effect on the probability of A.

## 2. Random Variables

Consider an experiment where we flip 5 coins and we want to know the number of heads. here the elements of the sample saplce $\Omega$ are 5-length sequences of heads and tails. In practice, we care about real-valued functions of outcomes, such as the number of heads that apear among 5 tosses or the length of the longest run of tails. These functions under some techincal conditions are known as **random variables**.

More formally, a random variable **X** is a function $X:\Omega\implies\mathbb{R}$. Typically, we denote random variables using upper case letters $X(\omega)$ or simply $X$ (where the dependence on the random outcome $\omega$ is implied). We denote the value that a random variable may take on using lower case letters $x$. Thus, $X = x$ means that we assign . the value $x\in\mathbb{R}$ to the random variable $X$.

**Example:** In the experiment above, suppose that $X(\omega)$ is the number of heads which occur in the sequence of tosses $\omega$. Given that only 5 coins are tossed, $X(\omega)$ can take only a finite number of values, so it is known as a **discrete random variable**. Here the probability of the set associated with a random variable $X$ taking on some specific value $k$ is:

$P(X=k) := P({w:X(\omega) = k})$

**Example:** Suppose that $X(\omega)$ is a random variable indicating the amount of time it takes for a radioactive particle to deay. In this case, $X(\omega)$ takes on an infinite nubmer of possible values, so it is called a **continuous random variable**. We denote the probability that $X$ takes on a value between two real constants $a$ and $b$ (where $a<b$) as:

$P(a \leq X \leq b) := P({w: a \leq X(\omega) \leq b})$

----

To specify the probability lmeasures used when dealing with random variables, it is often convenient to specify alternative functions (CDFs, PDFs, and PMFs) from which the probability measure governing an experiment immediately follows. 

### 2.1 Cumulative Distribution Functions

<img src="images/cdf.png" width="50%"/>

A Cumulative Distribution Function (CDF) is a function $F_X : \mathbb{R}\implies[0, 1]$ which specifies a probability measure as:

$F_X(x) = P(X\leq x)$.

By using this function, one can calculate the probability of any event.

#### _Properties:_
* $0\leq F_X(x) \leq 1$
* $\lim_{x\to-\infty} F_X(x) = 0$
* $\lim_{x\to+\infty} F_X(x) = 1$
* $x \leq y \implies F_X(x) \leq F_X(y)$

### 2.2 Probability Mass Functions

<img src="images/pmf.png" width="50%"/>

When a random variable $X$ takes on a finite set of possible values (ie. $X$ is a discrete random variable), a simpler way to represent the probability measure associated with a random variable is to directly specify the probability of each value that the random variable can assume. In particular, a probability mass function (PMF) is a function $p_X : \Omega\implies\mathbb{R}$ such that $p_X(x) = P(X=x)$.

In the case of discrete random variable, we use the notation $Val(X)$ for the set of possible values that the random variable $X$ may assume. For example, if $X(\omega)$ is a random variable indicating the number of heads out of 5 coin tosses, then $Val(X) = {0, 1, 2, 3, 4, 5}$.

#### _Properties:_
* $0 \leq p_X(x) \leq 1$
* $\sum_{x\in Val(X)}p_X(x) = 1$
* $\sum_{x\in A}p_X(x) = P(X\in A)$

### 2.3 Probability Density Functions

<img src="images/pdf.png" width="50%"/>

For some continuous random variables, the cumulative distribution function $F_X(x)$ is differentiable everywhere. In these cases, we define the Probability Density Function (PDF) as the derivative of the CDF, ie.

$f_X(x) = \frac{dF_X(x)}{dx}$

Note that the PDF for a continuous random variable may not always exist (ie. If F_X(x) is not differentiable everywhere).

According to the properties of differentiation, for every small $\delta x$:

$P(x \leq X \leq x + \delta x) \approx f_X(x)\delta x$

Both CDFs and PDFs (when they exist) can be used for calculating the probabilities of different events. But it should be emphasized that the value of PDF at any given point $x$ is not the probability of that event, i.e, $f_X(x)\neq P(X=x)$. For example, $f_X(x)$ can take on values larger than one (but the integral of $f_X(x)$ over any subset of $\mathbb{R}$ will be at most one).

#### _Properties:_
* $f_X(x) \geq 0$
* $\int_{-\infty}^{\infty}f_X(x) = 1$
* $\int_{x\in A}f_X(x)dx = P(X\in A)$

### 2.4 Expectation

The **expectation** of a function $g(X)$ can be thought of an a "weighted average" of the values that $g(X)$ can be taken for different values of $x$. Suppose that $X$ is a discrete random variable with PMF $p_X(x)$ and $g: \mathbb{R}\to\mathbb{R}$ is an arbitrary function. In this case, $g(X)$ can be considered a random variable, and we define the **expectation** or expected value of $g(X)$ as:

$\mathbb{E}[g(X)] = \sum_{x\in Val(x)} g(x)p_X(x)$

If $X$ is a continuous random variable with PDF $f_X(x)$, then the expected value of $g(X)$ is defined as:

$\mathbb{E}[g(X)] = \int_{-\infty}^{+\infty} g(x)f_X(x)$

#### _Properties:_
* $\mathbb{E}[a] = a$ for any constant $a\in\mathbb{R}$
* $\mathbb{E}[af(X)] = a\mathbb{E}[f(x)]$ for any constant $a\in\mathbb{R}$
* (Linearity of Expectation): $\mathbb{E}[f(X) + g(X)] = \mathbb{E}[f(X)] + \mathbb{E}[g(X)]$
* For a discrete random variable $X, \mathbb{E}[1\{X=k\}] = P(X=k)$

### 2.5 Variance

The variance of a random variable $X$ is a measure of how concentrated the distribution of a random variable $X$ is around its mean. Formally, the variance of a random variable $X$ is defined as $Var[X] = \mathbb{E}[(X-\mathbb{E}[X])^2]$.

Using properties in the previous section, we can derive an alternate expression for the variance:

$\mathbb{E}[(X-\mathbb{E}[x])^2]$

$= \mathbb{E}[X^2-2\mathbb{E}[X]X+\mathbb{E}[X]^2]$

$= \mathbb{E}[X^2] - 2\mathbb{E}[X]\mathbb{E}[X]+\mathbb{E}[X]^2$

$= \mathbb{E}[X^2]-\mathbb{E}[X]^2$

#### _Properties:_
* $Var[a] = 0$ for any constant $a\in \mathbb{R}$
* $Var[af(X)] = a^2Var[f(X)]$ for any constant $a\in \mathbb{R}$

### 2.6 Some Common Random Variables

#### Discrete Random Variables
* $X\sim Bernoulli(p)$ (where $0 \leq p \leq 1)$: the outcome of a coin flip (H = 1, T= 0) for a coin that comes up heads with probability p.

$p(x)= \begin{cases} 
          p, & if x=1 \\
          1-p, & if x=0
       \end{cases}$
       
* $X\sim Binomial(n, p)$ (where $0 \leq p \leq 1$): the number of heads in $n$ independent flips of a coin with heads probability $p$.

$p(x)= {n\choose x}p^x(1-p)^{n-x}$

* $X\sim Geometric(p)$ (where $p > 0$): the number of flips of a coin until the first heads, for a coin that comes up heads with probability $p$.

$p(x) = p(1-p)^{x-1}$

* $X\sim Poisson(\lambda)$ (where $\lambda > 0$): a probability distribution over the non-negative integers used for modeling the frequency of rare events.

$p(x) = e^{-\lambda}\frac{\lambda^x}{x!}$

#### Continuous Random Variables
* $X\sim Uniform(a, b)$ (where $a < b$): equal probability density to every value between a and b on the real line.

$f(x) = \begin{cases} 
          \frac{1}{b-a}, & if a \leq b \\
          0, & otherwise
       \end{cases}$
       
* $X\sim Exponential(\lambda)$ (where $\lambda > 0$): decaying probability density over the non-negative reals.

$f(x) = \begin{cases} 
          \lambda e^{\lambda x}, & if x\geq 0 \\
          0, & otherwise
       \end{cases}$
       
* $X\sim Normal(\mu, \sigma^2)$: also known as the Gaussian distribution

$f(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

## 3. Two Random Variables

There might be cases where we are interested in knowning more than one quantity during a random experiment. For example, if we flip a coin multiple times, we may care about both $X(\omega)$ = number of heads that come up and the $Y(\omega)$ = the length of the longest run of consecutive heads. 

### 3.1 Joint and Marginal Distributions

Suppose that we have two random variables $X$ and $Y$. One way to work with these two random variables is to consider each of them separately. If we do that, we will only need $F_X(x) and F_Y(y)$. But if we want to know about the values that $X$ and $Y$ assume simultaneously during outcomes of a random experiment, we require a more complicated structure known as the joint cumulative distribution function of $X$ and $Y$, defined by:

$F_{XY}(x, y) = P(X\leq x, Y \leq y)$.

It can be shown that by knowning the joint cumulative distribution function, the probability of any event involving $X$ and $Y$ can be calculated.

The joint CDF $F_{XY}(x, y)$ and the cumulative distribution functions $F_X(x) and F_Y(y)$ of each variable separately are related by:

$F_X(x) = \lim_{y\to\infty} F_{XY}(x, y)$

$F_Y(x) = \lim_{x\to\infty} F_{XY}(x, y)$

Here, we call $F_X{x}$ and $F_Y{y}$ the **marginal cumulative distribution functions** of $F_{XY}(x, y)$.

### 3.2 Joint and Marginal PMF

If $X$ and $Y$ are discrete random variables, then the joint probability mass function $p_{XY}: Val(X) \times Val(Y) \to [0, 1]$ is defined by:

$p_{XY}(x, y) = P(X=x, Y=y)$

Here, $0 \leq P_{XY}(x, y) \leq 1$ for all x, y, and $\sum_{x\in Val(X)}\sum_{y\in Val(Y)} P_{XY}(x, y) = 1$

The **marginal probability mass function** of $X$ is defined as:

$p_X(x) = \sum_yp_{XY}(x, y)$.

This is also the case with $p_Y(y)$.

### 3.3 Joint and Marginal PDF

If $X$ and $Y$ are continuous random variables with joint distribution function $F_{XY}$. IN the case that $F_{XY}(x, y)$ is differentiable everywhere in both x and y, then the joint probability density function is:

$f_{XY}(x, y) = \frac{\partial^2F_{XY}(x, y)}{\partial x \partial y}$

Also:

$\int\int_{(x,y)\in A} f_{XY}(x, y)dxdy = P((X, Y) \in A)$.

Note the values of the probability density function $f_{XY}(x, y)$ are always non-negative but they may be greater than 1. Nonetheless, it must be the case that $\int_{-\infty}^\infty\int_{-\infty}^\infty f_{XY}(x, y) = 1$.

Analogous to the discrete case, we define **marginal probability density function** (or marginal density) of X as:

$f_X(x) = \int_{-\infty}^\infty f_{XY}(x,y)dy$

This is also the case with $f_Y(y)$.

### 3.4 Conditional Distributions

What is the probability distribution over $Y$, when we know that $X$ must take on a certain value $x$?

In the discrete case, the conditional PMF of $Y$ given $X$ is simply:

$p_{Y|X}(y|x) = \frac{p_{XY}(x, y)}{p_X(x)}$ 

assuming that $p_X(x) \neq 0$.

In the continuous case, it is more complicated as the probability that a continuous random variable $X$ takes on a specific value $x$ is equal to zero. Ignoring this technical point, we simply define the conditional PDF of $Y$ given $X = x$ as:

$f_{Y|X}(y|x) = \frac{f_{XY}(x, y}{f_X(x)}$

assuming $f_X(x) \neq 0$.

### 3.5 Chain Rule

Chain rule derived earlier is applicable to random variables as follows:

$p_{X_1,...,X_n}(x_1, ..., x_n)$

$= p_{X_1}(x_1)p_{X_2|X_1}(x_2|x_1)...p_{X_n|X_1, ..., X_{n-1}}(x_n|x_1,..., x_{n-1})$

### 3.6 Baye's Rule

For discrete random variables $X$ and $Y$:

$P_{Y|X}(y, x) = \frac{P_{XY}(x,y)}{P_X{(x)}} = \frac{P_{X|Y}(x|y)P_Y(y)}{\sum_{y'\in Val(Y)}P_{X|Y}{(x|y')}P_Y(y')}$

For continuous random variables $X$ and $Y$:

$f_{Y|X}(y|x) = \frac{f_{XY}(x, y)}{f_X(x)} = \frac{f_{X|Y}(x|y)f_Y(y)}{\lim_{-\infty}^{\infty}f_{X|Y}(x| y')f_Y(y')dy'}$

### 3.7 Indepdence

Two random variables are independent if $F_{XY}(x, y) = F_X(x)F_Y(y)$ for all values of x and y. Equivalently:

* For discrete RV, $p_{XY}(x, y) = p_X(x)p_Y(y)$ for all $x\in Val(X), y\in Val(Y)$
* For discrete RV, $p_{Y|X}(y|x) = p_Y(y)$ whenever $p_X(x)\neq 0$ for all $y\in \mathbb{R}$
* For continuous RV, $f_{XY}(x, y) = f_X(x)f_Y(y)$ for all $x, y \in \mathbb{R}$
* For continuous RV, $f_{Y|X}(y|x) = f_Y(y)$ whenver $f_X(x) \neq 0$ for all $y\in\mathbb{R}$

Informally, two variables are independent if knowning the value of one of them will not change the conditional probability distribution of the other variable.

#### Lemma 3.1:

If $X$ and $Y$ are independent then for any subsets $A, B \subset \mathbb{R}$, we have:

$P(X\in A, Y\in B) = P(X\in A)P(Y\in B)$

This lemma can be used to prove if $X$ is independent from $Y$. If so, then any function of $X$ is independent of any function of $Y$.

### 3.8 Expectation and Co-variance

Suppose two random variables $X$, $Y$ and $g:\mathbb{R}^2\to\mathbb{R}$ is a function of these two random variables. Then the expected value of $g$ is defined as:

$\mathbb{E}[g(X, Y)] = \sum_{x \in Val(X)}\sum_{y\in Val(Y)} g(x, y)p_{XY}(x, y)$

For continous random variables $X, Y$, the analogous expression is:

$\mathbb{E}[g(X, Y)] = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}g(x, y)f_{XY}(x, y)dxdy$.

Can use the concept of expectation to study th relationship of the two random variables with each other. In particular, the co-variance of the two random variables can be defined as:

$Cov[X,Y] = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]$

Using a similar argument as before, we can rewrite as:

$Cov[X,Y] = \mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y]$.

Here, the key step in showing the equality of the two forms of covariance is in the third equality, where we use the fact that $\mathbb{E}[X]$ and $\mathbb{E}[Y]$ are actually constants which can be pulled out of the expectation. When $Cov[X,Y] = 0$, we say that $X$ and $Y$ are uncorrelated.

#### _Properities:_

* (Linearity of expectation) $\mathbb{E}[f(X,Y) + g(X, Y)] = \mathbb{E}[f(X,Y)] + \mathbb{E}[g(X, Y)]$
* $Var[X + Y] = Var[X] + Var[Y] + 2Cov[X, Y]$
* If $X$ and $Y$ are independent, then $Cov[X, Y] = 0$
* If $X$ and $Y$ are independent, then $\mathbb{E}[f(X) + f(Y)] = \mathbb{E}[f(X)]\mathbb{E}[g(X)]$