# Likelihood: estimate probability of parameters given the parametric model and observed data

When we describe phenomena by probability model (probability distributions) it's likely that we don't know all the parameters in the model. We estimate these unknowns parameters using observed data. The Likelihood function tells us for which parameter value does the observed data have the highest probability?

Exp: we have a biased coin, and we model this as having "head" with probability $p$, which is unknown. We toss this biased coin 100 times and get 55 heads. What should p be so that our observed data is most likely true?

Using this probability model with $p$, we have $P(55 heads) = \begin{pmatrix} 100 \\55 \end{pmatrix}p^{55}(1-p)^{45}$

For example, exponential distribution has one parameter $\lambda$ with pdf: $f_X(x)=\lambda e^{-\lambda}$.

# P(A|BC)P(B|C)=P(AB|C), $P(A|BC)=\frac{P(B|AC)P(A|C)}{P(B|C)}$
P(A|BCD)P(B|CD)=P(AB|CD)

# Marginalization, Conditioning

p 512 - AIAM

# i.i.d = independent and identically distriubuted
Example of identically distributed nonindependent events

Consider an urn with two balls in it, one black and one white. We reach into it and draw out the two balls one after the other, choosing the first one at random (and this of course determines the color of the next ball). Thus, the two equally likely outcomes of the experiment are (White, Black) and (Black, White), and we see that the first ball is equally likely to be Black or White and so is the second ball also equally likely to be Black or White. In other words, the events

{first ball drawn is Black}  and  {second ball drawn is Black}

certainly are identically distributed, but they are definitely not independent events. Indeed, if we know that the first event has occurred, we know for sure that the second cannot occur. Thus, while our initial evaluation of the probability of the second event is 1/2, once we know that the first event has occurred, we had best revise our assessment of the probability of the second drawn will be black from 1/2 to 0.

# Bayes network stores cond prob to calculate joint prob

When you have a large number of random variables, how to represent the full joint probability (which requires a huge space)?

Answer: Bayes network. Bayes network is a graph that represent conditional probabilities; from a Bayes network, one can calculate any joint probabilities.

Bayes network is a directed *ACYCLIC* graph (DAG):

- Nodes are random variables $X_1,X_2,…$
- An edge from $X_1$ to $X_2$ means $X_1$ influences directly $X_2$, i.e. $X_2$ is conditionally depends on $X_1$. If there is no edge between two nodes, the two nodes are independent.
- Each node $X$ contains a table of conditional probabilities of the form $P(X=x|parent(X)=y)$

***Bayes assumption***: 

Each variable is conditionally independent of all its non-descendants in the graph given the value of all its parents.
$$X\perp Y|parent(X)\implies P(X|parent(X)Y) = P(X|parent(X))$$
or equivalently $$P(XY|parent(X)) = P(X|parent(X))P(Y|parent(X))$$
A consequence of this assumption is: a node is conditionally independent of all other nodes in the network, given its parents, children, and children’s parents—that is, given its Markov blanket

***Bayes network property***: $$P(X_1,X_2,...,X_n)=\prod_{i=1}^{n}P(X_i|parent(X_i))$$
This theorem shows how Bayes network encode full join probabilities compactly.

Each node $X$ contains a table of conditional probabilities of the form $P(X=x|parent(X)=y)$ which quantifies the influence of $parent(X)$ on $X$

***The difficulty is to calculate the posterior $P(parent(X)=y|X=x)$***

Note that from $X,parent(X)$ the joint probability $P(X)=P(X=x|parent(X)=y)P(parent(X)=y)$ can be calculated.

# Markov chain

Formal def (not accurate yet)
- a Stochastic process is a family of random variables on a same sample space.
- a Markov chain is a sequence of random variables $\{X_i, i=0,1,2,...\}$ 

A Markov chain is a process with a finite number of states in which the probability of being in a particular state at step n + 1 depends only on the state occupied at step n.

![title](ProbStat_res/markov.png)

![title](ProbStat_res/markov2.png)

So $\vec{p}_n = P^n \vec{p}_0$

We also call the probability vectors $\vec{p}_0,\vec{p}_{n},\vec{p}_{n+1}$ probability distributions.

The chain becomes stationary at step n if probability distribution $\vec{p}_n = \vec{p}_{n+1} = \vec{p}_{n+2} = ...$

(in other words, $P\vec{p}_n = \vec{p}_n$

# Markov chain Monte Carlo for approximating posterior probability

In Bayesian inference, given data X and parameters θ, the posterior

$$p(θ∣X)=\frac{p(X∣θ)P(θ)}{∫p(X∣θ)P(θ)dθ}$$

is generally unavailable in closed form, and we must rely on other methods to perform inference. MCMC methods approach this problem by simulating a Markov chain whose stationary distribution is the desired posterior, P(θ∣X).

# Markov blanket, Markov boundary

In statistics and machine learning, when one wants to infer a random variable with a set of variables, usually a subset is enough, and other variables are useless. Such a subset that contains all the useful information is called a Markov blanket. If a Markov blanket is minimal, meaning that it cannot drop any variable without losing information, it is called a Markov boundary.

In a Bayesian network, the Markov boundary of node A includes its parents, children and the other parents of all of its children.

![title](ProbStat_res/Markov_blanket.png)