# Bayesian Machine Learning

---

## Prior and Posterior Probability
- Two different approaches in statistics: Bayesianism and Frequentism
- Bayesian statistics modeling is trying to encode initial beliefs and to update the belief as we observe the new data.  Useful when data is limited.  Assign probability to hypothesis based on belief.
- Frequentist statistics computes probability of an event that occurs many many times, with no prior beliefs
- Prior: Previous beliefs about an object before it has observed $x$
- Posterior: Probability that is computed after observation of $x$
- Stochastic event – A random varibale $x$ is function of both time and outcome

## Baye’s Rule
- Baye’s Rule: inference about hypothesis from data
- Hypothesis space is set of all concepts or functions that learning algorithm is allowed to select as being the solution to the problem: “coin is fair”...
- When making prediction for new data, consider effect of measured data on hypothesis about data:

$$ P(Hypothesis|Data) = \frac{P(Data|Hypothesis)P(Hypoyhesis)}{P(Data)} $$
where $P(Hypothesis)$ is a **Prior** and $P(Hypothesis|Data)$ is **Posterior Prediction**


## Prior and Parametric Models
- Prior in Bayesian statistics is $P(hypothesis)$ where hypothesis is controlled with parameters $\theta$,(vector $\theta$)
- Parametric models in ML assume some finite set of parameters $\theta$ that capture everything there is to know
about the data 

$$P(X|\theta, {x_1, . . x_m}) = P(X|\theta) = P(Data|Hypothesis)$$

- Goal is to find the posterior predictive distribution $P(Hypothesis|Data)$

## Likelihood
- What if there are multiple hypothesis consistent with evidence, (multiple priors)
- Definition of Likelihood is $p(X|\theta)$
- Example: $X=[16,8,2,64]$ given two hypothesis: “power of two” and “even numbers” 

$$p(X|\theta) = \left(\frac{1}{size(hypothesis}\right)^n = \left(\frac{1}{|\theta|}\right)^n$$ 
where $n$ is number of samples

- Models favor simplest hypothesis consistent with the data, (Occam’s razor principle)
    - "Power of two" is the hypothesis that covers this situation

## Parametric Models
- Parametric models assume some finite set of parameters $\theta$. Given the parameters, future predictions, $x$, are independent of the observed data, $X$:

$$P(x|\theta, X, m) = P(x|\theta, m)$$

- Therefore $\theta$ capture everything there is to know about the data.
- So the complexity of the model is bounded even if the amount of data is unbounded. This **does not** makes them very flexible.

## Bayesian Machine Learning
- $P(\theta, m)$ is prior probability of $\theta$
- $P(X|\theta, m)$ is likelihood of parameters $\theta$ in model $m$
- Posterior probability of $\theta$ given data $X$ is:

$$ P(\theta|X,m) = \frac{P(X|\theta,m)P(\theta,m)}{P(X|m)}$$

- Prediction:

$$P(x|X,m) = \int P(x|\theta,X,m)P(\theta|X,m)d\theta$$

## Example...
- Suppose that we have two bags each containing black and white balls.
- One bag contains three times as many white balls as blacks. The other bag contains three times as many black balls as white.
- Suppose we choose one of these bags at random. For this bag we select five balls at random, replacing each ball after it has been selected. The result is that we find 4 white balls and one black.
- What is the probability that we were using the bag with mainly white balls?

$$ P(A|B) = \frac{P(B|A)P(A)}{\Sigma_i P(B|A_i)P(A_i)} $$
where
$$ P(B) = \Sigma_i P(B, A_i) = \Sigma_i P(B|A_i)P(A_i) $$

## Solution 
- Let $A$ be the random variable "bag chosen" then $A=[a1,a2]$ where $a1$ represents "bag with mostly white balls" and $a2$ represents "bag with mostly black balls". 
- We know that $P(a1)=P(a2)=1/2$ since we choose the bag at random.
- Let $B$ be the event "4 white balls and one black ball chosen from 5 selections".
- Now, for the bag with mostly white balls the probability of a ball being white is 3⁄4 and the probability of a ball being black is 1⁄4. Thus, we can use the **Binomial Theorem**, to compute $P(B|a1) and P(B|a2)$ as:

$$ P(B|a_1) = \left(_{1}^{5}\right) \left(\frac{3}{4}\right)^4 \left(\frac{1}{4}\right)^1 = \frac{405}{1024} $$

$$ P(B|a_2) = \left(_{1}^{5}\right) \left(\frac{1}{4}\right)^4 \left(\frac{3}{4}\right)^1 = \frac{15}{1024} $$

- Then calculate $P(a_1|B)$ from Baye's rule:

$$ P(a_1|B) = \frac{\frac{405}{1024}}{\frac{405}{1024}+\frac{15}{1024}} = \frac{405}{420} = 0.964 $$

---

## Bayesian Statistics
- In general before observing the data we represent our knowledge of $\theta$ using the prior probability distribution $p(\theta)$ with high entropy, as high degree of uncertainty
- For a set of data samples $X =$ {$x^{(1)} , x^{(2)}, ..., x^{(m)}$} consider their effect on hypothesis about $\theta$ by combining the data likelihood $p(x^{(1)} , x^{(2)}, ..., x^{(m)})|\theta)$ with the prior Bayes rule:

$$ p(\theta|x^{(1)} , x^{(2)}, ..., x^{(m)}) = \frac{p(x^{(1)} , x^{(2)}, ..., x^{(m)}|\theta)p(\theta)}{p(x^{(1)} , x^{(2)}, ..., x^{(m)})} $$

- After observing $m$ examples the predicted distribution over the next data sample $x^{(m+1)}$ is:

$$ p(x^{(m+1)} | x^{(1)}, x^{(2)}, ...,x^{(m)} )= \int p(x^{(m+1)}|\theta) p(\theta| x ^{(1)} , x^{(2)}, ...,x^{(m)})dθ $$

## Max A-Posteriori Estimation
- The MAP estimate chooses the point of maximal posterior probability or maximal probability density in the more common case of continuous $\theta$ 

$$\theta_{MAP} = arg_{\theta} max p(\theta|x) = arg_{\theta}max \log p(x|\theta) + \log p(\theta)$$

- Here $log p(x|\theta)$ is the standard likelihood term and $log p(\theta)$ corresponds to the prior distribution

## Probabilistic Models
- They incorporate random variables and probabilistic distributions as outcome
- Faithfully represent uncertainty in model structure and parameters and noise in data

## Gaussian Mixture Model (GMM)
- Mixture of distributions: $P(x) = \sum_i P(c=i)P(x|c=i)$ , where $P(c)$ is multinoulli distribution over component identities
- In Gaussian Mixture Model the components $p(x|c=i)$ are Gaussians, each component has separately parametrized mean $\mu^{(i)}$ , variance $\sigma^{(i)}$ and covariance ($cap \space \sigma$) $\sum(i)$

## Probabilistic Graphical Model
- Probabilistic graphical models represent large joint distributions compactly using a set of “local” relationships specified by a graph.
- Each random variable in our model corresponds to a graph node.
- There are directed/undirected edges between the nodes which tell us qualitatively about the factorization of the joint probability.
- There are functions stored at the nodes which tell us the quantitative details of the pieces into which the distribution factors
- Graphical models are also known as Bayes(ian) (Belief) Net(work)s

## Directed Graphical Model
- Consider directed acyclic graphs over $n$ variables.
- Each node has set of parents $\pi_i$
- Each node maintains a function $f_i (x_i;X_{\pi i})$ such that
$$ f_i > 0 \space and \space \Sigma_{xi} f_i(x_i;X_{\pi i}) =1 $$
- Define the joint probability to be
$$ P(x_1, x_2,...,x_n) = \prod_i P(x_i|X_{\pi i}) $$
- Factorization of the joint in terms of local conditional probabilities. Exponential in “fan-in” of each node instead of in total variables $n$

## Goal of Graphical Model
- Graphical models aim to provide compact factorizations of large joint probability distributions.
- These factorizations are achieved using local functions which exploit conditional independencies in the models.
- The graph tells us a basic set of conditional independencies that must be true.
- These independencies are crucial to developing efficient algorithms valid for all numerical settings of the local functions.
- Local functions tell us the quantitative details of the distribution.

---

## Classification
- Given examples of a discrete class label y and some features $x$.
- Goal: compute label $(y)$ for new inputs $x$.
- Two approaches:
    - Generative: model $p(x, y) = p(y)p(x|y)$; use Bayes’ rule to infer conditional $p(y|x)$.
    - Discriminative: model discriminants $f(y|x)$ directly and take max.
    - Generative approach is related to conditional density estimation while discriminative approach is closer to regression.

## Probabilistic Classification: Bayes Classifiers
- Generative model: $p(x, y) = p(y)p(x|y)$
    - $p(y)$ are called class priors.
    - $p(x|y)$ are called class conditional feature distributions.
- For the prior we use a Bernoulli or multinomial:
- $p(y = k|\pi) = \pi_i \space with \space \Sigma_k \pi_i = 1$

## Bayes Classifiers
- Classification rules:
    - ML: $argmax_y p(x|y)$ (can behave badly if skewed priors)
    - MAP: $argmax_y p(y|x)$ = $argmax_y(\log p(x|y) + \log p(y))$ (safer)
- Fitting:
    1. Sort data into batches by class label.
    2. Estimate $p(y)$ by counting size of batches (plus regularization).
    3. Estimate $p(x|y)$ separately within each batch using ML.

## Three Key Regularization Ideas
- To avoid overfitting, put priors on the parameters of the class and class conditional feature distributions.
- Tie some parameters together so that fewer of them are estimated using more data.
- Make factorization or independence assumptions about the distributions. In particular, for the class conditional distributions assume the features are fully dependent, partly dependent, or independent

## Naive Bayes Classifier
- Assumption: conditioned on class, attributes are independent. $P(x_1,x_2,x_n|y) = \prod_i P(x_i|X_{\pi i})$
- Algorithm: sort data cases into bins according to $y_n$ . Compute marginal probabilities $p(y = c)$ using frequencies.
- For each class, estimate distribution of $i^th$ variable: $p(x_i |y = c)$.
- At test time, compute $argmax_c p(c|x)$ using $c(x) = argmax_c p(c|x) = argmax_c [\log p(x|c) + \log p(c)]$

---