# 27. Lecture 15. Generative Models

## 27.1. Objectives

* Understand what Generative Models are and how they work
* Understand estimation and prediction phases of generative models
* Derive a relation connecting generative and discriminative models
* Derive **Maximum Likelihood Estimates (MLE)** for multinomial and Gaussian generative models

## 27.2. Generative vs Discriminative models


Discriminative models | Generative models
--- | ---
learn the **decision boundary** between the classes | model the **probability distribution** of each class
need labeled data, more data | unlabeled data
Ex: Classification > Find a Separator (decision boundary) | Find a structure in the group of data, with a probalistic view
. | KNN is exmaple of structured data

![Discriminative vs Generative](img/27.L15.1.png)

2 types of generative models:
1. Multinomials Models
2. Gaussians Models

2 questions about the models:
- **Estimation**: How to estimate the model?
- **Prediction**: ..

Fit probability distribution to the group of + and - classes

## 27.3. Simple Multinomial Generative model

Ex: documents of text  
Model will generate a document, by picking a word at a time
vs. vector of fixed lenght




Let's:
* M: a multinomial model (to generate text in documents)
* W: a vocabulary
* $ p(w|\theta) $ with $\theta$ the parameter of the model, or $ \theta_w$: capture the likelihood of selecting a word, given all possibilities.

To have a valid probability distribution, the constraints are:
* $ \theta_w \ge 0 $  
  (each probability is greater than 0, can't be negative)
* $ \sum_{w \in W} \theta_w = 1 $  
  (the sum of all probabilities is equal to 1)


Notes:
* Why is this model called "multinomial" generative model?
  * because of the number of outcomes. If there are two outcomes it is binomial. In the context of the example, there are many words (more than two), so it's called multinomial. 

## 27.4. Likelihood function

How to calculate the probability $p$ to generate a document $D$:  
1. it's the product of the probabilities to pick each n words of D
$$ p(D | \theta) = \prod_{i=1}^n \theta_{w_i} $$  
2. or the product of the probability of each word to the power of its occurence
$$ p(D | \theta) = \prod_{w \in W} \theta_{w}^{count(w)} $$

#TODO what happen if prob = 0

#### Example

Let's:
* The vocabulary $W = \{ "cat", "dog" \}$
* The model $M_1$ with paramter $\theta_1: \theta_{cat}=0.3, \theta_{dog}=0.7 $
* The model $M_2$ with paramter $\theta_2: \theta_{cat}=0.9, \theta_{dog}=0.1 $

A document $D = { cat, cat, dog}$ 

Compute the probabilities, that each models, generates the document:
* $ p(D | \theta_1) = .3^2 + .7 = 0.79 $
* $ p(D | \theta_2) = .9^2 + .1 = 0.91 $


Note:
* In the example before, each word is generated independently of all other words.
* In the case of common languages (like English), this assumption is not realistic, as words  probability of appearance depends of surrounding words, rules of grammar, sentence structure..

## 7.5 Maximum Likelihood Estimate

How to use the training data to find the best parameter that fit data?

Find $\theta$ such as:
$$max_{\theta} \ p(D|\theta)$$ 
$$max_{\theta} \prod_{w \in W} \theta_w^{count(w)}$$ 

It's equivalent to (and easier than) maximise the log of the product (which become a sum):

$$max_{\theta} \sum_{w \in W} log(\theta_w^{count(w)}) $$
$$max_{\theta} \sum_{w \in W} count(w) * log(\theta_w) $$


Note:
* Log, usually refers to the logarithm of base 10, or common logarithm
* Ln, the natural logarithm, is the log base e. $ln x = log_e x$




### Case n=2

Let's solve first for a trivial case, the vocabulary W has only 2 words (symbols).

$ W = \{0,1\} $, (n = 2)

Then:
* $\theta_0 = \theta $ and $ \theta_1 = 1 - \theta $

$max_{\theta}$ of $[count(w=0) * log(\theta_0) + count(w=1) * log(\theta_1)]$  
$max_{\theta}$ of: $count(0) * log(\theta) + count(1) * log(1 - \theta)$

To find the max, we search the 0 of the derivative:

$$ d / d \theta  count(0) * log(\theta) + count(1) * log(1 - \theta) = 0 $$

$$ { count(0) \over \theta} + (-1) *{count(1) \over (1-\theta)} = 0 $$

$$ (1-\theta) * count(0) - \theta * count(1) = 0 $$

$$ \theta = { count(0) \over (count(0) + count(1)) } $$


#### Note:  
The minimal number of parameters that a model need to be defined is count(w) - 1
* For count(w) = 2, only $\theta_1$ is needed, because $\theta_2 = 1 -\theta_1$
* For count(w) = 3, only $\theta_1$ and $\theta_2$ is needed, because $\theta_3 = 1 -\theta_1 -\theta_2$
* etc...

* Maximum Likelihood Estimate (MLE) is a very general method that can be applied to both continuous and discrete distributions, such as poisson distribution.

## 27.6. MLE for Multinomial Distribution

### Case: n > 2

Vocabulary on any length

$$ \theta_w = { count(w) \over \sum_{w \in W} (count(n)) } $$


This technic is also applicable to a collection of documents $D_1, ..., D_n$, By concatenating all the documents $D_i$.  

assumption is that the words are generated independently.



## 27.7. Predictions

$$ log ({p(+) \over p(-)}) $$

$$ ...$$

Linear classifier!

## 27.8. Prior, Posterior and Likelihood

##  27.9. Gaussian Generative models

## 27.10. MLE for Gaussian Distribution