# Non Parametric Models  

Parametric
- Gaussian distribution is determenined by mean and covariance matrix parameters


Non Parametric
- No pre specified shape of the distribution
    - KDE
    - Histogram

KDE can capture where you have bi-modal data, vs Gaussian will only return one mode


### Parametric Models

Can be described by a fixed number of parameters

- Discrete
    - Bernoulli distribution
    - One parameter, $\theta \in [0, 1]$, which generate a family of models
- Continuous
    - Gaussian distribution
        - $(\mu, \sigma)$
- Probabilistic graph models
    - Probability relationship between variables
    - Dependence of these random variables

### Non Parametric Models

- Smooth density pdf

- Histogram
- KDE

Non parametric does NOT mean there are no parameters
- Can not be described by a fixed number of parameters
    - Multivariate gaussian has fixed params
    - Non-param model, we don’t want to fix the parameters
        - The degree of freedom are not fixed
        - Models are quite flexible


## MLE

Simple and has good statistical properties. 

Data $D = {x^1, x^2, .. x^n}$ for iid from some unknown distribution $P^{*}(x)$

- iid means drawn from the same distribution as well

Want to fit the data with a model $P(x|\theta)$

$\hat\theta = argmax_{\theta} log P(D|\theta)$
$= argmax_{\theta} log \prod_{i=1}^{m} P(x^i|\theta)$

You want to find the probabilkity of your data being maxed. 
The product of their distibution 


use the log, so the product becomes the sum of the log.
- maximize the sum of a bunch of variables instead of maximizing the products





Example

- Estimate the probability $\theta$ of landing in heads for a biased coin

Given a sequence of $m$ *i.i.d.* flips

$$D = {x^1, x^2, ..., x^m} = {1, 0,1, ..., 0}, x^i \in {0, 1}$$

### Both are the same ways to write

Model: $P(x| \theta) = \theta^x (1 - \theta)^{1 - x}$

- Compact expression

$P(x|\theta)=  \left\{
\begin{array}{ll}
      1 - \theta, x=0 \\
      \theta, x = 1 \\
\end{array} 
\right.  $

- Piecewise



- Likelihood of a single observation $x_i$

$P(x^i | \theta) = \theta^{x^i} = \theta^{x^i}(1 - \theta)^{1 - x^i}$

$\theta$ is probability of getting a HEADS

### MLE for biased coin

- objective function: log likelihood

- log is the sum over all the samples and the likelihood for each individual example

- Property of the log to bring down $x^i$ to sum of $x^i * log$

$l(\theta; D) = log P(D | \theta) = log \theta^{n_h}(1 - \theta)^{n_t} = n_h log \theta + (m - n_h) log(1 - \theta)$

$n_h$: number of heads, $n_t$: number of tails

- maximize $l(\theta; D)$ w.r.t. $\theta$

- Take derivatives w.r.t. $\theta$

$\frac{\partial l}{\partial \theta} = \frac{n_h}{\theta} - \frac{(m - n_h)}{1 - \theta}$  

$\Rightarrow \theta = \frac{n_h}{m} or \hat\theta_{MLE} = \frac{1}{m}\Sigma_i x^i$

derivate of l and set to zero to maximize 

## Gaussian

Gaussian distribution in R

$p(X | \mu, \sigma = \frac{1}{(2\pi)}$

## Homework

- Histograms
    - Too many bins in high dimensional data
    - Most bins will be empty, grows exponentaly fast
    - We will not have a meaningful estimate
    
    - Output depends on where you put the bins: estimates are **noisy*
    - arbitrary bin size produce different histograms
    
    
- KDE
    - approximate density from histogram box shaped functions
    - smooth density where boxes are located
    - Place one smoothing kernel centered at the data point
        - After you place all the kernels, sum them together to interpolate the data points
        
        
   - Kernel choices
       - gaussian
       - tophat (not popular)
       - epanechnikov
       - exponential
       - linear
       - cosine
       
   - kernel bandwith
       - too large: too much interpolation and blue out everything
       - to small captures too many modes
       
       
How to choose kernel bandwidth?

If you are using guassian, $h = 1.06 \hat\sigma m^{-1/5}$

OR 

A better approach is cross validate

- randomly split the data
- obtain kernel density estimate using one set
- measure the likelihood of the second set
- repeat over many splits and average

Drawback to KDE
- In order to represent density function, you have to keep all the data in memory
- If you have a lot of data, it is better to summarize the data
- Most expensive computation
- 

# EM

$\tau_1^i$ is the probability of likelihood that the $i^{th}$ data point comes from this gaussian distribution

Latent variable - hidden variable to randomly choose a mixture component. 
- after, sample the actual value of $x^i$ from a gaussian dist $N(x|u_z^i, \Sigma_z$

We don't know the latent vector
- impute missing information by taking the expectation with respect to unkown latent factros



## Expecation Maximization

1. Compute E: take expecation over posterior conditioned on date: forms a lower bound
2. Maximize: $\theta^{t+1} = argmax_\theta f(\theta)$

1. Start with initilization of theta
2. use that to find lower bound 
3. maximize 
4. Estimate posterior
4. Improve by maximizing lower bound

$$P(z|x) = \frac{P(x|z)P(z)}{P(x)}$$

$$\frac{P(x,z)}{\Sigma_z , P(x, z')}$$

$$Posterior = \frac{likelihood * prior}{normalization constant}$$

$$Prior = p(z) = \pi_z$$

- Margianl distribution of $Z$

$$Likelihood: p(x|z) = N(x|\mu_z, \Sigma_z$$

- Told what z is, what should be distribution of x

$$Posterior: p(z|x) = \frac{\pi_z N(X|\mu_z, \Sigma_z)}{\Sigma_z, \pi_z, N(X|\mu_{z'}, \Sigma_{z'}}$$

- Probability dist of z given x, trying to make a guess if I know x best guess about z

#### Notes

$$p(x, z) = p(x|z)p(z) = p(z|x)p(x)$$

$$p(z|x) = \frac{p(x|z)p(z)}{p(x)}$$

$p(x)$ is the marginal distribution, ensuring that this will sum up to 1

# E-step: find posterior

$$q(z^1, z^2, .., z^m) = \prod_{i=1}^{m} p(z^i | x^i, \theta^t)$$

for each data point $x^i$ compute $p(z^i = k|x^i, \theta^t)$ for each $k$

$$\tau_k^i = p(z^i = k|x^i, \theta^t) = \frac{p(x^i|z^i=k)p(z^i=k)}{\Sigma_{k' = 1..K}p(z^i = k', x^i)}$$

$$ = \frac{\pi_z N(X|\mu_z, \Sigma_z)}{\Sigma_z, \pi_z, N(X|\mu_{z'}, \Sigma_{z'}}$$



# Estep: compute expectation

$$(\theta) := E_{q(z^1, z^2, .., z^m)} [log \prod_{i=1}^{m} p(x^i, z^i | \theta)] = \sum_{i=1}^{m} E_{p(z^i|x^i, \theta^t)}[log p(x^i, z^i|\theta)]$$

Expand log of Gaussian density $log N(x^i | \mu_{z^i}, \Sigma_{z^i}$

$$ f(\theta) = \sum_{i=1}^{m} E_{p(z^i|x^i, \theta^t)} [log\pi_{z^i} - \frac{1}{2}(x^i - \mu_{z^i}^T \)] $$

$\prod_{i=1]^{m} p(x^i, z^i | \theta)$