# Contents
* Reviews
* The Log-Likelihood Gradient
* Stochastic Maximum Likelihood and Contrastive Divergence
* Pseudolikelihood
* Score Matching and Ratio Matching
* Denoising Score Matching
* Noise-Contrastive Estimation
* Estimating the Partition Function
  * Annealed Importance Sampling
  * Bridge Sampling

# Reviews

* Probability vs Compatibility
  * $p(x) = \cfrac{1}{Z}\tilde{p}(x)$
  * where $Z = \int_x \tilde{p}(x)$
  * $Z$ is Partition Function 
* Energy-based model
<img  src="SRLTQV05FB5UBWRN9UDSJELJ1DRUI1FC.png" width=600/>
  * $\tilde{p}(\mathbf{x}) = \exp(-E(\mathbf{x}))$
    * to Make $\forall{\mathbf{x}},\tilde{p}(\mathbf{x}) > 0$
    * $E(\mathbf{x})$ is Energy Function
    * $\tilde{p}(\mathbf{x})$ in EBM is an example of a Boltzmann distribution.

* MCMC
  * We can do Sampling from System on Steady State even though random initializatoin
  * Approximation for Integration

# 18.1 Log-likelihood Gradient

* Gradient
  * <img  src="PJ5U2NA13VAVD6U2LJOYR8B2534R9VFR.png" width= 400/>
* Partition Z
  * <img  src="LMVG828BPQNWEWC9KXUH3V3MJQR23WNK.png" width=600/>
  * <img  src="CLGAS72B313BQOLV3OD06NL0A8CAAOKR.png" width=300/>

# 18.2 Stochastic Maximum Likelihood and Contrastive Divergence

* How to use gibbs_update (sampling conditionally and partially using MCMC)
* How to make it Faster
* Positive Phase vs Negative Phase
  * $\nabla_\theta \log p(x)=\nabla_\theta \log \tilde{p}(x) - \mathbb{E}_{x\sim p(x)} [\nabla_\theta \log \tilde{p}(x)]$
    * Positive: $\nabla_\theta \log \tilde{p}(x)$
    * Negative: $\mathbb{E}_{x\sim p(x)} [\nabla_\theta \log \tilde{p}(x)]$
  * Practical Calculation
    * Gradient update $=\Sigma_{x\sim data} \nabla_\theta \log \tilde{p}(x) - \Sigma_{x\sim MCMC} \nabla_\theta \log \tilde{p}(x)$

<img  src="YBQ30HJSELRUVFPCSE9OXC6SA4TP3PA7.png" width=700/>

## Naive Algorithm 
<img  src="H21Y4NO2ADFIO7MN7DRKRTSC93GNY52I.png" width=700/>

## Contrastive Divergence Algorithm (CD)

* Motivation
  * Iteration step for gibbs_update, k=100 is too expensive computationally.
  * Reduce k into 1-20 for model on a small image patch
<img  src="R90F96NWSLVOTN1948YETJSVXXBX57X6.png" width=700/>

<img  src="W1EWXN4R96DR6C2K6RQD1I9FT1G9NOCR.png" width=700/>

## Stochastic Maximum Likelihood Algorithm (SML)

* Also called Persistant Contrastive Divergence (PCD)
* Reduce k into 1 for model on a small image patch
* How
  * Maintain contrastive samples persistantly!!! (So Simple)
<img  src="Y9JCHDQI6LH0XOKFNVYRVY5PW3RNPSBG.png" width=700/>

* Theree are additional approach to accelerate SML called as Faster PCD.

# 18.3 Pseudolikelihood

* Set $\log p(\vec{x}) \approx \sum_d \log p(x_d|\vec{x}_{-d})$
* Use Conditional Probability property
  * $p(x_1|x_2)=\cfrac{p(x_1,x_2)}{p(x_2)}=\cfrac{\tilde{p}(x_1,x_2)}{\tilde{p}(x_2)}=\cfrac{\tilde{p}(x_1,x_2)}{\int_{x_1}\tilde{p}(x_1,x_2)}$
* e.g.
  * If we want to calculate $p([1,1,1,1])$ where each state = 1~3.
    * We have to evaluate $p(x_1=1|x_2=1, x_3=1, x_4=1])$
      * By calculating model $\tilde{p}$
        * $\tilde{p}(x_1=1||x_2=1, x_3=1, x_4=1)$
        * $\tilde{p}(x_1=2||x_2=1, x_3=1, x_4=1)$
        * $\tilde{p}(x_1=3||x_2=1, x_3=1, x_4=1)$
      * $p(x_1=1|x_2=1, x_3=1, x_4=1]) = \cfrac{\tilde{p}(x_1=1||x_2=1, x_3=1, x_4=1)}{\sum_k \tilde{p}(x_1=k||x_2=1, x_3=1, x_4=1)}$
    * iterate D step
    * Finally
      * $p(x_1=1|[1, 1, 1]) \times p(x_2=1|[1, 1, 1]) \times p(x_3=1|[1, 1, 1]) \times p(x_4=1|[1, 1, 1])$
  * $K^D$ -> $K\times D$  
* This may look like an unprincipled hack, but it can be proven that estimation by maximizing the pseudolikelihood is asymptotically consistent
* "Generalized" Pseudolikelihood
  * Make dimensions into $m$ Partitions
    * If $m=1$
      * No partition
      * Exact likelihood
    * If $m=D$
      * Partition for a individual dimension
      * Same as Pseudolikelihood
    * If $1<m<D$
      * $\log p(x) \approx \sum_m p(x_m|\vec{x}_{-m})$
<img  src="OHIU5CVP5SD67HQM6ESBBI0TDKS2W7S1.png"/>

* It can perform better than maximum likelihood for "tasks that require only the conditional distributions used during training"
  * such as filling in small amounts of missing values.


* Generalized pseudolikelihood techniques are especially powerful
  * If designed to capture the most important correlations
    * For example, in natural images, pixels that are widely separated in space also have weak correlation

# 18.4 Score Matching and Ratio Matching

## Score Matching

* Like Pseudolikelihood, Score Matching does not approximate Partition $Z$
* Score
  * $\nabla_x \log p(x)$
<img  src="VLA45G8R0SY3XCMM1PL56VM7MD9N3N0Q.png" width=600/>
* Approximation

<img  src="NR7TGUVPMVQ17W8F87PQ058HQJUB5UK8.png" width=600/>
## Ratio Matching

* A more successful approach to extending the basic ideas of score matching to discrete data 
* Ratio?
<img  src="VIRU0L40JP9E43MKIKBHN9M30WY72EGG.png" width=400/>