# 9. Generative Models 1

## 9.1 Introduction

> What does it mean to learn a **generative model**?

<br>

## 9.2 Learning a Generative Model

- Suppose we are given images of dogs.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src='https://drive.google.com/uc?id=1_kgakInGQBBg6IPNOfbZuO4Sp4BiBd3R' width=800/>

<br>

### 9.2.1 learn $p(x)$

- We want to learn a probability distribution $p(x)$ such that
  - Generation
  - Density estimation
  - Unsupervisesd representation learning

<br>

#### 9.2.1.1 Generation

- If we sample $x_{\text {new }} \sim p(x)$, $x_{\text {new }}$ should like a dog.
- **sampling**

<br>

#### 9.2.1.2 Density estimation

- $p(x)$ should be high if $x$ looks like a dog, and low otherwise.
- **anomaly detection**
- Also known as, **explicit** models

<br>

#### 9.2.1.3 Unsupervisesd representation learning

- We should be able to learn what these images have in common, e.g., ears, tail, etc
- **feature learning**

<br>

### 9.2.2 **represent** $p(x)$

- Then, how can we **represent $p(x)$?

<br>

## 9.3 Basic Discrete Distributions

### 9.3.1 Bernoulli distribution

- (biased) coin flip
- $D=\{$ Heads, Tails $\}$
- Specify $P(X=$ Heads $)=p$. Then $P(X=$ Tails $)=1-p$
- Write: $X \sim \operatorname{Ber}(p)$

<br>

### 9.3.2 Categorical distribution

- (biased) m-sided dice
- $D = \{ 1, \dots, m \}$
- Specify $P(Y = i) = p_i$, such that $\sum_{i=1}^m p_i = 1$
- Write: $Y \sim Cat(p_1, \dots, p_m)$

<br>

## 9.4 Example

### 9.4.1 RGB image

- Modeling an RGB joint distribution (of a single pixel)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src='https://drive.google.com/uc?id=1RfoFXWwkV-hSx7Byc6KEk5fhYHbkVz2r' width=300/>

- $(r, g, b) \sim p(R, G, B)$
- Number  of cases?
  - 256 x 256 x 256
- How many parameters do we need to specify?
  - 256 x 256 x 256 - 1

<br>

### 9.4.2 binary image

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src='https://drive.google.com/uc?id=13q5GXS8WRCNUNB5YK09PcxFJ0muyPlAy' width=300/>

- Suppose we have $X_{1}, \ldots, X_{n}$ of $n$ binary pixels (a binary image).
- How many possible states?
  - $2 \times 2 \times \dots \times 2 = 2^n$
- Sampling from $p(x_1, \dots, x_n)$ generates and image.
- How many parameters to specify $p(x_1, \dots, x_n)$?
  - $2^n - 1$

<br>

## 9.5 Structure Through Independence

- What if $X_1, dots, X_n$ are independent, then
  - $p\left(x_{1}, \ldots, x_{n}\right)=p\left(x_{1}\right) p\left(x_{2}\right) \cdots p\left(x_{n}\right)$
- How many possible states?
  - $2^n$
- How many parameters to specify $p(x_1, \dots, x_n)$?
  - $n$

<br>

- $2^n$ entries can be described by just $n$ numbers!
- But this **independence** assumption is too strong to model useful distributions.

<br>

## 9.6 Conditional Independence

- Three important rules
  - Chain rule
  - Bayes' rule
  - Conditional independence

<br>

### 9.6.1 Chain rule

$$p\left(x_{1}, \ldots, x_{n}\right)=p\left(x_{1}\right) p\left(x_{2} \mid x_{1}\right) p\left(x_{3} \mid x_{1}, x_{2}\right) \cdots p\left(x_{n} \mid x_{1}, \cdots, x_{n-1}\right)$$

<br>

### 9.6.2 Bayes' rule

$$
p(x \mid y)=\frac{p(x, y)}{p(y)}=\frac{p(y \mid x) p(x)}{p(y)}
$$

<br>

### 9.6.3 Conditional independence

$$
\text { If } x \perp y \mid z \text { , then } p(x \mid y, z)=p(x \mid z)
$$

<br>

### 9.6.4 Using the chain rule

- Using the chain rule,
  - $p\left(x_{1}, \ldots, x_{n}\right)=p\left(x_{1}\right) p\left(x_{2} \mid x_{1}\right) p\left(x_{3} \mid x_{1}, x_{2}\right) \cdots p\left(x_{n} \mid x_{1}, \cdots, x_{n-1}\right)$
- How many parameters?
  - $p(x_1)$: 1 parameter
  - $p(x_2 | x_1)$: 2 parameters
    - one per $p(x_2|x_1 = 0)$ and one per $p(x_2|x_1 = 1)$
  - $p(x_3 | x_1, x_2)$: 4 parameters
  - Hence, $1 + 2 + 2^2 + \cdots + 2^{n-1} = 2^n - 1$, which is the same as before.
- Why?

<br>

### 9.6.5 Markov assumption

- Now, suppose $X_{i+1} \perp X_{1}, \ldots, X_{i-1} \mid X_{i}$ (Markov assumption), then
  - $p\left(x_{1}, \ldots, x_{n}\right)=p\left(x_{1}\right) p\left(x_{2} \mid x_{1}\right) p\left(x_{3} \mid x_{2}\right) \cdots p\left(x_{n} \mid x_{n-1}\right)$
- How many parameters?
  - $2n - 1$
- Hence, by leveraging the Markov assumption, we get exponential reduction on the number of parameters.

<br>

- **Auto-regressive models** leverage this conditional independency.

<br>

## 9.7 Auto-regressive Model

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src='https://drive.google.com/uc?id=1JaCmxJYv_rEv1dTOjKoi1QXhA9woJkdx' width=600/>

- Suppose we have $28 \times 28$ binary pixelss.
- Out goal is to learn $p(x)=p\left(x_{1}, \ldots, x_{784}\right)$ over $x \in\{0,1\}^{784}$.
- How can we parametrize $p(x)$?
  - Let's use the **chain rule** to factor the joint distribution.
  - $p\left(x_{1: 784}\right)=p\left(x_{1}\right) p\left(x_{2} \mid x_{1}\right) p\left(x_{3} \mid x_{1: 2}\right) \cdots$
  - This is called an **autoregressive model**.
  - Note that we need **an ordering** of all random variables.

<br>

## 9.8 NADE: Neural Autoregressive Density Estimator

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src='https://drive.google.com/uc?id=1XDQtG3RqWQRf_3QX82LbAcvHH7aryaMh' width=800/>

- The probability distribution of $i$-th pixel is
  - $p\left(x_{i} \mid x_{1: i-1}\right)=\sigma\left(\alpha_{i} \mathbf{h}_{i}+b_{i}\right)$ where $\mathbf{h}_{i}=\sigma\left(W_{<i} x_{1: i-1}+\mathbf{c}\right)$

- **NADE** is an **explicit** model that can compute the **density** of the given inputs.
- How can we compute the **density** of the given image?
  - Suppose we have a binary image with 784 binary pixels, $\{x_1, x_2, \dots, x_{784}\}$
  - Then, the joint probability is computed by
    - $p\left(x_{1}, \ldots, x_{784}\right)=p\left(x_{1}\right) p\left(x_{2} \mid x_{1}\right) \cdots p\left(x_{784} \mid x_{1: 783}\right)$
    - where each conditional probability $p\left(x_{i} \mid x_{1: i-1}\right)$ is computed independently.
- In case of modeling continuous random variables, **a mixture of Gaussian** can be used.

<br>

## 9.9 Pixel RNN

- We can also use **RNNs** to define an auto-regressive model.
- For example, for an $n \times n$ RGB images,

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src='https://drive.google.com/uc?id=1xAszapTNxUesQ7FPyiWMuV4dZR-ZI0gn' width=600/>

<br>

- There are two model architectures in Pixel RNN based on the **ordering** of chain
  - Row LSTM
  -  Diagonal BiLSTM

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src='https://drive.google.com/uc?id=1USBlyWv3tiQn1TIouTayi73SBswNuost' width=600/>