# DA323 Project- paper summary and explanation
## Zero Shot Text to Image Generation


## Contents

- About the Paper  
- Motivation  
- Previous Approaches  
- Method  
- Datasets  
- Significance and Results  
- Future Impact  


# About the Paper

**Paper Title:**  
*Zero-Shot Text-to-Image Generation*  

**Authors:**  
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,  
Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever (OpenAI)  

**Published:**  
arXiv preprint: [https://arxiv.org/abs/2102.12092](https://arxiv.org/abs/2102.12092)  
Date: February 2021  

**Core Contribution:**  
A scalable, autoregressive transformer-based model that treats text and image tokens as a single sequence.  
Capable of generating realistic images from text prompts **without any task-specific fine-tuning**.


## Motivation for this Paper

Earlier models like **GANs** and **VAEs** struggle with generalization — that is, they perform poorly on unseen data.

Major issues with GANs include:
- **Mode collapse**  
  The GAN ends up producing data very similar to only a few types (the “modes”), which adversely affects the variety of obtained multi-modal data.
  
- **Nonconvergence and instability**  
  This can arise due to:
  - Inappropriate design of network architecture  
  - Poor choice of objective function  
  - Suboptimal optimization algorithms

Training instability is a critical problem. For example, if the discriminator can easily distinguish between fake and real images, its gradient vanishes —


<img src="https://raw.githubusercontent.com/zombie-programmer-code/DA323_project/main/project_images/mode_collapse.png" alt="Mode collapse" width="600"/>





Traditionally, **text-to-image synthesis** has been approached by improving modeling assumptions for training on a fixed dataset.

These approaches often rely on:
- Complex architectures  
- Auxiliary losses  
- Object part labels  
- Dataset-specific design choices  

Such methods can **hurt generalization**, making them less effective on unseen prompts or domains.

---

To counter these limitations, the authors propose a **simple and scalable approach**:
- Use an **autoregressive transformer**  
- Model **text and image tokens as a single stream of data**  

With enough data, this method proves **competitive with task-specific models** when evaluated in a **zero-shot** setting.


## Generating Images from Captions with Attention

In **Mansimov et al., 2015**, the authors extended the **Deep Recurrent Attention Writer (DRAW)** technique to train a model that:

- Iteratively draws patches on a canvas  
- Attends to relevant words in the text description at each step  

### Key Mechanism:
- The current hidden state for image generation is computed using:
  - The **previous hidden state**, and  
  - An **alignment score** between:
    - The hidden states from the text  
    - The previous image hidden state  

- This current hidden state is then used to **generate the image at the current iteration**

This was one of the **first deep learning-based approaches** for image synthesis from natural language descriptions.


<img src="https://raw.githubusercontent.com/zombie-programmer-code/DA323_project/main/project_images/variational_auto_encoder.png" alt="dVAE Diagram" width="800"/>


### Figure: Conventional variational autoencoder

## StackGAN++ (Xu et al., 2018)

**StackGAN-v1** uses a **2-stage Generative Adversarial Network** for text-to-image synthesis:

- **Stage I**:
  - Sketches primitive features like **color** and **shape** based on the text description.
  - Produces **low-resolution images**.

- **Stage II**:
  - Takes Stage I output **and** the original text as input.
  - Generates **high-resolution images** by refining the initial sketch.

---

An extended approach (**StackGAN++**) uses:
- **Multiple generators and discriminators** arranged in a **tree structure**.
- Images at **multiple scales** for the same scene are generated from **different branches** of the tree.


<img src="https://raw.githubusercontent.com/zombie-programmer-code/DA323_project/main/project_images/stackgan-v1.png" alt="dVAE Diagram" width="800"/>


### Method described in the paper

## Key Innovation and Method

The goal of this paper is to train a **transformer** that **autoregressively models both images and text as a single data stream**.

However, directly using pixel values requires **too much memory**, especially for high-resolution images.  
To overcome this, the authors **compress images into tokens**.

---

### Main Components of the Architecture:

- **Discrete Variational Autoencoder (dVAE):**  
  Compresses each **256×256 RGB image** into a **32×32 grid of image tokens**.

- Each token can take on **8192 possible values**, i.e., from a discrete vocabulary of size 8192.

- This reduces the **context length for the transformer by a factor of 192**,  
  with **minimal degradation in visual quality** (as demonstrated in the next section).


<img src="https://raw.githubusercontent.com/zombie-programmer-code/DA323_project/main/project_images/dVAE1.png" alt="dVAE Diagram" width="400"/>


## The Second Stage of the Model

- Up to **256 BPE-encoded text tokens** are concatenated with the **32 × 32 = 1024 image tokens**.
- The **autoregressive transformer** is trained to model the **joint distribution over both text and image tokens**.

---

### Objective Function and Interpretation:

The model aims to learn the joint probability of:
- **Text prompt** $y$
- **Image** $x$
- **Latent image tokens** $z$ (from the discrete VAE)

This joint distribution is modeled as:

**$p_{\theta, \psi}(x, y, z) = p_{\theta}(x \mid y, z) \cdot p_{\psi}(y, z)$**

Where:
- $p_{\theta}(x \mid y, z)$: Conditional image generation given text and image tokens  
- $p_{\psi}(y, z)$: Prior over text and image tokens


## Objective Function

Since we cannot directly optimize the joint probability $p_{\theta, \psi}(x, y)$,  
we instead **maximize the Evidence Lower Bound (ELBO):**

**ELBO:**

$\ln p_{\theta, \psi}(x, y) \geq \mathbb{E}_{z \sim q_{\phi}(z \mid x)} \left[ \ln p_{\theta}(x \mid y, z) \right] - \beta \, D_{\mathrm{KL}} \left( q_{\phi}(y, z \mid x) \,\|\, p_{\psi}(y, z) \right)$

---

### Explanation of Terms:

- $x$ — Input RGB image  
- $y$ — Text caption  
- $z$ — Discrete latent tokens from the dVAE  
- $q_{\phi}(z \mid x)$ — Distribution over image tokens generated by the dVAE encoder  
- $p_{\theta}(x \mid y, z)$ — Distribution over RGB images generated by the dVAE decoder  
- $p_{\psi}(y, z)$ — Prior modeled by the transformer over text and image tokens  
- $\beta$ — KL divergence weight


## Interpretation of the Objective Function

Optimizing the term $\ln p_{\theta}(x \mid y, z)$ helps improve the **reconstruction quality** of the dVAE.  
This ensures that the **image token compression retains as much information** from the original image as possible.

---

The KL divergence term:

$
D_{\mathrm{KL}} \left( q_{\phi}(y, z \mid x) \,\|\, p_{\psi}(y, z) \right)
$

acts as a **regularizer**, similar to its role in standard autoencoders:

- It **prevents the dVAE encoder from encoding gibberish**
- It discourages the encoder from **memorizing input images**
- It encourages the encoder to **learn shared structure** across different images in the dataset


## Architecture in Brief: dVAE

The **dVAE encoder and decoder** are convolutional **ResNets** (He et al., 2016), inspired by the bottleneck-style residual blocks introduced in earlier convolutional networks (LeCun et al., 1998).

---

### Key Architecture Details:

- Both encoder and decoder use primarily **3 × 3 convolutions**
- **1 × 1 convolutions** are used in **skip connections** where the number of feature maps changes
- The **first convolution of the encoder** is **7 × 7**
- The **final encoder layer** is a **1 × 1 convolution** that produces a **32 × 32 × 8192** output (used as logits for the image token categorical distribution)
- The **first and last layers of the decoder** are also **1 × 1 convolutions**

---

### Downsampling & Upsampling:

- The **encoder** uses **max-pooling** to downsample feature maps  
  *(empirically found to yield better ELBO than average pooling)*
- The **decoder** uses **nearest-neighbor upsampling**

---

### Optimization:

- Parameters are updated using the **AdamW optimizer**


## Architecture in Brief: Transformer

The authors present a **decoder-only transformer** designed to model the **joint distribution of text and image tokens**.

---

### Model Overview:

- A **12-billion parameter** transformer  
- Processes sequences of:
  - Up to **256 BPE-encoded text tokens**
  - Followed by **1,024 image tokens** from the dVAE

---

### Attention Mechanism:

- Uses **causal self-attention**, so each image token can attend to:
  - All **preceding text tokens**
  - All **preceding image tokens**

This enables the generation of **coherent images conditioned on textual descriptions**.

---

### Modality-Specific Attention Masks:

- **Text-to-text:** Standard **causal masks**
- **Image-to-image:**  
  Uses specialized masks like:
  - **Row masks**
  - **Column masks**
  - **Convolutional masks**

---

### Key Property:

This design allows the transformer to perform **zero-shot image generation** from text prompts —  **no task-specific fine-tuning** is required.


## Dataset

To scale the model to **12 billion parameters**, the authors collected a much larger dataset of  
**250 million image–text pairs from the internet**, comparable in scale to **JFT-300M**.

- While **MS-COCO** was not explicitly included, some validation images (not captions) were present due to **overlap with YFCC100M**, one of the sources used.

---

### Filtering Techniques Used to Ensure Quality:

- Captions that were **too short** or **not in English** (detected using `cld3`) were removed  
- Captions with **generic boilerplate** text (e.g., "photographed on \<date\>") were discarded  
- Images with **extreme aspect ratios** (outside the range [1/2, 2]) were excluded  
  - This helps avoid cropping out important visual content during training

---

These preprocessing steps ensured that the resulting dataset was both **high-quality** and **diverse**, supporting effective training of the large-scale transformer.


## Significance and Results

The dataset used in this paper — **250 million image–text pairs** — is significantly larger than previously used datasets such as **MS-COCO** and **CUB-200**.

---

### Key Insight:

> **More data + larger model = better generalization**

---

### Zero-Shot Evaluation:

The model was evaluated in a **zero-shot setting** against three prior state-of-the-art approaches:

- **AttnGAN** (Xu et al., 2018)  
- **DMGAN** (Zhu et al., 2019)  
- **DF-GAN** (Tao et al., 2020)  
  - Reports the best **Inception Score (IS)** and **Fréchet Inception Distance (FID)** on MS-COCO  
  - Metrics based on:  
    - Inception Score: *Salimans et al., 2016*  
    - FID: *Heusel et al., 2017*

---

### Human Evaluation:

A human preference study, similar to **Koh et al. (2021)**, was conducted to compare the model's performance to DF-GAN.  



<img src="https://raw.githubusercontent.com/zombie-programmer-code/DA323_project/main/project_images/results.png" alt="Results" width="500"/>


### Results on CUB, a specialized dataset

<img src="https://raw.githubusercontent.com/zombie-programmer-code/DA323_project/main/project_images/cub_results.png" alt="CUB results" width="600"/>

## Significance and Results 

The model performs **significantly worse on the CUB dataset**, with nearly a **40-point gap in FID** compared to the leading prior approach.

- Although there was a **12% image overlap** found in the CUB dataset,  
  the authors observed **no significant difference in results** after removing these overlapping images.

---

### Key Observation:

- The **zero-shot approach** tends to struggle on **specialized distributions** like CUB.
- This suggests that **general-purpose models may underperform** on domain-specific datasets without adaptation.

---

### Future Direction:

- The authors identify **fine-tuning** as a **promising path for improving performance** on more niche or fine-grained datasets like CUB — and **leave this investigation to future work**.


## Future Impact

### This Paper Sparked a Revolution

It directly inspired a wave of groundbreaking models:

| **Model**            | **Key Innovation**                                 |
|----------------------|-----------------------------------------------------|
| **DALL·E 2**         | CLIP + Diffusion                                    |
| **GLIDE**            | Guided diffusion with text-image similarity         |
| **Imagen**           | Large Language Model (LM) + Diffusion              |
| **Parti**            | Autoregressive image generation                     |
| **Stable Diffusion** | Open-source latent diffusion model                  |

> *Table: Notable models inspired by Zero-Shot Text-to-Image Generation*


<img src="https://raw.githubusercontent.com/zombie-programmer-code/DA323_project/main/project_images/thank-you.jpg" alt="thank you" width="600"/>