# Zero Shot Text to Image Generation


## Contents

- About the Paper  
- Motivation  
- Previous Approaches  
- Method  
- Datasets  
- Significance and Results  
- Future Impact  


# About the Paper

**Paper Title:**  
*Zero-Shot Text-to-Image Generation*  

**Authors:**  
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,  
Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever (OpenAI)  

**Published:**  
arXiv preprint: [https://arxiv.org/abs/2102.12092](https://arxiv.org/abs/2102.12092)  
Date: February 2021  

**Core Contribution:**  
A scalable, autoregressive transformer-based model that treats text and image tokens as a single sequence.  
Capable of generating realistic images from text prompts **without any task-specific fine-tuning**.


## Motivation for this Paper

Earlier models like **GANs** and **VAEs** struggle with generalization — that is, they perform poorly on unseen data.

Major issues with GANs include:
- **Mode collapse**  
  The GAN ends up producing data very similar to only a few types (the “modes”), which adversely affects the variety of obtained multi-modal data.
  
- **Nonconvergence and instability**  
  This can arise due to:
  - Inappropriate design of network architecture  
  - Poor choice of objective function  
  - Suboptimal optimization algorithms

Training instability is a critical problem. For example, if the discriminator can easily distinguish between fake and real images, its gradient vanishes —


![Alt text](Desktop/mode_collapse.png)



Traditionally, **text-to-image synthesis** has been approached by improving modeling assumptions for training on a fixed dataset.

These approaches often rely on:
- Complex architectures  
- Auxiliary losses  
- Object part labels  
- Dataset-specific design choices  

Such methods can **hurt generalization**, making them less effective on unseen prompts or domains.

---

To counter these limitations, the authors propose a **simple and scalable approach**:
- Use an **autoregressive transformer**  
- Model **text and image tokens as a single stream of data**  

With enough data, this method proves **competitive with task-specific models** when evaluated in a **zero-shot** setting.


## Generating Images from Captions with Attention

In **Mansimov et al., 2015**, the authors extended the **Deep Recurrent Attention Writer (DRAW)** technique to train a model that:

- Iteratively draws patches on a canvas  
- Attends to relevant words in the text description at each step  

### Key Mechanism:
- The current hidden state for image generation is computed using:
  - The **previous hidden state**, and  
  - An **alignment score** between:
    - The hidden states from the text  
    - The previous image hidden state  

- This current hidden state is then used to **generate the image at the current iteration**

This was one of the **first deep learning-based approaches** for image synthesis from natural language descriptions.
