Sana is a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution developed by NVLabs. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include:
- Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens.
- Linear DiT: authors replaced all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality.
- Decoder-only text encoder*: T5 replaced by modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment.
- Efficient training and sampling: Proposed Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence.
More details about model can be found in paper, model page and original repo In this tutorial, we consider how to run Sana model using OpenVINO.
In this demonstration, you will learn how to perform text-to-image generation using Sana and OpenVINO.
Example of model work:
Input prompt: a cyberpunk cat with a neon sign that says "Sana"
The tutorial consists of the following steps:
- Install prerequisites
- Collect Pytorch model pipeline
- Convert model to OpenVINO intermediate representation (IR) format
- Compress model weights using NNCF
- Prepare OpenVINO Inference pipeline
- Run Text-to-Image generation
- Launch interactive demo
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
For further details, please refer to Installation Guide.