📃 Paper • 📦 Checkpoint
This repo contains PyTorch model definitions, pre-trained weights and sampling code for our flexible vision transformer (FiT). FiT is a diffusion transformer based model which can generate images at unrestricted resolutions and aspect ratios.
The core features will include:
- Pre-trained class-conditional FiT-XL-2-16 (1800K) model weight trained on ImageNet (
$H\times W \le 256\times256$ ). - A pytorch sample code for running pre-trained DiT-XL/2 models to generate images at unrestricted resolutions and aspect ratios.
Why we need FiT?
- 🧐 Nature is infinitely resolution-free. FiT, like Sora, was trained on the unrestricted resolution or aspect ratio. FiT is capable of generating images at unrestricted resolutions and aspect ratios.
- 🤗 FiT exhibits remarkable flexibility in resolution extrapolation generation.
Stay tuned for this project! 😆
This codebase borrows from DiT.
@article{Lu2024FiT,
title={FiT: Flexible Vision Transformer for Diffusion Model},
author={Zeyu Lu and Zidong Wang and Di Huang and Chengyue Wu and Xihui Liu and Wanli Ouyang and Lei Bai},
year={2024},
journal={arXiv preprint arXiv:2402.12376},
}