GitHub - xuduo35/MakeLongVideo: Implementation of long video generation

MakeLongVideo - Pytorch

Implementation of long video generation based on diffusion model.

"Ironman is surfing"	"a car is racing"	"a cat eating food of a bowl, in von Gogh style"	"a giraffe underneath the microwave"

"a glass bead falling into water with huge splash"	"a video of Earth rotating in space"	"A teddy bear running in New York City"	"A stunning aerial drone footage time lapse of El Capitan in Yosemite National Park at sunset"

Change Logs

[07/23/2023] LAION400M did not help too much, so I collected another 100m video/text pairs except 2M webvid dataset. Part of them are watermark free. After 2~3 months training, result seems not bad. I will release watermark free checkpoint soon. Training on RTX3090 2GPUs for video generation task is really a pain.

Setup

Requirements

python3 -m pip install -r requirements.txt

Training

Prepare Stable Diffusion v1-4 pretrained weights

download from huggingface and put it in directory 'checkpoints' which is configured in configs/makelongvideo.yaml

Download webvid dataset

download webvid dataset into directory 'data/webvid' using https://github.com/m-bain/webvid repo. Then prepare dataset using command

python3 genvideocap.py

Download LAION400M dataset

download laion400m into directory 'data/laion400m'

Train

first train using resolution 128x128

accelerate launch --config_file ./configs/multigpu.yaml train.py --config configs/makelongvideo.yaml

then finetune in resolution 256x256, modify last line of configs/makelongvideo256x256.yaml according to your local epoch checkpoint

accelerate launch --config_file ./configs/multigpu.yaml train.py --config configs/makelongvideo256x256.yaml

Inference

Pretrained weights: https://huggingface.co/xiexiecn/MakeLongVideo

# unwrap checkpoint first
TORCH_DISTRIBUTED_DEBUG=DETAIL accelerate launch train.py --config configs/makelongvideo.yaml --unwrap ./outputs/makelongvideo/checkpoint-5200

inference directly

python3 infer.py  --width 256 --height 256 --prompt "a panda is surfing"

inference using latents initialized by sample video

python3 infer.py  --width 256 --height 256 --prompt "a panda is surfing" --sample_video_path your_sample_video

inference by sample frame rate 6 (actual frame rate is 24/6==4)

python3 infer.py  --width 256 --height 256 --prompt "a panda is surfing" --speed 6

Todo

References

Make-A-Video: https://github.com/lucidrains/make-a-video-pytorch
Tune-A-Video: https://github.com/showlab/Tune-A-Video
diffusers: https://github.com/huggingface/diffusers

Citations

@misc{Singer2022,
    author  = {Uriel Singer},
    url     = {https://makeavideo.studio/Make-A-Video.pdf}
}

@article{wu2022tuneavideo,
    title   = {Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation},
    author  = {Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
    journal={arXiv preprint arXiv:2212.11565},
    year    = {2022},
    note    = {under review}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
configs		configs
data		data
makelongvideo		makelongvideo
samples		samples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
genvideocap.py		genvideocap.py
infer.py		infer.py
requirements.txt		requirements.txt
train.py		train.py
validatevideo.py		validatevideo.py

License

xuduo35/MakeLongVideo

Folders and files

Latest commit

History

Repository files navigation