## CogVideoX Text-to-Video

This notebook demonstrates how to run [CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b) and [CogVideoX-5b](https://huggingface.co/THUDM/CogVideoX-5b) with 🧨 Diffusers on a free-tier Colab GPU.

Additional resources:
- [Docs](https://huggingface.co/docs/diffusers/en/api/pipelines/cogvideox)
- [Quantization with TorchAO](https://github.com/sayakpaul/diffusers-torchao/)
- [Quantization with Quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)

Note: If, for whatever reason, you randomly get an OOM error, give it a try on Kaggle T4 instances instead. I've found that Colab free-tier T4 can be unreliable at times. Sometimes, the notebook will run smoothly, but other times it will crash with an error 🤷🏻‍♂️

#### Install the necessary requirements

In [12]:
%pip install diffusers transformers hf_transfer

Note: you may need to restart the kernel to use updated packages.


In [13]:
%pip install git+https://github.com/huggingface/accelerate
%pip install accelerate==0.33.0

Collecting git+https://github.com/huggingface/accelerate
  Cloning https://github.com/huggingface/accelerate to /tmp/pip-req-build-6ab7uhac
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate /tmp/pip-req-build-6ab7uhac
  Resolved https://github.com/huggingface/accelerate to commit 8ade23cc6aec7c3bd3d80fef6378cafaade75bbe
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: accelerate
  Building wheel for accelerate (pyproject.toml) ... [?25ldone
[?25h  Created wheel for accelerate: filename=accelerate-1.1.0.dev0-py3-none-any.whl size=333339 sha256=8fee06936c9b64e1e91a214f6337b27be50235a9261f980b856d9e17fccd0244
  Stored in directory: /tmp/pip-ephem-wheel-cache-zbnu_1s7/wheels/f6/c7/9d/1b8a5ca8353d9307733bc719107acb67acdc95063bba749f26
Successfully built accelerate
Installing collected pack

#### Import required libraries

The following block is optional but if enabled, downloading models from the HF Hub will be much faster

In [14]:
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

In [15]:
import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video
from transformers import T5EncoderModel

#### Load models and create pipeline

Note: `bfloat16`, which is the recommended dtype for running "CogVideoX-5b" will cause OOM errors due to lack of efficient support on Turing GPUs.

Therefore, we must use `float16`, which might result in poorer generation quality. The recommended solution is to use Ampere or above GPUs, which also support efficient quantization kernels from [TorchAO](https://github.com/pytorch/ao) :(

In [16]:
# Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
model_id = "THUDM/CogVideoX-5b"

In [17]:
# Thank you [@camenduru](https://github.com/camenduru)!
# The reason for using checkpoints hosted by Camenduru instead of the original is because they exported
# with a max_shard_size of "5GB" when saving the model with `.save_pretrained`. The original converted
# model was saved with "10GB" as the max shard size, which causes the Colab CPU RAM to be insufficient
# leading to OOM (on the CPU)

transformer = CogVideoXTransformer3DModel.from_pretrained("camenduru/cogvideox-5b-float16", subfolder="transformer", torch_dtype=torch.float16)
text_encoder = T5EncoderModel.from_pretrained("camenduru/cogvideox-5b-float16", subfolder="text_encoder", torch_dtype=torch.float16)
vae = AutoencoderKLCogVideoX.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float16)

Downloading shards: 100%|██████████| 3/3 [00:00<00:00, 2424.92it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  4.67it/s]


In [21]:
# Create pipeline and run inference
pipe = CogVideoXPipeline.from_pretrained(
    model_id,
    text_encoder=text_encoder,
    transformer=transformer,
    vae=vae,
    torch_dtype=torch.float16,
)

Loading pipeline components...:  40%|████      | 2/5 [00:00<00:00, 2105.05it/s]


ImportError: 
T5Tokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.


#### Enable memory optimizations

Note that sequential cpu offloading is necessary for being able to run the model on Turing or lower architectures. It aggressively maintains everything on the CPU and only moves the currently executing nn.Module to the GPU. This saves a lot of VRAM but adds a lot of overhead for inference, making generations extremely slow (1 hour+). Unfortunately, this is the only solution for running the model on Colab until efficient kernels are supported.

In [None]:
pipe.enable_sequential_cpu_offload()
# pipe.vae.enable_tiling()

#### Generate!

In [None]:
prompt = (
    "A weary soldier, clad in a dusty, camouflage uniform, stands solemnly in front of the camera, his eyes reflecting a deep sadness and resignation. His face, marked by the grime of battle and the weight of impending conflict, conveys a poignant awareness that the war is imminent. The background is a blur of military activity, hinting at the chaos about to unfold. His posture is rigid yet somehow defeated, as he clutches his helmet in one hand, a symbol of the protection and burden he carries. The somber lighting casts shadows over his features, emphasizing the heavy toll of his duty and the somber realization that his time to face the horrors of war has arrived."
)

In [None]:
video = pipe(prompt=prompt, guidance_scale=6, use_dynamic_cfg=True, num_inference_steps=50).frames[0]

  0%|          | 0/50 [00:00<?, ?it/s]

In [None]:
export_to_video(video, "output.mp4", fps=30)