Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contrains on Parallel Setting for CogVideoX and Performance issue #265

Closed
feifeibear opened this issue Sep 13, 2024 · 2 comments
Closed
Assignees

Comments

@feifeibear
Copy link
Collaborator

feifeibear commented Sep 13, 2024

xDiT currently implements the sequential parallel version of CogVideoX. However, there are restrictions when using it:

  1. head_num (30 here) % ulysses_degree == 0
  2. height % sp_degree == 0
  3. If we use --height 640 --width 720 with sp_degree = 8 (uly=2, ring=4), the VAE decoder throws an error.
[rank7]:   File "/cfs/fjr2/xDiT/xfuser/model_executor/pipelines/pipeline_cogvideox.py", line 372, in __call__
[rank7]:     video = self.decode_latents(latents)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 360, in decode_latents
[rank7]:     frames = self.vae.decode(latents).sample
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
[rank7]:     return method(self, *args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1153, in decode
[rank7]:     decoded = self._decode(z).sample
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1123, in _decode
[rank7]:     z_intermediate = self.decoder(z_intermediate)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 851, in forward
[rank7]:     hidden_states = self.conv_in(sample)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 144, in forward
[rank7]:     output = self.conv(inputs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 64, in forward
[rank7]:     return super().forward(input)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 608, in forward
[rank7]:     return self._conv_forward(input, self.weight, self.bias)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 603, in _conv_forward
[rank7]:     return F.conv3d(
[rank7]: RuntimeError: Given groups=1, weight of size [512, 16, 3, 3, 3], expected input[1, 10, 5, 82, 92] to have 16 channels, but got 10 channels instead
@feifeibear
Copy link
Collaborator Author

In addition to parallel degree setting constrain, there are performance issues with SP implementation.
On an L40 machine, using two GPUs is slower than using 1 GPU.

1 GPU
output saved to results/cogvideox_dp1_cfg1_ulysses1_ring1_tp1_pp1_patchNone_720x640.mp4
epoch time: 2.42 sec, memory: 28.733202944 GB

2 GPU
output saved to results/cogvideox_dp1_cfg1_ulysses2_ring1_tp1_pp1_patchNone_720x640.mp4
epoch time: 2.58 sec, memory: 29.172894208 GB

@feifeibear feifeibear changed the title Contrains on Parallel Setting for CogVideoX Contrains on Parallel Setting for CogVideoX and Performance issue Sep 13, 2024
@xibosun
Copy link
Collaborator

xibosun commented Sep 29, 2024

Facing the restrictions on the choice of SP parallelism degree, we developed CFG parallel to improve the inference efficiency. We also run experiments on our final versions. The updated results can be found in docs/performance/cogvideo.md, where the parallel versions show reasonable speedup.

@xibosun xibosun closed this as completed Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants