Contrains on Parallel Setting for CogVideoX and Performance issue #265

feifeibear · 2024-09-13T06:05:11Z

xDiT currently implements the sequential parallel version of CogVideoX. However, there are restrictions when using it:

head_num (30 here) % ulysses_degree == 0
height % sp_degree == 0
If we use --height 640 --width 720 with sp_degree = 8 (uly=2, ring=4), the VAE decoder throws an error.

[rank7]:   File "/cfs/fjr2/xDiT/xfuser/model_executor/pipelines/pipeline_cogvideox.py", line 372, in __call__
[rank7]:     video = self.decode_latents(latents)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 360, in decode_latents
[rank7]:     frames = self.vae.decode(latents).sample
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
[rank7]:     return method(self, *args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1153, in decode
[rank7]:     decoded = self._decode(z).sample
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1123, in _decode
[rank7]:     z_intermediate = self.decoder(z_intermediate)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 851, in forward
[rank7]:     hidden_states = self.conv_in(sample)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 144, in forward
[rank7]:     output = self.conv(inputs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 64, in forward
[rank7]:     return super().forward(input)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 608, in forward
[rank7]:     return self._conv_forward(input, self.weight, self.bias)
[rank7]:   File "/home/pjz/miniconda3/envs/fjr/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 603, in _conv_forward
[rank7]:     return F.conv3d(
[rank7]: RuntimeError: Given groups=1, weight of size [512, 16, 3, 3, 3], expected input[1, 10, 5, 82, 92] to have 16 channels, but got 10 channels instead

The text was updated successfully, but these errors were encountered:

feifeibear · 2024-09-13T06:07:23Z

In addition to parallel degree setting constrain, there are performance issues with SP implementation.
On an L40 machine, using two GPUs is slower than using 1 GPU.

1 GPU
output saved to results/cogvideox_dp1_cfg1_ulysses1_ring1_tp1_pp1_patchNone_720x640.mp4
epoch time: 2.42 sec, memory: 28.733202944 GB

2 GPU
output saved to results/cogvideox_dp1_cfg1_ulysses2_ring1_tp1_pp1_patchNone_720x640.mp4
epoch time: 2.58 sec, memory: 29.172894208 GB

xibosun · 2024-09-29T05:51:56Z

Facing the restrictions on the choice of SP parallelism degree, we developed CFG parallel to improve the inference efficiency. We also run experiments on our final versions. The updated results can be found in docs/performance/cogvideo.md, where the parallel versions show reasonable speedup.

feifeibear changed the title ~~Contrains on Parallel Setting for CogVideoX~~ Contrains on Parallel Setting for CogVideoX and Performance issue Sep 13, 2024

feifeibear assigned xibosun Sep 13, 2024

feifeibear mentioned this issue Sep 14, 2024

Tensor size mismatch in CogVideoX transformer forward pass #260

Closed

xibosun closed this as completed Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contrains on Parallel Setting for CogVideoX and Performance issue #265

Contrains on Parallel Setting for CogVideoX and Performance issue #265

feifeibear commented Sep 13, 2024 •

edited

Loading

feifeibear commented Sep 13, 2024

xibosun commented Sep 29, 2024 •

edited

Loading

Contrains on Parallel Setting for CogVideoX and Performance issue #265

Contrains on Parallel Setting for CogVideoX and Performance issue #265

Comments

feifeibear commented Sep 13, 2024 • edited Loading

feifeibear commented Sep 13, 2024

xibosun commented Sep 29, 2024 • edited Loading

feifeibear commented Sep 13, 2024 •

edited

Loading

xibosun commented Sep 29, 2024 •

edited

Loading