Add gradient checkpointing support for AutoencoderKLWan #11105

victolee0 · 2025-03-18T12:01:30Z

What does this PR do?

Fixes #11071 (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@a-r-r-o-w

a-r-r-o-w

Thanks, very nice!

a-r-r-o-w · 2025-03-19T04:52:29Z

@bot /style

victolee0 · 2025-03-19T05:01:58Z

Thanks, very nice!

My PR failed to pass the test code. Would you like to look at this link?

test failure case: test_effective_gradient_checkpointing
test: test_models_autoencoder_wan.py

quickdahuk · 2025-03-26T01:07:01Z

@victolee0 I implemented it the same way you did. It works fine for the forward pass, but during the backward pass, the cache index gets mixed up and goes out of bounds. I might need to use a dictionary for the cache mechanism instead of an index-based list.

victolee0 · 2025-03-29T14:38:37Z

@victolee0 I implemented it the same way you did. It works fine for the forward pass, but during the backward pass, the cache index gets mixed up and goes out of bounds. I might need to use a dictionary for the cache mechanism instead of an index-based list.

@quickdahuk
When I add exception handling as shown below, I get a different error.

Code

def forward(self, x, feat_cache=None, feat_idx=[0]):
        # Apply shortcut connection
        h = self.conv_shortcut(x)

        # First normalization and activation
        x = self.norm1(x)
        x = self.nonlinearity(x)

        # exception handling
        if feat_cache is not None and len(feat_cache) < feat_idx[0]:
            idx = feat_idx[0]
            cache_x = x[:, :, -CACHE_T:, :, :].clone()
            if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
                cache_x = torch.cat([feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(cache_x.device), cache_x], dim=2)

            x = self.conv1(x, feat_cache[idx])
            feat_cache[idx] = cache_x
            feat_idx[0] += 1
        else:
            x = self.conv1(x)

        # Second normalization and activation
        x = self.norm2(x)
        x = self.nonlinearity(x)

        # Dropout
        x = self.dropout(x)

        # exception handling
        if feat_cache is not None and len(feat_cache) < feat_idx[0]:
            idx = feat_idx[0]
            cache_x = x[:, :, -CACHE_T:, :, :].clone()
            if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
                cache_x = torch.cat([feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(cache_x.device), cache_x], dim=2)

            x = self.conv2(x, feat_cache[idx])
            feat_cache[idx] = cache_x
            feat_idx[0] += 1
        else:
            x = self.conv2(x)

        # Add residual connection
        return x + h

test case error

E           torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: A different number of tensors was saved during the original forward and recomputation.
E           Number of tensors saved during forward: 8
E           Number of tensors saved during recomputation: 6

quickdahuk · 2025-03-30T01:05:44Z

@victolee0 I've implemented gradient checkpointing for the decoder. I don't need it for the encoder now. The training is working fine for me. I implemented checkpointing for each frame but didn't put it inside decoder operations.
I've the code here.

victolee0 · 2025-03-31T12:42:27Z

@quickdahuk
After copying and pasting the code you provided, I still get errors when running the test code on the same test.
I'm including the error message here:

a = tensor([-4.6806e-05, -2.9469e-05,  1.2341e-04]), b = tensor([-3.7164e-06, -4.1561e-06,  9.7381e-06]), args = (), kwargs = {'atol': 5e-05}

    def torch_all_close(a, b, *args, **kwargs):
        if not is_torch_available():
            raise ValueError("PyTorch needs to be installed to use this function.")
        if not torch.allclose(a, b, *args, **kwargs):
>           assert False, f"Max diff is absolute {(a - b).abs().max()}. Diff tensor is {(a - b).abs()}."
E           AssertionError: Max diff is absolute 0.00011366714170435444. Diff tensor is tensor([4.3089e-05, 2.5313e-05, 1.1367e-04]).

src/diffusers/utils/testing_utils.py:111: AssertionError

quickdahuk · 2025-03-31T18:01:50Z

@victolee0 The gradient calculated (<= 1.1367e-04) differs slightly. This difference may not be significant, but it's better to have a much lower difference. @a-r-r-o-w, Do you see anything suspicious in our implementation?

quickdahuk · 2025-03-31T18:23:55Z

@victolee0 I just found that if I used use_reentry=False, then it matches perfectly. I've updated the code.

HuggingFaceDocBuilderDev · 2025-03-31T19:59:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

a-r-r-o-w · 2025-03-31T22:42:20Z

The differing result is definitely suspicious. I would actually prefer having a refactor of the VAE so that we don't have to work with cache indexing in the way it's done here (and rather have it behave similar to what's done in CogVideoX and Mochi). If it's not possible to do it without indexing, we can consider removing the cache completely too..

The speed difference from using cache vs not is minimal from my past benchmarks with CogVideoX VAE and removing it, given it's complicated to do and probably doesn't save much time here, is a tradeoff that we could possibly make. cc @hlky @yiyixuxu Would you be able to take a look?

victolee0 · 2025-04-07T11:57:25Z

quickdahuk's updated code passes the test code successfully. Should I create a new commit based on this code? I plan to commit by replacing the torch.utils.checkpoint.checkpoint function with self._gradient_checkpointing_func. This code applies gradient checkpointing only to the decoder.
If you're planning to refactor AutoEncoderKLWan to remove the cache mechanism, then a new commit might not be necessary, so please let me know.

@a-r-r-o-w

a-r-r-o-w · 2025-04-07T12:00:35Z

If the test passes just for the decoder, let's add support for just the decoder now. We can revisit if someone wants encoder support too. I'm not sure if anyone on the team has the bandwidth to refactor the implementation at the moment, so it may take time until that's possible.

victolee0 · 2025-04-07T12:10:26Z

I've made the commit

yiyixuxu · 2025-04-07T22:00:32Z

src/diffusers/models/autoencoders/autoencoder_kl_wan.py

+        else:
+            return self._decode(x, feat_cache, feat_idx)
+
+    def _decode(self, x, in_cache=None, feat_idx=[0], latent=0):


what's this new argument latent? it is not used here?

I've removed the latent argument as it wasn't being used in this implementation. Thank you for pointing that out.

victolee0 · 2025-04-18T16:47:15Z

@a-r-r-o-w @yiyixuxu
I'd like to follow up on this PR. Would it be possible to get some feedback or an update on the status of this pull request? I'm happy to make any additional changes if needed.

yiyixuxu

thanks!

feat: Add gradient checkpointing support for AutoencoderKLWan

f0d4e5f

victolee0 changed the title ~~Add gradient checkpointing support for AutoencoderKLWan~~ [WIP]Add gradient checkpointing support for AutoencoderKLWan Mar 18, 2025

test: add test for gradient checkpointing

c88f65b

a-r-r-o-w approved these changes Mar 19, 2025

View reviewed changes

fix indexerror

ecddbda

yiyixuxu reviewed Apr 7, 2025

View reviewed changes

fix: remove latent variable

c559b9b

victolee0 requested a review from yiyixuxu April 10, 2025 11:54

victolee0 changed the title ~~[WIP]Add gradient checkpointing support for AutoencoderKLWan~~ Add gradient checkpointing support for AutoencoderKLWan Apr 11, 2025

yiyixuxu approved these changes Apr 21, 2025

View reviewed changes

Merge branch 'main' into feat/AutoEncoderKLWan/gradient_checkpointing

2c94c8b

yiyixuxu added the close-to-merge label Apr 21, 2025

Add gradient checkpointing support for AutoencoderKLWan #11105

Are you sure you want to change the base?

Add gradient checkpointing support for AutoencoderKLWan #11105

Uh oh!

Conversation

victolee0 commented Mar 18, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

a-r-r-o-w left a comment

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w commented Mar 19, 2025

Uh oh!

victolee0 commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quickdahuk commented Mar 26, 2025

Uh oh!

victolee0 commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quickdahuk commented Mar 30, 2025

Uh oh!

victolee0 commented Mar 31, 2025

Uh oh!

quickdahuk commented Mar 31, 2025

Uh oh!

quickdahuk commented Mar 31, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Mar 31, 2025

Uh oh!

a-r-r-o-w commented Mar 31, 2025

Uh oh!

victolee0 commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a-r-r-o-w commented Apr 7, 2025

Uh oh!

victolee0 commented Apr 7, 2025

Uh oh!

yiyixuxu Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

victolee0 Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

victolee0 commented Apr 18, 2025

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

victolee0 commented Mar 19, 2025 •

edited

Loading

victolee0 commented Mar 29, 2025 •

edited

Loading

victolee0 commented Apr 7, 2025 •

edited

Loading

victolee0 Apr 8, 2025 •

edited

Loading