[rfc][compile] compile method for DiffusionPipeline #11705

anijain2305 · 2025-06-13T04:05:29Z

This PR adds a compile method to the DiffusionPipeline to make it really easy for diffusers users to apply torch.compile.

There are a couple of reasons why we want a deeper integration of compile with diffusion models

Regional compile (instead of full model compile) is better fit for diffusion models, and its not easy for the user to apply regional compile precisely.
A pipeline has many components, and many of them are not suitable for compile. Since we (or a new model author) know the architecture of the models, we can leverage it to apply compile at the most relevant portions of the pipeline.

Update - Jun 20

We went with a new Option 4 - adding a new method to ModelMixin - compile_repeated_blocks to avoid any confusion and also get a sense from the community on its adoption. If we see positive feedback, we can move with a deeper integration like Option 1. Currently, there is no data to go for an aggressive approach like Option 1.

Option 1 - Pipeline has a compile method

From the user standpoint, they will have to add this line

pipe.compile(**compile_kwargs)

This will by default find the regions in the transformer model (more on how to find regions later) and apply regional compile to those regions. We can use this to provide more control to the user - if they want to apply compile to pre and pos-transformer model.

Pro

Easiest for the user

Cons

Might be considered a big change, and we might want to phase this out.

Option 2 - ModelMixin overrides nn.Module compile method (this PR does Option 2)

User does something like this

pipe.transformer.compile(use_regional_compile=True/False, **kwargs)

This overrides the nn.Module compile and provides options for regional compile. This is one level deeper integration, so user will have to know that the pipe must have a transformer named attribute. If use_regional_compile is True, we will find the regions (more on that later), and apply torch.compile on them.

Pros

ModelMixin is the most time consuming part of the model, so the abstraction seems right.

Cons

Users might already be using pipe.transformer.compile to enable full model compilation. I DO want regional compile as the default (supported by numbers later). But when users upgrade to new diffusers, they will see regional compile if we make it a default, and might be confused. Turning off regional compile by defaults solves this problem, but it is sub-par UX because of high compile time w/o any real benefit.
Though minor, its less obvious than pipe.compile

Option 3 - Combine option 1 and option 2

This could be the best option. We have both option 1 and 2. In option 2, we keep the default regional compile OFF. But option 1 - pipe.compile has regional compile by default, and propagates the regional compile to pipe.tranformer.compile. This way, the OOTB solution would be pipe.compile and then any more enthusiastic user can play with pipe.transformer.compile if they wish.

How to find regions?

Here, we can make a simple addition to the existing models and provide guidance for the new models. Most of the diffusion models are basically a sequence of Transformer blocks, and these blocks are obvious from the model __init__ method. We can add a new class attribute for each model - _regions_for_compile - which point to the classes of those Transformer blocks. This is very precise and easy. Relatedly, I also find another field - _no_split_modules - which incidentally serves the same purpose. In the absence of the _regions_for_compile attribute, we can rely on _no_split_modules.

Why regional compile as default?

Regional compilation gives as good speedup as full model compilation for diffusion models with 8x-10x smaller compilation time. More data to support regional compilation as default is below

One experiment that I am missing is cudagraphs. I am not sure yet if enabling cudagraphs makes full model better than regional. I will rerun the experiments today with cudagraphs.

cc @sayakpaul

sayakpaul

Thanks for starting this! LMK what you think of the comments.

src/diffusers/models/transformers/transformer_flux.py

src/diffusers/pipelines/pipeline_utils.py

sayakpaul · 2025-06-13T05:37:06Z

src/diffusers/pipelines/pipeline_utils.py

+        compile_regions_for_transformer: bool = True,
+        transformer_module_name: str = "transformer",


These two will have to have changed if we proceed to have it inside ModelMixin.

yiyixuxu · 2025-06-14T16:17:52Z

thanks for the PR !
How do we determine default compile_region_classes for each model?

anijain2305 · 2025-06-14T16:33:07Z

thanks for the PR ! How do we determine default compile_region_classes for each model?

This is something that we will have to do manually. I deliberately made that choice. It should be be fairly easy to find those classes, basically the transformer classes. In absence of compile_region_classes, we can fallback to the full model compilation with a warning.

I think this is a fair compromise because it gives control to the model author, and with little guidance it should be very easy.

yiyixuxu · 2025-06-15T02:20:18Z

It should be be fairly easy to find those classes, basically the transformer classes

is it because that they are repeated?

can we add a enable_regional_compile() on ModelMixin since this is the main purpose of that?

sayakpaul · 2025-06-15T08:25:34Z

can we add a enable_regional_compile() on ModelMixin since this is the main purpose of that?

@yiyixuxu this is a good suggestion. But I wanted to discuss the developer experience a bit.

Our initial (@anijain2305 feel free to correct me) plan was to override compile() at ModelMixin. This overriden compile() method will have all the args and kwargs taken by the original compile() method of torch (note that any nn.Module now comes with compile()). But it will additionally take an argument to enable regional compilation.

So, it would be something like:

model = FluxTransformer2DModel()
`regional_compilation` is just an arg, name can be changed/decided later.
model.compile(regional_compilation=True, ...)

With enable_regional_compile(), would the developer experience look like?

model = FluxTransformer2DModel()
model.enable_regional_compile()
model.compile(...)

anijain2305 · 2025-06-16T15:09:12Z

@sayakpaul @yiyixuxu I updated the PR and also the original description with some discussion points (pros and cons). Let me know what you think.

src/diffusers/models/modeling_utils.py

StrongerXi · 2025-06-16T17:32:50Z

But when users upgrade to new diffusers, they will see regional compile if we make it a default, and might be confused.

I thought the only thing the users will see here is less compilation time, if they already have pipeline.transformer.compile(...)?

yiyixuxu · 2025-06-16T18:18:42Z

i see, maybe enalbe_ is indeed a bit confusing, maybe just model.reginal_compile() then?

yiyixuxu · 2025-06-16T18:23:13Z

one thing I want to understand is this this pattern (regional compile) only beneficial to diffusers models?
should and would this be supported in pytorch compile in the future (or maybe now)?

anijain2305 · 2025-06-16T20:36:54Z

one thing I want to understand is this this pattern (regional compile) only beneficial to diffusers models? should and would this be supported in pytorch compile in the future (or maybe now)?

Regional compile is very useful for diffusion models because the transformer blocks are big enough that the compile optimization inside the transformer block is enough to give major speedup, and there is very little to be had with compile optimizations across the transformer blocks.

The regional compile is not a feature per se, its more like saying - "compile responsibly" - there is no specific support required from the pytorch side to enable this feature. For torch.compile, its just another nn.Module (or function) that you are compiling. As a user, we are compiling just small repeated regions to keep the compilation cost low.

For LLMs or other non-diffusion models, it is possible that you do want cross-transformer layer optimizations to see compiler benefits (maybe the presence of static kv cache changes the space of optimizations).

For diffusion models, we can do more testing to convince ourselves. One thing that I do want to test is cudagraphs.

I thought the only thing the users will see here is less compilation time, if they already have pipeline.transformer.compile(...)?

@StrongerXi It will be true in almost all cases, but if a user specifically wants to compile the FULL model for some reason, they will be confused.

anijain2305 · 2025-06-16T20:44:10Z

i see, maybe enalbe_ is indeed a bit confusing, maybe just model.reginal_compile() then?

If we go with Option 3 (which is Option 1 + option 2), we might not need model.regional_compile. In this option 3, the ModelMixin will have a compile method which takes the additional regional_compile args. These regional compile will all be defaulted to None by default. But the Pipeline compile method will use regional by default and setup the regional kwargs whlie calling pipeline.transformer.compile. Let me know if you need to see the code.

yiyixuxu · 2025-06-17T01:38:12Z

The regional compile is not a feature per se, its more like saying - "compile responsibly" - there is no specific support required from the pytorch side to enable this feature. For torch.compile, its just another nn.Module (or function) that you are compiling. As a user, we are compiling just small repeated regions to keep the compilation cost low.

thanks for the explanation! As I understand, this is something works great in diffusers but not unique to it. i.e. it can potentially benefit many non-diffusers models. I think it would not be a good practice for different libraries to wrap the compile API in different ways so I would like to upstream it if it's possible

if it is not possible or does not make sense to add in pytorch, I would prefer to very quickly support it with a separate method instead of wrapping around the compile API - it is one less layer and less complex, also easier user to discover this and use

yiyixuxu · 2025-06-17T01:52:38Z

regards to pipeline.compile(), I agree with @sayakpaul I think we should focus on model level first and eventually come to pipeline.compile() if it is meaningful

anijain2305 · 2025-06-17T02:09:45Z

@yiyixuxu I hear you. There is not really a way to upstream this, as this is completely model dependent. In practice, library/model authors are the best folks to determine where to apply torch.compile (in diffusion models, it turns out to be simple).

I am actually ok with anything that makes it easy to give the best ootb torch.compile experience to the end users. With that in mind, I am ok with regional_compile API in ModelMixin, supplemented with good documentation.

Cc @sayakpaul to hear his thoughts.

sayakpaul · 2025-06-17T04:33:25Z

@anijain2305 let's go with #11705 (comment)? Thanks for being willing to reiterate. I will help with tests and documentation as well to see this through with you.

I think Yiyi's observations are quite aligned with what we see in the community of our users.

I would also be keen on seeing the results with CUDAgraphs now that @a-r-r-o-w made a nice observation by using it (Slack).

DN6 · 2025-06-17T16:11:20Z

I'm with @yiyixuxu on the dedicated method for the initial version. If we go with ModelMixin.compile with the regional args turned off by default, then we're effectively the same as the torch native compile. If we enable regional compile by default, then it might be confusing for users working with non-DiT models, e.g. Would pipe.unet.compile see similar benefits?

We definitely want this feature @anijain2305 (thank you taking the initiative here 💪🏽 ), and since the end user API is still a bit fuzzy, having a dedicated method feeds like the way to go IMO. We can always refactor later based on adoption/feedback from the community.

anijain2305 · 2025-06-19T23:21:34Z

I have not forgotten about this. Just been swamped by the impending branch cut for PyTorch 2.8 release.

I agree with all of you, and I will add a new method in ModelMixin, that does regional compile but does not override compile method.

anijain2305 · 2025-06-20T19:19:35Z

@yiyixuxu @sayakpaul @DN6 This is ready for next round of review.

HuggingFaceDocBuilderDev · 2025-06-23T02:43:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul

This is off to a really good start. I don't have any major comments to share here.

For the TODOs, we have docs to add as well as some tests. For docs, we could add a section in https://huggingface.co/docs/diffusers/en/optimization/torch2.0#torchcompile. For tests, we could add them here.

src/diffusers/models/modeling_utils.py

anijain2305 · 2025-06-23T21:48:37Z

The tests are in.

src/diffusers/models/transformers/transformer_flux.py

yiyixuxu

thanks @anijain2305 !
looking great! let's get this merged this soon:)

I left some small comments. Mainly I think we don't not need a fallback at all and just throw an error when _repeated_blocks is empty and someone tries to use it

this simplifies the code and would encourage the user and send a PR our way to set the attribute when it's not set

sayakpaul

Excited to see this shipped!

Let's add the docs as we have great comments from @yiyixuxu as well!

sayakpaul · 2025-06-24T02:10:02Z

src/diffusers/models/modeling_utils.py

+        can reduce end-to-end compile time substantially, while preserving the
+        runtime speed-ups you would expect from a full `torch.compile`.
+
+        The set of sub-modules to compile is discovered by the presence of


We could also include the doc link that we're gonna add in this PR.

sayakpaul · 2025-06-24T02:11:20Z

tests/models/test_modeling_common.py

+
+        with (
+            torch._inductor.utils.fresh_inductor_cache(),
+            torch._dynamo.config.patch(recompile_limit=recompile_limit),


Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

sayakpaul · 2025-06-24T08:19:07Z

@bot /style

github-actions · 2025-06-24T08:20:06Z

Style fixes have been applied. View the workflow run here.

docs/source/en/optimization/fp16.md

anijain2305 · 2025-06-25T17:12:03Z

@yiyixuxu Just a gentle ping for another review.

yiyixuxu

thank you for adding this feature!

sayakpaul · 2025-06-26T02:00:25Z

Will merge once this round of CI is through.

sayakpaul · 2025-06-26T02:14:34Z

The failure is probably due to #11759 (comment). So, will merge accordingly.

sayakpaul · 2025-06-26T03:23:55Z

@anijain2305 huge thanks for this feature!

anijain2305 force-pushed the compile_utils branch from 59f1f8c to e825022 Compare June 13, 2025 04:43

sayakpaul reviewed Jun 13, 2025

View reviewed changes

sayakpaul requested review from DN6 and yiyixuxu June 13, 2025 05:37

anijain2305 force-pushed the compile_utils branch from e825022 to b7366a7 Compare June 16, 2025 02:22

anijain2305 commented Jun 16, 2025

View reviewed changes

src/diffusers/models/modeling_utils.py Outdated Show resolved Hide resolved

anijain2305 force-pushed the compile_utils branch 4 times, most recently from 2a92c01 to 7d058d2 Compare June 20, 2025 19:18

anijain2305 mentioned this pull request Jun 20, 2025

[user empathy] compile for transformers model pytorch/pytorch#156520

Open

3 tasks

sayakpaul reviewed Jun 23, 2025

View reviewed changes

yiyixuxu reviewed Jun 23, 2025

View reviewed changes

src/diffusers/models/transformers/transformer_flux.py Outdated Show resolved Hide resolved

yiyixuxu reviewed Jun 23, 2025

View reviewed changes

anijain2305 force-pushed the compile_utils branch 2 times, most recently from 058cbc3 to 83ba712 Compare June 24, 2025 00:14

sayakpaul approved these changes Jun 24, 2025

View reviewed changes

[rfc][compile] compile method for DiffusionPipeline

f794d66

anijain2305 force-pushed the compile_utils branch from 86a4e2a to 73989a6 Compare June 24, 2025 08:08

Apply suggestions from code review

932914f

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

anijain2305 force-pushed the compile_utils branch from 73989a6 to 932914f Compare June 24, 2025 08:14

Apply style fixes

13eca6e

Merge branch 'main' into compile_utils

9196f3d

sayakpaul requested a review from yiyixuxu June 24, 2025 08:27

sayakpaul reviewed Jun 24, 2025

View reviewed changes

docs/source/en/optimization/fp16.md Outdated Show resolved Hide resolved

Update docs/source/en/optimization/fp16.md

45ff4a8

StrongerXi mentioned this pull request Jun 24, 2025

[ued] Slow start up time for torch.compile on GGUF Auraflow pytorch/pytorch#150706

Closed

yiyixuxu approved these changes Jun 25, 2025

View reviewed changes

sayakpaul added 4 commits June 26, 2025 06:55

Merge branch 'main' into compile_utils

e2c7d17

Merge branch 'main' into compile_utils

8022496

Merge branch 'main' into compile_utils

ce360cc

check

8c9d6aa

sayakpaul merged commit d93381c into huggingface:main Jun 26, 2025
28 of 29 checks passed

sayakpaul added performance torch.compile labels Jul 2, 2025

		compile_regions_for_transformer: bool = True,
		transformer_module_name: str = "transformer",

[rfc][compile] compile method for DiffusionPipeline #11705

[rfc][compile] compile method for DiffusionPipeline #11705

Uh oh!

Conversation

anijain2305 commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update - Jun 20

Option 1 - Pipeline has a compile method

Pro

Cons

Option 2 - ModelMixin overrides nn.Module compile method (this PR does Option 2)

Pros

Cons

Option 3 - Combine option 1 and option 2

How to find regions?

Why regional compile as default?

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sayakpaul Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

yiyixuxu commented Jun 14, 2025

Uh oh!

anijain2305 commented Jun 14, 2025

Uh oh!

yiyixuxu commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Jun 15, 2025

Uh oh!

anijain2305 commented Jun 16, 2025

Uh oh!

Uh oh!

StrongerXi commented Jun 16, 2025

Uh oh!

yiyixuxu commented Jun 16, 2025

Uh oh!

yiyixuxu commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anijain2305 commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anijain2305 commented Jun 16, 2025

Uh oh!

yiyixuxu commented Jun 17, 2025

Uh oh!

yiyixuxu commented Jun 17, 2025

Uh oh!

anijain2305 commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Jun 17, 2025

Uh oh!

DN6 commented Jun 17, 2025

Uh oh!

anijain2305 commented Jun 19, 2025

Uh oh!

anijain2305 commented Jun 20, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 23, 2025

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anijain2305 commented Jun 23, 2025

Uh oh!

Uh oh!

yiyixuxu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

anijain2305 commented Jun 13, 2025 •

edited

Loading

yiyixuxu commented Jun 15, 2025 •

edited

Loading

yiyixuxu commented Jun 16, 2025 •

edited

Loading

anijain2305 commented Jun 16, 2025 •

edited

Loading

anijain2305 commented Jun 17, 2025 •

edited

Loading

yiyixuxu left a comment •

edited

Loading