Skip to content

Conversation

@StrongerXi
Copy link
Contributor

This adds a benchmark for the Flux image generation pipeline. Specifically, it only benchmarks the diffusion transformer (and omits the text encoder and vae, which don't take up much time for the e2e generation in Flux).

Needs pytorch/pytorch#168176 to run in pytorch repo:

python ./benchmarks/dynamo/torchbench.py --accuracy --inference --backend=inductor --only flux
python ./benchmarks/dynamo/torchbench.py --performance --inference --backend=inductor --only flux

This adds a benchmark for the Flux image generation pipeline.
Specifically, it only benchmarks the diffusion transformer (and omits
the text encoder and vae, which don't take up much time for the e2e
generation in Flux).

Needs pytorch/pytorch#168176 to run in pytorch
repo:
```
python ./benchmarks/dynamo/torchbench.py --accuracy --inference --backend=inductor --only flux
python ./benchmarks/dynamo/torchbench.py --performance --inference --backend=inductor --only flux
```
@StrongerXi
Copy link
Contributor Author

cc @sayakpaul we are trying to add some diffusers models to our nightly benchmark, this represents a first attempt using Flux. Would love to get your thoughts on

  1. What models to add. I'm thinking focusing on category first (e.g., Flux for txt2img, Qwen Edit for txt-img2img, Wan for txt2video, ...), then add more popular models to improve coverage.
  2. Whether there are incentives to consolidate this and the diffusers benchmarks. At a high level I think the pytorch benchmark will enable early detections of pytorch changes that break compile + diffusers (help nightly users and avoid bug congestion during release), and the diffusers benchmark will do similar thing, but for diffusers changes and users. If the benchmark infra are sufficiently different maybe it's fine to have a bit of code dup.
  3. What exactly to benchmark -- right now we are just benchmarking the denoising transformer. Are there incentives to benchmark the txt/img encoder and vae as well? Say if the number of steps are low for distilled model variants? There are also variables like image size, which actually affects torch.compile speedup ratio, as larger size make it more compute bound e2e.

Also cc @anijain2305 @BoyuanFeng.

@@ -0,0 +1,72 @@
import torch
from torchbenchmark.tasks import COMPUTER_VISION
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,10 @@
devices:
NVIDIA A100-SXM4-40GB:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why it's all A100-40GB in this repo.

I got about 1.3x inference speed up on A100 80G.

Copy link
Contributor

@huydhn huydhn Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more context, A100 40G was the SKU we had from AWS at the beginning when A100 was still new. Nowadays, TorchInductor benchmark is run on H100 though, so there is no incentive to migrate A100 to the 80GB version on CI anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are these annotations really used anywhere.....?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml is still running an failing without anyone care. I think I should submit a PR to stop this, but it's compiler team's decision

@StrongerXi
Copy link
Contributor Author

Got this from CI, might need some help from devinfra.

Access to model black-forest-labs/FLUX.1-dev is restricted and you are not in the authorized list. Visit https://huggingface.co/black-forest-labs/FLUX.1-dev to ask for access.

@huydhn
Copy link
Contributor

huydhn commented Nov 20, 2025

Access to model black-forest-labs/FLUX.1-dev is restricted and you are not in the authorized list. Visit https://huggingface.co/black-forest-labs/FLUX.1-dev to ask for access.

This issue has been fixed. Another one shows up though pytorch/pytorch#167895, so let me skip the installation of stabilityai/stable-diffusion-2 on TorchBench for now

@huydhn
Copy link
Contributor

huydhn commented Nov 20, 2025

@StrongerXi Let's see if #2656 helps

@sayakpaul
Copy link

What models to add. I'm thinking focusing on category first (e.g., Flux for txt2img, Qwen Edit for txt-img2img, Wan for txt2video, ...), then add more popular models to improve coverage.

Yeah, this list is perfect! Many models in the image and video gen space just come with a lot of promise and vanish quickly. Only a few stick around. The ones you mentioned are the most promising ones and have somewhat stood the "test of time". I would also add SDXL to this mix.

Whether there are incentives to consolidate this and the diffusers benchmarks. At a high level I think the pytorch benchmark will enable early detections of pytorch changes that break compile + diffusers (help nightly users and avoid bug congestion during release), and the diffusers benchmark will do similar thing, but for diffusers changes and users. If the benchmark infra are sufficiently different maybe it's fine to have a bit of code dup.

I am fine with that, I think that's a great idea! We could always test against the nightlies (both PyTorch and Diffusers) to ensure no performance regression. This (reference on AWS) is the infra on which our benchmark is run.

What exactly to benchmark -- right now we are just benchmarking the denoising transformer. Are there incentives to benchmark the txt/img encoder and vae as well? Say if the number of steps are low for distilled model variants? There are also variables like image size, which actually affects torch.compile speedup ratio, as larger size make it more compute bound e2e.

I don't think there are any incentives to benchmark the other components, TBH, as any gain in the denoiser is just a compounding factor in most cases. I think benchmarking for different batch sizes and for 2/3 different resolutions is a good idea. Also, there are multiple use cases beyond text-to-{image,video,audio} such as image-guided editing, etc. But I think it's fine to just keep it at the text-to-X level as it's easy to extrapolate to the other use cases.

Just as a note, in our benchmark, we also check for quantization backends and memory features like layerwise casting. I understand that this deviates from a core PyTorch benchmark, but these features are quite popular in the community, hence we benchmark them. Nothing to fret about, just noting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants