-
Notifications
You must be signed in to change notification settings - Fork 326
Add Flux benchmark #2654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Flux benchmark #2654
Conversation
This adds a benchmark for the Flux image generation pipeline. Specifically, it only benchmarks the diffusion transformer (and omits the text encoder and vae, which don't take up much time for the e2e generation in Flux). Needs pytorch/pytorch#168176 to run in pytorch repo: ``` python ./benchmarks/dynamo/torchbench.py --accuracy --inference --backend=inductor --only flux python ./benchmarks/dynamo/torchbench.py --performance --inference --backend=inductor --only flux ```
|
cc @sayakpaul we are trying to add some diffusers models to our nightly benchmark, this represents a first attempt using Flux. Would love to get your thoughts on
Also cc @anijain2305 @BoyuanFeng. |
| @@ -0,0 +1,72 @@ | |||
| import torch | |||
| from torchbenchmark.tasks import COMPUTER_VISION | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 3 files are mostly an adaptation of https://github.com/pytorch/benchmark/tree/f0b2a09591de5cf276bbfa7ce06a3d35f508da84/torchbenchmark/models/stable_diffusion_unet for Flux.
| @@ -0,0 +1,10 @@ | |||
| devices: | |||
| NVIDIA A100-SXM4-40GB: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why it's all A100-40GB in this repo.
I got about 1.3x inference speed up on A100 80G.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more context, A100 40G was the SKU we had from AWS at the beginning when A100 was still new. Nowadays, TorchInductor benchmark is run on H100 though, so there is no incentive to migrate A100 to the 80GB version on CI anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So are these annotations really used anywhere.....?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml is still running an failing without anyone care. I think I should submit a PR to stop this, but it's compiler team's decision
|
Got this from CI, might need some help from devinfra.
|
This issue has been fixed. Another one shows up though pytorch/pytorch#167895, so let me skip the installation of |
|
@StrongerXi Let's see if #2656 helps |
Yeah, this list is perfect! Many models in the image and video gen space just come with a lot of promise and vanish quickly. Only a few stick around. The ones you mentioned are the most promising ones and have somewhat stood the "test of time". I would also add SDXL to this mix.
I am fine with that, I think that's a great idea! We could always test against the nightlies (both PyTorch and Diffusers) to ensure no performance regression. This (reference on AWS) is the infra on which our benchmark is run.
I don't think there are any incentives to benchmark the other components, TBH, as any gain in the denoiser is just a compounding factor in most cases. I think benchmarking for different batch sizes and for 2/3 different resolutions is a good idea. Also, there are multiple use cases beyond text-to-{image,video,audio} such as image-guided editing, etc. But I think it's fine to just keep it at the text-to-X level as it's easy to extrapolate to the other use cases. Just as a note, in our benchmark, we also check for quantization backends and memory features like layerwise casting. I understand that this deviates from a core PyTorch benchmark, but these features are quite popular in the community, hence we benchmark them. Nothing to fret about, just noting. |
This adds a benchmark for the Flux image generation pipeline. Specifically, it only benchmarks the diffusion transformer (and omits the text encoder and vae, which don't take up much time for the e2e generation in Flux).
Needs pytorch/pytorch#168176 to run in pytorch repo: