Cross-attention optimizations #109

vladmandic · 2023-04-13T17:01:10Z

vladmandic
Apr 13, 2023
Maintainer

As of recently, I've moved all command line flags regarding cross-optimization options to UI settings
So things like --xformers are gone

Default method is scaled dot product from torch 2.0
And its probably best unless you're running on low-powered GPU (e.g. nVidia 1xxx), in which case xformers are still better

Note that xformers are not released in binary form on Windows for Torch 2.0 and CUDA 11.8. So at the moment, only way to get xformers on Windows is to either manually compile them (requires compiler to be installed) or to find a 3rd party compiled ones (cannot guarantee quality of or any 3rd party compiled code)

And AMD users report that sub-quad attention achieves best results on some systems, so give that a try

mart-hill · 2023-04-13T17:17:26Z

mart-hill
Apr 13, 2023

I noticed that Tiled VAE/Tiled Diffusion doesn't want to work, unless I turn on xformers (which I installed along with Torch 2.0). Am I missing something within the config of the UI maybe?

11 replies

vladmandic Apr 13, 2023
Maintainer Author

i haven't said sdp is 50% faster than xformers. but its much cleaner implementation-wise, far less installation problems (xformers is quite often not available for this-or-that-platform in binary form and needs to be recompiled separately for any minor change to torch, etc.)

so if sdp "just works", i'm happy.
xformers were amazing, but now i'd recommend them only for much older gpu's which do not have sufficient pipeline. for example, nvidia 1xxx series is better with xformers. the rest? why bother, just my $0.02.

also, xformers also break quite a few things related to training, so need to unwind them each time, etc. end-user may not care, but from dev perspective, i hate doing that all the time.

derspanier1 Apr 22, 2023

@vladmandic from your exchange i gather your mentioned tweaks can be inserted into the launch.py ? i assume in the header section right ? i tested this, and it seems to be right, and working, but its gets removed with every upgrade. is there another way of inserting the tweaks without being deleted by an update ? Thank you for this fork, i absolutely love it.

vladmandic Apr 23, 2023
Maintainer Author

If you're using my fork, those ops are already built-in.

derspanier1 Apr 23, 2023

@vladmandic thank you for explaining, appreciated. yeah using your fork, tested it 2 days ago, ran so much better than main auto, migrated everything to this install. works and feels much much better. Thank you for your work !

BSnowdonia Apr 26, 2023

You can try to update the cuDNN component to version 8.8.x. The version of this component can be checked by in sd-extension-system-info. The default built-in cuDNN version of pytorch 2.0 is 8500, and in order to maximize the performance of the 40xx card, you need to update it to version 88xx, which can be found at cuDNN. placing the cudnn*.dll in the bin folder of the zip package into the %PYTHON_ROOT%\Lib\site-packages\torch\lib directory, where %PYTHON_ROOT% is the path of your python installation.
These are some of the measurement results from which you can find some differences.

This method also has some efficiency gains for other models of graphics cards, so give it a try.

razvan-nicolae · 2023-04-14T01:01:35Z

razvan-nicolae
Apr 14, 2023

No, I didn’t say you suggested that, but it was posted in a few places, one big study here <https://pytorch.org/blog/accelerated-diffusers-pt-20/> [pytorch-logo.png] PyTorch<https://pytorch.org/blog/accelerated-diffusers-pt-20/> pytorch.org<https://pytorch.org/blog/accelerated-diffusers-pt-20/> The pytotch devs are claiming these improvements, but honestly I don’t see them. P.S: offtopic is there any way to customize font in your fork? Eyes are getting tired of the boldish looking default font :D On 14 Apr 2023, at 02:53, Vladimir Mandic ***@***.***> wrote: i haven't said sdp is 50% faster than xformers. but its much cleaner implementation-wise, far less installation problems (xformers is quite often not available for this-or-that-platform in binary form and needs to be recompiled separately for any minor change to torch, etc.) so if sdp "just works", i'm happy. xformers were amazing, but now i'd recommend them only for much older gpu's which do not have sufficient pipeline. for example, nvidia 1xxx series is better with xformers. the rest? why bother, just my $0.02. also, xformers also break quite a few things related to training, so need to unwind them each time, etc. end-user may not care, but from dev perspective, i hate doing that all the time. — Reply to this email directly, view it on GitHub<#109 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AESH55WR5YNAOZKTUYSF5ILXBCGVFANCNFSM6AAAAAAW5LKODY>. You are receiving this because you commented.Message ID: ***@***.***>

…

_________________________________________________________________________________________________________________________ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified. Thank you.

0 replies

vladmandic · 2023-04-14T01:14:04Z

vladmandic
Apr 14, 2023
Maintainer Author

P.S: offtopic is there any way to customize font in your fork? Eyes are getting tired of the boldish looking default font :D

not cleanly yet, its on my todo. but for now, you can edit user.css.

0 replies

TomerGadol · 2023-04-14T10:43:19Z

TomerGadol
Apr 14, 2023

When you say Xformers is better than SDP on old GPUs, like my 1660ti, is the difference noticeable? Are there any measurements?
If not, I might just do a few tests myself.

22 replies

vladmandic Apr 20, 2023
Maintainer Author

This is happening with the base 1.5 model, enabled both upcast options in SD settings as well as both full precision options in CUDA settings. Default VAE set to 'none'

now, THAT should not be happening. i know nvidia 1xxx series had some internal issues enabling/disabled fp16/fp32 and there was a trick to it, but for the sake of me, cant remember the details. i think it had to do with enabling cudnn.benchmark and then voila, it work, although its totally unrelated.

TomerGadol Apr 20, 2023

Should I open an issue for this? or just fall back to --no-half?

mart-hill Apr 20, 2023

I also just encountered NaN with Torch 2.0 (never had these before with RTX 3090) Version: df424d6 Thu Apr 20 17:18:31 2023 -0400 of the UI. Since there's an issue with Tiled VAE addon, again (this time probably because it wants 32bits, but gets halfs; checked various options with newly created CUDA section in settings, no dime), I reopened the #128 and reported all the issued I've noticed.
For now, to avoid NaN in VAE, vae-ft-mse-840000-ema-pruned should be used, right? It's weird though, as I had preview of the image all the way to the end with "blessed.vae.pt", which is a variant of NAI VAE.

nekoworkshop Apr 21, 2023

@mart-hill That is a known issue with NAI VAE or any VAE derived from the NAI VAE. You need --no-half-vae, not --no-half. The downside is a VRAM usage spike in the very last step.

The preview images are derived from the latent space using approximation, hence why they are still good. The NaN happens in the last step where the latent space is decoded using the VAE.

mart-hill Apr 21, 2023

Exactly, though I had this VERY rarely, and had --no-half-vae all the time ON. But now this option landed in the Settings (it's not available as cmdline option anymore for this fork), and this probably needs a tiny bit of polishing. 🙂

FurkanGozukara · 2023-04-15T16:07:33Z

FurkanGozukara
Apr 15, 2023

--xformers still works better than opt sdp attention
For example xformers can generate bigger resolution image and faster for older cards
Also in dreambooth training xformers was much faster
You may say how do you know I made extensive tests on this video

RTX 3090 vs RTX 3060 Ultimate Showdown for Stable Diffusion, ML, AI & Video Rendering Performance

you really should support --xformers along with opt sdp attention

4 replies

vladmandic Apr 15, 2023
Maintainer Author

and who says i don't? i just set sdp as default and i don't plan to remove xformers. and nothing you said is any different than what i said earlier?

xedis Apr 16, 2023

you really should support --xformers along with opt sdp attention

vlad made it part of the ui. quick ctrl+f of the settings options after clicking show all pages would have went far here.

FurkanGozukara Apr 16, 2023

and who says i don't? i just set sdp as default and i don't plan to remove xformers. and nothing you said is any different than what i said earlier?

when you said you removed from command line arguments i thought it was removed

@xedis nice

vladmandic Apr 16, 2023
Maintainer Author

I've moved all command line flags regarding cross-optimization options to UI settings

How was that not clear?

radry · 2023-04-24T21:41:58Z

radry
Apr 24, 2023

I tried both xformers and SDP in vlad fork on a RTX 4070 but got the exact same speed, so why is SDP better?
(Yes, I verified that the correct one is used and also restarted the server in between switching, so it will apply the change in settings)

2 replies

vladmandic Apr 24, 2023
Maintainer Author

Did I say they were? They perform the same (except on very old HW) with much less compatibility/install issues.

CapsAdmin Apr 26, 2023

it also works on AMD hardware

DaveScream · 2023-04-26T04:19:19Z

DaveScream
Apr 26, 2023

On my 3060 12gb windows 10 I cant get the same speed as in a1111.
A1111 torch 1.13 + cu117 + xformers I get 7.4it/s (optimized previews)

Vlad torch 2.0 + cu118 sdp or xformers - 6.7it/s (live previews optimized or turned off, no matter)

So it is not possible to reach same speeds, sadly

0 replies

CapsAdmin · 2023-04-26T05:28:42Z

CapsAdmin
Apr 26, 2023

And AMD users report that sub-quad attention achieves best results on some systems, so give that a try

On a 6900XT I found that sub-quad attention performs the worst but has the lowest vram usage. So kind of useful to quickly switch to (thanks to it being moved to options) if you want to render something high resolution to avoid OOM.

0 replies

Cross-attention optimizations #109

vladmandic Apr 13, 2023 Maintainer

Replies: 8 comments · 39 replies

vladmandic Apr 13, 2023 Maintainer Author

vladmandic Apr 23, 2023 Maintainer Author

vladmandic Apr 14, 2023 Maintainer Author

vladmandic Apr 20, 2023 Maintainer Author

vladmandic Apr 15, 2023 Maintainer Author

vladmandic Apr 16, 2023 Maintainer Author

vladmandic Apr 24, 2023 Maintainer Author

vladmandic
Apr 13, 2023
Maintainer

Replies: 8 comments 39 replies

vladmandic Apr 13, 2023
Maintainer Author

vladmandic Apr 23, 2023
Maintainer Author

vladmandic
Apr 14, 2023
Maintainer Author

vladmandic Apr 20, 2023
Maintainer Author

vladmandic Apr 15, 2023
Maintainer Author

vladmandic Apr 16, 2023
Maintainer Author

vladmandic Apr 24, 2023
Maintainer Author