PSA - Wanted to share a few things I've learned to improve it/s. #1244

myndxero · 2023-06-01T07:07:58Z

myndxero
Jun 1, 2023

First of all for Windows 11 users , this is important! On my RTX 3090 TI , rough estimate on improved it/s was about 15-20% - Requires restart.

EDIT: Oh yeah, turn off hardware acceleration in your web browser settings!!!

Secondly, this one may be much more obscure, but it is important to BALANCE your prompt and negative prompt. Gains here, for me, anywhere 5-15% it/s on loose testing, eyeballing.

~~EX test prompt - tokens for both prompt and negative prompt are 74/74. To compare, simply unbalance it, clear the negative prompted entirely.~~ While the concept remains true, this isn't the prompt I was thinking of. I am trying to locate it. The prompt here is giving me 80/78. I could have sworn it was the 74/74, it's how I have it notated in my notes, but it's incorrect.

Prompt: masterpiece, realistic, (((perfect))), photorealistic, ultrarealism, (official art), 1girl, bobcut, blue eyes, brunette, serious look on face, (sleeveless red dress), (white thighhighs), (hightops), hands on handlebars, ((riding on motorcyle)), speeding, looking at viewer, highway, scenery, dusk, raining, puddles, bokeh, intricate, highly detailed, motion lines"
Negative: nsfw, lowres, ((mangled)), inaccurate, mutated, mutant, tumor, growth, cancer, guro, injury, violence, (deformed), (bad anatomy), (disembodied limbs), bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry"

I cannot find where I picked up this prompt. It is not my own but is in one of heap pile of it/s improvement threads/comments/forums/reddits whatever I have gone through in my searches.

I have a prompt that was 314/50-something, was generating results at about 5.2 it/s, when I bumped negative to 301, it's currently running at 5.88 it/s (315/301).

TIPS on balancing

TL;DR -

Using BREAKS will give you 75 tokens (and improve your prompts when used effectively)
In my example below of a 'a man wearing a black shirt and driving a red car' simply use negative prompt 'black car'
tokens don't have to be a precise match, such as the above test prompt at ~~74/74~~, but the closer they are the better the it/s
other things that can influence it/s is samplers, models, loras, controlnets all that. I've gone back to using Euler and Euler a again, backing off DPM++ 2M Karras, as my mains.
the eff is UniPC?

Long-winded -
trick I used to bump the tokens up in the negative was by using BREAKS. BREAKS are how my prompt crossed 300 mark in the first place due to the 'padding' breaks put between tokens to isolate 'cross-contamination.' By cross-contamination I mean if you specify 'a man is wearing a black shirt and driving a red car,' sometimes you'll have a black shirt black car, red shirt red car, from the contamination between the tokens 'black' and 'red.' Putting a BREAK between them isolates them from each other a little better, improving adherence to your prompt. As such, each BREAK inserts 75 token's worth of padding.

So to bump my 50 to 300 for my negative, I did 'BREAK blah.' I can't dbl check atm but will update - Vlad's token counter wasn't recognizing my BREAKS unless I followed it with something, in this case 'blah.' Anapnoe's counts every BREAK even if it's the only thing in the prompt. Not sure if there's a technical difference here as far as balancing prompts go, but something to maybe note.

GO BACK TO USING EULER/EULER A!!!!!! IF it/s is super important to you. There was something important to mention about DDIM too but I can't remember, for the life of me, what it was. I'll edit if I remember. I feel like samplers are largely preference, I dunno. I know at one time the Ancestrals were considered more creative and were not deterministic, ie. you could put in any number of steps and it would just continue to develop to your step count, for better or worse. I know the DPM++ Karras series and the modified series are popular, but really all they seem to do is double-sample per step or something? I'll need to refresh myself on the samplers. I still don't know the point o UniPC still haha, except that it won't work with AND syntax in prompts.

Additional thoughts -

token padding in text encoder settings for Dreambooth training IS THIS. I know for awhile that seemed like a mystery too.
anecdotal but I'll still throw it out there - my series of BREAK blah additions to my negative prompt have seem to increase my generation fidelity? I will need to test more but could be interesting. Maybe the mere act of balancing your tokens is why /shrug.
I'm not big on running diagnostics and benchmarking, maybe a reader will feel motivated to do that, but these are simple things to easily achieve some gains.

Be interested to hear what other people might have to share or general thoughts on this.

ghost · 2023-06-01T07:59:15Z

ghost
Jun 1, 2023

I still don't know the point o UniPC still haha, except that it won't work with AND syntax in prompts.

You are missing out. I wouldnt recommend Euler. I render with 7 UniPC steps for drafting without controlnet and get a glimpse of what SD thinks of my prompt in no time, and then I can continue or select seeds with higher steps. With Controlnet I only need 3 UniPC steps in txt2img!! Thats the minimum, okay quality would begin at 4 steps, if controlnet is activated.

Tip for UniPC: the lower the steps, the more you profit from lower CFG. In my 3 step experiments I had the cfg at 1,7 in the end to make it work (and it does not mess with the prompt interpretation). UniPC likes low CFG values in low steps

This sampler is blazingly fast if you know how to tweak the maximum out of it. I recommend setting UniPC variant to "bh2" and UniPC covariant to "time_quadratic" in settings under "Sampler parameters". This helps with artifacts in low steps. If you still get artifacting or blur you need to lower cfg

Comparison (4 Steps):

6 replies

ghost Jun 1, 2023

I never noticed any benefit of going above 50 steps. Have you compared a 50 step image with a 200 step image directly, and is it worth 4x the compute cost? Have to do some experiments myself on this

Havent changing the minimum step number, I think 3 steps is extreme, dont need 2 steps. I think img2img with UniPC can go below 3 steps, but havent looked into it too much. I wanted to see how fast I can go with SD while sill retaining some quality

myndxero Jun 1, 2023
Author

This may be anecdotal, but sometimes I feel like higher steps can be necessary, alongside higher CFG, to materialize a complex prompt. With UniPC, 50 steps would give me a blurry image. At 200 it would be crisp. But in my instance, just to remind, I am clearly using UniPC wrong haha.

With Ancestral samplers (denoted with 'a' in name, Euler am DPM++ 2S a), they tend to react to your step count, details will keep changing whereas none ancestral samplers more or less reach an end point and further steps are irrelevant. IIRC.

Then there's DPM Adaptive which more or less just does its own thing. Steps make a difference but Adaptive kinda has a mind of its own. DPM++ SDE seems like that too.

I am not super knowledgeable on samplers though. There are probably holes in my descriptions and such.

EDIT: As far as worth the compute cost? I guess that's subjective. I run everything on local hardware and am set up to do plenty of other things when waiting. If I was running on Runpod or something, that would be a whole different story.

myndxero Jun 1, 2023
Author

I'll run a good ol fashioned XY test whenever my current job finishes for funsies. I think > 50 steps can matter, and some of it continues to just be preference.

I know many used to say Euler a, 25-35 was the sweet spot, anymore was a waste. Depends on one's definition of a waste, which turns it into opinion making it simply a matter of preference past a point.

EDIT: > 4 hours left on my current thing. 133k steps or so, 30 per image including hires.

Aptronymist Jun 1, 2023
Collaborator

You are missing out. I wouldnt recommend Euler. I render with 7 UniPC steps for drafting without controlnet and get a glimpse of what SD thinks of my prompt in no time, and then I can continue or select seeds with higher steps. With Controlnet I only need 3 UniPC steps in txt2img!! Thats the minimum, okay quality would begin at 4 steps, if controlnet is activated.

What do you mean by "if controlnet is activated"? What mode would I have it in for that?
I really haven't taken the time to learn much about it or use it, beyond a brief tinkering with the new reference-only mode. I do sometimes use After Detailer, which utilizes controlnet's inpainting though.

This sampler is blazingly fast if you know how to tweak the maximum out of it. I recommend setting UniPC variant to "bh2" and UniPC covariant to "time_quadratic" in settings under "Sampler parameters". This helps with artifacts in low steps. If you still get artifacting or blur you need to lower cfg

I do use UniPC a lot now, it's my default sampler after reading Vlad praise it, and giving it a real shot myself, but I haven't messed with those settings because they are essentially unexplained, which seems to be something a lot of good extensions suffer from: tons of options, little to no documentation, much less tooltips.

Yodrik Jun 2, 2023

I guess for controlnet you don't really need a high step concise image to get model images? I've never thought about it like that.

You may wish to take a look at past tests on differences between UniPC and other samplers https://www.reddit.com/r/StableDiffusion/comments/11oke60/comparison_of_new_unipc_sampler_method_added_to/ It follows the general summary in the brief and paper written by the folk behind it ("Need less steps for success~").

brknsoul · 2023-06-01T09:48:48Z

brknsoul
Jun 1, 2023

Another tip to squeeze out every last megabyte of vRAM: Open up Task Manager, head on over to the Details tab, rightclick a column title and click Select Columns. Find Dedicated GPU Memory, tick it and click OK.

Now sort the list by Dedicated GPU Memory to see what's eating up your vRAM. Open that program and see if you can disable Hardware Acceleration (similar to the browser setting in the OP).

For example, Discord likes to nom on vRAM, so click the settings cog ⚙near the bottom left, go to Advanced and disable Hardware Acceleration there.
Some IM programs (Telegram or Facebook Messenger clones) use Hardware Acceleration too.

1 reply

myndxero Jun 1, 2023
Author

Also reminds me, TURN OFF LIVE PREVIEW!! Since I am an xyz addict I used to require it to watch while I did other things... But putting the slider on -1 still saves me a little VRAM AND I will still see each image when it is generated along the way. I haven't looked directly at it/s to know if it helps there, but will take a bit off your VRAM usage!

I also like to run WebUI in a separate browser from my main one. My main one is usually closed altogether.

vladmandic · 2023-06-01T12:28:50Z

vladmandic
Jun 1, 2023
Maintainer

thanks for the write-up, i think this is useful for a lot of people!
i'd add few more things (more of clarifications)

windows 11 use gpu scheduling
well, the answer is not just "disable it", its more "it depends". and the answer is similar to xformers vs sdp - they are nearly identical if your cpu and gpu are somewhat balanced. but pairing fast gpu with a slower cpu and you're better off gpu doing scheduling as cpu wont be able to schedule enough tasks for gpu to start with. and if you have fast cpu and slower gpu, you're better off using cpu to schedule tasks so gpu can focus purely on calculations. in case of sd, xformers use cpu-scheduling and are better for systems with old gpu and sdp uses gpu-scheduling and its better for systems with high-end gpus
unipc scheduler
its different than any other scheduler in a way that it has separate and different initial and final steps. so you cant even run unipc with steps=1, it will always be minimum of 3. but then those middle steps are much faster than typical scheduler and you need less of them (nearly 1/2 less). so for example, unipc with 10 steps is 1+8+1 and would be about the same as euler with 16 steps.
windows applications that use hw acceleration
windows is actually pretty smart when it comes to that and unless app is doing something bad, minimizing app is normally enough as app will not render non-visible things. this is definitely the case for browser. but yes, discord is bad
steps above 99
this is actually bad, never use it. the reason is how samplers interpolate user defined steps into so-called timesteps. so you end up with incorrectly interpolated values (and worst case, even runtime errors like index out of range). in the recent versions, i've actually locked ui controls so steps cannot be set above 99
most schedulers (not all) have early-exit
meaning, actual number of executed steps may be lower than what you've set if sampler thinks its done
this is especially true for dpm adaptive, but not limited to it
number of token segments
yes, there is a big (~15%) performance hit if number of segments in positive and negative prompt is different
number of tokens doesn't matter, number of segments does
there are some improvements on the way that will auto-pad either prompt to always make them have same segment count

8 replies

vladmandic Jun 1, 2023
Maintainer

Unpopular opinion here, better than Matrix (and the first of many dramatic Sean Bean character deaths).

definitely not better than the matrix part one. the rest? well, yeah, possibly.

myndxero Jun 1, 2023
Author

definitely not better than the matrix part one. the rest? well, yeah, possibly.

More? I only remember the first 🗡️

JK, there was one moment that made Revolutions worth it to me, the whole entire thing, where they knocked it out of the park for me......

When Smith and Neo fought at the end and Smith asks, "why Mr Anderson, why do you keep fighting?" and Neo response, "because I choose too." That was EVERYTHING in the entire trilogy right there. People say Neo had a God complex or he was like Jesus dying for sins or whatever, no no no. That's not it at all. There was no grand design, rather he OPPOSED the grand design, he represented the whole crux of the Machine's vision, HUMAN CHOICE and FREE WILL, that was the whole fucking point. Viewers lost sight of that in Revolutions, but it was the one saving grace of it for me.

Also, Navras is one of the greatest tracks in any movie ever, CHANGE MY MIND.

brknsoul Jun 1, 2023

number of segments in positive and negative prompt is different

So If my prompt is 150 or 225 segments long, then I should bump my rather short negative prompt with a BREAK or two to increase generation speed?

vladmandic Jun 1, 2023
Maintainer

correct.

derspanier1 Jun 7, 2023

grateful to stumble upon info like this. BREAK pad negative. noted. thank you

myndxero · 2023-06-01T16:38:14Z

myndxero
Jun 1, 2023
Author

~~EX test prompt - tokens for both prompt and negative prompt are 74/74. To compare, simply unbalance it, clear the negative prompted entirely.~~ While the concept remains true, this isn't the prompt I was thinking of. I am trying to locate it. The prompt here is giving me 80/78. I could have sworn it was the 74/74, it's how I have it notated in my notes, but it's incorrect.

Prompt: masterpiece, realistic, (((perfect))), photorealistic, ultrarealism, (official art), 1girl, bobcut, blue eyes, brunette, serious look on face, (sleeveless red dress), (white thighhighs), (hightops), hands on handlebars, ((riding on motorcyle)), speeding, looking at viewer, highway, scenery, dusk, raining, puddles, bokeh, intricate, highly detailed, motion lines" Negative: nsfw, lowres, ((mangled)), inaccurate, mutated, mutant, tumor, growth, cancer, guro, injury, violence, (deformed), (bad anatomy), (disembodied limbs), bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry"

@vladmandic if you have time to sus this out: my test prompt, without quotes in your UI, hits the 74/74 token count I noted in my notes and originally in this post. It still does, but in A1111 and Anapnoe, 80/78. My only guess, off the wall, maybe something to do with the token merging thing I've noticed in settings?

2 replies

vladmandic Jun 1, 2023
Maintainer

oh, that's totally expected - i rewrote prompt parser recently. old one is still in as compatibility option, but its disabled by default.
see #1005 (comment) for details.

myndxero Jun 3, 2023
Author

Such a Chad.

ghost · 2023-06-05T09:15:39Z

ghost
Jun 5, 2023

I cleaned up the SSD on where my sdnext install was (there were so many broken a1111 and sdnext folders....) and I gained on average 0.7 it/s.

Thats a new record on my system (512x512, UniPC 20 steps, batch size 4, ToMe 0.5, RTX 2080S)

I was under the impression storage would contribute the least to performance but this really surprised me?

edit: and this is with the new broken nvidia drivers

0 replies

PSA - Wanted to share a few things I've learned to improve it/s. #1244

Replies: 5 comments · 17 replies

myndxero Jun 1, 2023 Author

myndxero Jun 1, 2023 Author

Aptronymist Jun 1, 2023 Collaborator

myndxero Jun 1, 2023 Author

vladmandic Jun 1, 2023 Maintainer

vladmandic Jun 1, 2023 Maintainer

myndxero Jun 1, 2023 Author

vladmandic Jun 1, 2023 Maintainer

myndxero Jun 1, 2023 Author

vladmandic Jun 1, 2023 Maintainer

myndxero Jun 3, 2023 Author

Replies: 5 comments 17 replies

myndxero Jun 1, 2023
Author

myndxero Jun 1, 2023
Author

Aptronymist Jun 1, 2023
Collaborator

myndxero Jun 1, 2023
Author

vladmandic
Jun 1, 2023
Maintainer

vladmandic Jun 1, 2023
Maintainer

myndxero Jun 1, 2023
Author

vladmandic Jun 1, 2023
Maintainer

myndxero
Jun 1, 2023
Author

vladmandic Jun 1, 2023
Maintainer

myndxero Jun 3, 2023
Author