Hypernetwork Style Training, a tiny guide #2670

Heathen · 2022-10-15T01:03:47Z

Heathen
Oct 15, 2022

The negative text preview during training appears to have been fixed a few patches ago, carry on.

tl;dr

Prep:

Select good images, quality over quantity
Train in 512x512, anything else can add distortion
Use BLIP and/or deepbooru to create labels
Examine every label and remove whatever is wrong, add whatever is missing
For activation and initialization, check Hypernetwork Style Training, a tiny guide #2670 (comment)
For network size, check Hypernetwork Style Training, a tiny guide #2670 (comment)

Training:

Learning Rate: 5e-5:100, 5e-6:1500, 5e-7:10000, 5e-8:20000
Prompt Template: a .txt with only [filewords] in it.
Steps: 20000 or less should be enough.

Longer explanation:

Select good images, quality over quantity
My best trained model was done using 21 images. Keep in mind that hypernetwork style transfer is highly dependent on content. If you pick an artist that only does cityscapes and then ask the AI to generate a character with his style, it might not give the results you expect. The hypernetwork intercepts the words used during training, so if there are no words describing characters, it doesn't know what to do. It might work, might not.
Train in 512x512, anything else can add distortion
I've tested this several times. I haven't gotten good results out of it yet. So up to you.
Use BLIP and/or deepbooru to create labels AND Examine every label and remove whatever it wrong, add whatever is missing
It's tedious and might not be necessary, if you see blip and deepbooru are working well, you can let it as is. In any way, describing the images is important so the hypernetwork knows what it is trying to change to be more like the training image.
Learning Rate: 5e-5:100, 5e-6:1500, 5e-7:10000, 5e-8:20000
They added a training scheduler a couple days ago. I've seen people recommending training fast and this and that. Well, this kind of does that. This schedule is quite safe to use. I haven't had a single model go bad yet at these rates and if you let it go to 20000 it captures the finer details of the art/style.
Prompt Template: a .txt with only [filewords] in it.
If your blip/booru labels are correct, this is all you need. You might want to use the regular hypernetwork txt file if you want to remove photo/art/etc bias from the model you're using. Up to you.
Steps: 20000 or less should be enough.
I'd say it's usable in the 5000-10000 range with my learning rate schedule up there. Buuut you will notice that in the 10000-20000 range, a lot of the finer details will show up. So as the rock would say, put in the work, put in the hours.

Final notes after the rock intermission.

If your model uses VAE, keep it on. Don't know if this makes a difference in training, but just making sure.
Unload any other hypernetworks. I'm not sure if it interferes with training, but better be safe.
If your model breaks and the preview tests show colorful noise, don't just go back a little, pick an even earlier model to train onward and reduce the learning rate even more
Don't change your training data midway, better start over.
https://www.birme.net/ is great for bulk cropping images.
If your loss goes past 0.3 you done goofed and probably broke your hypernetwork, lower the learning rates.

Examples:
Trained NAI for 6500 steps on Andreas Rocha style. I plan on letting it train to 20000 later. And done.

Vanilla NAI

RTX On, I mean, Style on

20000 steps

Vanilla NAI

Rocha ON

20000 Steps

ghost · 2022-10-15T01:29:19Z

ghost
Oct 15, 2022

I find that my hypernetworks are starting to cook on 5e-6 somewhere after ~17k steps, so that might be a good stopping point.
On 5e-7 it's taking really tons of time to train, starting to be close to artist of my choice somewhere ~70k steps.

7 replies

Heathen Oct 15, 2022
Author

It was added a couple days ago, the format is
learningrate:steps, ...
It does interpret scientific notation as well, so a schedule like this example works:
5e-6:15000, 5e-7:30000

ogkalu2 Oct 15, 2022

@andrei007999
How many images did you use to train at 5e-7 ?
How long past 70k steps did you go and how close did the style get ?

ghost Oct 15, 2022

@ogkalu2 ~250 augmented i think, not much images produced by that author, but they were manually tagged and cropped.

I didn't, 70k did produce some outputs very close to style, so i stopped there, but it's likely that you can go for another 80k i think.

ogkalu2 Oct 16, 2022

@andrei007999
Did you try the scheduler ? Did it work better ?

ghost Oct 17, 2022

@ogkalu2 I did, it seem to get faster to base of style, well, but that was obvious, because starting LR is much faster. But to see if it's better, i would need to train over 20k steps, and i did not have opportunity to do so yet on any hypernetwork.

What i can say, seem like lower LR make it work out smaller details better instead of generalized stylistic of art, so, in theory scheduler should achieve much better results in the long run.

ExponentialML · 2022-10-15T01:32:48Z

ExponentialML
Oct 15, 2022

Thanks for the guide! I find that hypernetworks work best to use after fine tuning or merging a model. Trying to train things that are too far out of domain seem to go haywire. It makes sense considering that when you fine tune a Stable Diffusion model, it will learn the concepts pretty well, but will be somewhat difficult to prompt engineer what you've trained on. Hypernetworks seem to help alleviate this issue.

1 reply

ogkalu2 Oct 17, 2022

Do you train the hypernetworks on the fine tuned or merged model ?

Heathen · 2022-10-15T02:35:15Z

Heathen
Oct 15, 2022
Author

A few more examples of NAI + Andreas Rocha hypernetwork now that it is trained.

0 replies

Chubler-XL · 2022-10-15T06:35:42Z

Chubler-XL
Oct 15, 2022

Thanks for this guide I have been struggling to get an embedding of a particular atrists style sorted out and this helped no end to an acceptable result.

I had 26 examples of the artists work which I manually resized/cropped to 512x512.
I created the labels using BLIP and manually corrected them.

Embeddings need a much bigger Learning rate, and after some trial and error, I ended up with:

Initialization text: *
Vectors: 2
Learning rate: 0.02:200, 0.008:800, 0.002:2000, 0.0008:5000
Prompt template file: textual_inversion_templates\style_filewords.txt

Nice thing about the embeddings is I can use a standard model and just add "painting by [artist-name]" to my prompts.

I will try extending the final learn out to a much bigger number of steps and see if more details appear.

12 replies

mykeehu Oct 15, 2022

What I have noticed so far is that if the VAE was loaded for the basic SD model, it learned slowly, but more accurately. If it was not, it learned faster, but with txt2img it was very sensitive to steps and CFG.
Otherwise I had to set the CFG very low (3-5) to generate good results in both cases. At a value higher than 10, it transferred style rather than object.

Chubler-XL Oct 17, 2022

Thank you very much for the ideas! I'm training for concept now, not style, but I'll try your numbers. Watching the reps and periods, they look good, but I'll only bring them down after a longer workout, but not to 0.00005 (for TI): 0.02:200, 0.008:800, 0.002:2000, 0.0008:8000, 0.0002:20000, 0.00008:30000 I always get very confused images, but it seems that as I lower the rate, it starts to learn what I want. The interesting thing about the training though is that even though I tried to step back from 5000 to 4000 and lower the rate, the training still continued from 5000, even though I deleted the newer models and changed the rate in the folder. Maybe I should try without VAE for TI as well? Although in @Heathen's case VAE was left on, elsewhere they say exactly to disable it anyway for training. I'm getting a little unsure about that. Maybe it should be turned off for HN but not for TI?

The current step (and everything learned) is stored in the .pt file embeddings\xyz.pt so you want to replace this file with textual_inversion\<date>\xyz\embeddings\xyz-4000.pt to back track to step 4000.

If you are training a hypernetwork you would replace models\hypernetworks\xyz.pt with the file from textual_inversion\<date>\xyz\hypernetworks\xyz-4000.pt

If you want to scrap and start again you would delete the file and then use "Create Embedding" or "Create Hypernetwork" to build an unpopulated start file.

mykeehu Oct 18, 2022

@Chubler-XL I admit, I still don't understand these LR values. I've been trying different solutions for several days: training with a constant value of 0.005 for at least 20000 steps, or aggressive values at first, then lower values, but the loss always stays below 0.2 and above 0.05, average between 0.09-0.12. I almost always get the same loss values whichever LR I choose. Of course, the images always change and are sometimes better, sometimes worse, sometimes better again... So how to find out a good learning rate that works in many cases?

The other is that if I don't train on style but on concept (i.e. general, e.g. a motto, a posture) and I use [filewords], TI generates better images, but if I retrieve an image for the keyword, it doesn't understand it, as if it was just an effect, even though I train on subject_filewords.txt. Conversely, if I just use subject.txt, the images are blurry, but then txt2img generates the same blurry image, so that's the better way to go. I'm training with 10 images right now, but I also used 60 (30x2) and that seemed like too many for it.

That's why I'm really interested in LR, is it worth starting coarser (like 0.5 or 0.2) and then tweaking if I want to train something new and unknown, or should I start lower so it doesn't rush and then I can go up to 40,000 steps?

Chubler-XL Oct 18, 2022

I've been playing around with quite small LRs like the guys are using for the Hyopernet "5e-5:300, 5e-6:1200, 5e-7:10000, 5e-8:20000" with some fairly good results. Everything I've tried so far has been a style rather than a single object / posture etc.

I'm trying to do "art by Ken Done" at the moment and that is quite difficult, very colorfull and abstract style which SD is struggling with.

mykeehu Oct 18, 2022

For me, it would be a specific pose, shot from several angles, 10 pictures in total. I've been trying on TI beyond 20000, and with subject.txt (filewords just messed it up), but it still generates a similar one for every twentieth image (with a step of 50). It's not efficient that way, and the one that is similar is still far from what I want. It's like it has stopped learning and can't move on. He is training at 5e-8:50000 on the TI. Unfortunately Hypernetwork is not good for me because it is more style than subject or pose.
if I train from subject_filewords.txt, I see a good result (although it doesn't cover what I'm training for, but fewer bad pictures), but as soon as I generate for the keyword, I get again some deformed result. It's as if I fed it based on the filewords in vain.

chekaaa · 2022-10-15T17:15:27Z

chekaaa
Oct 15, 2022

can you share the post processed imgs that you used for the training? if its possible. Just to have a better idea of what works

1 reply

Heathen Oct 15, 2022
Author

I use a mix of zoomed out images and close ups in details. These are from a single image.

What you really need to avoid is having images of the same thing that are only a few pixels apart, you might start getting double-vision-like images if you train on those.

tearxinnuan · 2022-10-15T20:58:42Z

tearxinnuan
Oct 15, 2022

Very good tutorial, although my VRAM currently doesn't support me to use Train XD

0 replies

chekaaa · 2022-10-16T07:46:45Z

chekaaa
Oct 16, 2022

I trained a mob psycho Hypernetwork , here are the results with 26k steps

No mob psycho prompts where use to generate these images.

Some extras

10 replies

chekaaa Oct 19, 2022

Something that I noticed and Idk why It happens, I stopped the training and then started again with the same prompts and seed for the sample generation and I got a totally different img. It has happened to me before with other trainings. Any idea why this happens?

thezakman Oct 21, 2022

I notice that too and I wonder why, sometimes I don't see any improvement and when I stop and start the training again, the same or next render is super different. Not sure if the hyper are being truly loaded in the actual render or something else.

iosiflazotis Oct 26, 2022

@chekaaa Hello, do you mind uploading what your previews were like during the training of the Spy x Family hypernetwork? I want to get an idea of how I would know if I'm on the right track. I would greatly appreciate it.

chekaaa Oct 26, 2022

Sadly I deleted all the previews bc of space.

Ill keep it in mind next time

iosiflazotis Oct 27, 2022

Sadly I deleted all the previews bc of space.

Ill keep it in mind next time

It's okay. Thank you very much for considering it. I just find your training amazing, is all. I mean, that's literally Loid Forger. Hahaha.

gissleh · 2022-10-16T13:05:06Z

gissleh
Oct 16, 2022

I'm trying to train it on Mass Effect aliens. I know the SD 1.4/1.5 model has a vague idea of what they are, but training goes in circles.

Is Hypernetwork the wrong tool for that?

3 replies

Heathen Oct 16, 2022
Author

It's a bit pushing it. It only modifies the results of a model. If the result is bad or nonexistent, hypernetwork isn't it for you. I'd say train on everything mass effect, aliens, humans, style, etc and see if that works. If not, you probably need Dreambooth or Textual Inversion. But for TI you will have to train each character as its own embedding.

gissleh Oct 16, 2022

Thanks, you're probably right. I was waiting on getting a used M40 working for Dreambooth, but maybe my card (2080ti) could handle it now that Dreambooth's VRAM usage has been optimized.

AndreRatzenberger Oct 19, 2022

Dreambooth or native fine tuning/training would be the tool of your choice

horribleCodes · 2022-10-16T20:23:41Z

horribleCodes
Oct 16, 2022

Do you know where the default values for training are kept? I'd like to edit change the usual 0.005 to your recommended schedule. There's usually always one value I forget to set and I have to start all over.

4 replies

horribleCodes Oct 16, 2022

I found it: modules\ui.py, line 1232. You can find the path to the template at line 1236. Unfortunately, this file is being tracked, so you'd either have to force pull and redo the edit, or stash the change and pop it back in after the pull.

Heathen Oct 16, 2022
Author

stable-diffusion-webui/modules/ui.py

Line 1232 in c8045c5

    
           learn_rate = gr.Textbox(label='Learning rate', placeholder="Learning rate", value="0.005")

Yup, was going to link it, but you beat me to it.

mykeehu Oct 16, 2022

It would be nice if this would be included in the ui settings file, and then you wouldn't have to touch the code

horribleCodes Oct 16, 2022

Already started an issue: #2900

Heathen · 2022-10-18T20:09:23Z

Heathen
Oct 18, 2022
Author

For anyone wanting to test something, this is an annealing learning rate I'm trying out:
5e-5:100, 5e-6:1500, 5e-7:2000, 5e-5:2100, 5e-7:3000, 5e-5:3100, 5e-7:4000, 5e-5:4100, 5e-7:5000, 5e-5:5100, 5e-7:6000, 5e-5:6100, 5e-7:7000, 5e-5:7100, 5e-7:8000, 5e-5:8100, 5e-7:9000, 5e-5:9100, 5e-7:10000, 5e-6:10100, 5e-8:11000, 5e-6:11100, 5e-8:12000, 5e-6:12100, 5e-8:13000, 5e-6:13100, 5e-8:14000, 5e-6:14100, 5e-8:15000, 5e-6:15100, 5e-8:16000, 5e-6:16100, 5e-8:17000, 5e-6:17100, 5e-8:18000, 5e-6:18100, 5e-8:19000, 5e-6:19100, 5e-8:20000

It would be better if we could put math expressions in the learning rate field instead.

11 replies

Heathen Oct 19, 2022
Author

It doesn't seem loss converges in this model. I think there's a minimal downward trend, but that would take hundreds of thousands of steps to converge if at all.

It also appears that in my particular data set there is a group of images with small loss and the majority has high loss. I would like for the CSV to include the picture name of the step taken, if you set it to log every 1 step.

chekaaa Oct 19, 2022

its time to experiment with multilayer structure now, and I have no idea what it does

enn-nafnlaus Oct 21, 2022

@Heathen I see what you're doing and I like it. :) Helps avoid getting stuck in local minima. Will try this out this evening, on a "1,2,4,2,1" model. I went and extended it out to >80k cycles, with decreasing sizes of bumps and annealing cycles.

enn-nafnlaus Oct 21, 2022

Made a quick-and-dirty customizable generator to make learning rate prompts that emulate your "punctuated evolution". :)

learning_rate.txt

The burst rates are able to be varied in magnitude, and you can set the exponent for the distribution - a high exponent means that the highest bursts are rare while most are low.

JonathanDotCel Jan 30, 2023

annealing

Tbh, I also find that this approach works nicely with manual intervention.
E.g. if things are starting to look warped, or go off topic, bump the learn rate up & down from a known-good checkpoint. It's far more hands-on, but does yield decent results quite quickly.

chekaaa · 2022-10-18T21:20:00Z

chekaaa
Oct 18, 2022

There is a PR for multilayer structure settings for hypernetworks #3086. Does anyone have an idea on this affects training?

3 replies

SeverianVoid Oct 20, 2022

From this which is the original pull request
#3086

hyper network layer structure
If write "1, 2, 1", hypernetworks are composed of 2 fully connected layers whose intermediate dim is 2x, which is same as up to now.
The more you add the number, like "1, 2, 4, 2, 1", the more the structure of hypernetworks becomes deeper. Deep hypernetworks are suited for training with large datasets.

Add layer normalization
If checked, add layer normalization after every fully connected layer. It would be meaningful to prevent hypernetworks from overfitting and make training more stable.

enn-nafnlaus Oct 20, 2022

My experiments with normalization have been... less than stellar. But maybe I'm doing something wrong. Seems you have to use much higher training rates to see changes, but then they wonk out. Experimenting with deeper networks now.

Heathen Oct 20, 2022
Author

I've been able to train faster with normalization, but increasing the neural network density only slowed training down without any perceivable gain, at least on 100ish picture dataset.

enn-nafnlaus · 2022-10-20T18:18:54Z

enn-nafnlaus
Oct 20, 2022

What learning rate did you use? fim., 20. okt. 2022 kl. 17:42 skrifaði Pirate Kitty < ***@***.***>:

…

I've been able to train faster with normalization, but increasing the neural network density only slowed training down without any perceivable gain, at least on 100ish picture dataset. — Reply to this email directly, view it on GitHub <#2670 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A3XG2HZZR47QM2FXR3N7R2LWEGABLANCNFSM6AAAAAARFV5TAI> . You are receiving this because you commented.Message ID: <AUTOMATIC1111/stable-diffusion-webui/repo-discussions/2670/comments/3927066 @github.com>

1 reply

Heathen Oct 20, 2022
Author

I'm currently using 5e-3:200, 5e-4:400, 5e-5:1000, 5e-6:2000, 5e-7:3000 for normalized. Only training up to 3000 steps. But the results aren't good and they don't seem to get better with normalized. So it's only if you want something fast, I suppose.

enn-nafnlaus · 2022-10-20T20:03:03Z

enn-nafnlaus
Oct 20, 2022

Yeah, my goal is good, not fast. fim., 20. okt. 2022 kl. 19:43 skrifaði Pirate Kitty < ***@***.***>:

…

I'm currently using 5e-3:200, 5e-4:400, 5e-5:1000, 5e-6:2000, 5e-7:3000 for normalized. Only training up to 3000 steps. But the results aren't good and they don't seem to get better with normalized. So it's only if you want something fast, I suppose. — Reply to this email directly, view it on GitHub <#2670 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A3XG2H2NL22VKOMDM6G22UDWEGOGJANCNFSM6AAAAAARFV5TAI> . You are receiving this because you commented.Message ID: <AUTOMATIC1111/stable-diffusion-webui/repo-discussions/2670/comments/3927937 @github.com>

0 replies

Heathen · 2022-10-20T23:29:56Z

Heathen
Oct 20, 2022
Author

Trained another hypernetwork with okayish results, but I feel the loss we're getting back isn't indicative of anything. I know it's median loss and not step loss, but even so.

There doesn't seem to be a downward trend per epoch.

But analyzing the loss density, there are local downward trends.

This hypernetwork was trained with VAE on, I'm going to try again with VAE off and compare results and loss data.

3 replies

Heathen Oct 21, 2022
Author

Trained the same data with VAE off

Loss was generally lower, but also all over the place. Still doesn't seem indicative of anything as far as hypernetworks go.

ifeelagood Nov 18, 2022

if you take the EMA with a window size of ~100, there is a decently clear, small downward trend.
activation: softsign
network shape: 1, 1.5, 1.5, 1
dropout: no
layer initialisation: xavier normal
model: waifu-diffusion-v1.3
learning rate scheduler: 1e-4:200, 1e-5:2000,1e-6:4000,1e-7:20000,1e-8:100000

mom333 Dec 5, 2022

I don't know if I am late to the party and this is how it looks for everyone now that #4886 (comment) is a thing.

This is loss per epoch for 247 images split into 64 accumulation steps with LR 1e-3 and weight decay of 0.1. Result is obviously looking bad now (trying something here) but this is the first time I have seen optimizer actually converging towards a fraction of initial value.

giteeeeee · 2022-10-21T02:35:09Z

giteeeeee
Oct 21, 2022

Just a quick question on how to read "loss convergence".

For example, a loss from "0.30-0.10" to "0.25-0.15", can be interpreted as converging? Am I understanding this correctly?

1 reply

Heathen Oct 21, 2022
Author

Technically yes, but I haven't seen it actually do that, as you can see in the graphs above.

Kallamamran · 2022-11-24T13:12:27Z

Kallamamran
Nov 24, 2022

I can't get Hypernetwork training "to work". Training a model on myself via Dreambooth creats great results, but trying the same with Hypernetwork I look like a** 😂 Like a long lost cousin or something. I tried Hypernetwork on a friend of mine and no matter what I did he turned up a good lookin asian dude and he is not asian at all 😂 Lol... Wish I could use this training since 10GB VRAM is to little to train Dreambooth locally

1 reply

XiteSDF Nov 24, 2022

If you haven't yet, try applying #4886. The difference is literally night and day.

msampson3d · 2022-12-06T02:48:34Z

msampson3d
Dec 6, 2022

So much has changed in the recent commits that I feel like most of the info in this thread is no longer relevant. Kinda feel like its time to start a new one with revised research and findings.

3 replies

ifeelagood Dec 6, 2022

agreed.

mykeehu Dec 6, 2022

Yes, it is time to summarise the experience so far in a new group.

DarkAlchy Dec 17, 2022

Agreed, since 2.1 rolled up I no longer can create a HN on a T4 no matter what I do. Prior I had made one HN.

ProfaneServitor · 2022-12-06T12:45:21Z

ProfaneServitor
Dec 6, 2022

I tried training a hypernetwork on a character. So far, the best model was done with 20000 steps on 32 images and (1,3,1) architecture. However, it has big limitations:

It requires a long prompt. 1girl, iwakura_lain, asymmetrical_hair, blunt_bangs, single_sidelock, x_hair_ornament, short_hair, brown_hair brown eyes is absolute minimum.
As result, if you add more details, some tags start falling off: hair becomes symmetrical, the hairclip falls off or gets cloned, face becomes blurry, etc..
It doesn't transfer well to other models. It works ok within its family, but, say x_hair_ornament doesn't work at all on SD.
Questions:
Should I specify short_hair, brown eyes and all that jazz on input before learning, or should I expect the network to just assume it by default and bake it into the character tag? Maybe it will be better to start with them, and then turn them off when the model starts drawing something close to acceptable outcome?
Does training in webui support augmentation, or should I generate augmented pictures on my own? I assume it's latter, because "Preparing dataset" tab mentions only one form of augmentation, flipping, and it's useless in this case.
Is it even reasonable to expect finer details like hairclip to transfer between different models?
Is it possible to visualise the activation layers of hypernetwork to see what is going wrong?
How dangerous is overlearning? Is it worth going over 20000 steps to try and get more accurate results?

0 replies

ogkalu2 · 2022-12-06T16:22:33Z

ogkalu2
Dec 6, 2022

Has anyone got good hypernetworks working on non anime styles ?

0 replies

XiteSDF · 2022-12-06T18:09:15Z

XiteSDF
Dec 6, 2022

So, I was going to look into how weight decay affects training, but in the process found a different issue with current training.
TLDR: having preview on/off and whether seed is parsed from txt2img affects training

Step/Loss Graphs

So, the first thing I saw was the very smooth loss line (Preview with fixed seed). I had loss recording set to every 10 steps, with a single epoch per step due to batch size, and preview also every 10 steps. However, when I removed seed from preview, the loss graph became a random line again. Just as a sanity check I've rerun the first training, and the loss/step graph was almost identical.

So I looked into code. Turns out, when generating any images, we call this function:

stable-diffusion-webui/modules/devices.py

Lines 84 to 88 in 44c46f0

    
           def randn(seed, shape): 
        
               torch.manual_seed(seed) 
        
               if device.type == 'mps': 
        
                   return torch.randn(shape, device=cpu).to(device) 
        
               return torch.randn(shape, device=device)

From what I can tell, this is also called in case of seed == -1. This function sets global torch seed value. What it results in is that after we generated a preview, we have effectively reset random number generator. I've added a small print to confirm that it is indeed the case. Confirming my suspicions was the fact that, after adding a manual seed reset to train_hypernetwork after every step that would result in preview generation, the loss graph became almost identical, and behave very closely to the fixed seed preview runs. However, note that it's very different from a no preview run, which meant that potentially even just resetting random number generator affects what we see here.

Which lead to the second test. Now, I'm still not sure which parts of HN training cause non-determinism, but I've set a fixed seed at the beginning of train_hypernetwork. I've set batch size to 1 and run 900 steps (100 epochs), with a preview every 45 steps (5 epochs). I'm logging loss every step for this one. Configurations only differ in how and whether or not we generate previews. All starting hypernetworks were identical.

Step/Loss Graphs for this test

So, yeah. They are mostly identical. And the thing that really sticks out is that it's very periodic, like you see an identical pattern every 45 steps, but squashed a little bit (because it still converges). We can also add a loss graph of the same training, but without resetting RNG every 45 steps.

Step/Loss Graphs with 5th run

And yes, it diverges at 45 steps. Well, 44, but that's a different issue.

Now the question is, does it only affect loss metrics, or it also affects training as well. Easy, just all of our hypernets against each other and the answer should be clear, right?

Weeeeell

seed_test 0 is a hypernet without previews and seed reset, 1-4 are hypernets with various preview settings and seed reset. It doesn't really matter which is which as, while 0 is clearly different from others, the differences between 1-4 hint at some other non-determinism that I'm missing, and I'd like to minimize them for a more definitive answer. And you know what it means, a final test!

Three identical hypernets, 90 steps (10 epochs), batch size 1, fixed seed at the beginning of train_hypernetwork, and generating an image with the same seed before starting training just in case. No preview. seed_test 5-1 and seed_test 5-2 are trained identically, with an RNG reset after 45 (actually 44) steps. For seed_test 5-3 RNG is not reset, but all other settings are left the same.

Steps/Loss Graphs confirmed to behave the same as in previous test

Some results from here on may be NSFWish. Hypernets have a very loose idea of black sleeveless dress it seems.

And I'm not even doctoring it. These are all the ones I generated to test

There isn't even a question, 5-1 and 5-2 look like they're the same image on slightly different systems, while 5-3 is drastically different by comparison!

So what does this all mean?

Having preview on resets RNG for every generated preview, which affects the training process. This seems like a bug.
It is unclear whether or not it's a good thing. For this reason I'm not making an issue yet and just writing my findings here, as I may still be missing something.
If a seed is set in txt2img preview, it results in a cyclic pattern on a loss graph

33 replies

Heathen Dec 9, 2022
Author

I'm not sure why it wasn't doing it for you. Maybe it was seed resets, again?

I was using Monkeypatch to train and he did fix it a couple days ago.

XiteSDF Dec 10, 2022

I was using Monkeypatch to train and he did fix it a couple days ago.

Well, I've tried their solution, and it correctly preserves RNG state, so you can probably check if it's active at all. ~~or not using monkey patch~~

mom333 Dec 12, 2022

The only way to "escape hell" i've found so far is to split the dataset, train on one set, and when you see it go picasso (or about to), backtrack and continue with a different dataset. Which makes no sense, but seems to work most of the time.

You can achieve the same by changing seed after some time. You change the objective function and your network suddenly starts running in a different direction destroying your training accuracy which leads to better generalization. Changing your data set creates the same situation. You could actually use this if you use a small learn rate and watch your previews for the moment you want to stop. Also actually the more I think about it changing the dataset is probably better cause you start moving in a completely different direction. With a different seed you may have a problem of the road between two seeds being too short and never taking you out of overfitting.

mom333 Dec 16, 2022

After 300k steps into my first somewhat promising attempt at making an overparameterized model I had to stop because it wouldn't go any further. I still have just a little bit of hope that you can make one and I am trying for the last time. But you can make a lot of bad decisions along the way which I probably did and it screwed me over to a point where I couldn't weight decay my network enough to leave the picasso zone. Or maybe I could but it would take another 4 days of running and I am not gonna try that.

mom333 Jan 8, 2023

I think I got an answer after I made my share of completely useless runs that always ended up with nothing regardless of how much I managed to lower weight norms.

This is absolutely a feature and not a bug. During all my recent useless runs I reset the seed back to the same seed each step and always got a neat loss graph that was constantly lowering (as long as I had a good learn rate and weight decay). But at some point you always get complete trash output. My understanding of textual inversion is that you transform the random noised picture at different stages of denoising so your model denoises it in a certain way to show your training data. And the part of code that creates those random noised pictures uses random seed cause that is what you will also be getting while using your model (random input). So if you set a static seed you get an easy training where you can gradually lower loss cause you are always getting the same random noise picture, as input into your network and your network learns how to transform that picture. But the moment you try to generate a picture with a different input (not even prompt but just general random noise) you get garbage. It is like the definition of overfitting (network performing ok only for trained input) but it is also a special case of overfitting characteristic only to this method of training.

So loss should always look chaotic. However it should also behave like: #2670 (reply in thread) gradually trending downward. That means that you keep giving it random input and it is getting better and better.

DOJO148 · 2022-12-06T18:21:50Z

DOJO148
Dec 6, 2022

The training cannot be carried out due to the above error. plz help..

0 replies

sashasubbbb · 2022-12-11T11:51:08Z

sashasubbbb
Dec 11, 2022

So, is there a universal guide how to train art styles (character drawing style) using monkeypatch fast and without blowing up weights, and destroying background details

9 replies

Heathen Dec 11, 2022
Author

Yup.

msampson3d Dec 11, 2022

How does one know when it's "Done?" Graphing loss or simply judging by sample images?

Heathen Dec 11, 2022
Author

20k to 30k (divided by gradient) is usually enough in term of steps. You're mainly going by the sample images. If you use normal initialization, graphing loss is useless. If you use Xavier, which is good for full style changes to a model, then graphing loss might help somewhat.

Somni206 Jun 29, 2023

Sorry for necro-ing this, but I just prefer training on hypernetworks instead of LoRA and there's not a lot of updated info re: hypernetworks. So, I gotta ask, how do you tag when you're training an anime-like style?

Say I got a picture of a guy studying a report in a private office among the sample images. Would I basically put it as:

"art by artistname, digital art, 1boy, man, solo, standing, holding paper, narrowed eyes, office, suit, window, skyline"

Do I have to go that detailed in the tagging for training styles with hypernetworks?

Heathen Jun 29, 2023
Author

You have to think of tags as things the hypernetwork can identify, but they're only as efficient as the examples you have. For best results you should always have an image with the tag and one without. One with 1boy, and one with 1girl for example. One with holding paper, one without. Requires a lot of decent pictures to make a versatile hypernetwork/lora. Otherwise they are still functional, but they might not respond to tags very well.

You can also do them without any tags or just "artistname", it sometimes work better. You really have to just try it out.

slashedstar · 2022-12-13T18:53:44Z

slashedstar
Dec 13, 2022

@Heathen have you seen #4940 ? After reading here and getting some good results in ~5k steps, I achieved similar results in 500 steps with the recommendations there (1, 0.1, 1 lr: 1e-4, changing optimizer, adding the if loss)

8 replies

mom333 Dec 14, 2022

a lot of dead neurons

Pretty sure it is reverse of that. You always end up going back towards 0 weight so you walk back from blown out neurons or dead neurons. Unless you count not being able to move away from 0 as dead neuron.

orcinus Dec 14, 2022

That is pretty much the definition of a dead neuron, no? :)
Neuron whose weight cannot be updated anymore.

Heathen Dec 14, 2022
Author

a lot of dead neurons

Pretty sure it is reverse of that. You always end up going back towards 0 weight so you walk back from blown out neurons or dead neurons. Unless you count not being able to move away from 0 as dead neuron.

I was thinking more in terms of a normal relu network. It actually depends on the activation to be fair. In the case of hypernetworks, a 0 neuron isn't exactly doing nothing. It is a pass-through. The only problem with the neuron actually hitting 0 or less is if you're using relu. I guess don't relu if you use a lot of weight decay.

Underfitting can still happen, but since we're dealing with something where 0 is acceptable, it wont be as noticeable.

But since you have pointed it out, I've been using 0.1 weight decay and that made training so much faster and precise, thanks for that.

That is pretty much the definition of a dead neuron, no? :) Neuron whose weight cannot be updated anymore.

Yes and no, he's correct if you're not using relu and in the case of hypernetworks where 0 is okay.

orcinus Dec 14, 2022

Yes and no, he's correct if you're not using relu and in the case of hypernetworks where 0 is okay.

My point was more that it doesn't matter if it's 0 or not - what matters is that back-propagation is not doing anything anymore and the neuron is stuck. Whether it's stuck at 0, 1, or something else is not too relevant.

And yes, it's mostly a relu specific thing.

sashasubbbb Dec 14, 2022

Yeah, i've been playing around with info from this discussion as well as from #4940 and i stumbled upon some interesting stuff.

Basically my goal is to train hypernetwork in order to change an art style without ruining details (anime art style in my case).
I managed to get decent results without any tags in dataset in a couple of minutes of training with a HN that is only 4MB in size.

Use monkeypatch, add changes "if loss ..." and optimizer with "weight_decay= 1e20" from #4940 to extensions*monkeypatch_extension_dir*\patches\external_pr\hypernetwork.py. Create beta hypernetwork 1, 0.1, 1 (linear, normal).
Set LR at 1e-3. Batch size at 1. Set Gradient accumulation steps to the number of all pictures in your dataset.
Prompt template: art in style of [name]. Save an image and embed at every step.
Finally, deterministic latent sampling.

(beware that preview images could be complete trash if you didn't setup appropriate prompts and check "Read parameters (prompt, etc...) from txt2img tab when making previews")

I have no idea what's happening inside training, but i'm amazed by outputs.
I'm sure this method could be enhanced further.

upd: Just dipping my toes into dreambooth, and with addition of LORA, i think that's its better suited for my objective.

kenedos · 2023-01-03T14:04:48Z

kenedos
Jan 3, 2023

I don't know if this is still relevant, but I have found my biggest success on wide and deep bottlenecked networks.

From my empirical testing:

Wide nets (1, 3, 1) or (1, 4, 1)
These will provide very good results if your prompts have same keywords as trained data. They will quickly resemble the artstyle of trained data. They will, however, completely mess up when using prompt keywords different to your training data. For example, if your dataset only consists of upper body character images ("upper body" keyword), adding "full body" to your prompt will cause the net to often duplicate legs or generate unspeakable horrors.

Deep nets (1, 1.5, 1.5, 1) or (1, 1.5, 1.5, 1.5, 1)
These will keep most of the base model characteristics. The artstyle becomes more of a "mix" between your dataset and the base model rather than just your dataset artstyle alone. When using keywords on your prompt that were not trained with, you will get results that highly approach the base model and completely ignore the dataset training. It seems safer to use deeper nets because they will generate less horrors and just highly approach the base model when prompt is very different to the trained dataset.

'Default' nets (1, 2, 1)
It's kind of a inbetween but it slightly approaches wide nets more than deep nets. It'll keep some of the characteristics of the trained data (but not every) and mostly ignore the hypernet when prompt has tons of keywords that were not trained with (similar to deep nets).

Needless to say, all three architectures are far from ideal, since all of them have a very hard time generating images where prompt is very far off from trained data captions. So I went on and did some more testing:

Wide, deep and bottlenecked (1, 3, 0.75, 0.75, 0.75, 3, 1) or (1, 4, 0.75, 0.75, 0.75, 4, 1)
I've been having my best results with these so far. They seem to retain some properties from wide nets (replicating artstyle very close to dataset) but also able to extend that artstyle to other generic prompts in which the net was not trained on. Similar to wide nets, these wide and deep nets will also duplicate body parts sometimes, but not as often as just wide nets on untrained keywords.

The activation functions does not seem to play a huge part in the results. Sure, certain activation functions will allow the net to learn further or not explode, but I've been having decent results even with linear activation and normal initialization. It seems that the net architecture plays a much bigger role in making a good hypernet than the activation functions themselves.

1 reply

mom333 Jan 4, 2023

I am still trying to do stuff so I wasn't saying anything yet but one thing I noticed is that normal initialization of weights places weights on all layers with STD 0.01. Which means that with usual learn rate people use (less than 1e-4) I don't think you ever reach weights high enough for activation functions to matter in any way. And all activation functions return roughly the same number when they are at 0 point except for relu of course. With that I think that activation functions matter only when initializing kaiming or xavier.

Also since I am writing this already. Not sure if norm layers are useful in any way in current form. They also initialize with the same normal distribution where they should probably initialize with ones.

ivangarciafilho · 2023-01-08T04:15:10Z

ivangarciafilho
Jan 8, 2023

I noticed a plenty of artists that embeddings are not capable of recreating the artstyle (when it comes to recreating an artstyle instead of a subject), did some testing with hypernetworks and they started "looking like something different" towards the artstyle, however, only with dreambooth I could achieve such quality, however ... they were ... "mimmicing" the inputs ... even if i checked to train an artstyle, not a subject, stable difffusion always came up with something that looked like an "interpolation" of the original input instead of keeping the composition intact ....
Any know workflow or blend between techniques (even tryied to combo dreambooth + hypernetwork + embedding + aesthetic gradients of the same artists to output something simillar )?

The artists : Wayne Raynolds and Robbie trevino

0 replies

mom333 · 2023-01-08T06:31:32Z

mom333
Jan 8, 2023

And expanding a bit on #2670 (reply in thread). After I removed setting seed at each step and forked rng for previews I tried to make this double descent I keep talking about but it still didn't work out (admiteddly the size of the network was kinda small).

However I made a network that makes "perfect copies" of the training image.

Here is the exact prompt used during training with a 4 batch generation.

face, ears, hair, pupils, eyes, glasses, eyebrows, nose from front, closed mouth, art by artist name

Same image each time regardless of seed.

And now same prompt but I remove glasses.

And now I add "from side"

And now I say just art by artist name

I added braid into negative prompt.

Didn't go away. Then I added black hair instead of hair.

No black hair. And finally green hair, smug smile, pupils, eyes, eyebrows, from side

And I am tired so I don't have any huge revelation about how tagging works, but it seems like the closer the prompt is to the tags you use the more features will be overridden by hypernetwork. Granted this is a special case of single image used for training but I think this applies to all networks. So tagging even with obscure tags could be good - personally I thought it makes sense just to tell the network which feature it should learn from image.

1 reply

starik222 Jan 8, 2023

After various experiments, I came to the conclusion that in order to clearly distinguish between details (tags), it is necessary that the source set contains at least one image with a missing or different tag. For example, in the original set there are 5 images with horse ears and there is a "horse ears" tag everywhere. After training, even if I do not use this tag, the ears will still be present on the images, i.e. they will be attached to other tags. But if I add one image without ears and a tag, then the generator will already clearly understand what ears are :)

bananasss00 · 2023-01-10T18:06:06Z

bananasss00
Jan 10, 2023

another tutorial: https://civitai.com/models/4086/luisap-tutorial-hypernetwork-monkeypatch-method

0 replies

mom333 · 2023-01-21T17:11:49Z

mom333
Jan 21, 2023

I wrote down what I was working on for the past month or two:

#7011

Would be nice if someone else would give it a shot and maybe get a better result with those settings so I can finally stop trying to make it perfect.

1 reply

DarkAlchy Jan 21, 2023

Hypernetworks I love, but when a 16GB card can only do a batchsize of 1 and takes forever per epoch they become DOA. Some optimizations are needed as LoRa and Dreambooth and Textual Inversion are optimized just not HN.

Pudding-0503 · 2023-05-12T11:23:54Z

Pudding-0503
May 12, 2023

How should I read the log files of the hypernetwork? I use the extension to complete the training, run the command tensorboard --logdir and open the link, but it shows that there is no data

4 replies

aria1th May 12, 2023
Collaborator

When you use the extension, it saves tensorboard log at stable-diffusion-webui\textual_inversion\<your_hypernetwork_name>
There is tensorboard folder you can look at!

Pudding-0503 May 12, 2023

Thank you so much for getting back to me so quickly! But I did not find the tensorboard folder in this path, my path is C:\Projects\stable-diffusion-webui\textual_inversion\2023-05-12\Camille_Pissarro_test, but there are only hypernetworks folder, images folder and hypernetwork_loss. csv in it, am I missing any necessary steps during training? Run tensorboard --logdir=C:\Projects\stable-diffusion-webui\textual_inversion, the result is still No dashboards are active for the current data set.

aria1th May 12, 2023
Collaborator

If you're not seeing tensorboard folder, I guess the option for tensorboard logging is off for you - you have to enable it!
I have to also mention that in main repository, its off at default (and broken), and its only available via extension.

Its now in Settings, Enable tensorboard logging.

Pudding-0503 May 12, 2023

My problem has been perfectly solved, thank you very very much!

PiggyChu620 · 2023-06-21T14:49:01Z

PiggyChu620
Jun 21, 2023

Hello,
First of all, thanks for your guide.
I have a question, will the result get better if I constantly train the same hypernetwork with new set of data?
Thanks for your help.

0 replies

Hypernetwork Style Training, a tiny guide #2670

Replies: 72 comments · 381 replies

Heathen Oct 15, 2022 Author

Heathen Oct 15, 2022 Author

Heathen Oct 15, 2022 Author

Heathen Oct 16, 2022 Author

Heathen Oct 16, 2022 Author

Heathen Oct 18, 2022 Author

Heathen Oct 19, 2022 Author

Heathen Oct 20, 2022 Author

Heathen Oct 20, 2022 Author

Heathen Oct 20, 2022 Author

Heathen Oct 21, 2022 Author

Heathen Oct 21, 2022 Author

Replies: 72 comments 381 replies

Heathen Oct 15, 2022
Author

Heathen
Oct 15, 2022
Author

Heathen Oct 15, 2022
Author

Heathen Oct 16, 2022
Author

Heathen Oct 16, 2022
Author

Heathen
Oct 18, 2022
Author

Heathen Oct 19, 2022
Author

Heathen Oct 20, 2022
Author

Heathen Oct 20, 2022
Author

Heathen
Oct 20, 2022
Author

Heathen Oct 21, 2022
Author

Heathen Oct 21, 2022
Author