Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got weird results, not sure if I missed a step? #35

Open
altryne opened this issue Aug 29, 2022 · 82 comments
Open

Got weird results, not sure if I missed a step? #35

altryne opened this issue Aug 29, 2022 · 82 comments

Comments

@altryne
Copy link

altryne commented Aug 29, 2022

Hey @rinongal thank you so much for this amazing repo.

I trained with over 10K steps I believe, and around 7 images. (Trained on my face)
Using this colab

I then used those pt files in running the SD version right in the collab and a weird thing happens, when I mention * in my prompts, I get results that look identical to the photos in style, but it does try to ... draw the objects.

For example :
CleanShot 2022-08-29 at 14 04 48@2x

Prompt was portrait of joe biden with long hair and glasses eating a burger, detailed painting by da vinci
and
portrait of * with long hair and glasses eating a burger, detailed painting by da vinci

So SD added the glasses and the eating pose, but completely disregarded the detailed painting and da vincin and the style.

What could be causing this? Any idea? 🙏

@rinongal
Copy link
Owner

Hey!

The most likely candidate is just that our SD version isn't officially released yet because it's not behaving well under new prompts :) It's placing too much weight on the new embedding, and too little on other words. We're still trying to work that out, but it wasn't as simple a port from LDM as we hoped. If this is the issue, you can try to work around it by repeating the parts of the prompt that it ignores, for example by adding "In the style of da vinci" again at the end of the prompt.

With that said, if you want to send me your images, I'll try training a model and seeing if I can get it to behave better.

@altryne
Copy link
Author

altryne commented Aug 30, 2022

Thank you! I'll try putting more weight on the later keywords!
Don't think my images are anythings special or important to test with, I just took a few snapshots of myself ,and cropped to 512x512.

@ExponentialML
Copy link

One thing I found that helps and/or fixes this scenario is adding periods to your prompts, not commas like in original SD repo. This may or may not be a bug.

So this: portrait of * with long hair and glasses eating a burger, detailed painting by da vinci

Should become this: portrait of * with long hair and glasses eating a burger. detailed painting by da vinci.

If you trained on one token, you could possibly add weight by doing something like portrait of * * ...rest as well, but you'll get further away from the rest of your prompt

@ThereforeGames
Copy link

ThereforeGames commented Aug 30, 2022

If you're using the web UI (i.e. this repo: https://github.com/hlky/stable-diffusion-webui ), you can specify weight to certain tokens as such:

A photo of *:100 smiling.

I frequently have to do this with the finetuned object, sometimes using astronomical values like 1000+. This can greatly improve likeness. You may also need to adjust classifier guidance and denoise strength. All of these parameters do impact each other, and changing one often means needing to re-calibrate the rest.

Anyhow, you can try applying strength to the part of the prompt that SD is ignoring. Something like this:

portrait of * with long hair and glasses eating a burger, detailed painting:10 by da vinci:10

@altryne
Copy link
Author

altryne commented Aug 30, 2022

If you're using the web UI

I'm one of the maintainers in charge of the frontend part
but TBH I haven't yet added my own checkpoints to the webui! Will do that tomorrow

I will def try this! Thank you

@oppie85
Copy link

oppie85 commented Aug 30, 2022

I've found limited success in "diluting" the new token by making the prompt more vague - for exmple "a painting of *" results in pretty much the same image as just "*" on its own, but "a painting of a man who looks exactly like *" does (sometimes) work in succesfully applying a different style. Adding weights to the tokens as others have described also works, although it requires constant tweaking.

I don't know if it would be technically possible to test for style transfer during the training/validation phase; for example, on top of the 'preset' prompts that are used on the photos in the dataset, you would have a separate list of prompts like "A painting of *" that would be used to verify that an image generated with that prompt would also score high on the 'painting' token. In the DreamBooth paper, they describe that they combatted overfitting (which I guess is causing these issues) by also training 'negatively' - something which I've tried to rudimentally replicate by including prompts without the "*" in the list of predefined ones, but I don't think this would actually do anything since the mechanism for DreamBooth and Textual Inversion are very different.

@1blackbar
Copy link

If you guys had a single instance of succefully finetuning a photo likeness of human being into SD with this code please share, ive yet to see that and im almost sure that this code is not to "inject" your own face into SD model as people might think.

@ThereforeGames
Copy link

ThereforeGames commented Sep 2, 2022

If you guys had a single instance of succefully finetuning a photo likeness of human being into SD with this code please share

I won't be sharing my model at this time, but I can tell you that this method is indeed capable of pulling off a convincing headswap under the right conditions:

  • With photorealistic subjects (people), I have had better results when providing the model ~10 images and training longer than suggested (25-40k iterations). This could be a fluke, I'm sure the authors of the research paper know what they're talking about when they say 5 is the optimal number of images - but I'm not convinced it's always 5.
  • txt2txt often produces mediocre and "samey" results with finetuned checkpoints. Try img2img instead. You'll get more variety in terms of facial expression and surprisingly higher fidelity in the face itself. Using photos as your img2img input is better than using simple drawings or other kinds of illustrations. Denoise strength should be between 0.4 to 0.75 depending on how large the face is in your input image (larger face = go for higher denoise strength.)
  • Play around a lot with CFG and prompt weights. You can crank CFG to the 10-20 range to improve likeness at the cost of potentially introducing visual artifacts (can be counteracted to some extent by increasing inference steps). Likewise, you can apply more weight to your finetuned object by writing *:10 or *:100 etc, in your prompt.
  • k_euler_a sampling method seems to be the best for photorealistic people.
  • For best results, take an image from SD and throw it into a traditional faceswap solution like SimSwap or sber-swap.

Hope that helps.

@1blackbar
Copy link

1blackbar commented Sep 2, 2022

Well... i already heard that , its not saying much without comparison of actual photo and sd output .Even paper doesnt have result with human subjects.Some people claimed to do it but then i looked at pics and sd output was not the person thats on training data images.Vaguely yes it was same skin colour, similar haircut but proportions of the face against nose and lips... all mixed up from result to result.
So i stand by what i wrote, this method so far is not capable of finetuning a human likeness and synhesizing it in SD, until proven otherwise.
I dont mind trainig for a long time, i just want to know if ill be wasting my time and blocking gpu for nothing if ill never be able to get at least 90% likeness.
Almost all if not all results ive seen look like derivatives/mutations of the subjects and not like actual subject.
Identity loss is one of the biggest issues in face synthesis and restoration.Few managed to solve it.
I trained 3-4 subjects with about 30k iterations each, results were not succesfull ( well it did "learned" them but they looked like mutations of subjects)besides one with training a style that was bigger success, so for now id wait until i see someone pushing finetuning and proving it can be done and you can synthesize a finetuned face that looks like on original images.

@oppie85
Copy link

oppie85 commented Sep 2, 2022

Here's what you can try to verify that textual inversion can create a convincing likeness;
First of all train at 256x256 pixels with larger batch sizes; depending on your GPU you can easily train 4x as fast so you'll see results sooner. The downside of this is that only the ddim sampler really works with the final result, but I feel like that's an acceptable tradeoff if your main goal is just to check whether or not it's even possible. Also bump up the num_vectors_per_token a bit; if you're not worried about overfitting you can even bump it up to ridiculous levels like 256 (edit: I've now learned that putting this higher than 77 is useless because SD has a limit of 77 tokens per input) - the result of this is that you'll get a convincing likeness way quicker, but it'll never deviate that much from the original photos and style transfer may be impossible.

I've fiddled a lot with all kinds of parameters and have gotten results that are all over the place; with the 256x256 method I can iterate pretty quickly but the end result is always overfitting. For example, most of the photos I used were in an outdoors setting and textual inversion thus inferred that being outdoors was such a key feature that it'd try to replicate the same outdoor settings for every generation. I thought that maybe adding A * man outdoors (and variations) would help in separating the location from the token, but I feel that it only reinforces it because now generated images that are in an outdoors setting score even higher on matching the prompt.

I think that's largely where the problem lies; apart from the initial embedding from the 'initializer word', there's no way to 'steer' training towards a particular subject. When using a conditioning prompt like A * man outdoors with a red shirt the conditioning algorithm doesn't know that it can disregard the "red shirt" part and that it should focus on the magic * that makes the difference between the encoding of a regular man and myself.
I don't know if it would be possible to basically train on two captions for each image; for example, we apply a * man outdoors in a red shirt and a man outdoors in a red shirt (without the *) and then take only the difference in the encoding instead of the entire thing.

@ExponentialML
Copy link

The two things that had the most success for me are:

  1. Replace the template string with a single {}
  2. Make sure you're using the sd-v1-4-full-ema.ckpt

I'm almost positive that the reason for overfitting in SD is that the conditioning scheme is far too aggressive. Simply letting the model condition itself on the single init word alone is sufficient in my opinion, and has always lead to better results for me.

What's funny is that you're staying close to Stable Diffusion's ethos of heavy prompting, because conditioning this way makes it to where you have to come up with the correct prompt during inference, rather than let the conditioned templates do the work.

Even if you have low confidence in this method, I say it's most certainly worth looking into. I'm also certain that PTI integration will mitigate a lot of these issues (it's a very cool method for inversion if you haven't looked into it).

@1blackbar
Copy link

1blackbar commented Sep 2, 2022

Well, i just fed it 2 pics of stallone, and im closest than i ever was with any face after 1500 iters , but its 256 size and 50 vectors , two init words - face ,photo.
So i have a plan, once it reaches likeness of reconstruction images, i will feed it 512 images, can i swap size like that when continuing finetuning , from 256 to 512 ?
samples_scaled_gs-002000_e-000010_b-000000_00000021

But i must say that reconstruction at 256 res is not looking too god tho, lost likeness a bit , this one looks better at 512 res
this image on bottom a is reconstruction, not actual sample, its how model interpreted original image and it trains from this :

reconstruction_gs-001000_e-000005_b-000000_00000008

@oppie85
Copy link

oppie85 commented Sep 2, 2022

I'd say 2 photos is actually not enough for training a likeness; I use around 10-20 pictures for my experiments. For the 256x256 method it works best to mix in a few extreme closeups of the face so that the AI can learn the finer details. I don't actually know if starting on 265x265 and then resuming at 512x512 is possible - I think it should be though because that's how SD was trained in the first place. For init words, I don't think "photo" is very good - I'm using "man" and "face" for that purpose - because those are the things that I want the AI to learn. Nevertheless; 1500 interations isn't very much. I usually get the best results at around 3000.

@1blackbar
Copy link

1blackbar commented Sep 2, 2022

Yes ill try that, its also strange i cant have batchsize of 2 with 11gb of ram on 256 res.
Does batch size affect the training? i think if it sees more images at once it learns better ? If thats the case id try on colab pro.
I also try man face but i wanted it to know that its a photo version , aphoto style, so maybe wit that it could be editable easier with styles.
I have tight close up on face (jaw to chin) so i can show it likeness better at that res now.I noticed in SD you lose likeness when you are in medium shot but on macro close up you get best likeness of a person
WEll... im qute impressed now , barely started and thats the result on epoch 4 and 1500 iters, how many epochs you recommend ?
Sorry to hijack like this but im sure more people will come so i think this could be useful for them to read
samples_scaled_gs-001500_e-000003_b-000300_00000016

OK so far from what i see... you should have mostly macro face close ups to get best identity, no ears visible besides one image like that stallone pic above, the rest should be very tight close ups of the face , probably even tighter than this one below

s3

Ill try to resume and give it even tighter one , or start over with only tight macro shots of the face, cause im training mostlyu face and 256 is a bit low

Wow this is pretty good, way above my expectations
samples_scaled_gs-002000_e-000005_b-000000_00000021
Oh crap this side shot looks too good, i wonder how editability wil work
samples_scaled_gs-004000_e-000010_b-000000_00000041
Ok... i think that proves it, you can actually train a human face and retain identity ... this result is beyond what i expected and it barely started finetuning
samples_scaled_gs-002500_e-000001_b-001000_00000026
OK so if anyone wants to get good results - drop resolution to 256 on bottom of yaml file
train:
target: ldm.data.personalized.PersonalizedBase
params:
size: 256

Also use init word "face " then actually give it face shots, not head shots , i got most images like this one and maybe two of whole head but majority is from eyebrows to lower lip framing
q8
OK final result, 11k iterations , almost fell of chair when ive seen this result , most of my images were from hairline to jawline, 2 images of full head, overall 10 images
samples_scaled_gs-006500_e-000004_b-000500_00000066

@ExponentialML
Copy link

ExponentialML commented Sep 2, 2022

W̶e̶l̶l̶,̶ ̶t̶h̶i̶s̶ ̶c̶e̶r̶t̶a̶i̶n̶l̶y̶ ̶i̶s̶ ̶a̶n̶ ̶i̶n̶t̶e̶r̶e̶s̶t̶i̶n̶g̶ ̶d̶i̶s̶c̶o̶v̶e̶r̶y̶.̶ ̶ ̶ ̶

S̶o̶ ̶t̶h̶i̶s̶ ̶c̶o̶u̶l̶d̶ ̶t̶h̶e̶o̶r̶e̶t̶i̶c̶a̶l̶l̶y̶ ̶p̶r̶o̶v̶e̶ ̶t̶h̶a̶t̶ ̶y̶o̶u̶ ̶n̶e̶e̶d̶ ̶t̶o̶ ̶f̶i̶n̶e̶ ̶t̶u̶n̶e̶ ̶o̶n̶ ̶t̶h̶e̶ ̶b̶a̶s̶e̶ ̶r̶e̶s̶o̶l̶u̶t̶i̶o̶n̶ ̶S̶t̶a̶b̶l̶e̶ ̶D̶i̶f̶f̶u̶s̶i̶o̶n̶ ̶w̶a̶s̶ ̶t̶r̶a̶i̶n̶e̶d̶ ̶o̶n̶,̶ ̶a̶n̶d̶ ̶n̶o̶t̶ ̶t̶h̶e̶ ̶u̶p̶s̶c̶a̶l̶e̶d̶ ̶r̶e̶s̶ ̶(̶5̶1̶2̶)̶.̶ ̶E̶i̶t̶h̶e̶r̶ ̶w̶a̶y̶ ̶t̶h̶i̶s̶ ̶s̶h̶o̶u̶l̶d̶n̶'̶t̶ ̶h̶a̶v̶e̶ ̶c̶a̶u̶s̶e̶d̶ ̶t̶h̶e̶ ̶i̶s̶s̶u̶e̶s̶ ̶p̶e̶o̶p̶l̶e̶ ̶h̶a̶v̶e̶ ̶b̶e̶e̶n̶ ̶h̶a̶v̶i̶n̶g̶ ̶a̶t̶ ̶t̶h̶e̶ ̶h̶i̶g̶h̶e̶r̶ ̶r̶e̶s̶o̶l̶u̶t̶i̶o̶n̶,̶ ̶s̶o̶ ̶I̶ ̶w̶o̶n̶d̶e̶r̶ ̶w̶h̶y̶ ̶t̶h̶i̶s̶ ̶i̶s̶?̶ ̶I̶'̶l̶l̶ ̶h̶a̶v̶e̶ ̶t̶o̶ ̶r̶e̶a̶d̶ ̶t̶h̶r̶o̶u̶g̶h̶ ̶t̶h̶e̶ ̶p̶a̶p̶e̶r̶ ̶a̶g̶a̶i̶n̶ ̶t̶o̶ ̶f̶i̶g̶u̶r̶e̶ ̶i̶t̶ ̶o̶u̶t̶.̶ ̶

Edit: Tested this and figured I'm wrong here. It simply allows for better inversion, which the model is fully capable of. The real issue is adding prompts to the embeddings, which is still WIP.

@altryne
Copy link
Author

altryne commented Sep 2, 2022

drop resolution to 256 on bottom of yaml file
with provided images to train also set to 256?

@1blackbar
Copy link

1blackbar commented Sep 2, 2022

altryne is it cause of 50 vectors that i used or is it cause of 256 res drop ? which one is more responsible for this ?
I restarted tuning, had it at 1 vector, now compared to 50 vectors id say this makes the difference the most, but whats the downside of using so many vectors? whats the most sane amount i can use and get reasonable editability ?
You can see pretty much from first 3 sample that you will get likeness, now im trying 20 vectors.

@AUTOMATIC1111
Copy link

So does anyone here know how to properly work with this? This is a [50, 768] tensor. All embeddings I've seen before are [1, 768]. Are you supposed to insert all 50 into the prompt, taking space of 50 tokens out of available 75? All the code that I've seen fails to actually use this embedding, include this repository, failing with error:

Traceback (most recent call last):
  File "stable_txt2img.py", line 287, in <module>
    main()
  File "stable_txt2img.py", line 241, in main
    uc = model.get_learned_conditioning(batch_size * [""])
  File "B:\src\stable_diffusion\textual_inversion\ldm\models\diffusion\ddpm.py", line 594, in get_learned_conditioning
    c = self.cond_stage_model.encode(c, embedding_manager=self.embedding_manager)
  File "B:\src\stable_diffusion\textual_inversion\ldm\modules\encoders\modules.py", line 324, in encode
    return self(text, **kwargs)
  File "B:\soft\Python38\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "B:\src\stable_diffusion\textual_inversion\ldm\modules\encoders\modules.py", line 319, in forward
    z = self.transformer(input_ids=tokens, **kwargs)
  File "B:\soft\Python38\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "B:\src\stable_diffusion\textual_inversion\ldm\modules\encoders\modules.py", line 297, in transformer_forward
    return self.text_model(
  File "B:\soft\Python38\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "B:\src\stable_diffusion\textual_inversion\ldm\modules\encoders\modules.py", line 258, in text_encoder_forward
    hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids, embedding_manager=embedding_manager)
  File "B:\soft\Python38\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "B:\src\stable_diffusion\textual_inversion\ldm\modules\encoders\modules.py", line 183, in embedding_forward
    inputs_embeds = embedding_manager(input_ids, inputs_embeds)
  File "B:\soft\Python38\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "B:\src\stable_diffusion\textual_inversion\ldm\modules\embedding_manager.py", line 101, in forward
    embedded_text[placeholder_idx] = placeholder_embedding
RuntimeError: shape mismatch: value tensor of shape [50, 768] cannot be broadcast to indexing result of shape [0, 768]

I manually inserted those 50 embeddings into prompt in order, and I am getting pictures of Stallone, but they all seem very same-y, which to me looks similar to overfitting, but I don't know if it's that or me incorrectly working with those embeddings.

Here are 9 pics all with different seeds:
grid-0000-3166629621

@oppie85
Copy link

oppie85 commented Sep 2, 2022

You also have to update num_vectors_per_token in v1-inference.yaml to the same value you trained with.

With 50 vectors per token, extreme overfitting is to be expected; I'm currently trying to find the right balance between the very accurate likeness with many vectors and the more varied results of fewer vectors. The codebase also contains an idea of 'progressive words' where new vectors get added as training progresses which might be interesting to explore.

Oh, also; a .pt file trained with 256x256 images only really works well with the ddim encoder; given enough vectors it'll look "acceptable" with k_lms but if you want the same quality you got with the training samples, use ddim.

Another thing I've been experimenting with is different initializer words per vector - for example, I set num_vectors_per_token to 4 and then pass "face", "eyes", "nose", "mouth" as the initializer words in the hopes that each of the vectors will focus on that one specific part of the likeness. So far I'm not sure if I'd call it a success, but at this point I'm just throwing random idea I get at it.

@AUTOMATIC1111
Copy link

AUTOMATIC1111 commented Sep 2, 2022

Ah. That did the trick, thank you. If anyone cares, here's 9 images produced by this repo's code on the stallone embedding:

res

DDIM. Previous pic I posted was using euler ancestral from k-diffusion.

I used just * as prompt in both cases.

@1blackbar
Copy link

1blackbar commented Sep 2, 2022

You also have to update num_vectors_per_token in v1-inference.yaml to the same value you trained with.

With 50 vectors per token, extreme overfitting is to be expected; I'm currently trying to find the right balance between the very accurate likeness with many vectors and the more varied results of fewer vectors. The codebase also contains an idea of 'progressive words' where new vectors get added as training progresses which might be interesting to explore.

Oh, also; a .pt file trained with 256x256 images only really works well with the ddim encoder; given enough vectors it'll look "acceptable" with k_lms but if you want the same quality you got with the training samples, use ddim.

Another thing I've been experimenting with is different initializer words per vector - for example, I set num_vectors_per_token to 4 and then pass "face", "eyes", "nose", "mouth" as the initializer words in the hopes that each of the vectors will focus on that one specific part of the likeness. So far I'm not sure if I'd call it a success, but at this point I'm just throwing random idea I get at it.

Im currently quick testing if i can still edit a style when use 5 vectors, are the cloned heads the result of 256 training? Can i resume training and change it to 512 or will it start over from 0 after i change to 512?
Also did spreading 4 vectors into 4 init words helped ? Maybe i made a mistake by using face,photo as init words and it pushed him deep into photo realm, i will try vague "male"
OK with 5 vectors i managed to rip out a person from photo style into anime style butits veryu hard, needs repetitions of word anime style, more than usual, so id say 5 is already too much but with 5 likeness is crap.... so thats that
I think with all 77 vectors you will get great likeness right away, but there wont be any room left for editability .
Ill try training for short time and highest vectors, then i try to spread inint words usiong high vectors
I will also do other method, using more precise init words like , lips, cheeks,nose, nostrils, eyes,eyelids,chin,jawline, whatever i can find, and high vectors, maybe it will spread into details more and will leave stye up to editing

@1blackbar
Copy link

1blackbar commented Sep 3, 2022

Overwhelmed overfitting with prompt ,from what i see, if you use 50 vectors, you just wasted 50 words
in a prompt on your subject being a photograph of a man, so you have like 27 left to skew it into painting or a drawing ? so you have to overwhelm it hard to change a style.
Or it might be that you have to use over 50 words to overwhelm it, theres definitely a ratio cause i can overwhelm low vector results faster . this is 50 vectors

00796-3972612447-ilya_repin!!_oil_painting,centered_macro_head_shot_of__character__as_soldier_character_by_ilya_repin,_painting,_hires,detail

00793-2276069365-ilya_repin!!_oil_painting,macro_head_shot_of__character__as_soldier_character_by_ilya_repin,_painting,hires,detailed__ilya

@altryne
Copy link
Author

altryne commented Sep 3, 2022

Try playing with prompt weights in the webui?

@1blackbar
Copy link

1blackbar commented Sep 3, 2022

started over, 2 vectors , 256 res, its at 36 epoch and 48k iters, will it be more editable than 50 vectors, we will see , i dont like the mirroring thing , how to turn it off ? his face is not identical when flipped
samples_scaled_gs-047000_e-000036_b-000200_00000471
Ok after testing for editability, the one with 50 vector is a better way, it takes abouyt the same overwhelming to edit a style of 50 vectors as it takes 2 vector one but it takes about hour to train 50 vectors and takes like 8 hours to train 2 vectors to satisfied identity of the subject on 1080ti.
Training 512 on 11gb 1080ti is a waste of time, go with 256 res, maybe its a ram thing and batchsize thing, you wont get likeness with 512, not in one day anyway.
I guess overfitting is just a thing we have to live with for now, identity preservation is way more important IMO.
00926-3513507886-classic_oil_painting_by_ilya_repin!!!_detailed_oil_painting__of_character_sly__as_rambo_in_the_stye_of__ilya_repin,intricate

@bmaltais
Copy link

bmaltais commented Sep 4, 2022

This is really interesting... I would like to as, how do you resume training? I have been looking aroung as to how to do that and can't find the answer. An example would be appreciated.

EDIT: Found my answer here: #38

@1blackbar
Copy link

1blackbar commented Sep 4, 2022

Got around overfitting, thats not an issue anymore, go with as many vectors as you like to speedup training, got new subject to train on , style change is not an issue at all , adapts even to cartoon styles , res 448 ,will do 512 later on, you can also control emotions of the face to make it smile
00103-1816228968-image_of_blee

00229-2336580884-image_of__blee
00345-1400431675-image_of__blee
00462-3494633885-johnny_blee
00253-3852576016-image_of__blee
00300-3887535150-image_of__blee
00167-189900867-image_of_smiling_happy_blee

@dboshardy
Copy link

@1blackbar how did you resolve the overfitting?

@oppie85
Copy link

oppie85 commented Sep 4, 2022

@1blackbar - looks great! Can you share what method you used to achieve this?

@rinongal
Copy link
Owner

@hopibel Probably hit the nail on the head. Huggingface uses more gradient accumulation steps, which means you're working with a larger batch size and are less likely to fall into minimas like overfitting the background of a specific image with your tokens.

@1blackbar
Copy link

where i can change the steps in this repository ?

@hopibel
Copy link

hopibel commented Sep 12, 2022

Set accumulate_grad_batches at the end of the yaml config, right next to max_steps

@CodeExplode
Copy link

Has anybody had any luck setting accumulate_grad_batches higher? With a value of say 4 I ran into errors like testing being delayed until 4 times as many iterations had passed, then 4 rounds of testing are done in a row.

@1blackbar
Copy link

i did, it helps with identity a lot, why was it removed from yamls, i dont know, im still testing, 850 iters showing very good identity
takes like 4 times longer tho
benchmark: True
accumulate_grad_batches: 4
max_steps: 100000

@1blackbar
Copy link

1blackbar commented Sep 13, 2022

OK, heads upo, iddentity might be ok but stylisation is crap, i think this is overfittin waaay faster but with low iterations, nost sure if its a way to get editability with batches of 4,, maybe with mnuch sower training rate combined, oi dont know
The thin is, i type in frazetta paingin of subject, and it changes to a painting but its no way a frazertta

@hopibel
Copy link

hopibel commented Sep 13, 2022

@1blackbar If you didn't change any other settings, you basically quadrupled the learn rate due to --scale_lr

@ThereforeGames
Copy link

OK, heads upo, iddentity might be ok but stylisation is crap, i think this is overfittin waaay faster but with low iterations, nost sure if its a way to get editability with batches of 4,, maybe with mnuch sower training rate combined, oi dont know The thin is, i type in frazetta paingin of subject, and it changes to a painting but its no way a frazertta

Is this not an issue with huggingface? And I assume you changed the learning rate here to 5.0e-04 as well, yeah?

Sounds like there's still something different about their config.

@1blackbar
Copy link

1blackbar commented Sep 13, 2022

from what i see their learning ratet changes with time , also accumulate changes
Also for some reason 3 females finetuned fine on hface , but males not that same level, go figure
Also the embeddings we finetune now, work on other ckpt 4GB files that people are training for themselves so thats good.

@rinongal
Copy link
Owner

@1blackbar Where are you seeing their learning rate changes with time? They appear to be setting the LR scheduler to constant mode by default.

Accumulation steps is basically saying: "I can't fit the full batch size in the GPU, so instead of doing a batch of 4 images, I'll do 4 batches of 1 image and accumulate the results", hence why it takes more "iterations". I wouldn't expect this to cause you more overfitting, other than any adjustments it makes to your LR.

@ThereforeGames
Copy link

ThereforeGames commented Sep 15, 2022

Hi all,

I wrote a new script that effectively circumvents overfitting from Textual Inversion:

https://github.com/ThereforeGames/txt2img2img

Combine it with prompt weighting for the best results.

Would love to know your thoughts. Thanks.

@hopibel
Copy link

hopibel commented Sep 15, 2022

@ThereforeGames Would be interesting to see how this compares to prompt2prompt, which replaces concepts mid-generation

@ThereforeGames
Copy link

@hopibel Agreed! I haven't had a chance to play with prompt2prompt yet, but I have a feeling there's probably a way to integrate it with txt2img2img for even better results.

prompt2prompt seems amazing for general subject replacement, but I'm wondering how it fares with "ridiculously overtrained" TI checkpoints.

@1blackbar
Copy link

Currently i have best results by inpainting heavily overfit face on stylised result (img2img inpaint in webui) Automating that would be interesting but i feel that having more manual control over result is just better uness it can force styles into overfit embedding so they dont all look like inpainted photolikeness on a cartoon versions

@CodeExplode
Copy link

CodeExplode commented Sep 15, 2022

I can confirm that the above txt2img2img approach works better than anything I've tried such as inpainting, having played around with it for a while on discord. It can do style, pose, and background changes on an embedding which otherwise would always overwhelm those prompts, and makes it very easy.

It has an autoconfigure setting which I turned off and which didn't work well in one attempt, but the author has mentioned it being very powerful and so the script might be even better than what I've seen so far, which is already great.

@nerdyrodent
Copy link

Nice. About to test it with my highly over-fitted embeddings which take ~10 mins to produce ;)

@CodeExplode
Copy link

CodeExplode commented Sep 15, 2022

p.s. We've been talking non-stop about textual inversion in the #community-research channel on the stable diffusion discord for days now, if anybody wants to join in. The script author gave me some tips getting it working which might be worth checking out if you have trouble.

https://discord.gg/stablediffusion

@ThereforeGames
Copy link

ThereforeGames commented Sep 16, 2022

hopibel I wrapped my head around Automatic's implementation of prompt2prompt - you can use it with txt2img2img now in the form of a custom prompt template

So far I haven't figured out a way to use prompt2prompt that yields better results than my default prompt template. It often does a better job with background detail, and perhaps with editability, but likeness seems to suffer a bit.

Feel free to play around with it and let me know if you find a formula that works! The prompt templates are like a primitive scripting language so you can do a lot with them - check docs for more info

@ExponentialML
Copy link

@ThereforeGames Would be interesting to see how this compares to prompt2prompt, which replaces concepts mid-generation

I was going to make a separate issue about this, but Cross Attention Control and prompt2prompt are the solutions for the overfitting / editability of prompts. In my testing, I've had extremely good results (I primarily use the Dreambooth implementation with my custom script, but textual inversion works too).

What happens is that the newly trained word is often prioritized as the init token, so you replace it with at x steps at inference.

So if you have a Mustang 2024 trained for instance, you could do something like a photo realistic art piece of a [car:*:0.2] driving down the road, high quality, where car is the init, * is the trained token, and the trained word replaces the init at 20% of the inference process. You usually have to scale the percentage up or down with the amount of steps you choose.

It's the same idea as txt2img2img, but without the img2img process.

@ThereforeGames
Copy link

It's the same idea as txt2img2img, but without the img2img process.

Sick, that actually works quite well. I need to perform more tests but so far it's more or less on par with my script - sometimes even better.

@ThereforeGames
Copy link

ThereforeGames commented Sep 17, 2022

Okay, now that I've had more time to play around with prompt2prompt, I can say that it generally yields "higher quality" pictures but the likeness isn't always as good as tx2img2img. Here's an example where I could not get a Sheik-looking Sheik out of prompt2prompt:

image

Versus txt2img2img:

image

In the first one, the facial features and expression aren't right. I played around with the ratio from 0.1 to 0.3, but couldn't get it looking much better. Tried CFG scales from 7 to 15. Seems that likeness goes down as prompt complexity goes up.

It might help if we could autotune the CFG and prompt ratios automatically, over the course of inference, but I'm not sure how to go about doing that. txt2img2img has the advantage of being able to look at the result of txt2img before processing img2img.

Would love to figure out a way to combine the high level of detail and speed from prompt2prompt with the consistency of txt2img2img!

@ThereforeGames
Copy link

Here's another example - 1st is prompt2prompt and 2nd is txt2img2img:

download - 2022-09-16T223918 191

download - 2022-09-16T223921 583

If anything, the clothes might be better in prompt2prompt... but the face is way off!

@1blackbar
Copy link

1blackbar commented Sep 17, 2022

prompt2prompt does work in webui from AUTOMATIC1111 , it works i think better with embeddings that were below 60 vectors

a painting by greg rutkowski of close portrait shot of [sylvester stalone :slyf:0.5] on neon city background , by greg rutkowski

image
image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests