-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Got weird results, not sure if I missed a step? #35
Comments
Hey! The most likely candidate is just that our SD version isn't officially released yet because it's not behaving well under new prompts :) It's placing too much weight on the new embedding, and too little on other words. We're still trying to work that out, but it wasn't as simple a port from LDM as we hoped. If this is the issue, you can try to work around it by repeating the parts of the prompt that it ignores, for example by adding "In the style of da vinci" again at the end of the prompt. With that said, if you want to send me your images, I'll try training a model and seeing if I can get it to behave better. |
Thank you! I'll try putting more weight on the later keywords! |
One thing I found that helps and/or fixes this scenario is adding periods to your prompts, not commas like in original SD repo. This may or may not be a bug. So this: Should become this: If you trained on one token, you could possibly add weight by doing something like |
If you're using the web UI (i.e. this repo: https://github.com/hlky/stable-diffusion-webui ), you can specify weight to certain tokens as such:
I frequently have to do this with the finetuned object, sometimes using astronomical values like 1000+. This can greatly improve likeness. You may also need to adjust classifier guidance and denoise strength. All of these parameters do impact each other, and changing one often means needing to re-calibrate the rest. Anyhow, you can try applying strength to the part of the prompt that SD is ignoring. Something like this:
|
I'm one of the maintainers in charge of the frontend part I will def try this! Thank you |
I've found limited success in "diluting" the new token by making the prompt more vague - for exmple "a painting of *" results in pretty much the same image as just "*" on its own, but "a painting of a man who looks exactly like *" does (sometimes) work in succesfully applying a different style. Adding weights to the tokens as others have described also works, although it requires constant tweaking. I don't know if it would be technically possible to test for style transfer during the training/validation phase; for example, on top of the 'preset' prompts that are used on the photos in the dataset, you would have a separate list of prompts like "A painting of *" that would be used to verify that an image generated with that prompt would also score high on the 'painting' token. In the DreamBooth paper, they describe that they combatted overfitting (which I guess is causing these issues) by also training 'negatively' - something which I've tried to rudimentally replicate by including prompts without the "*" in the list of predefined ones, but I don't think this would actually do anything since the mechanism for DreamBooth and Textual Inversion are very different. |
If you guys had a single instance of succefully finetuning a photo likeness of human being into SD with this code please share, ive yet to see that and im almost sure that this code is not to "inject" your own face into SD model as people might think. |
I won't be sharing my model at this time, but I can tell you that this method is indeed capable of pulling off a convincing headswap under the right conditions:
Hope that helps. |
Well... i already heard that , its not saying much without comparison of actual photo and sd output .Even paper doesnt have result with human subjects.Some people claimed to do it but then i looked at pics and sd output was not the person thats on training data images.Vaguely yes it was same skin colour, similar haircut but proportions of the face against nose and lips... all mixed up from result to result. |
Here's what you can try to verify that textual inversion can create a convincing likeness; I've fiddled a lot with all kinds of parameters and have gotten results that are all over the place; with the 256x256 method I can iterate pretty quickly but the end result is always overfitting. For example, most of the photos I used were in an outdoors setting and textual inversion thus inferred that being outdoors was such a key feature that it'd try to replicate the same outdoor settings for every generation. I thought that maybe adding I think that's largely where the problem lies; apart from the initial embedding from the 'initializer word', there's no way to 'steer' training towards a particular subject. When using a conditioning prompt like |
The two things that had the most success for me are:
I'm almost positive that the reason for overfitting in SD is that the conditioning scheme is far too aggressive. Simply letting the model condition itself on the single init word alone is sufficient in my opinion, and has always lead to better results for me. What's funny is that you're staying close to Stable Diffusion's ethos of heavy prompting, because conditioning this way makes it to where you have to come up with the correct prompt during inference, rather than let the conditioned templates do the work. Even if you have low confidence in this method, I say it's most certainly worth looking into. I'm also certain that PTI integration will mitigate a lot of these issues (it's a very cool method for inversion if you haven't looked into it). |
I'd say 2 photos is actually not enough for training a likeness; I use around 10-20 pictures for my experiments. For the 256x256 method it works best to mix in a few extreme closeups of the face so that the AI can learn the finer details. I don't actually know if starting on 265x265 and then resuming at 512x512 is possible - I think it should be though because that's how SD was trained in the first place. For init words, I don't think "photo" is very good - I'm using "man" and "face" for that purpose - because those are the things that I want the AI to learn. Nevertheless; 1500 interations isn't very much. I usually get the best results at around 3000. |
W̶e̶l̶l̶,̶ ̶t̶h̶i̶s̶ ̶c̶e̶r̶t̶a̶i̶n̶l̶y̶ ̶i̶s̶ ̶a̶n̶ ̶i̶n̶t̶e̶r̶e̶s̶t̶i̶n̶g̶ ̶d̶i̶s̶c̶o̶v̶e̶r̶y̶.̶ ̶ ̶ ̶ S̶o̶ ̶t̶h̶i̶s̶ ̶c̶o̶u̶l̶d̶ ̶t̶h̶e̶o̶r̶e̶t̶i̶c̶a̶l̶l̶y̶ ̶p̶r̶o̶v̶e̶ ̶t̶h̶a̶t̶ ̶y̶o̶u̶ ̶n̶e̶e̶d̶ ̶t̶o̶ ̶f̶i̶n̶e̶ ̶t̶u̶n̶e̶ ̶o̶n̶ ̶t̶h̶e̶ ̶b̶a̶s̶e̶ ̶r̶e̶s̶o̶l̶u̶t̶i̶o̶n̶ ̶S̶t̶a̶b̶l̶e̶ ̶D̶i̶f̶f̶u̶s̶i̶o̶n̶ ̶w̶a̶s̶ ̶t̶r̶a̶i̶n̶e̶d̶ ̶o̶n̶,̶ ̶a̶n̶d̶ ̶n̶o̶t̶ ̶t̶h̶e̶ ̶u̶p̶s̶c̶a̶l̶e̶d̶ ̶r̶e̶s̶ ̶(̶5̶1̶2̶)̶.̶ ̶E̶i̶t̶h̶e̶r̶ ̶w̶a̶y̶ ̶t̶h̶i̶s̶ ̶s̶h̶o̶u̶l̶d̶n̶'̶t̶ ̶h̶a̶v̶e̶ ̶c̶a̶u̶s̶e̶d̶ ̶t̶h̶e̶ ̶i̶s̶s̶u̶e̶s̶ ̶p̶e̶o̶p̶l̶e̶ ̶h̶a̶v̶e̶ ̶b̶e̶e̶n̶ ̶h̶a̶v̶i̶n̶g̶ ̶a̶t̶ ̶t̶h̶e̶ ̶h̶i̶g̶h̶e̶r̶ ̶r̶e̶s̶o̶l̶u̶t̶i̶o̶n̶,̶ ̶s̶o̶ ̶I̶ ̶w̶o̶n̶d̶e̶r̶ ̶w̶h̶y̶ ̶t̶h̶i̶s̶ ̶i̶s̶?̶ ̶I̶'̶l̶l̶ ̶h̶a̶v̶e̶ ̶t̶o̶ ̶r̶e̶a̶d̶ ̶t̶h̶r̶o̶u̶g̶h̶ ̶t̶h̶e̶ ̶p̶a̶p̶e̶r̶ ̶a̶g̶a̶i̶n̶ ̶t̶o̶ ̶f̶i̶g̶u̶r̶e̶ ̶i̶t̶ ̶o̶u̶t̶.̶ ̶ Edit: Tested this and figured I'm wrong here. It simply allows for better inversion, which the model is fully capable of. The real issue is adding prompts to the embeddings, which is still WIP. |
|
altryne is it cause of 50 vectors that i used or is it cause of 256 res drop ? which one is more responsible for this ? |
So does anyone here know how to properly work with this? This is a [50, 768] tensor. All embeddings I've seen before are [1, 768]. Are you supposed to insert all 50 into the prompt, taking space of 50 tokens out of available 75? All the code that I've seen fails to actually use this embedding, include this repository, failing with error:
I manually inserted those 50 embeddings into prompt in order, and I am getting pictures of Stallone, but they all seem very same-y, which to me looks similar to overfitting, but I don't know if it's that or me incorrectly working with those embeddings. |
You also have to update num_vectors_per_token in v1-inference.yaml to the same value you trained with. With 50 vectors per token, extreme overfitting is to be expected; I'm currently trying to find the right balance between the very accurate likeness with many vectors and the more varied results of fewer vectors. The codebase also contains an idea of 'progressive words' where new vectors get added as training progresses which might be interesting to explore. Oh, also; a .pt file trained with 256x256 images only really works well with the ddim encoder; given enough vectors it'll look "acceptable" with k_lms but if you want the same quality you got with the training samples, use ddim. Another thing I've been experimenting with is different initializer words per vector - for example, I set num_vectors_per_token to 4 and then pass "face", "eyes", "nose", "mouth" as the initializer words in the hopes that each of the vectors will focus on that one specific part of the likeness. So far I'm not sure if I'd call it a success, but at this point I'm just throwing random idea I get at it. |
Im currently quick testing if i can still edit a style when use 5 vectors, are the cloned heads the result of 256 training? Can i resume training and change it to 512 or will it start over from 0 after i change to 512? |
Overwhelmed overfitting with prompt ,from what i see, if you use 50 vectors, you just wasted 50 words |
Try playing with prompt weights in the webui? |
This is really interesting... I would like to as, how do you resume training? I have been looking aroung as to how to do that and can't find the answer. An example would be appreciated. EDIT: Found my answer here: #38 |
@1blackbar how did you resolve the overfitting? |
@1blackbar - looks great! Can you share what method you used to achieve this? |
@hopibel Probably hit the nail on the head. Huggingface uses more gradient accumulation steps, which means you're working with a larger batch size and are less likely to fall into minimas like overfitting the background of a specific image with your tokens. |
where i can change the steps in this repository ? |
Set |
Has anybody had any luck setting accumulate_grad_batches higher? With a value of say 4 I ran into errors like testing being delayed until 4 times as many iterations had passed, then 4 rounds of testing are done in a row. |
i did, it helps with identity a lot, why was it removed from yamls, i dont know, im still testing, 850 iters showing very good identity |
OK, heads upo, iddentity might be ok but stylisation is crap, i think this is overfittin waaay faster but with low iterations, nost sure if its a way to get editability with batches of 4,, maybe with mnuch sower training rate combined, oi dont know |
@1blackbar If you didn't change any other settings, you basically quadrupled the learn rate due to --scale_lr |
Is this not an issue with huggingface? And I assume you changed the learning rate here to 5.0e-04 as well, yeah? Sounds like there's still something different about their config. |
from what i see their learning ratet changes with time , also accumulate changes |
@1blackbar Where are you seeing their learning rate changes with time? They appear to be setting the LR scheduler to constant mode by default. Accumulation steps is basically saying: "I can't fit the full batch size in the GPU, so instead of doing a batch of 4 images, I'll do 4 batches of 1 image and accumulate the results", hence why it takes more "iterations". I wouldn't expect this to cause you more overfitting, other than any adjustments it makes to your LR. |
Hi all, I wrote a new script that effectively circumvents overfitting from Textual Inversion: https://github.com/ThereforeGames/txt2img2img Combine it with prompt weighting for the best results. Would love to know your thoughts. Thanks. |
@ThereforeGames Would be interesting to see how this compares to prompt2prompt, which replaces concepts mid-generation |
@hopibel Agreed! I haven't had a chance to play with prompt2prompt yet, but I have a feeling there's probably a way to integrate it with txt2img2img for even better results. prompt2prompt seems amazing for general subject replacement, but I'm wondering how it fares with "ridiculously overtrained" TI checkpoints. |
Currently i have best results by inpainting heavily overfit face on stylised result (img2img inpaint in webui) Automating that would be interesting but i feel that having more manual control over result is just better uness it can force styles into overfit embedding so they dont all look like inpainted photolikeness on a cartoon versions |
I can confirm that the above txt2img2img approach works better than anything I've tried such as inpainting, having played around with it for a while on discord. It can do style, pose, and background changes on an embedding which otherwise would always overwhelm those prompts, and makes it very easy. It has an autoconfigure setting which I turned off and which didn't work well in one attempt, but the author has mentioned it being very powerful and so the script might be even better than what I've seen so far, which is already great. |
Nice. About to test it with my highly over-fitted embeddings which take ~10 mins to produce ;) |
p.s. We've been talking non-stop about textual inversion in the #community-research channel on the stable diffusion discord for days now, if anybody wants to join in. The script author gave me some tips getting it working which might be worth checking out if you have trouble. |
hopibel I wrapped my head around Automatic's implementation of prompt2prompt - you can use it with txt2img2img now in the form of a custom prompt template So far I haven't figured out a way to use prompt2prompt that yields better results than my default prompt template. It often does a better job with background detail, and perhaps with editability, but likeness seems to suffer a bit. Feel free to play around with it and let me know if you find a formula that works! The prompt templates are like a primitive scripting language so you can do a lot with them - check docs for more info |
I was going to make a separate issue about this, but Cross Attention Control and prompt2prompt are the solutions for the overfitting / editability of prompts. In my testing, I've had extremely good results (I primarily use the Dreambooth implementation with my custom script, but textual inversion works too). What happens is that the newly trained word is often prioritized as the So if you have a Mustang 2024 trained for instance, you could do something like It's the same idea as txt2img2img, but without the img2img process. |
Sick, that actually works quite well. I need to perform more tests but so far it's more or less on par with my script - sometimes even better. |
Okay, now that I've had more time to play around with prompt2prompt, I can say that it generally yields "higher quality" pictures but the likeness isn't always as good as tx2img2img. Here's an example where I could not get a Sheik-looking Sheik out of prompt2prompt: Versus txt2img2img: In the first one, the facial features and expression aren't right. I played around with the ratio from 0.1 to 0.3, but couldn't get it looking much better. Tried CFG scales from 7 to 15. Seems that likeness goes down as prompt complexity goes up. It might help if we could autotune the CFG and prompt ratios automatically, over the course of inference, but I'm not sure how to go about doing that. txt2img2img has the advantage of being able to look at the result of txt2img before processing img2img. Would love to figure out a way to combine the high level of detail and speed from prompt2prompt with the consistency of txt2img2img! |
Hey @rinongal thank you so much for this amazing repo.
I trained with over 10K steps I believe, and around 7 images. (Trained on my face)
Using this colab
I then used those pt files in running the SD version right in the collab and a weird thing happens, when I mention
*
in my prompts, I get results that look identical to the photos in style, but it does try to ... draw the objects.For example :
![CleanShot 2022-08-29 at 14 04 48@2x](https://user-images.githubusercontent.com/463317/187288405-2990e489-eec3-4576-999e-16d039962d5a.jpg)
Prompt was
portrait of joe biden with long hair and glasses eating a burger, detailed painting by da vinci
and
portrait of * with long hair and glasses eating a burger, detailed painting by da vinci
So SD added the glasses and the eating pose, but completely disregarded the detailed painting and da vincin and the style.
What could be causing this? Any idea? 🙏
The text was updated successfully, but these errors were encountered: