New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot get satisfying reconstruction results. #61
Comments
Did the training stop automatically? You can try increasing max_steps: 6100 on the last line of v1-finetune.yaml to train further, and also try earlier checkpoints to see if they give better results in case you overtrained. |
"penguin" would probably have been a better init_word, though I'm not sure how much it matters |
Hi, It looks like your results are still on a positive improvement trajectory, so you could probably increase the number of iterations as @CodeExplode suggested. Other option:
If you want to force the shape, increase the num_vectors_per_token to something like 10. This result will be much less editable, however. You'll have to overwhelm it with more complex prompts at inference. The SD command is indeed the one you posted. |
When I increased the gradient accumulation steps, it led to odd behaviour where the test image generations and checkpoint saves were all done at once in batches (e.g. with a gradient accumulation step of 4, instead of every 500 iterations, the previews and checkpoints would be generated at 2000 iterations, 4 times in a row). I did it by adding accumulate_grad_batches: 4 to the very last line of the yaml settings file, indented to match the max_steps setting. That being said I also got the best results I've yet seen when doing that, so it seemed like it was worth doing, just so long as I could find a way to generate checkpoints on a proper schedule again. I wasn't aware that adjusting the LR might do the same thing, so maybe that's an option? |
liuquande you can also edit ldm/data/personalized.py, where there's a list called imagenet_templates_small. The prompts in there are used to generate test images, where {} is used in place of the token which you're generating (so 'a photo of a {}' for example would ideally generate photos like you've provided. Changing those to be ideally describe other parts of the training images can help (e.g. 'a close-up photo of a {} on a curved white desk beside a ruler, in front of a green and white horizontally striped wall with a shelf' might help (though that one might be too long). That way it will tend to generate test images with most of it filled in with the prompt, and then can focus on solving on what it needs to generate for {} to get closer to your provided images. Some people have also found just clearing the list and having a single '{}' test works well. The mirrored training image with the scarf end on the opposite side will probably only confuse things. Maybe see an inpainting tool to quickly remove it on one side. |
do 40 vectors my man, good luck |
Thank you all for the reply. @1blackbar Hi bro, yeah, using more vectors per token works well, maybe its hard to reconstruct the toy by using simply one vector. One more thing I would like to ask is that I find using the learned embeddings, the stable diffusion inference script can only generate the content of the embeddings, but ignoring the other information in my provided prompt. For example here, I noticed that your generated image in #35 are pretty good with the personalized prompt. Could you please give me some advice on how to solve this problem, many thanks. |
@liuquande - what you're experiencing is a form of overfitting where the training process has found vectors that perfectly recreate your training images but are so strong that any other prompt information is pretty much discarded. A simplified way of thinking about it is that you can equate each vector to a word in a prompt - with 40 vectors you're basically asking SD "which 40 word prompt leads to my image?", which is then 'compressed' into a single token. Even if the ultimate prompt is One way to counter this phenomenon is to counter-overhwelm the overfitting by repeatedly reinforcing the style you want, for example something along the lines of It should theoretically be possible to train an embedding with just the information you want (and many people in the community have been doing many experiments to get to this point, some with mixed success), but right now there's no universally accepted solution. |
Something else which can help is putting the high-vector embedding token later in the order of prompts, since prompts closer to the start have higher weight. Though if your prompt is particularly long already, some of the vectors from the embedding might start getting cut off by the 77 limit. |
Hi @oppie85 and @CodeExplode , Sure using a largers vectors will lead to overfitting of the learned embeddings which are so strong and only memory the content of the training data. But as @1blackbar has introducted in #35 , very good personalized results (shown as below) are generated using a large vectors in the token. Looking forward to any suggestions and help ! |
Yeah I've trained on a huge dataset with a high vector count which introduces a lot of noise and corruption in the images, and solved the overfitting issue by using image2image in small steps from a reference starting point, and masking to only use the embedding term when working in the masked area, and using regular prompts when outside of it. It's proven to be a fantastic workflow because you basically need to use image2image masking and small steps to work around all the general oddities of SD like extra limbs anyway. |
Many thanks for the suggestion ! @CodeExplode It seems that the textual inversion repo does not provide a image2image script, may I ask which repo did you use for the img2img and masking generation purpose? |
automATIC1111's web UI is handy for masking and a bunch of other features: https://github.com/AUTOMATIC1111/stable-diffusion-webui This script created today also works in that UI and has proven pretty amazing with overtrained embedding: https://github.com/ThereforeGames/txt2img2img |
Nice, I will try the txt2img2img first ! And thanks for sharing these inforamtion, I have joined the community-research channel you shared to learn more. |
Dear authors,
Thanks for the amazing work!
I am trying to learn the embeddings from the following three figures (with the middle one repeat for twice by the code), but the results is not good enough.
And here shows the sampled with scaled_gs at 6000 iterations:
I use 'toy' as the initial word and the code of my training is (no change in the config file):
python main.py --base configs/latent-diffusion/txt2img-1p4B-finetune.yaml -t --actual_resume ./models/ldm/text2img-large/model.ckpt -n penruin_toy --gpus 0, --data_root ./img/test/small/ --init_word 'toy'
Could you please give me some idea on how to improve the results?
Btw, if I would like to try textual inversion with stable diffusion(SD), how should I do? May I directly load their released model in this codebase and replace the config file to SD, like below:
python main.py --base configs/stable-diffusion/v1-finetune.yaml -t --actual_resume ./models/ldm/stable-diffusion/sd-v1-3.ckpt -n penruin_toy --gpus 0, --data_root ./img/test/small/ --init_word toy
Many thanks for the help.
The text was updated successfully, but these errors were encountered: