Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot get satisfying reconstruction results. #61

Closed
liuquande opened this issue Sep 8, 2022 · 16 comments
Closed

Cannot get satisfying reconstruction results. #61

liuquande opened this issue Sep 8, 2022 · 16 comments

Comments

@liuquande
Copy link

Dear authors,

Thanks for the amazing work!

I am trying to learn the embeddings from the following three figures (with the middle one repeat for twice by the code), but the results is not good enough.

image
And here shows the sampled with scaled_gs at 6000 iterations:
image

I use 'toy' as the initial word and the code of my training is (no change in the config file):
python main.py --base configs/latent-diffusion/txt2img-1p4B-finetune.yaml -t --actual_resume ./models/ldm/text2img-large/model.ckpt -n penruin_toy --gpus 0, --data_root ./img/test/small/ --init_word 'toy'

Could you please give me some idea on how to improve the results?

Btw, if I would like to try textual inversion with stable diffusion(SD), how should I do? May I directly load their released model in this codebase and replace the config file to SD, like below:
python main.py --base configs/stable-diffusion/v1-finetune.yaml -t --actual_resume ./models/ldm/stable-diffusion/sd-v1-3.ckpt -n penruin_toy --gpus 0, --data_root ./img/test/small/ --init_word toy

Many thanks for the help.

@CodeExplode
Copy link

Did the training stop automatically? You can try increasing max_steps: 6100 on the last line of v1-finetune.yaml to train further, and also try earlier checkpoints to see if they give better results in case you overtrained.

@liuquande
Copy link
Author

Yes, the training stop automatically.

I have checked the results at earlier iterations, from 2000 to 6000, but the results are not quite good, see as follows:
scaled_gs_3000:
image
scaled_gs_4000:
image
scaled_gs_5000:
image
scaled_gs_6000:
image

Not sure if the model was trained enough, then I shall increase the training steps and report the results once obtained.

Thanks!

@hopibel
Copy link

hopibel commented Sep 9, 2022

"penguin" would probably have been a better init_word, though I'm not sure how much it matters

@rinongal
Copy link
Owner

rinongal commented Sep 9, 2022

Hi,

It looks like your results are still on a positive improvement trajectory, so you could probably increase the number of iterations as @CodeExplode suggested.

Other option:

  1. Slightly increase LR, or increase gradient accumulation steps of you're on a single GPU with low batch sizes (which will also lead to higher LR).
  2. Try a different initial seed.

If you want to force the shape, increase the num_vectors_per_token to something like 10. This result will be much less editable, however. You'll have to overwhelm it with more complex prompts at inference.

The SD command is indeed the one you posted.

@CodeExplode
Copy link

When I increased the gradient accumulation steps, it led to odd behaviour where the test image generations and checkpoint saves were all done at once in batches (e.g. with a gradient accumulation step of 4, instead of every 500 iterations, the previews and checkpoints would be generated at 2000 iterations, 4 times in a row). I did it by adding accumulate_grad_batches: 4 to the very last line of the yaml settings file, indented to match the max_steps setting.

That being said I also got the best results I've yet seen when doing that, so it seemed like it was worth doing, just so long as I could find a way to generate checkpoints on a proper schedule again. I wasn't aware that adjusting the LR might do the same thing, so maybe that's an option?

@CodeExplode
Copy link

CodeExplode commented Sep 9, 2022

liuquande you can also edit ldm/data/personalized.py, where there's a list called imagenet_templates_small.

The prompts in there are used to generate test images, where {} is used in place of the token which you're generating (so 'a photo of a {}' for example would ideally generate photos like you've provided.

Changing those to be ideally describe other parts of the training images can help (e.g. 'a close-up photo of a {} on a curved white desk beside a ruler, in front of a green and white horizontally striped wall with a shelf' might help (though that one might be too long). That way it will tend to generate test images with most of it filled in with the prompt, and then can focus on solving on what it needs to generate for {} to get closer to your provided images. Some people have also found just clearing the list and having a single '{}' test works well.

The mirrored training image with the scarf end on the opposite side will probably only confuse things. Maybe see an inpainting tool to quickly remove it on one side.

@1blackbar
Copy link

do 40 vectors my man, good luck

@liuquande
Copy link
Author

liuquande commented Sep 13, 2022

Thank you all for the reply.

@1blackbar Hi bro, yeah, using more vectors per token works well, maybe its hard to reconstruct the toy by using simply one vector.

One more thing I would like to ask is that I find using the learned embeddings, the stable diffusion inference script can only generate the content of the embeddings, but ignoring the other information in my provided prompt.

For example here,
I find the learned embeddings has correctly reconstruct the given training data,
Reconstructed image:
image
But when I using the prompt "A photo of * driving a motorbike" with the stable_txt2img.py script
(python scripts/stable_txt2img.py --n_samples 8 --n_iter 2 --scale 10.0 --ddim_steps 200 --embedding_path ./logs/lixiaolong2022-09-09T17-56-03_lixiaolong_2vectors/checkpoints/embeddings_gs-999.pt --ckpt ./models/ldm/stable-diffusion/sd-v1-3.ckpt --prompt "A photo of * driving a motorbike"),
here is the output:
image
Nothing spatial comapred with the original embeddings and my prompt is totally ignored (is the reason overfitting?).

I noticed that your generated image in #35 are pretty good with the personalized prompt.

Could you please give me some advice on how to solve this problem, many thanks.

@liuquande
Copy link
Author

Do I need to use this repo to generate with the learned embeddings, as nicolai suggested.
Very happy to learn from you !

image

@oppie85
Copy link

oppie85 commented Sep 13, 2022

@liuquande - what you're experiencing is a form of overfitting where the training process has found vectors that perfectly recreate your training images but are so strong that any other prompt information is pretty much discarded.

A simplified way of thinking about it is that you can equate each vector to a word in a prompt - with 40 vectors you're basically asking SD "which 40 word prompt leads to my image?", which is then 'compressed' into a single token. Even if the ultimate prompt is a painting of *, it equates to something like a painting of <insert 40 words that describe your image>. These words may include the embedding for a black and white photo of which already overrides your desired style. Of course, each new learned embedding doesn't necessarily equate to an existing word which complicates the issue further (otherwise we could just find out which word/vector overrode the style and remove it); every single vector may or may not include a tiny bit of style information and all of them put together completely overwhelm everything else.

One way to counter this phenomenon is to counter-overhwelm the overfitting by repeatedly reinforcing the style you want, for example something along the lines of a painting of *, in the style of a painting. A painting painted by a painter who makes paintings (no joke, I've used prompts like this succesfully) can steer the prompt back to the style you want, but it's hit or miss.

It should theoretically be possible to train an embedding with just the information you want (and many people in the community have been doing many experiments to get to this point, some with mixed success), but right now there's no universally accepted solution.

@CodeExplode
Copy link

Something else which can help is putting the high-vector embedding token later in the order of prompts, since prompts closer to the start have higher weight. Though if your prompt is particularly long already, some of the vectors from the embedding might start getting cut off by the 77 limit.

@liuquande
Copy link
Author

Hi @oppie85 and @CodeExplode ,

Sure using a largers vectors will lead to overfitting of the learned embeddings which are so strong and only memory the content of the training data.

But as @1blackbar has introducted in #35 , very good personalized results (shown as below) are generated using a large vectors in the token.
So I am very curious how shall we achieve that.

image

Looking forward to any suggestions and help !

@CodeExplode
Copy link

CodeExplode commented Sep 13, 2022

Yeah I've trained on a huge dataset with a high vector count which introduces a lot of noise and corruption in the images, and solved the overfitting issue by using image2image in small steps from a reference starting point, and masking to only use the embedding term when working in the masked area, and using regular prompts when outside of it. It's proven to be a fantastic workflow because you basically need to use image2image masking and small steps to work around all the general oddities of SD like extra limbs anyway.

@liuquande
Copy link
Author

Many thanks for the suggestion ! @CodeExplode

It seems that the textual inversion repo does not provide a image2image script, may I ask which repo did you use for the img2img and masking generation purpose?

@CodeExplode
Copy link

Many thanks for the suggestion ! @CodeExplode

It seems that the textual inversion repo does not provide a image2image script, may I ask which repo did you use for the img2img and masking generation purpose?

automATIC1111's web UI is handy for masking and a bunch of other features: https://github.com/AUTOMATIC1111/stable-diffusion-webui

This script created today also works in that UI and has proven pretty amazing with overtrained embedding: https://github.com/ThereforeGames/txt2img2img

@liuquande
Copy link
Author

@CodeExplode

Nice, I will try the txt2img2img first !

And thanks for sharing these inforamtion, I have joined the community-research channel you shared to learn more.

@rinongal rinongal closed this as completed Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants