Cannot get satisfying reconstruction results. #61

liuquande · 2022-09-08T11:28:39Z

Dear authors,

Thanks for the amazing work!

I am trying to learn the embeddings from the following three figures (with the middle one repeat for twice by the code), but the results is not good enough.

And here shows the sampled with scaled_gs at 6000 iterations:

I use 'toy' as the initial word and the code of my training is (no change in the config file):
python main.py --base configs/latent-diffusion/txt2img-1p4B-finetune.yaml -t --actual_resume ./models/ldm/text2img-large/model.ckpt -n penruin_toy --gpus 0, --data_root ./img/test/small/ --init_word 'toy'

Could you please give me some idea on how to improve the results?

Btw, if I would like to try textual inversion with stable diffusion(SD), how should I do? May I directly load their released model in this codebase and replace the config file to SD, like below:
python main.py --base configs/stable-diffusion/v1-finetune.yaml -t --actual_resume ./models/ldm/stable-diffusion/sd-v1-3.ckpt -n penruin_toy --gpus 0, --data_root ./img/test/small/ --init_word toy

Many thanks for the help.

The text was updated successfully, but these errors were encountered:

CodeExplode · 2022-09-08T23:21:47Z

Did the training stop automatically? You can try increasing max_steps: 6100 on the last line of v1-finetune.yaml to train further, and also try earlier checkpoints to see if they give better results in case you overtrained.

liuquande · 2022-09-09T02:20:41Z

Yes, the training stop automatically.

I have checked the results at earlier iterations, from 2000 to 6000, but the results are not quite good, see as follows:
scaled_gs_3000:

scaled_gs_4000:

scaled_gs_5000:

scaled_gs_6000:

Not sure if the model was trained enough, then I shall increase the training steps and report the results once obtained.

Thanks!

hopibel · 2022-09-09T03:21:46Z

"penguin" would probably have been a better init_word, though I'm not sure how much it matters

rinongal · 2022-09-09T06:50:10Z

Hi,

It looks like your results are still on a positive improvement trajectory, so you could probably increase the number of iterations as @CodeExplode suggested.

Other option:

Slightly increase LR, or increase gradient accumulation steps of you're on a single GPU with low batch sizes (which will also lead to higher LR).
Try a different initial seed.

If you want to force the shape, increase the num_vectors_per_token to something like 10. This result will be much less editable, however. You'll have to overwhelm it with more complex prompts at inference.

The SD command is indeed the one you posted.

CodeExplode · 2022-09-09T07:36:58Z

When I increased the gradient accumulation steps, it led to odd behaviour where the test image generations and checkpoint saves were all done at once in batches (e.g. with a gradient accumulation step of 4, instead of every 500 iterations, the previews and checkpoints would be generated at 2000 iterations, 4 times in a row). I did it by adding accumulate_grad_batches: 4 to the very last line of the yaml settings file, indented to match the max_steps setting.

That being said I also got the best results I've yet seen when doing that, so it seemed like it was worth doing, just so long as I could find a way to generate checkpoints on a proper schedule again. I wasn't aware that adjusting the LR might do the same thing, so maybe that's an option?

CodeExplode · 2022-09-09T08:46:46Z

liuquande you can also edit ldm/data/personalized.py, where there's a list called imagenet_templates_small.

The prompts in there are used to generate test images, where {} is used in place of the token which you're generating (so 'a photo of a {}' for example would ideally generate photos like you've provided.

Changing those to be ideally describe other parts of the training images can help (e.g. 'a close-up photo of a {} on a curved white desk beside a ruler, in front of a green and white horizontally striped wall with a shelf' might help (though that one might be too long). That way it will tend to generate test images with most of it filled in with the prompt, and then can focus on solving on what it needs to generate for {} to get closer to your provided images. Some people have also found just clearing the list and having a single '{}' test works well.

The mirrored training image with the scarf end on the opposite side will probably only confuse things. Maybe see an inpainting tool to quickly remove it on one side.

1blackbar · 2022-09-10T22:49:36Z

do 40 vectors my man, good luck

liuquande · 2022-09-13T02:37:36Z

Thank you all for the reply.

@1blackbar Hi bro, yeah, using more vectors per token works well, maybe its hard to reconstruct the toy by using simply one vector.

One more thing I would like to ask is that I find using the learned embeddings, the stable diffusion inference script can only generate the content of the embeddings, but ignoring the other information in my provided prompt.

For example here,
I find the learned embeddings has correctly reconstruct the given training data,
Reconstructed image:

But when I using the prompt "A photo of * driving a motorbike" with the stable_txt2img.py script
(python scripts/stable_txt2img.py --n_samples 8 --n_iter 2 --scale 10.0 --ddim_steps 200 --embedding_path ./logs/lixiaolong2022-09-09T17-56-03_lixiaolong_2vectors/checkpoints/embeddings_gs-999.pt --ckpt ./models/ldm/stable-diffusion/sd-v1-3.ckpt --prompt "A photo of * driving a motorbike"),
here is the output:

Nothing spatial comapred with the original embeddings and my prompt is totally ignored (is the reason overfitting?).

I noticed that your generated image in #35 are pretty good with the personalized prompt.

Could you please give me some advice on how to solve this problem, many thanks.

liuquande · 2022-09-13T03:04:46Z

Do I need to use this repo to generate with the learned embeddings, as nicolai suggested.
Very happy to learn from you !

oppie85 · 2022-09-13T09:52:29Z

@liuquande - what you're experiencing is a form of overfitting where the training process has found vectors that perfectly recreate your training images but are so strong that any other prompt information is pretty much discarded.

A simplified way of thinking about it is that you can equate each vector to a word in a prompt - with 40 vectors you're basically asking SD "which 40 word prompt leads to my image?", which is then 'compressed' into a single token. Even if the ultimate prompt is a painting of *, it equates to something like a painting of <insert 40 words that describe your image>. These words may include the embedding for a black and white photo of which already overrides your desired style. Of course, each new learned embedding doesn't necessarily equate to an existing word which complicates the issue further (otherwise we could just find out which word/vector overrode the style and remove it); every single vector may or may not include a tiny bit of style information and all of them put together completely overwhelm everything else.

One way to counter this phenomenon is to counter-overhwelm the overfitting by repeatedly reinforcing the style you want, for example something along the lines of a painting of *, in the style of a painting. A painting painted by a painter who makes paintings (no joke, I've used prompts like this succesfully) can steer the prompt back to the style you want, but it's hit or miss.

It should theoretically be possible to train an embedding with just the information you want (and many people in the community have been doing many experiments to get to this point, some with mixed success), but right now there's no universally accepted solution.

CodeExplode · 2022-09-13T09:56:25Z

Something else which can help is putting the high-vector embedding token later in the order of prompts, since prompts closer to the start have higher weight. Though if your prompt is particularly long already, some of the vectors from the embedding might start getting cut off by the 77 limit.

liuquande · 2022-09-13T12:13:37Z

Hi @oppie85 and @CodeExplode ,

Sure using a largers vectors will lead to overfitting of the learned embeddings which are so strong and only memory the content of the training data.

But as @1blackbar has introducted in #35 , very good personalized results (shown as below) are generated using a large vectors in the token.
So I am very curious how shall we achieve that.

Looking forward to any suggestions and help !

CodeExplode · 2022-09-13T13:15:07Z

Yeah I've trained on a huge dataset with a high vector count which introduces a lot of noise and corruption in the images, and solved the overfitting issue by using image2image in small steps from a reference starting point, and masking to only use the embedding term when working in the masked area, and using regular prompts when outside of it. It's proven to be a fantastic workflow because you basically need to use image2image masking and small steps to work around all the general oddities of SD like extra limbs anyway.

liuquande · 2022-09-16T02:35:48Z

Many thanks for the suggestion ! @CodeExplode

It seems that the textual inversion repo does not provide a image2image script, may I ask which repo did you use for the img2img and masking generation purpose?

CodeExplode · 2022-09-16T05:37:28Z

Many thanks for the suggestion ! @CodeExplode

It seems that the textual inversion repo does not provide a image2image script, may I ask which repo did you use for the img2img and masking generation purpose?

automATIC1111's web UI is handy for masking and a bunch of other features: https://github.com/AUTOMATIC1111/stable-diffusion-webui

This script created today also works in that UI and has proven pretty amazing with overtrained embedding: https://github.com/ThereforeGames/txt2img2img

liuquande · 2022-09-16T06:57:54Z

@CodeExplode

Nice, I will try the txt2img2img first !

And thanks for sharing these inforamtion, I have joined the community-research channel you shared to learn more.

liuquande mentioned this issue Sep 9, 2022

getting string maps to more than single token #6

Closed

rinongal closed this as completed Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot get satisfying reconstruction results. #61

Cannot get satisfying reconstruction results. #61

liuquande commented Sep 8, 2022

CodeExplode commented Sep 8, 2022

liuquande commented Sep 9, 2022

hopibel commented Sep 9, 2022

rinongal commented Sep 9, 2022 •

edited

Loading

CodeExplode commented Sep 9, 2022

CodeExplode commented Sep 9, 2022 •

edited

Loading

1blackbar commented Sep 10, 2022

liuquande commented Sep 13, 2022 •

edited

Loading

liuquande commented Sep 13, 2022

oppie85 commented Sep 13, 2022

CodeExplode commented Sep 13, 2022

liuquande commented Sep 13, 2022

CodeExplode commented Sep 13, 2022 •

edited

Loading

liuquande commented Sep 16, 2022

CodeExplode commented Sep 16, 2022

liuquande commented Sep 16, 2022

Cannot get satisfying reconstruction results. #61

Cannot get satisfying reconstruction results. #61

Comments

liuquande commented Sep 8, 2022

CodeExplode commented Sep 8, 2022

liuquande commented Sep 9, 2022

hopibel commented Sep 9, 2022

rinongal commented Sep 9, 2022 • edited Loading

CodeExplode commented Sep 9, 2022

CodeExplode commented Sep 9, 2022 • edited Loading

1blackbar commented Sep 10, 2022

liuquande commented Sep 13, 2022 • edited Loading

liuquande commented Sep 13, 2022

oppie85 commented Sep 13, 2022

CodeExplode commented Sep 13, 2022

liuquande commented Sep 13, 2022

CodeExplode commented Sep 13, 2022 • edited Loading

liuquande commented Sep 16, 2022

CodeExplode commented Sep 16, 2022

liuquande commented Sep 16, 2022

rinongal commented Sep 9, 2022 •

edited

Loading

CodeExplode commented Sep 9, 2022 •

edited

Loading

liuquande commented Sep 13, 2022 •

edited

Loading

CodeExplode commented Sep 13, 2022 •

edited

Loading