Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About stable-diffusion-2-1 #14

Closed
Cococyh opened this issue Feb 6, 2023 · 14 comments
Closed

About stable-diffusion-2-1 #14

Cococyh opened this issue Feb 6, 2023 · 14 comments

Comments

@Cococyh
Copy link

Cococyh commented Feb 6, 2023

Hi,

I apply your method on stable-diffusion-1.5, it works.
But when I load the pre-trained model stable-diffusion-2-1, I meet this error, it seems like the pipeline is not work.

Seed: 0
0%| | 0/50 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/run.py", line 90, in
main()
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
response = fn(cfg, *args, **kwargs)
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/run.py", line 73, in main
image = run_on_prompt(prompt=config.prompt,
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/run.py", line 44, in run_on_prompt
outputs = model(prompt=prompt,
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/pipeline_attend_and_excite.py", line 506, in call
max_attention_per_index = self._aggregate_and_get_max_attention_per_token(
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/pipeline_attend_and_excite.py", line 224, in _aggregate_and_get_max_attention_per_token
attention_maps = aggregate_attention(
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/utils/ptp_utils.py", line 232, in aggregate_attention
out = torch.cat(out, dim=0)
RuntimeError: torch.cat(): expected a non-empty list of Tensors

@AttendAndExcite
Copy link
Collaborator

Hi @Cococyh, thanks for your interest!
looks like the attention maps are not being saved by the controller.
we have a pending pull request to upgrade the version of diffusers that may help solve your issue.
if the issue persists after the PR is merged, we will try to help investigating this issue :)

@Cococyh
Copy link
Author

Cococyh commented Feb 7, 2023

Hi @AttendAndExcite, thanks for your reply!
I update diffusers to 0.12.1, and I find out I was using the stabilityai/stable-diffusion-2-1-inpainting model, when I replace it to stabilityai/stable-diffusion-2-1, it works well! :)
When I set (width, height) as (256, 256), it still works, and the result picture is 256*256, but set (width, height) as (768, 768), I meet this error, it seems like a little different from the previously error.

Traceback (most recent call last):
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/run.py", line 90, in
main()
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
response = fn(cfg, *args, **kwargs)
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/run.py", line 73, in main
image = run_on_prompt(prompt=config.prompt,
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/run.py", line 44, in run_on_prompt
outputs = model(prompt=prompt,
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/pipeline_attend_and_excite.py", line 506, in call
max_attention_per_index = self._aggregate_and_get_max_attention_per_token(
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/pipeline_attend_and_excite.py", line 224, in _aggregate_and_get_max_attention_per_token
attention_maps = aggregate_attention(
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/utils/ptp_utils.py", line 232, in aggregate_attention
out = torch.cat(out, dim=0)
RuntimeError: torch.cat(): expected a non-empty list of Tensors

@XavierXiao
Copy link

XavierXiao commented Feb 7, 2023

I think this is because when you have 768-sized image, there is not an attention dimension 16, you need to adjust the attention dimension (to 24 I guess) correspondingly.

The forked codes is runnable for SDv2, but it seems like it is not effective. I found that the iterative refinement will always reach the max iter and the loss does not change much. I am trying to investigate what's wrong.

@AttendAndExcite
Copy link
Collaborator

Hi @XavierXiao, and @Cococyh.
@Cococyh, as mentioned by @XavierXiao, if your model does not have 16x16 attention maps you would need to adjust to work on the most semantic attention resolution. I’d recommend visualizing all the attention resolutions separately and determining which gives the maps that correspond to the generated objects best.

@XavierXiao, which prompts did you work on? It could be an issue of expressiveness of Stable Diffusion, have you tried the prompts from the paper?

@XavierXiao
Copy link

XavierXiao commented Feb 7, 2023

I am on a smaller GPU so I cannot run 768x768 generation. So I try stable diffusion 2-base which is a 512*512 model, with prompt A frog and a pink bench which is in the paper. But I found the gradient to be very small so the iterative refinement will always reach the max _iter, and the sample will look very similar to the original SD's sample.

Since it is still a 512 model, I use the default hyper parameters.

Here is a screenshot. As you can see, I print out the grad norm and it is small. The loss does not decrease much.
Screen Shot 2023-02-06 at 10 11 33 PM

@XavierXiao
Copy link

And the samples with (top) and without (bottom) A&E.

Maybe some configs need to be changed? You can have a try with the given SD version and prompt.

A frog and a pink bench_with ae
A frog and a pink bench

@AttendAndExcite
Copy link
Collaborator

AttendAndExcite commented Feb 7, 2023

Thanks for the additional information, @XavierXiao!
We will do our best to investigate the issue as well but in the meantime, here are some tips that may be useful in case you’d like to explore a solution:

  1. Since this is a newly trained model, I’d take a look at the visualizations of the attention maps at different resolutions (to make sure the 16x16 maps are indeed still the most semantic ones).
  2. I see that the model uses a different text encoder. We perform an additional Softmax operation to discard the <sot> token, it could be that this is not necessary with the ViT-H text encoder (not sure but the attention visualization should answer this question).
  3. Related to the previous point- the attention normalization should result in generated subject tokens obtaining a maximal attention value close to 1, if this is not the case for successful generations, it could indicate that the attention normalization does not fit the model.
  4. This is less likely, but it could be that you need to tune the hyperparameters for this case separately (you can try increasing the scale factor).

@Cococyh
Copy link
Author

Cococyh commented Feb 7, 2023

@AttendAndExcite @XavierXiao, As your say, I set attention dimension to 24 then it looks like I get a nice result, the result picture is 768*768.
The prompt is "a cat and a dog".

WX20230207-153134@2x

a cat and a dog

And this is the SD version result:

a cat and a dog_no_attent

@XavierXiao
Copy link

@Cococyh Can you try a more complicated example like "a frog and a pink bench"? As I say above the loss hardly decrease.

@Cococyh
Copy link
Author

Cococyh commented Feb 7, 2023

@AttendAndExcite ,when use sd2.1-768, the token word such as frog's max attention is too small, but use sd1.5, the number is close to 1. Then I change the scale_factor to 100, it's not effective.

@XavierXiao
image

a frog and a pink bench

@AttendAndExcite
Copy link
Collaborator

@XavierXiao @Cococyh the loss appears to be quite high for the cat and the dog too- this indicates that it may be an issue of the normalization of the probabilities. We will do our best to look into this ASAP :)

@XavierXiao
Copy link

XavierXiao commented Feb 7, 2023

I found that by changing this line to attention_for_text = attention_maps[:, :, 1:8] improves the convergence a lot on SD2.1. This is just a stupid way to hard-code, ideally the 8 should be len(non_padded_token_ids). This makes sense because out of 77 tokens, there is a sot, a eot and the remaining ones are all padded, which should not be included in the computation. I think this should be done in SD v1 as well.

@AttendAndExcite
Copy link
Collaborator

AttendAndExcite commented Feb 7, 2023

Thanks @XavierXiao, this actually makes a lot of sense, as I mentioned above, since the model uses a different text encoder the attention values may vary. It is entirely possible that the attention value of <eot> is high for SD 2.1 therefore removing it and normalizing without it helps the attention values be closer to 1 and then the optimization is easier.
This was not the case for SD 1.4, so it didn’t matter.
I’m leaving this issue open and we will modify the code accordingly once we officially add support for SD 2.1

@AttendAndExcite
Copy link
Collaborator

Hi @Cococyh, and @XavierXiao, thanks for the discussion! Our code now officially supports SD 2.1 via the sd_2_1 parameter (see our README and generate_images.ipynb notebook for details).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants