-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support different-length pos/neg prompts for FLUX.1-schnell variants like Chroma #11120
base: main
Are you sure you want to change the base?
Conversation
Thanks for your PR! Before we get to reviewing the PR, could you please provide some side-by-side results with Schnell and Chroma on the same inputs (including the seeds)? |
there some more context in this issue: #11010 P.S.: I tested V13 |
Oh, sure thing, see links below for some comparison grids - but some quick notes:
Image Grids:
Also, if you skim the above grids, note that some of the images from Schnell look quite "clean" and coherent, and this is definitely an advantage that Schnell currently has, but note that in some cases Chroma should arguably win based on the style specified in the prompt ("courtroom sketch"): Compared to Chroma: And Chroma with aesthetic keywords to try to emulate aesthetic tuning that Schnell has: You can see that although Schnell's is cleaner (and arguably slightly more coherent, though sample size is a bit small here), Chroma is definitely more faithful to the style specified in the prompt. Also note that, as with other models that haven't had CFG baked in, you can get entirely different 'vibes' by tweaking Chroma's CFG - above I've used 5, with 20 steps (lodestone is currently doing some small-scale experiments as a precursor to a few-step lora for chroma). My experience so far with testing Chroma is that it has a lot more "soul" than Schnell and Dev - it's quite fun to play with. |
Awesome. I looked at some of the outputs, Anubis one is amazing among many others. |
Side note: Playing with the official Chroma ComfyUI workflow just now with v15, I noticed that there are some potential differences in quality/coherence compared to my diffusers code which generated the above images - e.g. notice the alignment to the "bored human judge" in this seed=0 image with Chroma, which was less evident in the above examples: So please take the above example images with a pinch of salt - Chroma quality may be better than what I've shown here. It could be due to quantization, or subtitles around ComfyUI sampling. I'd need to take bigger sample sizes to know if ComfyUI outputs are actually better, but I'm going to sleep now :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @josephrocca
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I'm not sure about the conventions in diffusers, but since prompt truncation is equivalent to prompt masking, I wonder whether it'd be worth also/instead supporting masking for flux? This is working code, inserted here in transformer_flux.py: if joint_attention_kwargs is not None and "encoder_attention_mask" in joint_attention_kwargs and joint_attention_kwargs["encoder_attention_mask"] is not None:
encoder_attention_mask = joint_attention_kwargs.pop("encoder_attention_mask")
max_seq_length = encoder_hidden_states.shape[1]
seq_length = encoder_attention_mask.sum(dim=-1)
batch_size = encoder_attention_mask.shape[0]
encoder_attention_mask_with_padding = encoder_attention_mask.clone()
for i in range(batch_size):
current_seq_len = int(seq_length[i].item())
if current_seq_len < max_seq_length:
available_padding = max_seq_length - current_seq_len
tokens_to_unmask = min(1, available_padding) # unmask one of the padding tokens
encoder_attention_mask_with_padding[i, current_seq_len : current_seq_len + tokens_to_unmask] = 1
attention_mask = torch.cat(
[
encoder_attention_mask_with_padding,
torch.ones([hidden_states.shape[0], hidden_states.shape[1]], device=encoder_attention_mask.device),
],
dim=1,
)
attention_mask = attention_mask.float().T @ attention_mask.float()
attention_mask = (
attention_mask[None, None, ...]
.repeat(encoder_hidden_states.shape[0], self.config.num_attention_heads, 1, 1)
.int()
.bool()
)
joint_attention_kwargs["attention_mask"] = attention_mask |
can you demonstrate an example where zeroing the end of the prompt is equivalent to attention masking where the softmax scores for padding sequence is near -infinity? |
What does this PR do?
Context:
Chroma is a large-scale Apache 2.0 fine-tune of FLUX.1 Schnell. It is currently one of the top trending text-to-image models, and has been for several days now:
Someone recently asked about diffusers support:
I've currently got it working in diffusers:
but as you can see from the comments at the top of that script, it requires a couple of changes to diffusers source code for it to work out of the box.
Changes:
One such change is due to Chroma requiring masking/truncation of prompts (all but the final padding token).
Currently diffusers requires that prompts are the same length, since it assumes that the full 512 T5 tokens will be used for both positive and negative prompts.
So
check_inputs
blocks it, and if we remove that check, then we get this error:So we need to pass the negative prompt ids into the negative prompt forward pass, instead of passing the positive prompt ids into both.
Who can review?
@yiyixuxu @sayakpaul @DN6