You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Appreciate for your excellent work! I have some confusion about the training.
First, what's the difference between IPAdapterFull and IPAdapterPlus? I find only the image_proj_model is different. Maybe MLPProjModel extracts more detailed features?
Second, when trained with text-image pairs, the noisy GT and image prompt are the same image? I think generally, the training sample should be a triplet of (image prompt, text prompt, GT image) if the model want to obtain the capability of multimodal input. Meanwhile, you mentioned in the paper, it is also possible to train the model without text prompt since using image prompt only is informative to guide the final generation. what's the GT image when it degrades to a img2img task without text prompts?
Thanks again for your excellent contributions!
The text was updated successfully, but these errors were encountered:
Appreciate for your excellent work! I have some confusion about the training.
First, what's the difference between IPAdapterFull and IPAdapterPlus? I find only the image_proj_model is different. Maybe MLPProjModel extracts more detailed features?
Second, when trained with text-image pairs, the noisy GT and image prompt are the same image? I think generally, the training sample should be a triplet of (image prompt, text prompt, GT image) if the model want to obtain the capability of multimodal input. Meanwhile, you mentioned in the paper, it is also possible to train the model without text prompt since using image prompt only is informative to guide the final generation. what's the GT image when it degrades to a img2img task without text prompts?
Thanks again for your excellent contributions!
The text was updated successfully, but these errors were encountered: