question about cross attention #402

FORRYKIM · 2024-07-14T12:55:22Z

The data undergoes self-attention, and then proceeds to cross-attention through layer normalization. Why is it that in the code, the encoder_hidden_states for cross-attention are the embeddings of text and images, rather than the output from the previous self-attention layer?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about cross attention #402

question about cross attention #402

FORRYKIM commented Jul 14, 2024

question about cross attention #402

question about cross attention #402

Comments

FORRYKIM commented Jul 14, 2024