Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Q*(Q+K).t() #1731

@vainaixr

Description

@vainaixr

Why is

Q*(K).t()

( t() mean transpose) done in attention, and not

Q*(Q+K).t()

for example, if we have two pixels, black and white, and want to represent each combination of them differently.

black white -> (QK)        white   black  -> (KQ)           black  black  -> (QQ)       white white -> (KK)                                                                                    
                                              black -> (Q)         white -> (K)
Q*(K).t()

will give same result for

black white

and

white black

whereas if we do,

Q*(Q+K).t()

then four would be different, other options could be

Q*(Q-K)

but then

black black
white white

would be same, or

Q*K*K

, but that would be computationally expensive than

Q*(Q+K)

or

(Q+K)

but then,

black white
white black

would be same

or

(Q-K)

but then,

white white
black black

would be same

or only

Q

or only

K

but then all four would be same,

or concat Q and K together, but that would mean higher computation would be required to carry this operation again, as size increased.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions