Why is
( t() mean transpose) done in attention, and not
for example, if we have two pixels, black and white, and want to represent each combination of them differently.
black white -> (QK) white black -> (KQ) black black -> (QQ) white white -> (KK)
black -> (Q) white -> (K)
will give same result for
and
whereas if we do,
then four would be different, other options could be
but then
would be same, or
, but that would be computationally expensive than
or
but then,
would be same
or
but then,
would be same
or only
or only
but then all four would be same,
or concat Q and K together, but that would mean higher computation would be required to carry this operation again, as size increased.