Q*(Q+K).t()

Why is 
```
Q*(K).t()
```
( t() mean transpose) done in attention, and not 
```
Q*(Q+K).t()
```
for example, if we have two pixels, black and white, and want to represent each combination of them differently.

```
black white -> (QK)        white   black  -> (KQ)           black  black  -> (QQ)       white white -> (KK)                                                                                    
                                              black -> (Q)         white -> (K)
```
```
Q*(K).t()
```
will give same result for 
```
black white
```
and 
```
white black
```

whereas if we do,
```
Q*(Q+K).t()
```
then four would be different, other options could be 
```
Q*(Q-K)
```
but then
```
black black
white white
```
would be same, or 
```
Q*K*K
```
, but that would be computationally expensive than 
```
Q*(Q+K)
```

or 

```
(Q+K)
```
but then,
```
black white
white black
```
would be same

or

```
(Q-K)
```
but then,

```
white white
black black
```
would be same

or only
```
Q
```
or only
```
K
```
but then all four would be same,

or concat Q and K together, but that would mean higher computation would be required to carry this operation again, as size increased.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Q*(Q+K).t() #1731

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Q*(Q+K).t() #1731

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions