Linear Attention Mechanism #2150

parmarsuraj99 · 2020-09-05T15:01:36Z

Describe the feature and the current behavior/state.

Are we going to add LinearAttention? If yes, I can start working on it

Relevant information

Are you willing to contribute it (yes/no): yes
Are you willing to maintain it going forward? (yes/no): yes
Is there a relevant academic paper? (if so, where): Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Is there already an implementation in another framework? (if so, where): Official PyTorch implementation
Was it part of tf.contrib? (if so, where): no

Which API type would this fall under (layer, metric, optimizer, etc.) layer

Who will benefit with this feature? Building Transformer blocks that are faster with O(N) complexity compared to standard Softmax dot product attnetion

Any other info.
Paper's website

The text was updated successfully, but these errors were encountered:

AakashKumarNain · 2020-09-05T15:38:57Z

Please feel free to open a PR @parmarsuraj99 Thank you

bhack · 2020-09-05T16:51:22Z

/cc @saberkun @tanzhenyu @dynamicwebpaige Do you have any internal plan for this?

saberkun · 2020-09-05T20:11:08Z

Looking at the pytorch implementation, what's the difference with https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/layers/dense_attention.py?

parmarsuraj99 · 2020-09-06T05:08:28Z

In the PyTorch implementation, the authors have implemented many variants of attentions https://github.com/idiap/fast-transformers/tree/master/fast_transformers/attention

The one referenced above is this specific one
https://github.com/idiap/fast-transformers/blob/master/fast_transformers/attention/linear_attention.py

Major difference is the calculation of Values. Instead of Softmax, they introduced a kernel (here elu) on Q and K. Which according to them has shown some improvements.

Major change is focused only in this part.

addons/tensorflow_addons/layers/multihead_attention.py

Lines 223 to 239 in d466cb8

    
           depth = tf.constant(self.head_size, dtype=tf.float32) 
        
           query /= tf.sqrt(depth) 
        
           # Calculate dot product attention 
        
           logits = tf.einsum("...NHO,...MHO->...HNM", query, key) 
        
           # apply mask 
        
           if mask is not None: 
        
               mask = tf.cast(mask, tf.float32) 
        
               # possibly expand on the head dimension so broadcasting works 
        
               if len(mask.shape) != len(logits.shape): 
        
                   mask = tf.expand_dims(mask, -3) 
        
               logits += -10e9 * (1.0 - mask) 
        
           attn_coef = tf.nn.softmax(logits)

Can we implement something like callable Attention calculation after calculating Linear projections of inputs?
something like,

query = tf.einsum("...NI , HIO -> ...NHO", query, self.query_kernel)
key = tf.einsum("...MI , HIO -> ...MHO", key, self.key_kernel)
value = tf.einsum("...MI , HIO -> ...MHO", value, self.value_kernel)

output, attn_coef = scaled_dot_product_attention(query, key, value, mask)

or

output, attn_coef = linearized_attention(query, key, value, mask)

tanzhenyu · 2020-09-07T17:18:27Z

In the PyTorch implementation, the authors have implemented many variants of attentions https://github.com/idiap/fast-transformers/tree/master/fast_transformers/attention

The one referenced above is this specific one
https://github.com/idiap/fast-transformers/blob/master/fast_transformers/attention/linear_attention.py

Major difference is the calculation of Values. Instead of Softmax, they introduced a kernel (here elu) on Q and K. Which according to them has shown some improvements.

Major change is focused only in this part.

addons/tensorflow_addons/layers/multihead_attention.py

Lines 223 to 239 in d466cb8

depth = tf.constant(self.head_size, dtype=tf.float32)

query /= tf.sqrt(depth)

# Calculate dot product attention

logits = tf.einsum("...NHO,...MHO->...HNM", query, key)

# apply mask

if mask is not None:

mask = tf.cast(mask, tf.float32)

# possibly expand on the head dimension so broadcasting works

if len(mask.shape) != len(logits.shape):

mask = tf.expand_dims(mask, -3)

logits += -10e9 * (1.0 - mask)

attn_coef = tf.nn.softmax(logits)

Can we implement something like callable Attention calculation after calculating Linear projections of inputs?
something like,
query = tf.einsum("...NI , HIO -> ...NHO", query, self.query_kernel)
key = tf.einsum("...MI , HIO -> ...MHO", key, self.key_kernel)
value = tf.einsum("...MI , HIO -> ...MHO", value, self.value_kernel)

output, attn_coef = scaled_dot_product_attention(query, key, value, mask)

or

output, attn_coef = linearized_attention(query, key, value, mask)

This seems to be something we could ask user to subclass?

saberkun · 2020-09-08T00:10:14Z

@parmarsuraj99 Thanks!

A bit more concrete idea of subclassing.
There are many innovations about how attention is computed.
Thus, we are trying to have the keras MultiHeadAttention layer being subclassed:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/layers/multi_head_attention.py#L399
A few variants of attention can be accomplished by overriding the compute_attention method.
The variants subclass layers could be good fit to hosted inside packages like addons.

parmarsuraj99 · 2020-09-08T16:30:53Z

@parmarsuraj99 Thanks!

A bit more concrete idea of subclassing.
There are many innovations about how attention is computed.
Thus, we are trying to have the keras MultiHeadAttention layer being subclassed:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/layers/multi_head_attention.py#L399
A few variants of attention can be accomplished by overriding the compute_attention method.
The variants subclass layers could be good fit to hosted inside packages like addons.

Thanks

This is exactly what I was referring for implementation.
Subclassing MultiHeadAttention and flexible attention computation.

seanpmorgan · 2020-09-10T02:45:30Z

Given the feedback from TF team, if you would like to submit and maintain a PR for subclassing the Keras MHA that would be okay to proceed.

abhishek-niranjan · 2020-09-11T06:18:13Z

@seanpmorgan I'd like to contribute to this too. @parmarsuraj99 let me know if you'd like to work on it together?

parmarsuraj99 · 2020-09-11T06:55:33Z

@abhishek-niranjan Sure. I'd really love to collaborate

claverru · 2020-11-28T11:32:40Z

How is this going?

seanpmorgan · 2023-03-01T03:40:08Z

TensorFlow Addons is transitioning to a minimal maintenance and release mode. New features will not be added to this repository. For more information, please see our public messaging on this decision:
TensorFlow Addons Wind Down

Please consider sending feature requests / contributions to other repositories in the TF community with a similar charters to TFA:
Keras
Keras-CV
Keras-NLP

AakashKumarNain added the Feature Request label Sep 5, 2020

bhack added the ecosystem-review label Sep 5, 2020

bhack added the waiting_sponsor label Sep 8, 2020

seanpmorgan added feature-approved-for-pr and removed waiting_sponsor ecosystem-review labels Sep 10, 2020

seanpmorgan closed this as completed Mar 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linear Attention Mechanism #2150

Linear Attention Mechanism #2150

parmarsuraj99 commented Sep 5, 2020 •

edited

Loading

AakashKumarNain commented Sep 5, 2020

bhack commented Sep 5, 2020

saberkun commented Sep 5, 2020

parmarsuraj99 commented Sep 6, 2020

tanzhenyu commented Sep 7, 2020

saberkun commented Sep 8, 2020

parmarsuraj99 commented Sep 8, 2020 •

edited

Loading

seanpmorgan commented Sep 10, 2020

abhishek-niranjan commented Sep 11, 2020

parmarsuraj99 commented Sep 11, 2020

claverru commented Nov 28, 2020

seanpmorgan commented Mar 1, 2023

Linear Attention Mechanism #2150

Linear Attention Mechanism #2150

Comments

parmarsuraj99 commented Sep 5, 2020 • edited Loading

AakashKumarNain commented Sep 5, 2020

bhack commented Sep 5, 2020

saberkun commented Sep 5, 2020

parmarsuraj99 commented Sep 6, 2020

tanzhenyu commented Sep 7, 2020

saberkun commented Sep 8, 2020

parmarsuraj99 commented Sep 8, 2020 • edited Loading

seanpmorgan commented Sep 10, 2020

abhishek-niranjan commented Sep 11, 2020

parmarsuraj99 commented Sep 11, 2020

claverru commented Nov 28, 2020

seanpmorgan commented Mar 1, 2023

parmarsuraj99 commented Sep 5, 2020 •

edited

Loading

parmarsuraj99 commented Sep 8, 2020 •

edited

Loading