RFC: Multihead Attention and EinsumDense on Keras #260

saberkun · 2020-06-17T01:39:05Z

RFC: Multihead Attention and EinsumDense on Keras

Status	Proposed
RFC #	260
Author(s)	Hongkun Yu (hongkuny@google.com), Mark Omernick (momernick@google.com)
Sponsor	Francois Chollet (fchollet@google.com)
Updated	2020-06-16

Objective

Introduce the MultiHeadAttention layer and EinsumDense layer to tf.keras.

`tf.keras.layers.experimental.MultiHeadAttention` and `tf.keras.layers.experimental.EinsumDense`

terrytangyuan · 2020-06-17T01:49:01Z

rfcs/20200616-keras-multihead-attention.md

+    supports projecting Q, K, V to different dimensions.
+*   Final outputs projects to user specified dimensions.
+*   Using tf.einsum to express high-dimensional computation and adopts
+    [tf.keras.layers.experimental.EinsumDense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/EinsumDense)


What's the relationship between tfa.layers and tf.keras.layers.experimental? In particular, relation to the existing tfa.layers.MultiHeadAttention. cc @seanpmorgan @karmel @bhack

IIUC layers that go in tf.keras.layers.experimental are slated to land in keras, but the API is not set in stone.

Addons would be a place for contributions whos broad applicability is not yet clear, or it is mostly used by a smaller subset of the community (Per charter). This gets tricky though because MultiHeadAttention has proven its applicability but there was no installable implementation in the TF ecosystem. Perhaps we should bring all situations like this up to the Keras team beforehand? That is a subjective call for us to make though so a roadmap would be preferable.

The addons implementation is a very good reference. We incorporate the features inside the addons version and generalize more to fit the emerging needs. For this design, it should cover the tfa.layers.MultiHeadAttention. We hope this common layer can be inside tf.keras directly. The implementation started in model garden back to last year.

.experimental was defined and approved in https://github.com/tensorflow/community/blob/master/governance/api-reviews.md#experimental-apis.
Sometimes when an experimental namespace doesn't get too much traction it could has an opportunity to be downstreamed to tf.addons as an alternative to be removed (but this downstreaming step is not defined in the RFC).

For the MultiHeadAttention layer, it is relatively clear that we should put it inside core keras finally. Added a note inside the RFC.

@saberkun -- can you add to the Addons section below to detail the differences? The exact differences will be important for anyone migrating. Code samples demoing the migration would be useful too. @seanpmorgan -- the commitment in the Addons section below is somewhat vague as to who does what, and you should feel free to press for a specific commitment to deprecate the Addons version if you would like that to be handled by the authors here.

(Note: I have not read through the full RFC yet, so excuse me if I missed things that are already there.)

seanpmorgan

Would it be possible to have core team helping with TFA deprecation and re-mapping as part of this RFC?

rfcs/20200616-keras-multihead-attention.md

seanpmorgan · 2020-06-17T02:17:54Z

rfcs/20200616-keras-multihead-attention.md

+    supports projecting Q, K, V to different dimensions.
+*   Final outputs projects to user specified dimensions.
+*   Using tf.einsum to express high-dimensional computation and adopts
+    [tf.keras.layers.experimental.EinsumDense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/EinsumDense)


IIUC layers that go in tf.keras.layers.experimental are slated to land in keras, but the API is not set in stone.

Addons would be a place for contributions whos broad applicability is not yet clear, or it is mostly used by a smaller subset of the community (Per charter). This gets tricky though because MultiHeadAttention has proven its applicability but there was no installable implementation in the TF ecosystem. Perhaps we should bring all situations like this up to the Keras team beforehand? That is a subjective call for us to make though so a roadmap would be preferable.

Add mark to authors. Add plan for addons migration.

bhack · 2020-06-17T09:28:19Z

I don't understand here what Is the plan for garden official models.
I have a strong feeling that at some point we will have the model gardens official models own implementation, TFA implementation and this one that will be mainly going to be used to develop keras-nlp.

saberkun · 2020-06-17T16:32:41Z

I don't understand here what Is the plan for garden official models.
I have a strong feeling that at some point we will have the model gardens official models own implementation, TFA implementation and this one that will be mainly going to be used to develop keras-nlp.

Hi @bhack, the model garden implementation of the MultiHeadAttention will be moved to tf.keras.experimental in theory in the same commit. We have done the migration for EinsumDense. The model garden does not keep duplicated implementation whenever there is a reliable source with full features needed. @rachellj218 + In terms of how to maintain and build model garden library, we would need further design which is beyond the scope here.

rfcs/20200616-keras-multihead-attention.md

qlzh727 · 2020-06-17T20:19:31Z

rfcs/20200616-keras-multihead-attention.md

+  >>> target = tf.keras.Input(shape=[8, 16])
+  >>> source = tf.keras.Input(shape=[4, 16])
+  >>> mask_tensor = tf.keras.Input(shape=[8, 4])
+  >>> output_tensor, weights = layer([target, source])


@tomerk, since we started the effort of treat all tensor input to be the same and not specialize for the first call arg, should we encourage user to not pass in list of tensor, and have individual kwargs for tensor inputs? like call(query, key, value=None, mask=None)?

I think the explicit kwargs will be more readable, since the plain list doesn't carry any information of the individual tensor.

I am ok with that as Tom has fixed that. But yeah, I think we should test real models with mix precision policy.
If we use individual kwargs, the Attention layer interface may need to be consistent, which takes a list. Using inputs as a list is intentionally to be consistent with Attention and TFA implementation. Do we want to change that?

I think the first input tensor was specially handled before Tomer's change, which means you had to use list of tensors if there were multiple inputs. Now you can have individual tensors as kwargs. I think the kwargs will be more readable, and could set as an example for user when there are multiple inputs. @fchollet and @tomerk, WDYT?

IMO kwargs is much better not only for readability but it also removes the burden on the user of checking the consistency of positional arguments passed

Sounds good. I also like kwargs. Will add this to proposal and discuss.
Will also test it on our end. @qlzh727 @tomerk
I have tried and see the build() method only gets the first tensor_shape. It is a bit weird to not able infer all shapes. Let's follow up the discussion internally directly.

qlzh727 · 2020-06-17T20:24:56Z

rfcs/20200616-keras-multihead-attention.md

+    `tf.keras.layers.experimental.EinsumDense`. When the APIs are stable and
+    functionalities are fully verified, the next step is to
+    graduate as core keras layers by removing `experimental` scope.
+


@fchollet, for the existing keras.layers.Attention and keras.layers.AddictiveAttention, I think we should add some clarification to not confuse user between them and the new multihead attention. I could expect most of the user will use multihead attention.

We will clarify the Attention and AddictiveAttention are really for attention computation. They are subsets of MultiHeadAttention commonly used in NLP/transformer-based vision models which shares the same scope of other ML libraries.

karmel · 2020-06-17T19:55:54Z

rfcs/20200616-keras-multihead-attention.md

+    supports projecting Q, K, V to different dimensions.
+*   Final outputs projects to user specified dimensions.
+*   Using tf.einsum to express high-dimensional computation and adopts
+    [tf.keras.layers.experimental.EinsumDense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/EinsumDense)


@saberkun -- can you add to the Addons section below to detail the differences? The exact differences will be important for anyone migrating. Code samples demoing the migration would be useful too. @seanpmorgan -- the commitment in the Addons section below is somewhat vague as to who does what, and you should feel free to press for a specific commitment to deprecate the Addons version if you would like that to be handled by the authors here.

(Note: I have not read through the full RFC yet, so excuse me if I missed things that are already there.)

karmel · 2020-06-23T18:04:10Z

rfcs/20200616-keras-multihead-attention.md

+`tf.keras.layers.Attention`. In the reduced case of `tf.keras.layers.Attention`,
+the shape is (batch_size, target_length, source_length). Whenever
+`attention_mask` is specified, the `mask` argument is OK to be skipped.
+


I find this confusing. Can you be more explicit about the differences between the two attention layers? If this is a generalization, should there be an inheritance relationship here? If MultiHead is a generalization, why not just update the existing Attention layer and update to handle this case? We will need to have clear guidance for users as to when to use the one versus the other.

The MultiHeadAttention layer contains projection layers to Q, K, V inputs and outputs. The attention computation is multi-headed dot-product attention. The proposal follows the same scope as other ML libraries, please checkout. The module is commonly used in NLP and new Vision research.
The current Keras Attention layer is the attention computation for single-head and not using einsum. It can be used to implement the attention computation part of MultiHeadAttention.

Is tf.keras.layers.Attention design flexible enough to support the research proliferation on dense attention alternatives?

bhack has good comment. We will meet with the team and check the plan with keras.Attention layer.
For NLP research, people may prefer direct access to attention computation with ops.

rfcs/20200616-keras-multihead-attention.md

saberkun · 2020-07-20T04:56:48Z

Notes from the review meeting
Q1: Is TF Addons team ok with the RFC?
A: Yes, they are on board. Sean from TF Addons is ok with this. We will send a PR to TFA to deprecate or rewire to this new layer after it goes to tf nightly.
We will keep engaging with the TF Addons community. Consider giving them early updates.

Q2: The relation between the existing Attention layer and MultiHeadAttention. Is it possible to make Attention be part of MultiHeadAttention? or make MultiHeadAttention inherit from Attention?
A: The existing Attention layer is only for the attention computation. The Attention could take the inputs after the projections inside MHAttention and then do the computation. However, it doesn’t fully support computation needed in MHAttention. Also it doesn’t use einsum or support high-dimensional attention. To make it usable in MHAttention, we need to rewrite the attention layer. MHAttention contains more logics for transformer architecture, not a superset of Attention. It’s hard to deprecate Attention.
Will articulate clearly in RFC that the difference between attention and multi headed attention, with docstrings and code examples. Users will be able to use Attention/AdditiveAttention inside the compute_attention method once we add the attention_mask argument.

Q3 Do we agree to use Keyword arguments for tensor inputs?
A: yes, we should use this for better usability. This will only work with the latest TF version though.
Follow with keras team to make sure keyword-argument layers are treated as the first-class citizen in the TF ecosystem.

ematejska · 2020-07-20T16:16:15Z

@saberkun Are we ready to merge or did you want to check anything else before I push all the merge buttons?

Update two proposed changes to the existing Attention layer

saberkun · 2020-07-20T19:28:58Z

@saberkun Are we ready to merge or did you want to check anything else before I push all the merge buttons?

Yes, we are ready to merge. I have chat with the team to follow up with AIs after the review. It is clear now. Thanks

saberkun requested review from ematejska, ewilderj, martinwicke and theadactyl as code owners June 17, 2020 01:39

googlebot added the cla: yes label Jun 17, 2020

saberkun added 2 commits June 16, 2020 18:41

Update 20200616-keras-multihead-attention.md

ec047a9

Update 20200616-keras-multihead-attention.md

0472095

`tf.keras.layers.experimental.MultiHeadAttention` and `tf.keras.layers.experimental.EinsumDense`

terrytangyuan reviewed Jun 17, 2020

View reviewed changes

seanpmorgan reviewed Jun 17, 2020

View reviewed changes

Update 20200616-keras-multihead-attention.md

6b1b201

Add mark to authors. Add plan for addons migration.

Update 20200616-keras-multihead-attention.md

40b799b

saberkun requested a review from seanpmorgan June 17, 2020 16:42

bhack mentioned this pull request Jun 17, 2020

[Object_Detection] Add BoxCoder for SSD and FasterRCNN keras-team/keras-cv#4

Closed

qlzh727 reviewed Jun 17, 2020

View reviewed changes

Use the mask_tensor inside the example code.

33b20f1

bhack mentioned this pull request Jun 19, 2020

Implementation of MultiHeadAttention is incomplete. tensorflow/addons#1940

Closed

karmel reviewed Jun 23, 2020

View reviewed changes

Update 20200616-keras-multihead-attention.md

b3daa1d

ematejska added the RFC: Proposed RFC Design Document label Jul 7, 2020

ematejska changed the title ~~# RFC: Multihead Attention and EinsumDense on Keras~~ RFC: Multihead Attention and EinsumDense on Keras Jul 8, 2020

Update 20200616-keras-multihead-attention.md

4ffd127

fchollet mentioned this pull request Jul 8, 2020

Add Transformer Block in Search Space keras-team/autokeras#1188

Merged

bhack mentioned this pull request Jul 11, 2020

[Help wanted] DETR (DEtection TRansformer): End-to-End Object Detection with Transformers tensorflow/models#8707

Closed

saberkun added 2 commits July 20, 2020 12:26

Update two proposed changes

9778f63

Update two proposed changes to the existing Attention layer

Update to accepted

72c0662

ematejska approved these changes Jul 20, 2020

View reviewed changes

ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Jul 20, 2020

ematejska merged commit c3d9d99 into tensorflow:master Jul 20, 2020

byronyi mentioned this pull request Jul 21, 2020

End2end Transformer Training by Using Ragged Tensor tensorflow/tensorflow#40965

Closed

RFC: Multihead Attention and EinsumDense on Keras #260

RFC: Multihead Attention and EinsumDense on Keras #260

Uh oh!

Conversation

saberkun commented Jun 17, 2020 • edited by ematejska Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

RFC: Multihead Attention and EinsumDense on Keras

Objective

Uh oh!

terrytangyuan Jun 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seanpmorgan Jun 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bhack Jun 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saberkun Jun 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seanpmorgan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seanpmorgan Jun 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bhack commented Jun 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saberkun commented Jun 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saberkun Jun 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

saberkun commented Jul 20, 2020

Uh oh!

ematejska commented Jul 20, 2020

Uh oh!

saberkun commented Jul 20, 2020

Uh oh!

Uh oh!

saberkun commented Jun 17, 2020 •

edited by ematejska

Loading

terrytangyuan Jun 17, 2020 •

edited

Loading

seanpmorgan Jun 17, 2020 •

edited

Loading

bhack Jun 17, 2020 •

edited

Loading

saberkun Jun 17, 2020 •

edited

Loading

seanpmorgan Jun 17, 2020 •

edited

Loading

bhack commented Jun 17, 2020 •

edited

Loading

saberkun commented Jun 17, 2020 •

edited

Loading

saberkun Jun 20, 2020 •

edited

Loading