Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Multihead Attention and EinsumDense on Keras #260

Merged
merged 10 commits into from
Jul 20, 2020

Conversation

saberkun
Copy link
Member

@saberkun saberkun commented Jun 17, 2020

RFC: Multihead Attention and EinsumDense on Keras

Status Proposed
RFC # 260
Author(s) Hongkun Yu (hongkuny@google.com), Mark Omernick (momernick@google.com)
Sponsor Francois Chollet (fchollet@google.com)
Updated 2020-06-16

Objective

Introduce the MultiHeadAttention layer and EinsumDense layer to tf.keras.

The feedback phase will be open for two weeks until Wednesday July 02, 2020.

# RFC: Multihead Attention and EinsumDense on Keras

| Status        | (Proposed / Accepted / Implemented / Obsolete)          |
| :------------ | :------------------------------------------------------ |
| **RFC #**     | [NNN](https://github.com/tensorflow/community/pull/NNN) |
:               : (update when you have community PR #)                   :
| **Author(s)** | Hongkun Yu (hongkuny@google.com)                        |
| **Sponsor**   | Francois Chollet (fchollet@google.com)                  |
| **Updated**   | 2020-06-16                                              |

## Objective

Introduce the MultiHeadAttention layer and EinsumDense layer to tf.keras.
    `tf.keras.layers.experimental.MultiHeadAttention` and
    `tf.keras.layers.experimental.EinsumDense`
supports projecting Q, K, V to different dimensions.
* Final outputs projects to user specified dimensions.
* Using tf.einsum to express high-dimensional computation and adopts
[tf.keras.layers.experimental.EinsumDense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/EinsumDense)
Copy link
Member

@terrytangyuan terrytangyuan Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the relationship between tfa.layers and tf.keras.layers.experimental? In particular, relation to the existing tfa.layers.MultiHeadAttention. cc @seanpmorgan @karmel @bhack

Copy link
Member

@seanpmorgan seanpmorgan Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC layers that go in tf.keras.layers.experimental are slated to land in keras, but the API is not set in stone.

Addons would be a place for contributions whos broad applicability is not yet clear, or it is mostly used by a smaller subset of the community (Per charter). This gets tricky though because MultiHeadAttention has proven its applicability but there was no installable implementation in the TF ecosystem. Perhaps we should bring all situations like this up to the Keras team beforehand? That is a subjective call for us to make though so a roadmap would be preferable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addons implementation is a very good reference. We incorporate the features inside the addons version and generalize more to fit the emerging needs. For this design, it should cover the tfa.layers.MultiHeadAttention. We hope this common layer can be inside tf.keras directly. The implementation started in model garden back to last year.

Copy link
Contributor

@bhack bhack Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.experimental was defined and approved in https://github.com/tensorflow/community/blob/master/governance/api-reviews.md#experimental-apis.
Sometimes when an experimental namespace doesn't get too much traction it could has an opportunity to be downstreamed to tf.addons as an alternative to be removed (but this downstreaming step is not defined in the RFC).

Copy link
Member Author

@saberkun saberkun Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the MultiHeadAttention layer, it is relatively clear that we should put it inside core keras finally. Added a note inside the RFC.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saberkun -- can you add to the Addons section below to detail the differences? The exact differences will be important for anyone migrating. Code samples demoing the migration would be useful too. @seanpmorgan -- the commitment in the Addons section below is somewhat vague as to who does what, and you should feel free to press for a specific commitment to deprecate the Addons version if you would like that to be handled by the authors here.

(Note: I have not read through the full RFC yet, so excuse me if I missed things that are already there.)

Copy link
Member

@seanpmorgan seanpmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to have core team helping with TFA deprecation and re-mapping as part of this RFC?

rfcs/20200616-keras-multihead-attention.md Show resolved Hide resolved
supports projecting Q, K, V to different dimensions.
* Final outputs projects to user specified dimensions.
* Using tf.einsum to express high-dimensional computation and adopts
[tf.keras.layers.experimental.EinsumDense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/EinsumDense)
Copy link
Member

@seanpmorgan seanpmorgan Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC layers that go in tf.keras.layers.experimental are slated to land in keras, but the API is not set in stone.

Addons would be a place for contributions whos broad applicability is not yet clear, or it is mostly used by a smaller subset of the community (Per charter). This gets tricky though because MultiHeadAttention has proven its applicability but there was no installable implementation in the TF ecosystem. Perhaps we should bring all situations like this up to the Keras team beforehand? That is a subjective call for us to make though so a roadmap would be preferable.

Add mark to authors.
Add plan for addons migration.
@bhack
Copy link
Contributor

bhack commented Jun 17, 2020

I don't understand here what Is the plan for garden official models.
I have a strong feeling that at some point we will have the model gardens official models own implementation, TFA implementation and this one that will be mainly going to be used to develop keras-nlp.

@saberkun
Copy link
Member Author

saberkun commented Jun 17, 2020

I don't understand here what Is the plan for garden official models.
I have a strong feeling that at some point we will have the model gardens official models own implementation, TFA implementation and this one that will be mainly going to be used to develop keras-nlp.

Hi @bhack, the model garden implementation of the MultiHeadAttention will be moved to tf.keras.experimental in theory in the same commit. We have done the migration for EinsumDense. The model garden does not keep duplicated implementation whenever there is a reliable source with full features needed. @rachellj218 + In terms of how to maintain and build model garden library, we would need further design which is beyond the scope here.

rfcs/20200616-keras-multihead-attention.md Show resolved Hide resolved
>>> target = tf.keras.Input(shape=[8, 16])
>>> source = tf.keras.Input(shape=[4, 16])
>>> mask_tensor = tf.keras.Input(shape=[8, 4])
>>> output_tensor, weights = layer([target, source])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomerk, since we started the effort of treat all tensor input to be the same and not specialize for the first call arg, should we encourage user to not pass in list of tensor, and have individual kwargs for tensor inputs? like call(query, key, value=None, mask=None)?

I think the explicit kwargs will be more readable, since the plain list doesn't carry any information of the individual tensor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with that as Tom has fixed that. But yeah, I think we should test real models with mix precision policy.
If we use individual kwargs, the Attention layer interface may need to be consistent, which takes a list. Using inputs as a list is intentionally to be consistent with Attention and TFA implementation. Do we want to change that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first input tensor was specially handled before Tomer's change, which means you had to use list of tensors if there were multiple inputs. Now you can have individual tensors as kwargs. I think the kwargs will be more readable, and could set as an example for user when there are multiple inputs. @fchollet and @tomerk, WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO kwargs is much better not only for readability but it also removes the burden on the user of checking the consistency of positional arguments passed

Copy link
Member Author

@saberkun saberkun Jun 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I also like kwargs. Will add this to proposal and discuss.
Will also test it on our end. @qlzh727 @tomerk
I have tried and see the build() method only gets the first tensor_shape. It is a bit weird to not able infer all shapes. Let's follow up the discussion internally directly.

`tf.keras.layers.experimental.EinsumDense`. When the APIs are stable and
functionalities are fully verified, the next step is to
graduate as core keras layers by removing `experimental` scope.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fchollet, for the existing keras.layers.Attention and keras.layers.AddictiveAttention, I think we should add some clarification to not confuse user between them and the new multihead attention. I could expect most of the user will use multihead attention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will clarify the Attention and AddictiveAttention are really for attention computation. They are subsets of MultiHeadAttention commonly used in NLP/transformer-based vision models which shares the same scope of other ML libraries.

supports projecting Q, K, V to different dimensions.
* Final outputs projects to user specified dimensions.
* Using tf.einsum to express high-dimensional computation and adopts
[tf.keras.layers.experimental.EinsumDense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/EinsumDense)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saberkun -- can you add to the Addons section below to detail the differences? The exact differences will be important for anyone migrating. Code samples demoing the migration would be useful too. @seanpmorgan -- the commitment in the Addons section below is somewhat vague as to who does what, and you should feel free to press for a specific commitment to deprecate the Addons version if you would like that to be handled by the authors here.

(Note: I have not read through the full RFC yet, so excuse me if I missed things that are already there.)

`tf.keras.layers.Attention`. In the reduced case of `tf.keras.layers.Attention`,
the shape is (batch_size, target_length, source_length). Whenever
`attention_mask` is specified, the `mask` argument is OK to be skipped.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing. Can you be more explicit about the differences between the two attention layers? If this is a generalization, should there be an inheritance relationship here? If MultiHead is a generalization, why not just update the existing Attention layer and update to handle this case? We will need to have clear guidance for users as to when to use the one versus the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MultiHeadAttention layer contains projection layers to Q, K, V inputs and outputs. The attention computation is multi-headed dot-product attention. The proposal follows the same scope as other ML libraries, please checkout. The module is commonly used in NLP and new Vision research.
The current Keras Attention layer is the attention computation for single-head and not using einsum. It can be used to implement the attention computation part of MultiHeadAttention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is tf.keras.layers.Attention design flexible enough to support the research proliferation on dense attention alternatives?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bhack has good comment. We will meet with the team and check the plan with keras.Attention layer.
For NLP research, people may prefer direct access to attention computation with ops.

rfcs/20200616-keras-multihead-attention.md Outdated Show resolved Hide resolved
rfcs/20200616-keras-multihead-attention.md Show resolved Hide resolved
@ematejska ematejska added the RFC: Proposed RFC Design Document label Jul 7, 2020
@ematejska ematejska changed the title # RFC: Multihead Attention and EinsumDense on Keras RFC: Multihead Attention and EinsumDense on Keras Jul 8, 2020
@saberkun
Copy link
Member Author

Notes from the review meeting
Q1: Is TF Addons team ok with the RFC?
A: Yes, they are on board. Sean from TF Addons is ok with this. We will send a PR to TFA to deprecate or rewire to this new layer after it goes to tf nightly.
We will keep engaging with the TF Addons community. Consider giving them early updates.

Q2: The relation between the existing Attention layer and MultiHeadAttention. Is it possible to make Attention be part of MultiHeadAttention? or make MultiHeadAttention inherit from Attention?
A: The existing Attention layer is only for the attention computation. The Attention could take the inputs after the projections inside MHAttention and then do the computation. However, it doesn’t fully support computation needed in MHAttention. Also it doesn’t use einsum or support high-dimensional attention. To make it usable in MHAttention, we need to rewrite the attention layer. MHAttention contains more logics for transformer architecture, not a superset of Attention. It’s hard to deprecate Attention.
Will articulate clearly in RFC that the difference between attention and multi headed attention, with docstrings and code examples. Users will be able to use Attention/AdditiveAttention inside the compute_attention method once we add the attention_mask argument.

Q3 Do we agree to use Keyword arguments for tensor inputs?
A: yes, we should use this for better usability. This will only work with the latest TF version though.
Follow with keras team to make sure keyword-argument layers are treated as the first-class citizen in the TF ecosystem.

@ematejska
Copy link
Contributor

@saberkun Are we ready to merge or did you want to check anything else before I push all the merge buttons?

Update two proposed changes to the existing Attention layer
@saberkun
Copy link
Member Author

@saberkun Are we ready to merge or did you want to check anything else before I push all the merge buttons?

Yes, we are ready to merge. I have chat with the team to follow up with AIs after the review. It is clear now. Thanks

@ematejska ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Jul 20, 2020
@ematejska ematejska merged commit c3d9d99 into tensorflow:master Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants