-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Multihead Attention and EinsumDense on Keras #260
Conversation
The feedback phase will be open for two weeks until Wednesday July 02, 2020. # RFC: Multihead Attention and EinsumDense on Keras | Status | (Proposed / Accepted / Implemented / Obsolete) | | :------------ | :------------------------------------------------------ | | **RFC #** | [NNN](https://github.com/tensorflow/community/pull/NNN) | : : (update when you have community PR #) : | **Author(s)** | Hongkun Yu (hongkuny@google.com) | | **Sponsor** | Francois Chollet (fchollet@google.com) | | **Updated** | 2020-06-16 | ## Objective Introduce the MultiHeadAttention layer and EinsumDense layer to tf.keras.
`tf.keras.layers.experimental.MultiHeadAttention` and `tf.keras.layers.experimental.EinsumDense`
supports projecting Q, K, V to different dimensions. | ||
* Final outputs projects to user specified dimensions. | ||
* Using tf.einsum to express high-dimensional computation and adopts | ||
[tf.keras.layers.experimental.EinsumDense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/EinsumDense) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the relationship between tfa.layers
and tf.keras.layers.experimental
? In particular, relation to the existing tfa.layers.MultiHeadAttention. cc @seanpmorgan @karmel @bhack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC layers that go in tf.keras.layers.experimental
are slated to land in keras, but the API is not set in stone.
Addons would be a place for contributions whos broad applicability is not yet clear, or it is mostly used by a smaller subset of the community (Per charter). This gets tricky though because MultiHeadAttention has proven its applicability but there was no installable implementation in the TF ecosystem. Perhaps we should bring all situations like this up to the Keras team beforehand? That is a subjective call for us to make though so a roadmap would be preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addons implementation is a very good reference. We incorporate the features inside the addons version and generalize more to fit the emerging needs. For this design, it should cover the tfa.layers.MultiHeadAttention. We hope this common layer can be inside tf.keras directly. The implementation started in model garden back to last year.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.experimental
was defined and approved in https://github.com/tensorflow/community/blob/master/governance/api-reviews.md#experimental-apis.
Sometimes when an experimental namespace doesn't get too much traction it could has an opportunity to be downstreamed to tf.addons as an alternative to be removed (but this downstreaming step is not defined in the RFC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the MultiHeadAttention layer, it is relatively clear that we should put it inside core keras finally. Added a note inside the RFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saberkun -- can you add to the Addons section below to detail the differences? The exact differences will be important for anyone migrating. Code samples demoing the migration would be useful too. @seanpmorgan -- the commitment in the Addons section below is somewhat vague as to who does what, and you should feel free to press for a specific commitment to deprecate the Addons version if you would like that to be handled by the authors here.
(Note: I have not read through the full RFC yet, so excuse me if I missed things that are already there.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to have core team helping with TFA deprecation and re-mapping as part of this RFC?
supports projecting Q, K, V to different dimensions. | ||
* Final outputs projects to user specified dimensions. | ||
* Using tf.einsum to express high-dimensional computation and adopts | ||
[tf.keras.layers.experimental.EinsumDense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/EinsumDense) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC layers that go in tf.keras.layers.experimental
are slated to land in keras, but the API is not set in stone.
Addons would be a place for contributions whos broad applicability is not yet clear, or it is mostly used by a smaller subset of the community (Per charter). This gets tricky though because MultiHeadAttention has proven its applicability but there was no installable implementation in the TF ecosystem. Perhaps we should bring all situations like this up to the Keras team beforehand? That is a subjective call for us to make though so a roadmap would be preferable.
Add mark to authors. Add plan for addons migration.
I don't understand here what Is the plan for garden official models. |
Hi @bhack, the model garden implementation of the MultiHeadAttention will be moved to tf.keras.experimental in theory in the same commit. We have done the migration for EinsumDense. The model garden does not keep duplicated implementation whenever there is a reliable source with full features needed. @rachellj218 + In terms of how to maintain and build model garden library, we would need further design which is beyond the scope here. |
>>> target = tf.keras.Input(shape=[8, 16]) | ||
>>> source = tf.keras.Input(shape=[4, 16]) | ||
>>> mask_tensor = tf.keras.Input(shape=[8, 4]) | ||
>>> output_tensor, weights = layer([target, source]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomerk, since we started the effort of treat all tensor input to be the same and not specialize for the first call arg, should we encourage user to not pass in list of tensor, and have individual kwargs for tensor inputs? like call(query, key, value=None, mask=None)?
I think the explicit kwargs will be more readable, since the plain list doesn't carry any information of the individual tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with that as Tom has fixed that. But yeah, I think we should test real models with mix precision policy.
If we use individual kwargs, the Attention
layer interface may need to be consistent, which takes a list. Using inputs as a list is intentionally to be consistent with Attention
and TFA implementation. Do we want to change that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first input tensor was specially handled before Tomer's change, which means you had to use list of tensors if there were multiple inputs. Now you can have individual tensors as kwargs. I think the kwargs will be more readable, and could set as an example for user when there are multiple inputs. @fchollet and @tomerk, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO kwargs
is much better not only for readability but it also removes the burden on the user of checking the consistency of positional arguments passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`tf.keras.layers.experimental.EinsumDense`. When the APIs are stable and | ||
functionalities are fully verified, the next step is to | ||
graduate as core keras layers by removing `experimental` scope. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fchollet, for the existing keras.layers.Attention and keras.layers.AddictiveAttention, I think we should add some clarification to not confuse user between them and the new multihead attention. I could expect most of the user will use multihead attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will clarify the Attention and AddictiveAttention are really for attention computation. They are subsets of MultiHeadAttention commonly used in NLP/transformer-based vision models which shares the same scope of other ML libraries.
supports projecting Q, K, V to different dimensions. | ||
* Final outputs projects to user specified dimensions. | ||
* Using tf.einsum to express high-dimensional computation and adopts | ||
[tf.keras.layers.experimental.EinsumDense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/EinsumDense) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saberkun -- can you add to the Addons section below to detail the differences? The exact differences will be important for anyone migrating. Code samples demoing the migration would be useful too. @seanpmorgan -- the commitment in the Addons section below is somewhat vague as to who does what, and you should feel free to press for a specific commitment to deprecate the Addons version if you would like that to be handled by the authors here.
(Note: I have not read through the full RFC yet, so excuse me if I missed things that are already there.)
`tf.keras.layers.Attention`. In the reduced case of `tf.keras.layers.Attention`, | ||
the shape is (batch_size, target_length, source_length). Whenever | ||
`attention_mask` is specified, the `mask` argument is OK to be skipped. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this confusing. Can you be more explicit about the differences between the two attention layers? If this is a generalization, should there be an inheritance relationship here? If MultiHead is a generalization, why not just update the existing Attention layer and update to handle this case? We will need to have clear guidance for users as to when to use the one versus the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MultiHeadAttention layer contains projection layers to Q, K, V inputs and outputs. The attention computation is multi-headed dot-product attention. The proposal follows the same scope as other ML libraries, please checkout. The module is commonly used in NLP and new Vision research.
The current Keras Attention layer is the attention computation for single-head and not using einsum. It can be used to implement the attention computation part of MultiHeadAttention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is tf.keras.layers.Attention
design flexible enough to support the research proliferation on dense attention alternatives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bhack has good comment. We will meet with the team and check the plan with keras.Attention layer.
For NLP research, people may prefer direct access to attention computation with ops.
Notes from the review meeting Q2: The relation between the existing Attention layer and MultiHeadAttention. Is it possible to make Attention be part of MultiHeadAttention? or make MultiHeadAttention inherit from Attention? Q3 Do we agree to use Keyword arguments for tensor inputs? |
@saberkun Are we ready to merge or did you want to check anything else before I push all the merge buttons? |
Update two proposed changes to the existing Attention layer
Yes, we are ready to merge. I have chat with the team to follow up with AIs after the review. It is clear now. Thanks |
RFC: Multihead Attention and EinsumDense on Keras
Objective
Introduce the MultiHeadAttention layer and EinsumDense layer to tf.keras.