Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpt2pre 4] GPT2Preprocessor Layer #7814

Merged
merged 7 commits into from
Jul 21, 2023

Conversation

pforderique
Copy link
Contributor

@pforderique pforderique commented Jul 10, 2023

This PR implements the GPT2Preprocessor class with its call method.
Dependencies: #7791 #7806

[UPDATE 7-11-23]
After experimenting with adding generic input/output types to Layer, I noticed the refactoring might be a bit more involved than expected due to a lot of util methods expecting Tensor|Tensor[] or SymbolicTensor|SymbolicTensor[] as their inputs. I have documented this at go/TFJS-layer-generics and will be using the original design outlines below.

[UPDATE 7-10-23]
After discussion with @mattsoulanille today (7-10-23), I will be exploring changes to the base Layer class to allow different input and output types to the call method. This will allow us to support input strings for Tokenizers and any other special i/o needed by the other NLP layers.

[ORIGINAL]
Here is the original plan I had, as implemented in this PR:
I had to deviate from the reference implementation a bit more here because the types can get tricky. Since GPT2Preprocessor extends the Layer class (it really extends Preprocessor, which extends Layer), its overridden call() method must return Tensor|Tensor[].

However, the Keras implementation allows a bit more flexibility by returning the result of pack_x_y_sample_weight(), which returns a packed tuple of x, y, and sample_weight.

Designs Considered:

  • I thought about making the TS version packXYSampleWeight() simply return a Tensor of x, y, and sample_weight so that it can be called and directly returned from call() like Keras does, but since x is often used as an object of type {tokenIds: Tensor|Tensor[], paddingMask: Tensor[]}, this cannot be wrapped in a tensor.
  • Flattening the results into one large Tensor[] is also an option, but since tokenIds will most like be a Tensor[] itself (since it's the result of calling tokenize() on a tensor of strings), that would mean call would have to support a return type of Tensor|Tensor[]|(Tensor[]|Tensor)[] which starts to get confusing. Not to mention that distinguishing which output is at which position of the tensor would not be super clear.
  • Create a new method callAndReturnPaddingMask that just returns a tuple of the tokenIds and paddingMask and have call return the first element (tokenIds) and ignore trying to return y and sample_weight since they aren't mutated by the class. This option is more feasible, however looking ahead, it seems that the GPT2CasualLMPreprocessor class does change y and sample_weight eventually, so it's probably a good idea to figure out how to support this now.
  • [DESIGN CHOSEN] I've decided to opt for making a new method callAndPackArgs that returns an array of the arguments packed similarly to the Keras version and call will simply return a subset of these outputs to keep the class layer compatible.

Since GPT2 layers always pass in x as a type of {tokenIds: Tensor|Tensor[], paddingMask: Tensor[]}, I've declared this as a PreprocessorOutput interface and removed Array checking in packXYSampleWeight.

@pforderique pforderique changed the title [Draft: gpt2pre 4] GPT2Preprocessor Layer [gpt2pre 4] GPT2Preprocessor Layer Jul 11, 2023
@pforderique pforderique marked this pull request as ready for review July 11, 2023 17:03
Copy link
Member

@mattsoulanille mattsoulanille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I just have some more tests to add.

Copy link
Member

@mattsoulanille mattsoulanille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mattsoulanille
Copy link
Member

@Linchenn Please take a look when you get a chance. Thanks!

Copy link
Collaborator

@Linchenn Linchenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pforderique pforderique merged commit 81abd7b into tensorflow:master Jul 21, 2023
2 checks passed
@pforderique pforderique deleted the gpt2preprocessor branch July 21, 2023 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants