[gpt2pre 4] GPT2Preprocessor Layer #7814

pforderique · 2023-07-10T21:13:45Z

This PR implements the GPT2Preprocessor class with its call method.
Dependencies: #7791 #7806

[UPDATE 7-11-23]
After experimenting with adding generic input/output types to Layer, I noticed the refactoring might be a bit more involved than expected due to a lot of util methods expecting Tensor|Tensor[] or SymbolicTensor|SymbolicTensor[] as their inputs. I have documented this at go/TFJS-layer-generics and will be using the original design outlines below.

[UPDATE 7-10-23]
After discussion with @mattsoulanille today (7-10-23), I will be exploring changes to the base Layer class to allow different input and output types to the call method. This will allow us to support input strings for Tokenizers and any other special i/o needed by the other NLP layers.

[ORIGINAL]
Here is the original plan I had, as implemented in this PR:
I had to deviate from the reference implementation a bit more here because the types can get tricky. Since GPT2Preprocessor extends the Layer class (it really extends Preprocessor, which extends Layer), its overridden call() method must return Tensor|Tensor[].

However, the Keras implementation allows a bit more flexibility by returning the result of pack_x_y_sample_weight(), which returns a packed tuple of x, y, and sample_weight.

Designs Considered:

I thought about making the TS version packXYSampleWeight() simply return a Tensor of x, y, and sample_weight so that it can be called and directly returned from call() like Keras does, but since x is often used as an object of type {tokenIds: Tensor|Tensor[], paddingMask: Tensor[]}, this cannot be wrapped in a tensor.
Flattening the results into one large Tensor[] is also an option, but since tokenIds will most like be a Tensor[] itself (since it's the result of calling tokenize() on a tensor of strings), that would mean call would have to support a return type of Tensor|Tensor[]|(Tensor[]|Tensor)[] which starts to get confusing. Not to mention that distinguishing which output is at which position of the tensor would not be super clear.
Create a new method callAndReturnPaddingMask that just returns a tuple of the tokenIds and paddingMask and have call return the first element (tokenIds) and ignore trying to return y and sample_weight since they aren't mutated by the class. This option is more feasible, however looking ahead, it seems that the GPT2CasualLMPreprocessor class does change y and sample_weight eventually, so it's probably a good idea to figure out how to support this now.
[DESIGN CHOSEN] I've decided to opt for making a new method callAndPackArgs that returns an array of the arguments packed similarly to the Keras version and call will simply return a subset of these outputs to keep the class layer compatible.

Since GPT2 layers always pass in x as a type of {tokenIds: Tensor|Tensor[], paddingMask: Tensor[]}, I've declared this as a PreprocessorOutput interface and removed Array checking in packXYSampleWeight.

mattsoulanille

Looks good. I just have some more tests to add.

tfjs-layers/src/layers/nlp/models/gpt2/gpt2_preprocessor.ts

mattsoulanille

LGTM

mattsoulanille · 2023-07-21T20:11:49Z

@Linchenn Please take a look when you get a chance. Thanks!

Linchenn

LGTM!

Draft gpt2 Preprocessor class

2422c77

pforderique requested review from mattsoulanille and Linchenn July 10, 2023 21:18

pforderique changed the title ~~[Draft: gpt2pre 4] GPT2Preprocessor Layer~~ [gpt2pre 4] GPT2Preprocessor Layer Jul 11, 2023

pforderique marked this pull request as ready for review July 11, 2023 17:03

mattsoulanille requested changes Jul 13, 2023

View reviewed changes

tfjs-layers/src/layers/nlp/models/gpt2/gpt2_preprocessor.ts Outdated Show resolved Hide resolved

pforderique added 3 commits July 13, 2023 22:28

Merge branch 'main' into gpt2preprocessor

3f9ff88

Add serialization test

33b30b2

Add memory test

ec3239f

pforderique requested a review from mattsoulanille July 18, 2023 16:23

pforderique added 2 commits July 18, 2023 18:53

Merge branch 'main' into gpt2preprocessor

863b217

lint

6875f9f

mattsoulanille approved these changes Jul 21, 2023

View reviewed changes

Linchenn approved these changes Jul 21, 2023

View reviewed changes

Merge branch 'master' into gpt2preprocessor

ef71f60

pforderique merged commit 81abd7b into tensorflow:master Jul 21, 2023
2 checks passed

pforderique deleted the gpt2preprocessor branch July 21, 2023 22:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gpt2pre 4] GPT2Preprocessor Layer #7814

[gpt2pre 4] GPT2Preprocessor Layer #7814

pforderique commented Jul 10, 2023 •

edited

mattsoulanille left a comment

mattsoulanille left a comment

mattsoulanille commented Jul 21, 2023

Linchenn left a comment

[gpt2pre 4] GPT2Preprocessor Layer #7814

[gpt2pre 4] GPT2Preprocessor Layer #7814

Conversation

pforderique commented Jul 10, 2023 • edited

mattsoulanille left a comment

Choose a reason for hiding this comment

mattsoulanille left a comment

Choose a reason for hiding this comment

mattsoulanille commented Jul 21, 2023

Linchenn left a comment

Choose a reason for hiding this comment

pforderique commented Jul 10, 2023 •

edited