Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BPE PR 2.5] BytePairTokenizer tokenize function. #7780

Merged
merged 101 commits into from
Jul 5, 2023

Conversation

pforderique
Copy link
Contributor

This PR implements the tokenize functionality of BytePairTokenizer and associated test cases.

Depends on BPE PR #7770 and #7774.

pforderique and others added 30 commits June 14, 2023 21:46
Co-authored-by: Matthew Soulanille <matthew@soulanille.net>
Co-authored-by: Matthew Soulanille <matthew@soulanille.net>
@Linchenn
Copy link
Collaborator

Overall LGTM, great work! Could you help add some comments for BPE, like https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/tokenizers/byte_pair_tokenizer.py#L202?

tfjs-layers/src/layers/nlp/tokenizers.ts Outdated Show resolved Hide resolved
Comment on lines 507 to 508
const hasUnseenWords = cacheMask.map(
(bool, idx) => bool && emptyFlatTokensMask[idx]).some(e => e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: A for loop is more efficient than creating an array of booleans and calling .some on it since it can break out of the loop early when it finds a true value. This might not matter since the array is likely small.

JS should really have a builtin way to chain functions like map and some as generators so that some can exit early and not call map's function on the rest of the elements of the array. Unfortunately, it doesn't, as far as I know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I've realized that some() can break early, so I've rewritten this to skip some of the middle map calls.

tfjs-layers/src/layers/nlp/tokenizers.ts Outdated Show resolved Hide resolved
tfjs-layers/src/layers/nlp/tokenizers.ts Outdated Show resolved Hide resolved
tfjs-layers/src/layers/nlp/tokenizers.ts Outdated Show resolved Hide resolved
tfjs-layers/src/layers/nlp/tokenizers.ts Show resolved Hide resolved
@mattsoulanille mattsoulanille mentioned this pull request Jul 5, 2023
@Linchenn Linchenn self-requested a review July 5, 2023 18:49
@pforderique
Copy link
Contributor Author

@mattsoulanille comments have been resolved. Thanks!

Copy link
Member

@mattsoulanille mattsoulanille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

tfjs-layers/src/layers/nlp/tokenizers_test.ts Outdated Show resolved Hide resolved
@mattsoulanille mattsoulanille enabled auto-merge (squash) July 5, 2023 22:02
@mattsoulanille mattsoulanille merged commit ee54f7e into tensorflow:master Jul 5, 2023
2 checks passed
@pforderique pforderique deleted the tokenize-bpe branch July 5, 2023 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants