-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BPE PR 2.5] BytePairTokenizer tokenize function. #7780
Conversation
Co-authored-by: Matthew Soulanille <matthew@soulanille.net>
Co-authored-by: Matthew Soulanille <matthew@soulanille.net>
Overall LGTM, great work! Could you help add some comments for BPE, like https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/tokenizers/byte_pair_tokenizer.py#L202? |
const hasUnseenWords = cacheMask.map( | ||
(bool, idx) => bool && emptyFlatTokensMask[idx]).some(e => e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: A for loop is more efficient than creating an array of booleans and calling .some
on it since it can break
out of the loop early when it finds a true value. This might not matter since the array is likely small.
JS should really have a builtin way to chain functions like map
and some
as generators so that some
can exit early and not call map
's function on the rest of the elements of the array. Unfortunately, it doesn't, as far as I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I've realized that some()
can break early, so I've rewritten this to skip some of the middle map
calls.
@mattsoulanille comments have been resolved. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR implements the tokenize functionality of BytePairTokenizer and associated test cases.
Depends on BPE PR #7770 and #7774.