Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPE Detokenize #7788

Merged
merged 110 commits into from
Jul 6, 2023
Merged

BPE Detokenize #7788

merged 110 commits into from
Jul 6, 2023

Conversation

pforderique
Copy link
Contributor

Implements the detokenize method for BytePairEncoding.

Note:
Kera’s detokenize() currently uses Python’s byte strings (b’my text here’) so their implementation is a bit different than what I have here. I’ve added a test case for now that works on both implementations, highlighting that detokenize(tokenize(input)) should equal the original input, but there might be something I’m missing due to the omitted part of the implementation. This can be pushed off to the clean up PR in the future, but should be okay for now.

pforderique and others added 30 commits June 14, 2023 21:46
Co-authored-by: Matthew Soulanille <matthew@soulanille.net>
Co-authored-by: Matthew Soulanille <matthew@soulanille.net>
@mattsoulanille
Copy link
Member

LGTM once #7780 is merged.

Copy link
Collaborator

@Linchenn Linchenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pforderique pforderique merged commit 5f779ea into tensorflow:master Jul 6, 2023
2 checks passed
@pforderique pforderique deleted the detokenize-bpe branch July 6, 2023 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants