-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Port of clx subword tokenizer to cudf #5511
[REVIEW] Port of clx subword tokenizer to cudf #5511
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Codecov Report
@@ Coverage Diff @@
## branch-0.15 #5511 +/- ##
===============================================
+ Coverage 85.87% 86.15% +0.27%
===============================================
Files 72 72
Lines 12181 12926 +745
===============================================
+ Hits 10461 11136 +675
- Misses 1720 1790 +70
Continue to review full report at Codecov.
|
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. Thanks for answering all my comments.
Reference #4981
The tokenizer is added to nvtext.
C++ API interface
A 2nd API allows the caller to load the hashed vocabulary in a separate call in case it can be reused to tokenize multiple batches using the same vocabulary token ids.
The tokenize process occurs in 2 steps after the vocabulary file is loaded. The first step normalizes the data to optionally lower-case the character, remove accents, map spaces, add padding, and resolve some multi-byte characters. The 2nd step identifies the tokens (words between whitespace) and maps them to token ids in the vocabulary file.
The result is 3 columns of integers. The 1st is the set of token ids assigned to each word as found in the vocabulary file. The 2nd is an attention mask identifying which token values are valid. The 3rd is a vector or metadata per output row.
This PR includes gtest and gbenchmark code that uses a minimal vocab hash that it generates.
Also included is a starting set of Python/Cython interfaces to match the CLX version.