Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Port of clx subword tokenizer to cudf #5511

Merged
merged 68 commits into from
Jul 9, 2020

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Jun 18, 2020

Reference #4981

The tokenizer is added to nvtext.

C++ API interface

std::unique_ptr<TokenizerResult> subword_tokenize(
  cudf::strings_column_view const& sentences,
  std::string const& filename_hashed_vocabulary,
  uint32_t max_sequence_length,
  uint32_t stride,
  bool do_lower,
  bool do_truncate,
  uint32_t max_num_sentences,
  uint32_t max_num_chars,
  uint32_t max_rows_tensor,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

A 2nd API allows the caller to load the hashed vocabulary in a separate call in case it can be reused to tokenize multiple batches using the same vocabulary token ids.

The tokenize process occurs in 2 steps after the vocabulary file is loaded. The first step normalizes the data to optionally lower-case the character, remove accents, map spaces, add padding, and resolve some multi-byte characters. The 2nd step identifies the tokens (words between whitespace) and maps them to token ids in the vocabulary file.

The result is 3 columns of integers. The 1st is the set of token ids assigned to each word as found in the vocabulary file. The 2nd is an attention mask identifying which token values are valid. The 3rd is a vector or metadata per output row.

This PR includes gtest and gbenchmark code that uses a minimal vocab hash that it generates.
Also included is a starting set of Python/Cython interfaces to match the CLX version.

@davidwendt davidwendt self-assigned this Jun 18, 2020
@davidwendt davidwendt added the 2 - In Progress Currently a work in progress label Jun 18, 2020
@davidwendt davidwendt added this to PR-WIP in v0.15 Release via automation Jun 18, 2020
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@codecov
Copy link

codecov bot commented Jun 22, 2020

Codecov Report

Merging #5511 into branch-0.15 will increase coverage by 0.27%.
The diff coverage is 100.00%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.15    #5511      +/-   ##
===============================================
+ Coverage        85.87%   86.15%   +0.27%     
===============================================
  Files               72       72              
  Lines            12181    12926     +745     
===============================================
+ Hits             10461    11136     +675     
- Misses            1720     1790      +70     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/string.py 85.91% <100.00%> (-0.29%) ⬇️
python/cudf/cudf/core/column/column.py 89.57% <0.00%> (+1.60%) ⬆️
python/cudf/cudf/core/indexing.py 98.02% <0.00%> (+2.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 754f292...08a33ed. Read the comment docs.

python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_text.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_text.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_text.py Outdated Show resolved Hide resolved
@davidwendt
Copy link
Contributor Author

rerun tests

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Thanks for answering all my comments.

v0.15 Release automation moved this from PR-Needs review to PR-Reviewer approved Jul 9, 2020
@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond labels Jul 9, 2020
@kkraus14 kkraus14 merged commit e81d6a1 into rapidsai:branch-0.15 Jul 9, 2020
v0.15 Release automation moved this from PR-Reviewer approved to Done Jul 9, 2020
@davidwendt davidwendt deleted the port-subword-tokenizer branch July 9, 2020 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
No open projects
v0.15 Release
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

7 participants