[REVIEW] Port of clx subword tokenizer to cudf #5511

davidwendt · 2020-06-18T22:05:46Z

Reference #4981

The tokenizer is added to nvtext.

C++ API interface

std::unique_ptr<TokenizerResult> subword_tokenize(
  cudf::strings_column_view const& sentences,
  std::string const& filename_hashed_vocabulary,
  uint32_t max_sequence_length,
  uint32_t stride,
  bool do_lower,
  bool do_truncate,
  uint32_t max_num_sentences,
  uint32_t max_num_chars,
  uint32_t max_rows_tensor,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

A 2nd API allows the caller to load the hashed vocabulary in a separate call in case it can be reused to tokenize multiple batches using the same vocabulary token ids.

The tokenize process occurs in 2 steps after the vocabulary file is loaded. The first step normalizes the data to optionally lower-case the character, remove accents, map spaces, add padding, and resolve some multi-byte characters. The 2nd step identifies the tokens (words between whitespace) and maps them to token ids in the vocabulary file.

The result is 3 columns of integers. The 1st is the set of token ids assigned to each word as found in the vocabulary file. The 2nd is an attention mask identifying which token values are valid. The 3rd is a vector or metadata per output row.

This PR includes gtest and gbenchmark code that uses a minimal vocab hash that it generates.
Also included is a starting set of Python/Cython interfaces to match the CLX version.

GPUtester · 2020-06-18T22:06:55Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

codecov · 2020-06-22T15:29:01Z

Codecov Report

Merging #5511 into branch-0.15 will increase coverage by 0.27%.
The diff coverage is 100.00%.

@@               Coverage Diff               @@
##           branch-0.15    #5511      +/-   ##
===============================================
+ Coverage        85.87%   86.15%   +0.27%     
===============================================
  Files               72       72              
  Lines            12181    12926     +745     
===============================================
+ Hits             10461    11136     +675     
- Misses            1720     1790      +70

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/string.py	`85.91% <100.00%> (-0.29%)`	⬇️
python/cudf/cudf/core/column/column.py	`89.57% <0.00%> (+1.60%)`	⬆️
python/cudf/cudf/core/indexing.py	`98.02% <0.00%> (+2.12%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 754f292...08a33ed. Read the comment docs.

python/cudf/cudf/core/column/string.py

python/cudf/cudf/tests/test_text.py

davidwendt · 2020-07-08T23:32:03Z

rerun tests

harrism

Great work. Thanks for answering all my comments.

initial port of clx subword tokenizer

650b730

davidwendt self-assigned this Jun 18, 2020

davidwendt added the 2 - In Progress Currently a work in progress label Jun 18, 2020

davidwendt added this to PR-WIP in v0.15 Release via automation Jun 18, 2020

davidwendt added 3 commits June 22, 2020 08:09

Merge branch 'branch-0.15' into port-subword-tokenizer

d2f0028

update changelog

b8dbc5f

fix style violations

6b12e6c

davidwendt added 13 commits June 22, 2020 17:15

move some source to details

5a9a7a0

Merge branch 'branch-0.15' into port-subword-tokenizer

cbf6118

pass stream down

0bf1527

fix kernel name

4f4c92d

Merge branch 'branch-0.15' into port-subword-tokenizer

aa7277d

refactor 3 tokenizer classes to one

431fc56

rename basic-tokenizer to normalizer

c0e4d8c

rename full-tokenizer to wordpiece-tokenizer

e759147

meant to rename to data-normalizer

17b7c7a

create bigger test data

14ec5a9

Merge branch 'branch-0.15' into port-subword-tokenizer

4968b50

rename TokenizerResult to tokenizer_result

eccaa2a

add cython for subword_tokenizer

5aa3ef5

davidwendt mentioned this pull request Jun 23, 2020

[FEA] Port CLX wordpiece tokenizer into cuDF #4981

Closed

davidwendt added 7 commits June 23, 2020 18:19

declare a cython interface

c689f91

Merge branch 'branch-0.15' into port-subword-tokenizer

ad2aa6f

fix style violation

650d0c1

Merge branch 'branch-0.15' into port-subword-tokenizer

1c66503

return columns instead of device-buffers

01c468e

update cython/python for new return type

94fe0ba

move kernels to eliminate a header file

bb0de69

davidwendt added 13 commits July 6, 2020 15:31

fix doxygen param order

aea146c

change log to tensor

15c3c77

use grid_1d for kernel launch parms

47a4af7

Merge branch 'branch-0.15' into port-subword-tokenizer

3ac75bc

add more @params

107a042

add comments about hashed vocab values

6b64fbe

add more comments explaining numbers used

fa49311

add more consts

0357beb

remove commented out code

aaaf986

minor fixes like east consts

f1ab8da

add more gtests varying stride and do_truncate parms

e022568

rework tensor-output kernel to use fixed block-size

5ac2faf

Merge branch 'branch-0.15' into port-subword-tokenizer

b4bff95

davidwendt requested review from nvdbaranec and harrism July 7, 2020 20:15

kkraus14 requested changes Jul 8, 2020

View reviewed changes

python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved

python/cudf/cudf/tests/test_text.py Outdated Show resolved Hide resolved

python/cudf/cudf/tests/test_text.py Outdated Show resolved Hide resolved

python/cudf/cudf/tests/test_text.py Outdated Show resolved Hide resolved

davidwendt added 2 commits July 8, 2020 11:22

Merge branch 'branch-0.15' into port-subword-tokenizer

91bd0e2

return cupy arrays instead of Series

08a33ed

kkraus14 approved these changes Jul 8, 2020

View reviewed changes

harrism approved these changes Jul 9, 2020

View reviewed changes

nvdbaranec approved these changes Jul 9, 2020

View reviewed changes

v0.15 Release automation moved this from PR-Needs review to PR-Reviewer approved Jul 9, 2020

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond labels Jul 9, 2020

kkraus14 merged commit e81d6a1 into rapidsai:branch-0.15 Jul 9, 2020

v0.15 Release automation moved this from PR-Reviewer approved to Done Jul 9, 2020

davidwendt deleted the port-subword-tokenizer branch July 9, 2020 15:41

BartleyR mentioned this pull request Aug 26, 2020

[BUG] clx.analytics.tokenizer gives error to sentence length > 500 rapidsai/clx#219

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Port of clx subword tokenizer to cudf #5511

[REVIEW] Port of clx subword tokenizer to cudf #5511

davidwendt commented Jun 18, 2020 •

edited

Loading

GPUtester commented Jun 18, 2020

codecov bot commented Jun 22, 2020 •

edited

Loading

davidwendt commented Jul 8, 2020

harrism left a comment

[REVIEW] Port of clx subword tokenizer to cudf #5511

[REVIEW] Port of clx subword tokenizer to cudf #5511

Conversation

davidwendt commented Jun 18, 2020 • edited Loading

GPUtester commented Jun 18, 2020

codecov bot commented Jun 22, 2020 • edited Loading

Codecov Report

davidwendt commented Jul 8, 2020

harrism left a comment

Choose a reason for hiding this comment

davidwendt commented Jun 18, 2020 •

edited

Loading

codecov bot commented Jun 22, 2020 •

edited

Loading