Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BytePairEncoder class to cuDF #13891

Merged
merged 123 commits into from
Nov 14, 2023
Merged
Show file tree
Hide file tree
Changes from 117 commits
Commits
Show all changes
123 commits
Select commit Hold shift + click to select a range
c53c47b
Add BytePairEncoder class to cuDF
davidwendt Aug 16, 2023
0783939
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 16, 2023
370086f
nvbench experiment
davidwendt Aug 16, 2023
7e0d32d
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 16, 2023
4603bc9
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 17, 2023
5fff5e9
fix merge conflicts
davidwendt Aug 17, 2023
c167369
add separator parameter
davidwendt Aug 17, 2023
5595818
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 17, 2023
de059a2
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 18, 2023
1803937
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 21, 2023
ecb9b55
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 21, 2023
84a7a25
Merge branch 'bpe-python-api' of github.com:davidwendt/cudf into bpe-…
davidwendt Aug 22, 2023
0423224
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 22, 2023
2eccaaa
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 23, 2023
c0f0447
deprecate loading merge-pairs from a file
davidwendt Aug 23, 2023
3400ce8
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 23, 2023
03892a8
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 24, 2023
e6a0848
use u_char for is-whitespace fn
davidwendt Aug 24, 2023
44a2340
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 24, 2023
54814bd
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 24, 2023
14d87c2
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 25, 2023
a1735ac
use separator in final encoding step
davidwendt Aug 25, 2023
69dad47
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 25, 2023
15083ad
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 28, 2023
d10e894
Merge branch 'bpe-python-api' of github.com:davidwendt/cudf into bpe-…
davidwendt Aug 28, 2023
f392617
remove whitespace checks
davidwendt Aug 28, 2023
82055c6
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 28, 2023
a508030
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 29, 2023
fe81461
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 29, 2023
cb02a25
more efficient pair lookup
davidwendt Aug 29, 2023
721b5b2
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 29, 2023
9eb6666
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 30, 2023
3f0aaf4
fix merge conflict
davidwendt Aug 31, 2023
fb11116
try adding zlib to dependencies.yaml
davidwendt Aug 31, 2023
ea4150d
add zlib to conda env yamls too
davidwendt Aug 31, 2023
346b0cb
undo temp changes
davidwendt Aug 31, 2023
6ad4187
use segmented reduce
davidwendt Aug 31, 2023
bfe9d9e
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 31, 2023
9999b84
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 31, 2023
2fe62a3
Merge branch 'bpe-python-api' of github.com:davidwendt/cudf into bpe-…
davidwendt Aug 31, 2023
9ad9865
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Aug 31, 2023
d52ba46
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 1, 2023
b0fed2b
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 1, 2023
891af21
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 6, 2023
511a076
block per string
davidwendt Sep 7, 2023
5be1216
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 7, 2023
ddcf973
Merge branch 'bpe-python-api' of github.com:davidwendt/cudf into bpe-…
davidwendt Sep 10, 2023
4fe85ad
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 10, 2023
b91c13f
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 12, 2023
56c967f
add rerank working memory
davidwendt Sep 12, 2023
11730bc
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 12, 2023
ba638a3
Merge branch 'bpe-python-api' of github.com:davidwendt/cudf into bpe-…
davidwendt Sep 13, 2023
3e7a233
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 13, 2023
a716c02
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 18, 2023
038809f
fusing reduce and unfusing output size calc
davidwendt Sep 18, 2023
07ad295
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 18, 2023
3924d70
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 19, 2023
b14dcbc
fix merge conflict
davidwendt Sep 19, 2023
fb90f87
Merge branch 'bpe-python-api' of github.com:davidwendt/cudf into bpe-…
davidwendt Sep 19, 2023
4528e6f
move spaces init to main kernel
davidwendt Sep 19, 2023
6b87144
Merge branch 'bpe-python-api' of github.com:davidwendt/cudf into bpe-…
davidwendt Sep 20, 2023
9d177a7
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 20, 2023
5e27d5a
limit adjacent pair search
davidwendt Sep 20, 2023
05a0b60
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 20, 2023
4d1173b
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 22, 2023
ad32601
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 22, 2023
33f2bc7
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 22, 2023
ef498c5
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 24, 2023
66e42af
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 25, 2023
27af2a0
fix merge conflict
davidwendt Sep 25, 2023
5d73279
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 25, 2023
dbb011f
Merge branch 'bpe-python-api' of github.com:davidwendt/cudf into bpe-…
davidwendt Sep 26, 2023
6407245
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 26, 2023
75432d2
fix replace bug
davidwendt Sep 26, 2023
66a10cf
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 26, 2023
5aa4cff
Merge branch 'branch-23.10' into bpe-python-api
davidwendt Sep 27, 2023
d3ca6a0
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Sep 27, 2023
1ce7df6
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Sep 28, 2023
b3765cb
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Sep 28, 2023
92fea3f
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 3, 2023
c0aad86
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 3, 2023
5a955ab
Merge branch 'bpe-python-api' of github.com:davidwendt/cudf into bpe-…
davidwendt Oct 4, 2023
521cd1b
exploit unpairable boundaries
davidwendt Oct 4, 2023
9ed9b98
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 4, 2023
32935b6
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 4, 2023
b978e01
fix sliced input offset parm
davidwendt Oct 5, 2023
e158312
remove re-rank re-init
davidwendt Oct 5, 2023
5f54b37
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 5, 2023
0f3bc94
minor rework: variables and comments
davidwendt Oct 6, 2023
6e859c7
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 6, 2023
747e10f
remove any duplicate tmp offsets
davidwendt Oct 11, 2023
e2c08a8
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 11, 2023
aaeea74
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 12, 2023
1917b17
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 13, 2023
b873e40
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 13, 2023
9cb3e64
Merge branch 'bpe-python-api' of github.com:davidwendt/cudf into bpe-…
davidwendt Oct 13, 2023
e8139a7
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 16, 2023
50f80df
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 16, 2023
10e197f
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 17, 2023
c3c20c9
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 20, 2023
3c2a866
fix tmp-offsets size
davidwendt Oct 23, 2023
8b4e448
Merge branch 'bpe-python-api' of github.com:davidwendt/cudf into bpe-…
davidwendt Oct 23, 2023
ee8b3b7
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 23, 2023
664b96a
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 23, 2023
a28305d
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 24, 2023
de945ab
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 25, 2023
40bd2ae
replaced some device utilities with thrust functions
davidwendt Oct 25, 2023
f81d881
Merge branch 'bpe-python-api' of github.com:davidwendt/cudf into bpe-…
davidwendt Oct 25, 2023
1ca9147
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 25, 2023
6738159
change custom kernel to transform
davidwendt Oct 26, 2023
2d43098
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Oct 26, 2023
544a8e8
fix merge conflicts
davidwendt Nov 6, 2023
f24ce09
add pytest for BytePairEncoder
davidwendt Nov 6, 2023
ecd2062
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Nov 7, 2023
36a331a
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Nov 8, 2023
8d47d77
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Nov 8, 2023
63624ea
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Nov 8, 2023
6075435
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Nov 9, 2023
b278d0f
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Nov 13, 2023
335beb1
change BPE_Merge_Pairs to BPEMergePairs
davidwendt Nov 13, 2023
65cc91d
Merge branch 'bpe-python-api' of github.com:davidwendt/cudf into bpe-…
davidwendt Nov 13, 2023
333a812
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Nov 13, 2023
261385d
Merge branch 'branch-23.12' into bpe-python-api
davidwendt Nov 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions python/cudf/cudf/_lib/cpp/nvtext/byte_pair_encode.pxd
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Copyright (c) 2023, NVIDIA CORPORATION.

from libcpp.memory cimport unique_ptr
from libcpp.string cimport string

from cudf._lib.cpp.column.column cimport column
from cudf._lib.cpp.column.column_view cimport column_view
from cudf._lib.cpp.scalar.scalar cimport string_scalar


cdef extern from "nvtext/byte_pair_encoding.hpp" namespace "nvtext" nogil:

cdef struct bpe_merge_pairs "nvtext::bpe_merge_pairs":
pass

cdef unique_ptr[bpe_merge_pairs] load_merge_pairs(
const column_view &merge_pairs
) except +

cdef unique_ptr[column] byte_pair_encoding(
const column_view &strings,
const bpe_merge_pairs &merge_pairs,
const string_scalar &separator
) except +
4 changes: 2 additions & 2 deletions python/cudf/cudf/_lib/nvtext/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@
# =============================================================================

set(cython_sources
edit_distance.pyx generate_ngrams.pyx jaccard.pyx minhash.pyx ngrams_tokenize.pyx normalize.pyx
replace.pyx stemmer.pyx subword_tokenize.pyx tokenize.pyx
byte_pair_encode.pyx edit_distance.pyx generate_ngrams.pyx jaccard.pyx minhash.pyx
ngrams_tokenize.pyx normalize.pyx replace.pyx stemmer.pyx subword_tokenize.pyx tokenize.pyx
)
set(linked_libraries cudf::cudf)
rapids_cython_create_modules(
Expand Down
50 changes: 50 additions & 0 deletions python/cudf/cudf/_lib/nvtext/byte_pair_encode.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Copyright (c) 2023, NVIDIA CORPORATION.


from cudf.core.buffer import acquire_spill_lock

from libcpp.memory cimport unique_ptr
from libcpp.utility cimport move

from cudf._lib.column cimport Column
from cudf._lib.cpp.column.column cimport column
from cudf._lib.cpp.column.column_view cimport column_view
from cudf._lib.cpp.nvtext.byte_pair_encode cimport (
bpe_merge_pairs as cpp_bpe_merge_pairs,
byte_pair_encoding as cpp_byte_pair_encoding,
load_merge_pairs as cpp_load_merge_pairs,
)
from cudf._lib.cpp.scalar.scalar cimport string_scalar
from cudf._lib.scalar cimport DeviceScalar


cdef class BPE_Merge_Pairs:
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
cdef unique_ptr[cpp_bpe_merge_pairs] c_obj

def __cinit__(self, Column merge_pairs):
cdef column_view c_pairs = merge_pairs.view()
with nogil:
self.c_obj = move(cpp_load_merge_pairs(c_pairs))


@acquire_spill_lock()
def byte_pair_encoding(
Column strings,
BPE_Merge_Pairs merge_pairs,
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
object separator
):
cdef column_view c_strings = strings.view()
cdef DeviceScalar d_separator = separator.device_value
cdef const string_scalar* c_separator = <const string_scalar*>d_separator\
.get_raw_ptr()
cdef unique_ptr[column] c_result
with nogil:
c_result = move(
cpp_byte_pair_encoding(
c_strings,
merge_pairs.c_obj.get()[0],
c_separator[0]
)
)

return Column.from_unique_ptr(move(c_result))
57 changes: 57 additions & 0 deletions python/cudf/cudf/core/byte_pair_encoding.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Copyright (c) 2023, NVIDIA CORPORATION.

from __future__ import annotations

import cudf
from cudf._lib.nvtext.byte_pair_encode import (
BPE_Merge_Pairs as cpp_merge_pairs,
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
byte_pair_encoding as cpp_byte_pair_encoding,
)


class BytePairEncoder:
"""

Parameters
----------
merges_pairs : str
Strings column of merge pairs

Returns
-------
BytePairEncoder
"""

def __init__(self, merges_pair: "cudf.Series"):
self.merge_pairs = cpp_merge_pairs(merges_pair._column)

def __call__(self, text, separator: str = " "):
"""
davidwendt marked this conversation as resolved.
Show resolved Hide resolved

Parameters
----------
text : cudf string series
The strings to be encoded.

Returns
-------
Encoded strings

Examples
--------
>>> import cudf
>>> from cudf.core.byte_pair_encoding import BytePairEncoder
>>> mps = cudf.Series(["e n", "i t", "i s", "e s", "en t",
... "c e", "es t", "en ce", "T h", "Th is",
... "t est", "s ent", "t h", "th is"])
>>> bpe = BytePairEncoder(mps)
>>> str_series = cudf.Series(['This is the sentence', 'thisisit'])
>>> bpe(str_series)
0 This is a sent ence
1 this is it
dtype: object
"""
sep = cudf.Scalar(separator, dtype="str")
result = cpp_byte_pair_encoding(text._column, self.merge_pairs, sep)

return cudf.Series(result)
41 changes: 41 additions & 0 deletions python/cudf/cudf/tests/text/test_text_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import pytest

import cudf
from cudf.core.byte_pair_encoding import BytePairEncoder
from cudf.core.tokenize_vocabulary import TokenizeVocabulary
from cudf.testing._utils import assert_eq

Expand Down Expand Up @@ -1030,3 +1031,43 @@ def test_jaccard_index_random_strings():

actual = str1.str.jaccard_index(str2, jaccard_width)
assert_eq(expected, actual)


@pytest.mark.parametrize(
"separator, input, results",
[
(" ", "thetestsentence", "the test sent ence"),
("_", "sentenceistest", "sent_ence_is_test"),
("$", "istestsentencehere", "is$test$sent$ence$he$r$e"),
],
)
def test_byte_pair_encoding(separator, input, results):
pairs_table = cudf.Series(
[
"t he",
"h e",
"e n",
"i t",
"i s",
"e s",
"en t",
"c e",
"es t",
"en ce",
"t h",
"h i",
"th is",
"t est",
"s i",
"s ent",
]
)
encoder = BytePairEncoder(pairs_table)

strings = cudf.Series([input, None, "", input])

expected = cudf.Series([results, None, "", results])

actual = encoder(strings, separator)
assert type(expected) == type(actual)
assert_eq(expected, actual)