description: Tokenizes a tensor of UTF-8 string tokens into subword pieces.
Tokenizes a tensor of UTF-8 string tokens into subword pieces.
Inherits From: TokenizerWithOffsets
,
Tokenizer
,
SplitterWithOffsets
,
Splitter
, Detokenizer
text.FastWordpieceTokenizer(
vocab=None,
suffix_indicator='##',
max_bytes_per_word=100,
token_out_type=dtypes.int64,
unknown_token='[UNK]',
no_pretokenization=False,
support_detokenization=False,
model_buffer=None
)
It employs the linear (as opposed to quadratic) WordPiece algorithm (see the paper).
Differences compared to the classic WordpieceTokenizer are as follows (as of 11/2021):
-
unknown_token
cannot be None or empty. That means if a word is too long or cannot be tokenized, FastWordpieceTokenizer always returnsunknown_token
. In constrast, the original WordpieceTokenizer would return the original word ifunknown_token
is empty or None. -
unknown_token
must be included in the vocabulary. -
When
unknown_token
is returned, in tokenize_with_offsets(), the result end_offset is set to be the length of the original input word. In contrast, whenunknown_token
is returned by the original WordpieceTokenizer, the end_offset is set to be the length of theunknown_token
string. -
split_unknown_characters
is not supported. -
max_chars_per_token
is not used or needed. -
By default the input is assumed to be general text (i.e., sentences), and FastWordpieceTokenizer first splits it on whitespaces and punctuations and then applies the Wordpiece tokenization (see the parameter
no_pretokenization
). If the input already contains single words only, please setno_pretokenization=True
to be consistent with the classic WordpieceTokenizer.
detokenize(
input
)
Detokenizes a tensor of int64 or int32 subword ids into sentences.
Detokenize and tokenize an input string returns itself when the input string is
normalized and the tokenized wordpieces don't contain <unk>
.
>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]",
... "'", "re", "ok"]
>>> tokenizer = FastWordpieceTokenizer(vocab, support_detokenization=True)
>>> ids = tf.ragged.constant([[0, 1, 2, 3, 4, 5], [9]])
>>> tokenizer.detokenize(ids)
<tf.Tensor: shape=(2,), dtype=string,
... numpy=array([b"they're the greatest", b'ok'], dtype=object)>
>>> ragged_ids = tf.ragged.constant([[[0, 1, 2, 3, 4, 5], [9]], [[4, 5]]])
>>> tokenizer.detokenize(ragged_ids)
<tf.RaggedTensor [[b"they're the greatest", b'ok'], [b'greatest']]>
Args | |
---|---|
`input` | An N-dimensional `Tensor` or `RaggedTensor` of int64 or int32. |
Returns | |
---|---|
A `RaggedTensor` of sentences that has N - 1 dimension when N > 1. Otherwise, a string tensor. |
split(
input
)
Alias for
Tokenizer.tokenize
.
split_with_offsets(
input
)
Alias for
TokenizerWithOffsets.tokenize_with_offsets
.
tokenize(
input
)
Tokenizes a tensor of UTF-8 string tokens further into subword tokens.
>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string,
... no_pretokenization=True)
>>> tokens = [["they're", "the", "greatest"]]
>>> tokenizer.tokenize(tokens)
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
[b'great', b'##est']]]>
>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]",
... "'", "re"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string)
>>> tokens = [["they're the greatest", "the greatest"]]
>>> tokenizer.tokenize(tokens)
<tf.RaggedTensor [[[b'they', b"'", b're', b'the', b'great', b'##est'],
[b'the', b'great', b'##est']]]>
Args | |
---|---|
`input` | An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings. |
Returns | |
---|---|
A `RaggedTensor` of tokens where `tokens[i, j]` is the j-th token (i.e., wordpiece) for `input[i]` (i.e., the i-th input word). This token is either the actual token string content, or the corresponding integer id, i.e., the index of that token string in the vocabulary. This choice is controlled by the `token_out_type` parameter passed to the initializer method. |
tokenize_with_offsets(
input
)
Tokenizes a tensor of UTF-8 string tokens further into subword tokens.
>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string,
... no_pretokenization=True)
>>> tokens = [["they're", "the", "greatest"]]
>>> subtokens, starts, ends = tokenizer.tokenize_with_offsets(tokens)
>>> subtokens
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
[b'great', b'##est']]]>
>>> starts
<tf.RaggedTensor [[[0, 4, 5], [0], [0, 5]]]>
>>> ends
<tf.RaggedTensor [[[4, 5, 7], [3], [5, 8]]]>
>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]",
... "'", "re"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string)
>>> tokens = [["they're the greatest", "the greatest"]]
>>> subtokens, starts, ends = tokenizer.tokenize_with_offsets(tokens)
>>> subtokens
<tf.RaggedTensor [[[b'they', b"'", b're', b'the', b'great', b'##est'],
[b'the', b'great', b'##est']]]>
>>> starts
<tf.RaggedTensor [[[0, 4, 5, 8, 12, 17], [0, 4, 9]]]>
>>> ends
<tf.RaggedTensor [[[4, 5, 7, 11, 17, 20], [3, 9, 12]]]>
Args | |
---|---|
`input` | An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings. |
Returns | |
---|---|
A tuple `(tokens, start_offsets, end_offsets)` where: | |
`tokens` | is a `RaggedTensor`, where `tokens[i, j]` is the j-th token (i.e., wordpiece) for `input[i]` (i.e., the i-th input word). This token is either the actual token string content, or the corresponding integer id, i.e., the index of that token string in the vocabulary. This choice is controlled by the `token_out_type` parameter passed to the initializer method. start_offsets[i1...iN, j]: is a `RaggedTensor` of the byte offsets for the inclusive start of the `jth` token in `input[i1...iN]`. end_offsets[i1...iN, j]: is a `RaggedTensor` of the byte offsets for the exclusive end of the `jth` token in `input[i`...iN]` (exclusive, i.e., first byte after the end of the token). |