Skip to content

Latest commit

 

History

History
398 lines (340 loc) · 12.5 KB

FastWordpieceTokenizer.md

File metadata and controls

398 lines (340 loc) · 12.5 KB

description: Tokenizes a tensor of UTF-8 string tokens into subword pieces.

text.FastWordpieceTokenizer

View source

Tokenizes a tensor of UTF-8 string tokens into subword pieces.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer

text.FastWordpieceTokenizer(
    vocab=None,
    suffix_indicator='##',
    max_bytes_per_word=100,
    token_out_type=dtypes.int64,
    unknown_token='[UNK]',
    no_pretokenization=False,
    support_detokenization=False,
    model_buffer=None
)

It employs the linear (as opposed to quadratic) WordPiece algorithm (see the paper).

Differences compared to the classic WordpieceTokenizer are as follows (as of 11/2021):

  • unknown_token cannot be None or empty. That means if a word is too long or cannot be tokenized, FastWordpieceTokenizer always returns unknown_token. In constrast, the original WordpieceTokenizer would return the original word if unknown_token is empty or None.

  • unknown_token must be included in the vocabulary.

  • When unknown_token is returned, in tokenize_with_offsets(), the result end_offset is set to be the length of the original input word. In contrast, when unknown_token is returned by the original WordpieceTokenizer, the end_offset is set to be the length of the unknown_token string.

  • split_unknown_characters is not supported.

  • max_chars_per_token is not used or needed.

  • By default the input is assumed to be general text (i.e., sentences), and FastWordpieceTokenizer first splits it on whitespaces and punctuations and then applies the Wordpiece tokenization (see the parameter no_pretokenization). If the input already contains single words only, please set no_pretokenization=True to be consistent with the classic WordpieceTokenizer.

Args

`vocab` (optional) The list of tokens in the vocabulary.
`suffix_indicator` (optional) The characters prepended to a wordpiece to indicate that it is a suffix to another subword.
`max_bytes_per_word` (optional) Max size of input token.
`token_out_type` (optional) The type of the token to return. This can be `tf.int64` or `tf.int32` IDs, or `tf.string` subwords.
`unknown_token` (optional) The string value to substitute for an unknown token. It must be included in `vocab`.
`no_pretokenization` (optional) By default, the input is split on whitespaces and punctuations before applying the Wordpiece tokenization. When true, the input is assumed to be pretokenized already.
`support_detokenization` (optional) Whether to make the tokenizer support doing detokenization. Setting it to true expands the size of the model flatbuffer. As a reference, when using 120k multilingual BERT WordPiece vocab, the flatbuffer's size increases from ~5MB to ~6MB.
`model_buffer` (optional) Bytes object (or a uint8 tf.Tenosr) that contains the wordpiece model in flatbuffer format (see fast_wordpiece_tokenizer_model.fbs). If not `None`, all other arguments (except `token_output_type`) are ignored.

Methods

detokenize

View source

detokenize(
    input
)

Detokenizes a tensor of int64 or int32 subword ids into sentences.

Detokenize and tokenize an input string returns itself when the input string is normalized and the tokenized wordpieces don't contain <unk>.

Example:

>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]",
...          "'", "re", "ok"]
>>> tokenizer = FastWordpieceTokenizer(vocab, support_detokenization=True)
>>> ids = tf.ragged.constant([[0, 1, 2, 3, 4, 5], [9]])
>>> tokenizer.detokenize(ids)
<tf.Tensor: shape=(2,), dtype=string,
...         numpy=array([b"they're the greatest", b'ok'], dtype=object)>
>>> ragged_ids = tf.ragged.constant([[[0, 1, 2, 3, 4, 5], [9]], [[4, 5]]])
>>> tokenizer.detokenize(ragged_ids)
<tf.RaggedTensor [[b"they're the greatest", b'ok'], [b'greatest']]>
Args
`input` An N-dimensional `Tensor` or `RaggedTensor` of int64 or int32.
Returns
A `RaggedTensor` of sentences that has N - 1 dimension when N > 1. Otherwise, a string tensor.

split

View source

split(
    input
)

Alias for Tokenizer.tokenize.

split_with_offsets

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

tokenize(
    input
)

Tokenizes a tensor of UTF-8 string tokens further into subword tokens.

Example 1, single word tokenization:

>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string,
...                                    no_pretokenization=True)
>>> tokens = [["they're", "the", "greatest"]]
>>> tokenizer.tokenize(tokens)
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
                   [b'great', b'##est']]]>

Example 2, general text tokenization (pre-tokenization on

punctuation and whitespace followed by WordPiece tokenization):

>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]",
...          "'", "re"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string)
>>> tokens = [["they're the greatest", "the greatest"]]
>>> tokenizer.tokenize(tokens)
<tf.RaggedTensor [[[b'they', b"'", b're', b'the', b'great', b'##est'],
                   [b'the', b'great', b'##est']]]>
Args
`input` An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.
Returns
A `RaggedTensor` of tokens where `tokens[i, j]` is the j-th token (i.e., wordpiece) for `input[i]` (i.e., the i-th input word). This token is either the actual token string content, or the corresponding integer id, i.e., the index of that token string in the vocabulary. This choice is controlled by the `token_out_type` parameter passed to the initializer method.

tokenize_with_offsets

View source

tokenize_with_offsets(
    input
)

Tokenizes a tensor of UTF-8 string tokens further into subword tokens.

Example 1, single word tokenization:

>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string,
...                                    no_pretokenization=True)
>>> tokens = [["they're", "the", "greatest"]]
>>> subtokens, starts, ends = tokenizer.tokenize_with_offsets(tokens)
>>> subtokens
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
                   [b'great', b'##est']]]>
>>> starts
<tf.RaggedTensor [[[0, 4, 5], [0], [0, 5]]]>
>>> ends
<tf.RaggedTensor [[[4, 5, 7], [3], [5, 8]]]>

Example 2, general text tokenization (pre-tokenization on

punctuation and whitespace followed by WordPiece tokenization):

>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]",
...          "'", "re"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string)
>>> tokens = [["they're the greatest", "the greatest"]]
>>> subtokens, starts, ends = tokenizer.tokenize_with_offsets(tokens)
>>> subtokens
<tf.RaggedTensor [[[b'they', b"'", b're', b'the', b'great', b'##est'],
                   [b'the', b'great', b'##est']]]>
>>> starts
<tf.RaggedTensor [[[0, 4, 5, 8, 12, 17], [0, 4, 9]]]>
>>> ends
<tf.RaggedTensor [[[4, 5, 7, 11, 17, 20], [3, 9, 12]]]>
Args
`input` An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.
Returns
A tuple `(tokens, start_offsets, end_offsets)` where:
`tokens` is a `RaggedTensor`, where `tokens[i, j]` is the j-th token (i.e., wordpiece) for `input[i]` (i.e., the i-th input word). This token is either the actual token string content, or the corresponding integer id, i.e., the index of that token string in the vocabulary. This choice is controlled by the `token_out_type` parameter passed to the initializer method. start_offsets[i1...iN, j]: is a `RaggedTensor` of the byte offsets for the inclusive start of the `jth` token in `input[i1...iN]`. end_offsets[i1...iN, j]: is a `RaggedTensor` of the byte offsets for the exclusive end of the `jth` token in `input[i`...iN]` (exclusive, i.e., first byte after the end of the token).