-
Notifications
You must be signed in to change notification settings - Fork 373
Description
TensorFlow-Text 2.20.0 FastWordpieceTokenizer Bug Analysis
Bug Summary
FastWordpieceTokenizer fails with "Cannot find unk_token in the vocab!" when:
- Vocabulary size ≥ 7, AND
- The
unknown_tokenis NOT the last element in the vocabulary
Despite the unknown_token being present in the vocabulary.
Reproduction
import tensorflow_text as tf_text
# ✓ Works: 6 tokens
vocab = ['[UNK]', 'token1', 'token2', 'token3', 'token4', 'token5']
tokenizer = tf_text.FastWordpieceTokenizer(vocab=vocab, unknown_token='[UNK]', no_pretokenization=True)
# ✗ Fails: 7 tokens, unknown_token at position 0
vocab = ['[UNK]', 'token1', 'token2', 'token3', 'token4', 'token5', 'token6']
tokenizer = tf_text.FastWordpieceTokenizer(vocab=vocab, unknown_token='[UNK]', no_pretokenization=True)
# RuntimeError: Cannot find unk_token in the vocab!
# ✓ Works: 7 tokens, unknown_token at LAST position
vocab = ['token1', 'token2', 'token3', 'token4', 'token5', 'token6', '[UNK]']
tokenizer = tf_text.FastWordpieceTokenizer(vocab=vocab, unknown_token='[UNK]', no_pretokenization=True)Root Cause
File: tensorflow_text/core/kernels/string_vocab.cc
Function: StringVocab::StringVocab() constructor
Lines: 20-25
StringVocab::StringVocab(const std::vector<std::string>& vocab)
: vocab_(vocab) {
for (int i = 0; i < vocab.size(); ++i) {
index_map_[vocab_[i]] = i; // BUG: No reserve() before loop
}
}The Bug
- The constructor builds an
absl::flat_hash_map<absl::string_view, int>by inserting vocabulary tokens one by one - No
index_map_.reserve(vocab.size())is called before the loop - When the hash map reaches its load factor threshold (~0.875), it triggers a rehash/resize
- For vocabularies with size ≥ 7, this rehash typically occurs during insertion
- There appears to be a bug in how
absl::flat_hash_maphandlesstring_viewkeys during rehashing
Why Position Matters
- Last position works: When
unknown_tokenis at the last position, it's inserted AFTER any rehashing occurs - Earlier positions fail: When
unknown_tokenis inserted before the rehash point, something goes wrong during rehashing that causes subsequent lookups to fail
Evidence
Testing shows the exact pattern:
Vocab Size | Unknown_token Position | Result
-----------|----------------------|--------
6 | first/middle/last | ✓ SUCCESS
7 | first/middle | ✗ FAILED
7 | last | ✓ SUCCESS
8 | first/middle | ✗ FAILED
8 | last | ✓ SUCCESS
The failure occurs precisely when:
- Vocabulary size >= 7 (triggers hash map rehashing)
- unknown_token is inserted before the rehash occurs
Why It Works with no_pretokenization=True
Setting no_pretokenization=True disables the punctuation-skipping logic, allowing "unk_token" to be added to the trie.
Why Vocabulary Size Matters
Note: The exact reason for the vocabulary size threshold (≥7) is unclear without deeper C++ debugging, but several hypotheses:
-
Different Code Paths: The C++ implementation may use different algorithms or optimizations based on vocabulary size, potentially affecting how token lookups are performed.
-
Hash Map vs Linear Search: Smaller vocabularies might use linear search while larger ones use hash-based lookups, where string comparison or hashing behavior differs.
-
Memory Layout Effects: With larger vocabularies, memory allocation patterns or string storage mechanisms might change, affecting how
"unk_token"is stored or compared. -
Trie Construction Dependencies: The issue occurs during
vocab_->LookupId(unk_token_)which should query the originalStringVocab, not the trie. However, there may be interdependencies where trie construction side effects influence the original vocabulary lookup mechanism. -
Internal Optimizations: The FastWordpiece implementation may have size-based optimizations that trigger different behavior patterns at the 7-token threshold.
Current Evidence:
- Works: vocabularies with ≤6 tokens
- Fails: vocabularies with ≥7 tokens containing
"unk_token" - The error occurs in
StringVocab::LookupId(), not during trie operations - Setting
no_pretokenization=Truebypasses the issue entirely
Impact
- KerasHub: Breaks WordPieceTokenizer for BERT/DistilBERT models when using tensorflow-text 2.20+
- LiteRT Export: Causes export tests to fail
- Workaround: Either:
- Use
no_pretokenization=True(not always appropriate) - Use a different token name without punctuation (e.g.,
"<unk>"instead of"unk_token") - Skip tests gracefully on affected systems (current KerasHub approach)
- Use
Versions Affected
- tensorflow-text: 2.20.0
- tensorflow: 2.18.0+
Regression
This appears to be a regression introduced in tensorflow-text 2.20.0. Earlier versions did not exhibit this behavior.
Recommendation
Submit bug report to tensorflow-text repository with the minimal reproduction script.