Skip to content

FastWordpieceTokenizer fails with "Cannot find unk_token in the vocab!" when vocabulary size >= 7 despite unk_token being present #1462

@pctablet505

Description

@pctablet505

TensorFlow-Text 2.20.0 FastWordpieceTokenizer Bug Analysis

Bug Summary

FastWordpieceTokenizer fails with "Cannot find unk_token in the vocab!" when:

  • Vocabulary size ≥ 7, AND
  • The unknown_token is NOT the last element in the vocabulary

Despite the unknown_token being present in the vocabulary.

Reproduction

import tensorflow_text as tf_text

# ✓ Works: 6 tokens
vocab = ['[UNK]', 'token1', 'token2', 'token3', 'token4', 'token5']
tokenizer = tf_text.FastWordpieceTokenizer(vocab=vocab, unknown_token='[UNK]', no_pretokenization=True)

# ✗ Fails: 7 tokens, unknown_token at position 0
vocab = ['[UNK]', 'token1', 'token2', 'token3', 'token4', 'token5', 'token6']
tokenizer = tf_text.FastWordpieceTokenizer(vocab=vocab, unknown_token='[UNK]', no_pretokenization=True)
# RuntimeError: Cannot find unk_token in the vocab!

# ✓ Works: 7 tokens, unknown_token at LAST position
vocab = ['token1', 'token2', 'token3', 'token4', 'token5', 'token6', '[UNK]']
tokenizer = tf_text.FastWordpieceTokenizer(vocab=vocab, unknown_token='[UNK]', no_pretokenization=True)

Root Cause

File: tensorflow_text/core/kernels/string_vocab.cc
Function: StringVocab::StringVocab() constructor
Lines: 20-25

StringVocab::StringVocab(const std::vector<std::string>& vocab)
    : vocab_(vocab) {
  for (int i = 0; i < vocab.size(); ++i) {
    index_map_[vocab_[i]] = i;  // BUG: No reserve() before loop
  }
}

The Bug

  1. The constructor builds an absl::flat_hash_map<absl::string_view, int> by inserting vocabulary tokens one by one
  2. No index_map_.reserve(vocab.size()) is called before the loop
  3. When the hash map reaches its load factor threshold (~0.875), it triggers a rehash/resize
  4. For vocabularies with size ≥ 7, this rehash typically occurs during insertion
  5. There appears to be a bug in how absl::flat_hash_map handles string_view keys during rehashing

Why Position Matters

  • Last position works: When unknown_token is at the last position, it's inserted AFTER any rehashing occurs
  • Earlier positions fail: When unknown_token is inserted before the rehash point, something goes wrong during rehashing that causes subsequent lookups to fail

Evidence

Testing shows the exact pattern:

Vocab Size | Unknown_token Position | Result
-----------|----------------------|--------
    6      | first/middle/last    | ✓ SUCCESS
    7      | first/middle         | ✗ FAILED  
    7      | last                 | ✓ SUCCESS
    8      | first/middle         | ✗ FAILED
    8      | last                 | ✓ SUCCESS

The failure occurs precisely when:

  1. Vocabulary size >= 7 (triggers hash map rehashing)
  2. unknown_token is inserted before the rehash occurs

Why It Works with no_pretokenization=True

Setting no_pretokenization=True disables the punctuation-skipping logic, allowing "unk_token" to be added to the trie.

Why Vocabulary Size Matters

Note: The exact reason for the vocabulary size threshold (≥7) is unclear without deeper C++ debugging, but several hypotheses:

  1. Different Code Paths: The C++ implementation may use different algorithms or optimizations based on vocabulary size, potentially affecting how token lookups are performed.

  2. Hash Map vs Linear Search: Smaller vocabularies might use linear search while larger ones use hash-based lookups, where string comparison or hashing behavior differs.

  3. Memory Layout Effects: With larger vocabularies, memory allocation patterns or string storage mechanisms might change, affecting how "unk_token" is stored or compared.

  4. Trie Construction Dependencies: The issue occurs during vocab_->LookupId(unk_token_) which should query the original StringVocab, not the trie. However, there may be interdependencies where trie construction side effects influence the original vocabulary lookup mechanism.

  5. Internal Optimizations: The FastWordpiece implementation may have size-based optimizations that trigger different behavior patterns at the 7-token threshold.

Current Evidence:

  • Works: vocabularies with ≤6 tokens
  • Fails: vocabularies with ≥7 tokens containing "unk_token"
  • The error occurs in StringVocab::LookupId(), not during trie operations
  • Setting no_pretokenization=True bypasses the issue entirely

Impact

  • KerasHub: Breaks WordPieceTokenizer for BERT/DistilBERT models when using tensorflow-text 2.20+
  • LiteRT Export: Causes export tests to fail
  • Workaround: Either:
    1. Use no_pretokenization=True (not always appropriate)
    2. Use a different token name without punctuation (e.g., "<unk>" instead of "unk_token")
    3. Skip tests gracefully on affected systems (current KerasHub approach)

Versions Affected

  • tensorflow-text: 2.20.0
  • tensorflow: 2.18.0+

Regression

This appears to be a regression introduced in tensorflow-text 2.20.0. Earlier versions did not exhibit this behavior.

Recommendation

Submit bug report to tensorflow-text repository with the minimal reproduction script.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions