FastWordpieceTokenizer fails with "Cannot find unk_token in the vocab!" when vocabulary size >= 7 despite unk_token being present

# TensorFlow-Text 2.20.0 FastWordpieceTokenizer Bug Analysis

## Bug Summary

FastWordpieceTokenizer fails with "Cannot find unk_token in the vocab!" when:
- Vocabulary size ≥ 7, AND
- The `unknown_token` is NOT the last element in the vocabulary

Despite the `unknown_token` being present in the vocabulary.

## Reproduction

```python
import tensorflow_text as tf_text

# ✓ Works: 6 tokens
vocab = ['[UNK]', 'token1', 'token2', 'token3', 'token4', 'token5']
tokenizer = tf_text.FastWordpieceTokenizer(vocab=vocab, unknown_token='[UNK]', no_pretokenization=True)

# ✗ Fails: 7 tokens, unknown_token at position 0
vocab = ['[UNK]', 'token1', 'token2', 'token3', 'token4', 'token5', 'token6']
tokenizer = tf_text.FastWordpieceTokenizer(vocab=vocab, unknown_token='[UNK]', no_pretokenization=True)
# RuntimeError: Cannot find unk_token in the vocab!

# ✓ Works: 7 tokens, unknown_token at LAST position
vocab = ['token1', 'token2', 'token3', 'token4', 'token5', 'token6', '[UNK]']
tokenizer = tf_text.FastWordpieceTokenizer(vocab=vocab, unknown_token='[UNK]', no_pretokenization=True)
```

## Root Cause

**File**: `tensorflow_text/core/kernels/string_vocab.cc`  
**Function**: `StringVocab::StringVocab()` constructor  
**Lines**: 20-25

```cpp
StringVocab::StringVocab(const std::vector<std::string>& vocab)
    : vocab_(vocab) {
  for (int i = 0; i < vocab.size(); ++i) {
    index_map_[vocab_[i]] = i;  // BUG: No reserve() before loop
  }
}
```

### The Bug

1. The constructor builds an `absl::flat_hash_map<absl::string_view, int>` by inserting vocabulary tokens one by one
2. **No `index_map_.reserve(vocab.size())` is called before the loop**
3. When the hash map reaches its load factor threshold (~0.875), it triggers a rehash/resize
4. For vocabularies with size ≥ 7, this rehash typically occurs during insertion
5. **There appears to be a bug in how `absl::flat_hash_map` handles `string_view` keys during rehashing**

### Why Position Matters

- **Last position works**: When `unknown_token` is at the last position, it's inserted AFTER any rehashing occurs
- **Earlier positions fail**: When `unknown_token` is inserted before the rehash point, something goes wrong during rehashing that causes subsequent lookups to fail

### Evidence

Testing shows the exact pattern:
```
Vocab Size | Unknown_token Position | Result
-----------|----------------------|--------
    6      | first/middle/last    | ✓ SUCCESS
    7      | first/middle         | ✗ FAILED  
    7      | last                 | ✓ SUCCESS
    8      | first/middle         | ✗ FAILED
    8      | last                 | ✓ SUCCESS
```

The failure occurs precisely when:
1. Vocabulary size >= 7 (triggers hash map rehashing)
2. unknown_token is inserted before the rehash occurs

## Why It Works with `no_pretokenization=True`

Setting `no_pretokenization=True` disables the punctuation-skipping logic, allowing `"unk_token"` to be added to the trie.

## Why Vocabulary Size Matters

**Note**: The exact reason for the vocabulary size threshold (≥7) is unclear without deeper C++ debugging, but several hypotheses:

1. **Different Code Paths**: The C++ implementation may use different algorithms or optimizations based on vocabulary size, potentially affecting how token lookups are performed.

2. **Hash Map vs Linear Search**: Smaller vocabularies might use linear search while larger ones use hash-based lookups, where string comparison or hashing behavior differs.

3. **Memory Layout Effects**: With larger vocabularies, memory allocation patterns or string storage mechanisms might change, affecting how `"unk_token"` is stored or compared.

4. **Trie Construction Dependencies**: The issue occurs during `vocab_->LookupId(unk_token_)` which should query the original `StringVocab`, not the trie. However, there may be interdependencies where trie construction side effects influence the original vocabulary lookup mechanism.

5. **Internal Optimizations**: The FastWordpiece implementation may have size-based optimizations that trigger different behavior patterns at the 7-token threshold.

**Current Evidence**: 
- Works: vocabularies with ≤6 tokens
- Fails: vocabularies with ≥7 tokens containing `"unk_token"`
- The error occurs in `StringVocab::LookupId()`, not during trie operations
- Setting `no_pretokenization=True` bypasses the issue entirely

## Impact

- **KerasHub**: Breaks WordPieceTokenizer for BERT/DistilBERT models when using tensorflow-text 2.20+
- **LiteRT Export**: Causes export tests to fail
- **Workaround**: Either:
  1. Use `no_pretokenization=True` (not always appropriate)
  2. Use a different token name without punctuation (e.g., `"<unk>"` instead of `"unk_token"`)
  3. Skip tests gracefully on affected systems (current KerasHub approach)

## Versions Affected

- **tensorflow-text**: 2.20.0
- **tensorflow**: 2.18.0+

## Regression

This appears to be a regression introduced in tensorflow-text 2.20.0. Earlier versions did not exhibit this behavior.

## Recommendation

Submit bug report to tensorflow-text repository with the minimal reproduction script.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastWordpieceTokenizer fails with "Cannot find unk_token in the vocab!" when vocabulary size >= 7 despite unk_token being present #1462

TensorFlow-Text 2.20.0 FastWordpieceTokenizer Bug Analysis

Bug Summary

Reproduction

Root Cause

The Bug

Why Position Matters

Evidence

Why It Works with `no_pretokenization=True`

Why Vocabulary Size Matters

Impact

Versions Affected

Regression

Recommendation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FastWordpieceTokenizer fails with "Cannot find unk_token in the vocab!" when vocabulary size >= 7 despite unk_token being present #1462

Description

TensorFlow-Text 2.20.0 FastWordpieceTokenizer Bug Analysis

Bug Summary

Reproduction

Root Cause

The Bug

Why Position Matters

Evidence

Why It Works with no_pretokenization=True

Why Vocabulary Size Matters

Impact

Versions Affected

Regression

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Why It Works with `no_pretokenization=True`