Error [tokenizers:re2_regex.cpp:26] Failed to compile Regex for Qwen2.5 Model

### 🐛 Describe the bug

Similar issue as related in "Running Qwen3 in Android LlamaDemo App shows an error while loading tokenizer #11311 while trying to run the LLM Model (Qwen-2.5-0.5B) of OuteTTS v0.2 500M. 

The tokenizer is a tokenizer.json file. I am using the one from OuteTTS-0.2-500M HuggingFace over here: https://huggingface.co/OuteAI/OuteTTS-0.2-500M/blob/main/tokenizer.json

Using Executorch git main branch (deaf37f4f7e2a4205ad09b75f04427568ff66ff2)

Error as reported in #11311 
```
I tokenizers:regex.cpp:27] Registering override fallback regex
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1756372464.977639 9813813 re2.cc:237] Error parsing '(\<\|2228\|\>|\<\|2229\|\>|\<\|2230\|\>|\<\|2231\|\>|\<\|2232\|\>|\<\|2233\|\>|\<\|2234\|\>|\<\|2235...': invalid UTF-8
E tokenizers:re2_regex.cpp:26] Failed to compile regex: (\<\|2228\|\>|\<\|2229\|\>|\<\|2230\|\>|\<\|2231\|\>|\<\|2232\|\>|\<\|2233\|\>|\<\|2234\|\>|\<\|2235\|\>|\<\|2236\|\>|\<\|2237\|\>|\<\|2238\|\>|\<\|2239\|\>|\<\|2240\|\>|\<\|2241\|\>|\<\|2242\|\>|\<\|2243\|\>|\<\|2244\|\>|\<\|224$
I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
E tokenizers:pcre2_regex.cpp:36] PCRE2 compilation failed at offset 275: UTF-8 error: isolated byte with 0x80 bit set
I tokenizers:regex_lookahead.cpp:40] PCRE2 failed to compile pattern, falling back to std::regex.
I tokenizers:hf_tokenizer.cpp:109] Setting up normalizer...
I tokenizers:normalizer.cpp:89] Using NFC normalizer. Please notice that our implementation may not handle all edge cases.
I tokenizers:hf_tokenizer.cpp:113] Normalizer set up
I tokenizers:hf_tokenizer.cpp:127] Setting up pretokenizer...
E0000 00:00:1756372465.044814 9813813 re2.cc:237] Error parsing '((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s...': invalid perl operator: (?!
E tokenizers:re2_regex.cpp:26] Failed to compile regex: ((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+), error: invalid perl operator: (?!

I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
I tokenizers:hf_tokenizer.cpp:131] Pretokenizer set up
I tokenizers:hf_tokenizer.cpp:147] Loading BPE merges...
I tokenizers:hf_tokenizer.cpp:185] Loaded 151387 BPE merge rules
I tokenizers:hf_tokenizer.cpp:193] Built merge ranks map with 151387 entries
libc++abi: terminating due to uncaught exception of type std::__1::regex_error: The complexity of an attempted match against a regular expression exceeded a pre-set level.
zsh: abort      ./cmake-out/examples/models/llama/llama_main --model_path=$OUTETTS_MODEL
```

Steps used to install ExecuTorch. Follow the models/llama/Readme.md Instructions :
```
# Download ExecuTorch
git clone https://github.com/pytorch/executorch.git && cd executorch

# Create a python virtual environment (python3.10) and activate it. 
python3.10 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip

# Update and pull the submodules
git submodule sync
git submodule update --init --recursive

# Run the setup script to install executorch. Install llama dependencies.
./install_executorch.sh
./examples/models/llama/install_requirements.sh

# Build ExecuTorch 
cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out
cmake --build cmake-out -j9 --target install --config Release

# Build the LLama-Runner with adding flag -DSUPPORT_REGEX_LOOKAHEAD=ON
cmake -DCMAKE_INSTALL_PREFIX=cmake-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DSUPPORT_REGEX_LOOKAHEAD=ON \
    -Bcmake-out/examples/models/llama \
    examples/models/llama
cmake --build cmake-out/examples/models/llama -j9 --config Release
``` 
Note: I do get a CMake Warning regarding the regex flag:
``` 
CMake Warning:
  Manually-specified variables were not used by the project:

    SUPPORT_REGEX_LOOKAHEAD
```
Command to test the OuteTTS LLM Model:
```
OUTETTS_MODEL="outetts_8da4wfp32_gs32_1024.pte"
TOKENIZER="OuteTTS-0.2-500M/tokenizer.json"
./cmake-out/examples/models/llama/llama_main --model_path=$OUTETTS_MODEL --tokenizer_path=$TOKENIZER --prompt="Welcome to Executorch framework."
``` 

Config file (.yaml) to convert the outetts llm model to .pte:
```
base:
  model_class: qwen2_5
  checkpoint: ../OuteTTS-0.2-500M/converted_weights.pth
  params: ../OuteTTS-0.2-500M/params.json
  metadata: '{"get_bos_id":151643, "get_eos_ids":151645}'

model:
  use_kv_cache: true
  use_sdpa_with_kv_cache: false
  dtype_override: fp32

export:
  max_seq_length: 1024
  max_context_length: 1024
  output_name: outetts_8da4wfp32_gs32_1024.pte

quantization:
  embedding_quantize: 8,0
  qmode: 8da4w
  group_size: 32

backend:
  xnnpack:
    enabled: true
    extended_ops: true

debug:
  verbose: true
```
Run command to convert the model:
```
python -m extension.llm.export.export_llm \
    --config-path=$WORKSPACE/scripts \
    --config-name=export_outetts
```

Thanks a lot for your help! Looking forward to hearing what I should be trying next!

### Versions

```
Collecting environment information...
PyTorch version: 2.9.0.dev20250811
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 15.6 (arm64)
GCC version: Could not collect
Clang version: 17.0.0 (clang-1700.0.13.3)
CMake version: version 3.31.6
Libc version: N/A

Python version: 3.10.18 (main, Jun  3 2025, 18:23:41) [Clang 17.0.0 (clang-1700.0.13.3)] (64-bit runtime)
Python platform: macOS-15.6-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M2 Pro

Versions of relevant libraries:
[pip3] executorch==0.8.0a0+3f81e81
[pip3] numpy==2.2.6
[pip3] pytorch_tokenizers==0.1.0
[pip3] torch==2.9.0.dev20250811
[pip3] torchao==0.13.0+git1526dfe50
[pip3] torchaudio==2.8.0.dev20250811
[pip3] torchdata==0.11.0
[pip3] torchsr==1.0.4
[pip3] torchtune==0.6.1
[pip3] torchvision==0.24.0.dev20250811
[conda] Could not collect
```

cc @larryliu0820 @mergennachin @cccclai @helunwencser @jackzhxng

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error [tokenizers:re2_regex.cpp:26] Failed to compile Regex for Qwen2.5 Model #14432

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error [tokenizers:re2_regex.cpp:26] Failed to compile Regex for Qwen2.5 Model #14432

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions