Summary
This release backports OpenAI tiktoken 0.13.0 into tiktoken-rs. The main reason to upgrade is better alignment with upstream tokenization behavior, especially the upstream Rust core changes for large BPE pieces and error-aware encoding.
For most users who call the high-level model/token counting helpers, this should behave the same aside from the new Rust compiler requirement. Users who call lower-level CoreBPE encoding methods directly should review the breaking changes below.
What Changed
- Backported the vendored OpenAI
tiktokenRust core from 0.9.0 to 0.13.0. - Added the upstream large-piece BPE merge path. Functionally, this improves behavior for very large or repetitive inputs that previously stressed the merge algorithm.
- Changed
CoreBPE::encodeto returnResult<(Vec<Rank>, usize), EncodeError>, matching upstream. Regex/tokenization failures can now be reported instead of being hidden behind infallible APIs. - Updated
encode_asandcountto returnResultbecause they callencode. - Re-exported
EncodeErrorso callers can handle encode failures directly. - Aligned the vendored core with Rust 2024 and raised the crate MSRV to Rust 1.85.
- Synced model-to-tokenizer mappings with upstream
tiktoken0.13.0 while keeping local extra prefixes isolated. - Hardened asset downloads with SHA-256 checks and a repo-root-aware asset path.
Breaking Changes
If your code calls CoreBPE::encode, unwrap or propagate the result before using the tokens:
let allowed = bpe.special_tokens();
let (tokens, last_piece_token_len) = bpe.encode("hello <|endoftext|>", &allowed)?;The generic helpers changed similarly:
let (tokens, last_piece_token_len) = bpe.encode_as::<usize>(text, &allowed)?;
let token_count = bpe.count(text, &allowed)?;encode_ordinary, encode_ordinary_as, encode_with_special_tokens, and count_ordinary remain infallible.
Projects must now build with Rust 1.85 or newer.
Practical Impact
- Applications processing long repeated text should see more robust tokenization behavior.
- Code that only uses helpers like
get_chat_completion_max_tokens,get_text_completion_max_tokens,bpe_for_model, or singleton tokenizer constructors should not need call-site changes. - Code using low-level
CoreBPE::encode,encode_as, orcountneeds a small migration to handleResult.
Links
- PR: #164
- Upstream
tiktoken0.13.0: https://github.com/openai/tiktoken/releases/tag/0.13.0 - Full changelog: v0.11.0...v0.12.0