UBPE -- Universal Byte-Pair Encoding. Universal means that it works not only with strings, but with general sequences too.
The package provides Universal Byte-Pair Encoding tokenizers:
UBPEClassic-- optimized version of classic BPE algorithmUBPE-- novel approach to BPE tokenization which allows you to choose between multiple different variants of encodings according to scores of tf-idf metric or something else; the most optimal encoding from this implementation was shorter than the encoding from classic implementation
- Description of tokenizer fitting algorithms
- Description of encoding and decoding algorithms for classic and novel approaches
- Google Colab Demo for
ubpe v0.2(with precomputed cells) - Google Colab Demo for
ubpe v0.3(with precomputed cells)
- Python native implementation
- Cython implementation with C++ backend
- Publish standalone C++ library (it is already usable)
- Other types than
uint32_tas inner token type
- Rust backend with standalone package
- Subdocument tokenization (since v0.3)
- RegEx support
- Support for known word tokens in alphabet
- Ignored tokens
- Collaborative training
- Training checkpoints
- Training on large datasets
- Training on splitted datasets
- Other Features:
- One token -- Many subsequences
- Spelling correction support
- Vocabulary pruning
- Examples:
- Demo with visualizaton of pros of the UBPE novel algorithm
- Subdocument tokenization example
It is planned to deliver different implementations for the algorithm, so the package is divided into general import package (this one), and implementations (for now, Python native and Cython with C++20 backend). To install use:
pip install ubpe[native]Or,
pip install ubpe[cython]Note
Starting with version 0.3, the C++ backend has become faster while the native one has become slower, so there's no reason to use the native backend except for educational purposes.
Note
While Google Colab is supported, interactive logging doesn't work in it due to complications with redirecting stderr to the cell output.
Warning
Encoding candidates from different backends of the novel tokenizer (UBPE) may differ in order, i.e. two encodings with the same length and weight may be returned in different ordered, but are still valid.
If you find a bug that occurs under certain circumstances in some tests, please report it.
Bugfixes and optimizations are welcomed!