Skip to content

Scurrra/ubpe

Repository files navigation

UBPE Tokenizer

PyPI - Version GitHub License

UBPE -- Universal Byte-Pair Encoding. Universal means that it works not only with strings, but with general sequences too.

The package provides Universal Byte-Pair Encoding tokenizers:

  • UBPEClassic -- optimized version of classic BPE algorithm
  • UBPE -- novel approach to BPE tokenization which allows you to choose between multiple different variants of encodings according to scores of tf-idf metric or something else; the most optimal encoding from this implementation was shorter than the encoding from classic implementation

Guides and theory

Roadmap

  • Python native implementation
  • Cython implementation with C++ backend
    • Publish standalone C++ library (it is already usable)
    • Other types than uint32_t as inner token type
  • Rust backend with standalone package
  • Subdocument tokenization (since v0.3)
    • RegEx support
    • Support for known word tokens in alphabet
    • Ignored tokens
  • Collaborative training
    • Training checkpoints
    • Training on large datasets
    • Training on splitted datasets
  • Other Features:
    • One token -- Many subsequences
    • Spelling correction support
    • Vocabulary pruning
  • Examples:
    • Demo with visualizaton of pros of the UBPE novel algorithm
    • Subdocument tokenization example

Installation

It is planned to deliver different implementations for the algorithm, so the package is divided into general import package (this one), and implementations (for now, Python native and Cython with C++20 backend). To install use:

pip install ubpe[native]

Or,

pip install ubpe[cython]

Note

Starting with version 0.3, the C++ backend has become faster while the native one has become slower, so there's no reason to use the native backend except for educational purposes.

Note

While Google Colab is supported, interactive logging doesn't work in it due to complications with redirecting stderr to the cell output.

Warning

Encoding candidates from different backends of the novel tokenizer (UBPE) may differ in order, i.e. two encodings with the same length and weight may be returned in different ordered, but are still valid.

Bug reports

If you find a bug that occurs under certain circumstances in some tests, please report it.

Contribution

Bugfixes and optimizations are welcomed!

Packages

 
 
 

Contributors

Languages