UBPE Tokenizer

UBPE -- Universal Byte-Pair Encoding. Universal means that it works not only with strings, but with general sequences too.

The package provides Universal Byte-Pair Encoding tokenizers:

UBPEClassic -- optimized version of classic BPE algorithm
UBPE -- novel approach to BPE tokenization which allows you to choose between multiple different variants of encodings according to scores of tf-idf metric or something else; the most optimal encoding from this implementation was shorter than the encoding from classic implementation

Guides and theory

Roadmap

Installation

It is planned to deliver different implementations for the algorithm, so the package is divided into general import package (this one), and implementations (for now, Python native and Cython with C++20 backend). To install use:

pip install ubpe[native]

Or,

pip install ubpe[cython]

Note

Starting with version 0.3, the C++ backend has become faster while the native one has become slower, so there's no reason to use the native backend except for educational purposes.

Note

While Google Colab is supported, interactive logging doesn't work in it due to complications with redirecting stderr to the cell output.

Warning

Encoding candidates from different backends of the novel tokenizer (UBPE) may differ in order, i.e. two encodings with the same length and weight may be returned in different ordered, but are still valid.

Bug reports

If you find a bug that occurs under certain circumstances in some tests, please report it.

Contribution

Bugfixes and optimizations are welcomed!

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
examples		examples
ubpe-cython @ a13e9c2		ubpe-cython @ a13e9c2
ubpe-native @ 078fa3d		ubpe-native @ 078fa3d
ubpe		ubpe
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UBPE Tokenizer

Guides and theory

Roadmap

Installation

Bug reports

Contribution

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UBPE Tokenizer

Guides and theory

Roadmap

Installation

Bug reports

Contribution

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages