A multilingual morphological analysis library.
-
Updated
May 2, 2024 - Rust
A grammar describes the syntax of a programming language, and might be defined in Backus-Naur form (BNF). A lexer performs lexical analysis, turning text into tokens. A parser takes tokens and builds a data structure like an abstract syntax tree (AST). The parser is concerned with context: does the sequence of tokens fit the grammar? A compiler is a combined lexer and parser, built for a specific grammar.
A multilingual morphological analysis library.
🎤 vibrato: Viterbi-based accelerated tokenizer
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
Rust re-implementation of OpenFST - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). A Python binding is also available.
Chinese tokenizer for tantivy, based on jieba-rs
Viterbi-based accelerated tokenizer (Python wrapper)
Thai Natural Language Processing library in Rust, with Python and Node bindings.
The maeel programming language
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
Rust wrapper for the BlingFire tokenization library
The Bytepiece Tokenizer Implemented in Rust.