tokenizer
A grammar describes the syntax of a programming language, and might be defined in Backus-Naur form (BNF). A lexer performs lexical analysis, turning text into tokens. A parser takes tokens and builds a data structure like an abstract syntax tree (AST). The parser is concerned with context: does the sequence of tokens fit the grammar? A compiler is a combined lexer and parser, built for a specific grammar.
Here are 62 public repositories matching this topic...
A multilingual morphological analysis library.
-
Updated
May 29, 2024 - Rust
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models
-
Updated
Oct 1, 2023 - Rust
Chinese tokenizer for tantivy, based on jieba-rs
-
Updated
Nov 4, 2023 - Rust
Rust re-implementation of OpenFST - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). A Python binding is also available.
-
Updated
May 23, 2024 - Rust
🎤 vibrato: Viterbi-based accelerated tokenizer
-
Updated
May 30, 2024 - Rust
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
-
Updated
May 31, 2024 - Rust
Thai Natural Language Processing library in Rust, with Python and Node bindings.
-
Updated
Nov 26, 2023 - Rust
3D object recognition tool for WASM. HASH ID sustainable identity creation and its verification.
-
Updated
Feb 27, 2024 - Rust
3D object recognition CLI tool for Linux
-
Updated
Mar 4, 2024 - Rust
Rust wrapper for the BlingFire tokenization library
-
Updated
Jun 23, 2020 - Rust
The maeel programming language
-
Updated
Apr 16, 2024 - Rust
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
-
Updated
Sep 5, 2023 - Rust
Converts Lox source code into syntax tokens.
-
Updated
Aug 23, 2018 - Rust
- Followers
- 10.1k followers
- Wikipedia
- Wikipedia