Byte-Pair Encoding tokenizer for large language models and huge datasets
-
Updated
Jun 3, 2024 - Python
A grammar describes the syntax of a programming language, and might be defined in Backus-Naur form (BNF). A lexer performs lexical analysis, turning text into tokens. A parser takes tokens and builds a data structure like an abstract syntax tree (AST). The parser is concerned with context: does the sequence of tokens fit the grammar? A compiler is a combined lexer and parser, built for a specific grammar.
Byte-Pair Encoding tokenizer for large language models and huge datasets
Tools and resources for the computational processing of Nheengatu (Modern Tupi)
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
retro style tokenization for language models
DadmaTools is a Persian NLP tools developed by Dadmatech Co.
BLEU Score in Rust
Taiwanese Hokkien Transliterator and Tokeniser
A simple, consistent and extendable toolkit for IndicTrans2 tokenizer
Persian NLP Toolkit
Python port of Moses tokenizer, truecaser and normalizer
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
IMDB Movie Reviews Sentiment Analysis using RNN
The collections of tools for testing and dumping LLMs
Bitextor generates translation memories from multilingual websites
Implementation of MambaByte in "MambaByte: Token-free Selective State Space Model" in Pytorch and Zeta