tokenizer
A grammar describes the syntax of a programming language, and might be defined in Backus-Naur form (BNF). A lexer performs lexical analysis, turning text into tokens. A parser takes tokens and builds a data structure like an abstract syntax tree (AST). The parser is concerned with context: does the sequence of tokens fit the grammar? A compiler is a combined lexer and parser, built for a specific grammar.
Here are 69 public repositories matching this topic...
Fast and customizable text tokenization library with BPE and SentencePiece support
-
Updated
Nov 10, 2023 - C++
High-Performance Stemmer, Tokenizer, and Spell Checker for R
-
Updated
Oct 27, 2023 - C++
Juman++ (a Morphological Analyzer Toolkit)
-
Updated
Oct 3, 2023 - C++
R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
-
Updated
Mar 1, 2023 - C++
A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
-
Updated
Apr 15, 2024 - C++
集成了FTS5中文分词器的Sqlite3源码
-
Updated
Dec 31, 2017 - C++
Thot toolkit for statistical machine translation
-
Updated
Nov 11, 2022 - C++
C++ Lexer Toolkit Library (LexerTk) https://www.partow.net/programming/lexertk/index.html
-
Updated
Nov 21, 2020 - C++
Source code to go with my parser programming tutorial videos.
-
Updated
Mar 6, 2022 - C++
A repository to bind mecab for Python 3.5+. Not using swig nor pybind. (Not Maintained Now)
-
Updated
May 21, 2021 - C++
Smart Language Model
-
Updated
Dec 21, 2022 - C++
Text segmenter and tokeniser for Danish, English and other languages. Reads an RTF or flat text file and outputs the text, one line per sentence & optionally tokenized.
-
Updated
Dec 1, 2022 - C++
Dockerfile/docker-compose Elasticsearch with plugins elasticsearch-analysis-vietnamese and coccoc-tokenizer
-
Updated
Nov 17, 2021 - C++
The AFP Library is a collection of C++11 header files that provides users with a flexible rapid prototyping tool to create general-purpose LL(k) parsers in C++.
-
Updated
Jan 10, 2019 - C++
-
Updated
Feb 28, 2020 - C++
- Followers
- 10.2k followers
- Wikipedia
- Wikipedia