tokenizers

Here are 38 public repositories matching this topic...

xebia-functional / xef

Building applications with LLMs through composability, in Kotlin

kotlin scala ai functional-programming embeddings artificial-intelligence openai multiplatform agents tokenizers llm chatgpt-api

Updated Oct 14, 2024
Kotlin

jshuadvd / LongRoPE

Star

Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper

nlp machine-learning natural-language-processing ai deep-learning transformers artificial-intelligence gpt language-model natural-language-inference natural tokenization natural-language-understanding attention-is-all-you-need attention-mechanisms transformer-architecture natural-language-procressing tokenizers llm

Updated Jul 20, 2024
Python

ImadSaddik / ElasticSearch_Python_Course

Star

This repository is part of a course on Elasticsearch in Python. It includes notebooks that demonstrate its usage, along with a YouTube series to guide you through the material.

search-engine elasticsearch embeddings elastic semantic-search knn-algorithm hybrid-search tokenizers

Updated Jul 6, 2025
Jupyter Notebook

Arunprakash-A / DL-Pytorch-Workshop

Star

Develop DL models using Pytorch and Hugging Face

workshop transformers pytorch datasets dl hf tokenizers

Updated Nov 30, 2024

Prismadic / magnet

Star

the small distributed language model toolkit; fine-tune state-of-the-art LLMs anywhere, rapidly

Updated Oct 19, 2024
Python

sayakpaul / count-tokens-hf-datasets

Star

This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.

transformers dataflow apache-beam tokenizers hf-datasets unigram-tokenization

Updated Oct 20, 2022
Python

1kkiRen / Tokenizer-Changer

Star

Python script for manipulating the existing tokenizer.

tokens delete tokenizers

Updated Jun 11, 2025
Python

megagonlabs / ginza-transformers

Star

Use custom tokenizers in spacy-transformers

nlp natural-language-processing transformers spacy ginza spacy-transformers tokenizers sudachitra

Updated Aug 9, 2022
Python

gweidart / rs-bpe

Star

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

python rust openai pypi-package bpe byte-pair-encoding huggingface tokenizers llm tiktoken bpe-tokenizer byte-pair-tokenizer

Updated Mar 19, 2025
Python

symanto-research / merge-tokenizers

Star

Package to align tokens from different tokenizations.

distance transformers tokens tokenizers

Updated Mar 25, 2024
Python

Hugging-Face-Supporter / tftokenizers

Star

Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels

nlp natural-language-processing tensorflow tokenizer transformers bert tensorflow-hub tokenizers sentencepie

Updated Mar 29, 2022
Python

sappho192 / Tokenizers.DotNet

Star

[Unofficial] Simple .NET wrapper of HuggingFace Tokenizers library

rust library csharp dotnet nuget huggingface tokenizers

Updated Jul 1, 2025
C#

Anush008 / tokenizers

Sponsor

Star

Multi-arch bindings for @huggingface/tokenizers.

huggingface tokenizers

Updated Sep 17, 2023
Rust

unfoldingWord / string-punctuation-tokenizer

Star

Small library that provides functions to tokenize a string into an array of words with or without punctuation

javascript nlp segmentation nlp-library tokenizers scripture-open-components

Updated Aug 9, 2023
JavaScript

mickymultani / LLM-Architecture

Star

Visualize some important concepts related to LLM architectures.

transformers attention-mechanism huggingface huggingface-transformers tokenizers llm llm-inference llm-architecture

Updated Oct 16, 2023
Jupyter Notebook

Beomi / megatronlm_dataset_autotokenizer

Star

Megatron-LM/GPT-NeoX compatible Text Encoder with 🤗Transformers AutoTokenizer.

transformers gpt-neox tokenizers megatron-lm

Updated Nov 16, 2023
Python

arturom / search-analysis

Star

A graphical user interface for the Elasticsearch Analyze API

react elasticsearch text-analysis filters analyzers tokenizers analyze-api

Updated Apr 1, 2025
JavaScript

wassemgtk / SuperTokenizer

Star

A high-performance tokenizer built to rival GPT-4, trained on the C4 dataset.

tokenizer tokenizer-framework tokenizers

Updated Mar 25, 2025
Jupyter Notebook

cobanov / turkish-bpe-tokenizer

Star

Byte Pair Encoding (BPE) tokenizer tailored for the Turkish language

tokenizer turkish bpe tokenizers turkish-tokenization

Updated Dec 10, 2024
Python

kojix2 / blingfire-crystal

Star

crystal tokenizers

Updated Jan 17, 2025
Crystal

Improve this page

Add a description, image, and links to the tokenizers topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the tokenizers topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizers

Here are 38 public repositories matching this topic...

xebia-functional / xef

jshuadvd / LongRoPE

ImadSaddik / ElasticSearch_Python_Course

Arunprakash-A / DL-Pytorch-Workshop

Prismadic / magnet

sayakpaul / count-tokens-hf-datasets

1kkiRen / Tokenizer-Changer

megagonlabs / ginza-transformers

gweidart / rs-bpe

symanto-research / merge-tokenizers

Hugging-Face-Supporter / tftokenizers

sappho192 / Tokenizers.DotNet

Anush008 / tokenizers

unfoldingWord / string-punctuation-tokenizer

mickymultani / LLM-Architecture

Beomi / megatronlm_dataset_autotokenizer

arturom / search-analysis

wassemgtk / SuperTokenizer

cobanov / turkish-bpe-tokenizer

kojix2 / blingfire-crystal

Improve this page

Add this topic to your repo