HTGM.2 is a Hindi-first BPE tokenizer trained on ~41GB corpus using streaming architecture for scalable Hindi LLMs, Devanagari NLP, and low-memory tokenizer engineering.
nlp machine-learning research artificial-intelligence devanagari streaming-tokenizer huggingface-tokenizers llm bpe-tokenizer hindi-ai hindi-llm hindi-tokenizer tokenizer-engineering
-
Updated
May 17, 2026 - Python