Train and evaluate Malayalam subword tokenizers using BPE and Unigram algorithms, built on the HuggingFace tokenizers library.
cargo build --releasemake train-bpe-tokenizer CORPUS_DIR=/path/to/corpus VOCAB_SIZE=16000
make train-unigram-tokenizer CORPUS_DIR=/path/to/corpus VOCAB_SIZE=16000make compare-tokenizersProduces a Markdown and JSON report comparing the two tokenizers across 9 intrinsic quality metrics including Rényi entropy, fertility, morphological consistency, and OOV rate.
uvx huggingface-cli login # once
git tag -a v1.0.0 -m "Release v1.0.0"
make publish-all VERSION=v1.0.0cargo testMIT