Skip to content

smc/malayalam-tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

malayalam-tokenizer

Train and evaluate Malayalam subword tokenizers using BPE and Unigram algorithms, built on the HuggingFace tokenizers library.

Trained tokenizers

Build

cargo build --release

Train

make train-bpe-tokenizer CORPUS_DIR=/path/to/corpus VOCAB_SIZE=16000
make train-unigram-tokenizer CORPUS_DIR=/path/to/corpus VOCAB_SIZE=16000

Evaluate

make compare-tokenizers

Produces a Markdown and JSON report comparing the two tokenizers across 9 intrinsic quality metrics including Rényi entropy, fertility, morphological consistency, and OOV rate.

Publish to HuggingFace

uvx huggingface-cli login        # once
git tag -a v1.0.0 -m "Release v1.0.0"
make publish-all VERSION=v1.0.0

Test

cargo test

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published