malayalam-tokenizer

Train and evaluate Malayalam subword tokenizers using BPE and Unigram algorithms, built on the HuggingFace tokenizers library.

Trained tokenizers

Build

cargo build --release

Train

make train-bpe-tokenizer CORPUS_DIR=/path/to/corpus VOCAB_SIZE=16000
make train-unigram-tokenizer CORPUS_DIR=/path/to/corpus VOCAB_SIZE=16000

Evaluate

make compare-tokenizers

Produces a Markdown and JSON report comparing the two tokenizers across 9 intrinsic quality metrics including Rényi entropy, fertility, morphological consistency, and OOV rate.

Publish to HuggingFace

uvx huggingface-cli login        # once
git tag -a v1.0.0 -m "Release v1.0.0"
make publish-all VERSION=v1.0.0

Test

cargo test

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
demo		demo
hf		hf
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

malayalam-tokenizer

Trained tokenizers

Build

Train

Evaluate

Publish to HuggingFace

Test

License

About

Uh oh!

Releases

Packages

Languages

License

smc/malayalam-tokenizer

Folders and files

Latest commit

History

Repository files navigation

malayalam-tokenizer

Trained tokenizers

Build

Train

Evaluate

Publish to HuggingFace

Test

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages