Skip to content

sign-language-processing/complex-tokenization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tokenization for Complex Scripts

This repository decomposes complex scripts (such as SignWriting and Chinese) into smaller units, and proposes more effective tokenization for NLP tasks.

Usage

Install:

git clone https://github.com/sign-language-processing/complex-tokenization.git
cd complex-tokenization
pip install ".[dev]"

Pretokenize text using a Huggingface Tokenizer implementation:

from complex_tokenization.tokenizer import WordsSegmentationTokenizer

pretokenizer = WordsSegmentationTokenizer(max_bytes=16)
tokens = pretokenizer.tokenize("hello world! 我爱北京天安门 👩‍👩‍👧‍👦")
# ['hello ', 'world! ', '我', '爱', '北京', '天安门', ' ', '👩‍👩‍👧‍👦‍']

Cite

If you use this code in your research, please consider citing the work:

@misc{moryossef2025complex,
  title={Tokenization for Complex Scripts},
  author={Moryossef, Amit},
  howpublished={\url{https://github.com/sign-language-processing/complex-tokenization}},
  year={2025}
}

About

Tokenization for Complex Scripts (Such as SignWriting, CJK, etc)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages