GitHub - sjais1337/tokenizers

This is a from scratch implementation of the following tokenizers, written as a part of assignments in CS779:

Appropirate data structures have been utilised to optimize the execution time, allowing for easy and fast tokenization of corpus.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
papers		papers
.gitignore		.gitignore
README.md		README.md
bpe.py		bpe.py
sp_bpe.py		sp_bpe.py
unigram.py		unigram.py
wp.py		wp.py

Provide feedback