This is a from scratch implementation of the following tokenizers, written as a part of assignments in CS779:
- WordPiece Tokenizer
- Unigram Tokenizer
- BPE Tokenizer
- SentecePiece BPE Tokenizer
Appropirate data structures have been utilised to optimize the execution time, allowing for easy and fast tokenization of corpus.