We trainend tokenizers out of corpus of four great classic romance and Jinyong corpus with notebook here. This is an exercise toward understanding of BPE tokenizer, a popular tokenizer for many of the large language model. Thanks great deals to this Andrej Karpathy video: Let's build a GPT tokenizer, I have more intuitive sense about it now.
Many of the visualization aimed to have more intuitive 1 glance grasp of the idea.
![]() |
![]() |
![]() |
用这里的笔记训练了几个分词器, 分别根据四大名著+金庸全文.
I implemented the very basic version of Bytes Pair Encoding, the original corpus was turned into bytes encoded utf8, all tokens' ancestry can be traced up to pairs of bytes.
采用了比较基础款的Bytes Pair Encoding, 原文按utf8转成bytes, 所有词元上游都可以追溯到一对基础bytes 单元的组合.
For the entire try, the combination is to combine pairs of tokens into new tokens.
整个树的组合方式是每一层能用2个词元(Pair)组合出一个新词元.
You can try this tokenizing tree interface, it will display its ancestry and for each token, their possible descendants.
可以试玩这个词元树的界面, 会显示上下游的词元.
可以选择分词器的语料典籍来源, 输入一句话(最好是文章里的原话)

我们可以看见中间有具体的token分词效果, 以及上游(上方)和下游(下方)的词元

每个词元节点的卡片里, 左下角的数字是token id, 右下角是出现频次:

不同的分词器, 上下游的树图的当量是不一样的, 比如同样刀这个字分别在水浒 红楼 三国中的关系树:

|
|
我们可以观察在完整的著作中, 词元是以什么粒度存在的:





