BPE Tokenizers on classical romance master pieces

四大名著 + 金庸定制BPE 词元 / 分词器

We trainend tokenizers out of corpus of four great classic romance and Jinyong corpus with notebook here. This is an exercise toward understanding of BPE tokenizer, a popular tokenizer for many of the large language model. Thanks great deals to this Andrej Karpathy video: Let's build a GPT tokenizer, I have more intuitive sense about it now.

Many of the visualization aimed to have more intuitive 1 glance grasp of the idea.

用这里的笔记训练了几个分词器, 分别根据四大名著+金庸全文.

分词器 / Tokenizer

I implemented the very basic version of Bytes Pair Encoding, the original corpus was turned into bytes encoded utf8, all tokens' ancestry can be traced up to pairs of bytes.

采用了比较基础款的Bytes Pair Encoding, 原文按utf8转成bytes, 所有词元上游都可以追溯到一对基础bytes 单元的组合.

For the entire try, the combination is to combine pairs of tokens into new tokens.

整个树的组合方式是每一层能用2个词元(Pair)组合出一个新词元.

You can try this tokenizing tree interface, it will display its ancestry and for each token, their possible descendants.

可以试玩这个词元树的界面, 会显示上下游的词元.

可以选择分词器的语料典籍来源, 输入一句话(最好是文章里的原话)

我们可以看见中间有具体的token分词效果, 以及上游(上方)和下游(下方)的词元

每个词元节点的卡片里, 左下角的数字是token id, 右下角是出现频次:

不同的分词器, 上下游的树图的当量是不一样的, 比如同样刀这个字分别在水浒红楼三国中的关系树:

分词原文 / Tokenizing Corpus

我们可以观察在完整的著作中, 词元是以什么粒度存在的:

四大名著, 金庸可以典籍右上角齿轮选择著作和章节

统计分析 / Stats & Dashboard

分词器统计页面

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
corpus		corpus
four		four
notebooks		notebooks
src		src
tokenizers		tokenizers
README.md		README.md
four.html		four.html
index.html		index.html
jy.html		jy.html
metadata.json		metadata.json
stats.html		stats.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BPE Tokenizers on classical romance master pieces

四大名著 + 金庸定制BPE 词元 / 分词器

分词器 / Tokenizer

分词原文 / Tokenizing Corpus

统计分析 / Stats & Dashboard

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BPE Tokenizers on classical romance master pieces

四大名著 + 金庸 定制BPE 词元 / 分词器

分词器 / Tokenizer

分词原文 / Tokenizing Corpus

统计分析 / Stats & Dashboard

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

四大名著 + 金庸定制BPE 词元 / 分词器

Packages