Skip to content

raynardj/ciyuan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BPE Tokenizers on classical romance master pieces

四大名著 + 金庸 定制BPE 词元 / 分词器

pages-build-deployment

We trainend tokenizers out of corpus of four great classic romance and Jinyong corpus with notebook here. This is an exercise toward understanding of BPE tokenizer, a popular tokenizer for many of the large language model. Thanks great deals to this Andrej Karpathy video: Let's build a GPT tokenizer, I have more intuitive sense about it now.

Many of the visualization aimed to have more intuitive 1 glance grasp of the idea.

image image image

用这里的笔记训练了几个分词器, 分别根据四大名著+金庸全文.

分词器 / Tokenizer

I implemented the very basic version of Bytes Pair Encoding, the original corpus was turned into bytes encoded utf8, all tokens' ancestry can be traced up to pairs of bytes.

采用了比较基础款的Bytes Pair Encoding, 原文按utf8转成bytes, 所有词元上游都可以追溯到一对基础bytes 单元的组合.

For the entire try, the combination is to combine pairs of tokens into new tokens.

整个树的组合方式是每一层能用2个词元(Pair)组合出一个新词元.

You can try this tokenizing tree interface, it will display its ancestry and for each token, their possible descendants.

可以试玩这个词元树的界面, 会显示上下游的词元.

可以选择分词器的语料典籍来源, 输入一句话(最好是文章里的原话) image

我们可以看见中间有具体的token分词效果, 以及上游(上方)和下游(下方)的词元 image

每个词元节点的卡片里, 左下角的数字是token id, 右下角是出现频次: image

不同的分词器, 上下游的树图的当量是不一样的, 比如同样这个字分别在水浒 红楼 三国中的关系树: image

image image

分词原文 / Tokenizing Corpus

我们可以观察在完整的著作中, 词元是以什么粒度存在的:

四大名著, 金庸可以典籍右上角齿轮选择著作和章节 image image

统计分析 / Stats & Dashboard

分词器统计页面 image

About

Full collection of Jinyong corpus for tokenization study

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors