# Tokenize

- tokenizer rule
- train
- encode
- decode
- tokenizer process
    - pre-process
    - padding
    - trunction
- tokenizer IO
    - save
    - load
- `TokenizerBase` class
- `TokenizerWord` class

## Tokenizer Rule

给定文本：

1. “我唱跳和rap有 2 年半。”
2. “I have 12 apples!”

token 定义： 文本离散序列表示中的最小元素。最小的粒度，可以是 “字符” character，“单词” word 实际上也是一种粒度。 

例如“live”可以是一个 token， 也可以是序列 `l`,`i`, `v`, `e`

什么是分词器（Tokenizer）？

分词 是将文本进行离散化表示的 规则，例如 character-level 切分 或 word-level 切分规则

所分的词，可以构建一个词表来存储。

分词器是 “分词规则 + 词表” 的操作集合。

期望分词后的 token 列表为:
1. `我`,`唱`,`跳`,`和`,`rap`,`有`,` `,`2`,` `,`年`,`半`,`。`”
2. `I`,` `,`have`,` `,`1`,`2`,` `,`apples`,`!`

    自定义分割规则
    
    1. 先将“特殊词元”, “标点符号”, “中文字符” 进行分割。 特殊词元如 `<EOS>` 是一个整体，要先提取，否则存在符号 `<`, `>` 会被拆分
    2. 数字要离散分割，如“12” -> `1`, `2`
    3. 空格 ` ` 也是独立的 token
    4. 将常用的符号进行初始化词表, 如`a`, `!`, `%`... 常见符号, 如果有 26 个字母, 可以排列出所有**不带其他符号**的单词

In [1]:
text = "<SOS>我唱跳和rap有 2 年半。<EOS><SOS>I have 12 apples!<PAD><PAD>"

In [2]:
import re
import string
zh_symbols = '，。！？；：“”‘’【】（）《》、'
en_symbols = re.escape(string.punctuation)  # 转义特殊字符
all_symbols = zh_symbols + en_symbols + ' '  # add space
print(all_symbols)

# 构建正则表达式：(不要求掌握）
# 1. [{}] - 匹配任意标点符号（1个）
# 2. \d    - 匹配任意数字（1个）
# 3. [\u4e00-\u9fa5] - 匹配任意中文字符（1个）
# 4. [^{}\d\u4e00-\u9fa5]+ - 匹配其他连续字符
# pattern = f'[^{all_symbols}\d\u4e00-\u9fa5]+|[{all_symbols}]|\d|[\u4e00-\u9fa5]'

special_tokens = ['<SOS>', '<EOS>', '<PAD>', '<UNK>']

pattern = (
    r'(?:' + '|'.join(special_tokens) + ')'   # 非捕获组，匹配任意固定标签
    r'|[' + re.escape(all_symbols) + ']'  # 匹配标点符号
    r'|\d'  # 匹配单个数字
    r'|[\u4e00-\u9fa5]'  # 匹配单个中文字符
    r'|[^' + re.escape(all_symbols) + r'\d\u4e00-\u9fa5<>]+'  # 匹配其他连续字符
)

，。！？；：“”‘’【】（）《》、!"\#\$%\&'\(\)\*\+,\-\./:;<=>\?@\[\\\]\^_`\{\|\}\~ 


In [3]:
token_list = re.findall(pattern, text)
print(token_list)

['<SOS>', '我', '唱', '跳', '和', 'rap', '有', ' ', '2', ' ', '年', '半', '。', '<EOS>', '<SOS>', 'I', ' ', 'have', ' ', '1', '2', ' ', 'apples', '!', '<PAD>', '<PAD>']


输出有重复的词元，构建词表需要进行去重，如使用 set 或 dict

## 语料示例

仅做展示，无需了解文本内容。

**Large Language Models (LLMs):**  
Modern LLMs, such as OpenAI's GPT-4 (2023) and Meta's LLaMA-3 (2024), leverage transformer architectures (Vaswani et al., 2017) with self-attention mechanisms:  

\begin{equation}
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
\end{equation}

These models scale to hundreds of billions of parameters (e.g., GPT-4: ~1.8T, LLaMA-3: ~400B), enabling state-of-the-art performance in NLP tasks. Training requires massive datasets (e.g., >1T tokens) and distributed computing frameworks.  

**大语言模型（LLM）技术：**  
现代大语言模型（如OpenAI的GPT-4（2023）和Meta的LLaMA-3（2024））基于Transformer架构（Vaswani等，2017），其自注意力机制公式为：  

\begin{equation}
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
\end{equation}

此类模型的参数量已达数千亿（如GPT-4约1.8万亿，LLaMA-3约4000亿），依赖超大规模训练数据（>1万亿词元）和分布式计算框架，推动NLP任务性能突破。  

 
**Scaling Laws & Trends:**  
**LLM Scaling Trends:**  
Empirical studies (Kaplan et al., 2020) show model performance scales as a power-law with compute budget ($C$), dataset size ($D$), and parameters ($N$):  

\begin{equation}
L(N, D) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty
\end{equation}

For example, Google's PaLM-2 (2023, 340B params) achieved 85% multilingual accuracy, while smaller models (e.g., Mistral-7B, 2024) optimize efficiency via sparse architectures.  

**大语言模型的扩展定律：**  
实证研究（Kaplan等，2020）表明，模型性能随算力（$C$）、数据量（$D$）和参数量（$N$）呈幂律增长：  

\begin{equation}
L(N, D) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty
\end{equation}

例如，谷歌的PaLM-2（2023，3400亿参数）实现了85%的多语言准确率，而小规模模型（如Mistral-7B，2024）通过稀疏架构提升效率。  

In [8]:
# 注意增加一段 26 大小写字母 和 10 个数字
text_init = """
 a b c d e f g h i j k l m n o p q r s t u v w x y z 
 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 
 0 1 2 3 4 5 6 7 8 9 10 
 <SOS> <EOS> <UNK> <PAD>
 ， 。 ！？；：“”‘’【】（）《》、!"\#\$%\&'\(\)\*\+,\-\./:;<=>\?@\[\\\]\^_`\{\|\}\~ 
"""

corpus_text = """

**Large Language Models (LLMs):**  
Modern LLMs, such as OpenAI's GPT-4 (2023) and Meta's LLaMA-3 (2024), leverage transformer architectures (Vaswani et al., 2017) with self-attention mechanisms:  

\ begin{equation}
\ text{Attention}(Q, K, V) = \ text{softmax}\left(\ frac{QK^T}{\sqrt{d_k}}\right)V
\end{equation}

These models scale to hundreds of billions of parameters (e.g., GPT-4: ~1.8T, LLaMA-3: ~400B), enabling state-of-the-art performance in NLP tasks. Training requires massive datasets (e.g., >1T tokens) and distributed computing frameworks.  

**大语言模型（LLM）技术：**  
现代大语言模型（如OpenAI的GPT-4（2023）和Meta的LLaMA-3（2024））基于Transformer架构（Vaswani等，2017），其自注意力机制公式为：  

\ begin{equation}
\ text{Attention}(Q, K, V) = \ text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
\end{equation}

此类模型的参数量已达数千亿（如GPT-4约1.8万亿，LLaMA-3约4000亿），依赖超大规模训练数据（>1万亿词元）和分布式计算框架，推动NLP任务性能突破。  



**Scaling Laws & Trends:**  
**LLM Scaling Trends:**  
Empirical studies (Kaplan et al., 2020) show model performance scales as a power-law with compute budget ($C$), dataset size ($D$), and parameters ($N$):  

\ begin{equation}
L(N, D) \ approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty
\end{equation}

For example, Google's PaLM-2 (2023, 340B params) achieved 85% multilingual accuracy, while smaller models (e.g., Mistral-7B, 2024) optimize efficiency via sparse architectures.  

**大语言模型的扩展定律：**  
实证研究（Kaplan等，2020）表明，模型性能随算力（$C$）、数据量（$D$）和参数量（$N$）呈幂律增长：  

\begin{equation}
L(N, D) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty
\end{equation}

例如，谷歌的PaLM-2（2023，3400亿参数）实现了85%的多语言准确率，而小规模模型（如Mistral-7B，2024）通过稀疏架构提升效率。
"""

  ， 。 ！？；：“”‘’【】（）《》、!"\#\$%\&'\(\)\*\+,\-\./:;<=>\?@\[\\\]\^_`\{\|\}\~
  \ begin{equation}


## Train

In [9]:
token_init_list = re.findall(pattern, text_init)
token_corpus_list = re.findall(pattern, corpus_text)
print(token_init_list)
print(token_corpus_list[:100])
print(len(token_corpus_list))

['\n', ' ', 'a', ' ', 'b', ' ', 'c', ' ', 'd', ' ', 'e', ' ', 'f', ' ', 'g', ' ', 'h', ' ', 'i', ' ', 'j', ' ', 'k', ' ', 'l', ' ', 'm', ' ', 'n', ' ', 'o', ' ', 'p', ' ', 'q', ' ', 'r', ' ', 's', ' ', 't', ' ', 'u', ' ', 'v', ' ', 'w', ' ', 'x', ' ', 'y', ' ', 'z', ' ', '\n', ' ', 'A', ' ', 'B', ' ', 'C', ' ', 'D', ' ', 'E', ' ', 'F', ' ', 'G', ' ', 'H', ' ', 'I', ' ', 'J', ' ', 'K', ' ', 'L', ' ', 'M', ' ', 'N', ' ', 'O', ' ', 'P', ' ', 'Q', ' ', 'R', ' ', 'S', ' ', 'T', ' ', 'U', ' ', 'V', ' ', 'W', ' ', 'X', ' ', 'Y', ' ', 'Z', ' ', '\n', ' ', '0', ' ', '1', ' ', '2', ' ', '3', ' ', '4', ' ', '5', ' ', '6', ' ', '7', ' ', '8', ' ', '9', ' ', '1', '0', ' ', '\n', ' ', '<SOS>', ' ', '<EOS>', ' ', '<UNK>', ' ', '<PAD>', '\n', ' ', '，', ' ', '。', ' ', '！', '？', '；', '：', '“', '”', '‘', '’', '【', '】', '（', '）', '《', '》', '、', '!', '"', '\\', '#', '\\', '$', '%', '\\', '&', "'", '\\', '(', '\\', ')', '\\', '*', '\\', '+', ',', '\\', '-', '\\', '.', '/', ':', ';', '<', '=', '>', '\\', '?'

In [10]:
from typing import Dict
# 构建词表 {"token", token_id}

# 拼接
token_all = token_init_list + token_corpus_list
print("原始token列表大小:",len(token_all))

# Dict[str, int]
vocab : Dict[str, int] = {}
vocab_reverse: Dict[str, int] = {}
idx = 0
for value in token_all:
    if value not in vocab:
        vocab[value] = idx
        vocab_reverse[idx] = value
        idx += 1
        
print("词表大小:", len(vocab))
print("词表:", vocab)  # 输出: {'a': 0, 'b': 1}
# print("词表反向:", vocab_reverse)  # 输出: {0: 'a', 1: 'b'}

原始token列表大小: 1138
词表大小: 308
词表: {'\n': 0, ' ': 1, 'a': 2, 'b': 3, 'c': 4, 'd': 5, 'e': 6, 'f': 7, 'g': 8, 'h': 9, 'i': 10, 'j': 11, 'k': 12, 'l': 13, 'm': 14, 'n': 15, 'o': 16, 'p': 17, 'q': 18, 'r': 19, 's': 20, 't': 21, 'u': 22, 'v': 23, 'w': 24, 'x': 25, 'y': 26, 'z': 27, 'A': 28, 'B': 29, 'C': 30, 'D': 31, 'E': 32, 'F': 33, 'G': 34, 'H': 35, 'I': 36, 'J': 37, 'K': 38, 'L': 39, 'M': 40, 'N': 41, 'O': 42, 'P': 43, 'Q': 44, 'R': 45, 'S': 46, 'T': 47, 'U': 48, 'V': 49, 'W': 50, 'X': 51, 'Y': 52, 'Z': 53, '0': 54, '1': 55, '2': 56, '3': 57, '4': 58, '5': 59, '6': 60, '7': 61, '8': 62, '9': 63, '<SOS>': 64, '<EOS>': 65, '<UNK>': 66, '<PAD>': 67, '，': 68, '。': 69, '！': 70, '？': 71, '；': 72, '：': 73, '“': 74, '”': 75, '‘': 76, '’': 77, '【': 78, '】': 79, '（': 80, '）': 81, '《': 82, '》': 83, '、': 84, '!': 85, '"': 86, '\\': 87, '#': 88, '$': 89, '%': 90, '&': 91, "'": 92, '(': 93, ')': 94, '*': 95, '+': 96, ',': 97, '-': 98, '.': 99, '/': 100, ':': 101, ';': 102, '<': 103, '=': 104, '>': 105,

## encode

In [11]:
def encode(vocab, pattern, text):
    """
    词表编码
    """
    tokens = re.findall(pattern, text) # 分词规则
    token_ids = []
    for token in tokens:
        if token in vocab:
            token_ids.append(vocab[token])
        else: 
            token_ids.append(vocab['<UNK>'])
    return tokens, token_ids

# text = "<SOS>我唱跳和rap有 2 年半。<EOS><SOS>I have 12 apples!<PAD><PAD>"
# tokens_new, token_ids_new = encode(vocab, pattern, text)
# print(tokens_new)
# print(token_ids_new)

In [12]:
# 对训练文本编码
text_tmp = '''Modern LLMs, such as OpenAI's GPT-4 (2023) and Meta's LLaMA-3 (2024), leverage transformer architectures (Vaswani et al., 2017) with self-attention mechanisms'''
tokens, token_ids = encode(vocab, pattern, text_tmp)
print(tokens)
print(token_ids)

['Modern', ' ', 'LLMs', ',', ' ', 'such', ' ', 'as', ' ', 'OpenAI', "'", 's', ' ', 'GPT', '-', '4', ' ', '(', '2', '0', '2', '3', ')', ' ', 'and', ' ', 'Meta', "'", 's', ' ', 'LLaMA', '-', '3', ' ', '(', '2', '0', '2', '4', ')', ',', ' ', 'leverage', ' ', 'transformer', ' ', 'architectures', ' ', '(', 'Vaswani', ' ', 'et', ' ', 'al', '.', ',', ' ', '2', '0', '1', '7', ')', ' ', 'with', ' ', 'self', '-', 'attention', ' ', 'mechanisms']
[66, 1, 121, 97, 1, 123, 1, 124, 1, 125, 92, 20, 1, 126, 98, 58, 1, 93, 56, 54, 56, 57, 94, 1, 127, 1, 128, 92, 20, 1, 129, 98, 57, 1, 93, 56, 54, 56, 58, 94, 97, 1, 130, 1, 131, 1, 132, 1, 93, 133, 1, 134, 1, 135, 99, 97, 1, 56, 54, 55, 61, 94, 1, 136, 1, 137, 98, 138, 1, 139]


## decode

In [14]:
def decode(vocab_reverse, token_ids, skip_special_tokens = False):
    decode_token = []
    for idx in token_ids:
        # if idx in 
        decode_token.append(vocab_reverse[idx])
    return decode_token

In [15]:
decode_token = decode(vocab_reverse, token_ids)
print(decode_token)
print(''.join(decode_token))

['<UNK>', ' ', 'LLMs', ',', ' ', 'such', ' ', 'as', ' ', 'OpenAI', "'", 's', ' ', 'GPT', '-', '4', ' ', '(', '2', '0', '2', '3', ')', ' ', 'and', ' ', 'Meta', "'", 's', ' ', 'LLaMA', '-', '3', ' ', '(', '2', '0', '2', '4', ')', ',', ' ', 'leverage', ' ', 'transformer', ' ', 'architectures', ' ', '(', 'Vaswani', ' ', 'et', ' ', 'al', '.', ',', ' ', '2', '0', '1', '7', ')', ' ', 'with', ' ', 'self', '-', 'attention', ' ', 'mechanisms']
<UNK> LLMs, such as OpenAI's GPT-4 (2023) and Meta's LLaMA-3 (2024), leverage transformer architectures (Vaswani et al., 2017) with self-attention mechanisms


## 分析

In [16]:
text = "<SOS>我唱跳和rap有 2 年半。<EOS><SOS>I have 12 apples!<PAD><PAD>"
tokens, token_ids = encode(vocab, pattern, text)
print('\n原文本:', text)
print('\n分词:',tokens)
print('\ntoken ids:',token_ids)
decode_token = decode(vocab_reverse, token_ids)
print('\ndecode ids:',decode_token)
print('\n解码文本',''.join(decode_token))


原文本: <SOS>我唱跳和rap有 2 年半。<EOS><SOS>I have 12 apples!<PAD><PAD>

分词: ['<SOS>', '我', '唱', '跳', '和', 'rap', '有', ' ', '2', ' ', '年', '半', '。', '<EOS>', '<SOS>', 'I', ' ', 'have', ' ', '1', '2', ' ', 'apples', '!', '<PAD>', '<PAD>']

token ids: [64, 66, 66, 66, 188, 66, 66, 1, 56, 1, 66, 66, 69, 65, 64, 36, 1, 66, 1, 55, 56, 1, 66, 85, 67, 67]

decode ids: ['<SOS>', '<UNK>', '<UNK>', '<UNK>', '和', '<UNK>', '<UNK>', ' ', '2', ' ', '<UNK>', '<UNK>', '。', '<EOS>', '<SOS>', 'I', ' ', '<UNK>', ' ', '1', '2', ' ', '<UNK>', '!', '<PAD>', '<PAD>']

解码文本 <SOS><UNK><UNK><UNK>和<UNK><UNK> 2 <UNK><UNK>。<EOS><SOS>I <UNK> 12 <UNK>!<PAD><PAD>


上述 '我' 词元并不在 词表中，所以按照规则，被编成了 '<UNK>', 会导致原始文本信息错误

优化 编码 算法

In [17]:
def encode_anything(vocab, pattern, text):
    """
    词表编码
    """
    tokens = re.findall(pattern, text) # 分词规则
    token_ids = []
    for token in tokens:
        if token in vocab:
            token_ids.append(vocab[token])
        else: 
            if len(token) == 1:
                token_ids.append(vocab['<UNK>'])
            else:
                for t in token:
                    token_ids.append( vocab[t] )
    return tokens, token_ids

In [18]:
text = "<SOS>我唱跳和rap有 2 年半。<EOS><SOS>I have 12 apples!<PAD><PAD>"
tokens, token_ids = encode_anything(vocab, pattern, text) # 新的编码函数
print('\n原文本:', text)
print('\n分词:',tokens)
print('\ntoken ids:',token_ids)
decode_token = decode(vocab_reverse, token_ids)
print('\ndecode ids:',decode_token)
print('\n解码文本',''.join(decode_token))


原文本: <SOS>我唱跳和rap有 2 年半。<EOS><SOS>I have 12 apples!<PAD><PAD>

分词: ['<SOS>', '我', '唱', '跳', '和', 'rap', '有', ' ', '2', ' ', '年', '半', '。', '<EOS>', '<SOS>', 'I', ' ', 'have', ' ', '1', '2', ' ', 'apples', '!', '<PAD>', '<PAD>']

token ids: [64, 66, 66, 66, 188, 19, 2, 17, 66, 1, 56, 1, 66, 66, 69, 65, 64, 36, 1, 9, 2, 23, 6, 1, 55, 56, 1, 2, 17, 17, 13, 6, 20, 85, 67, 67]

decode ids: ['<SOS>', '<UNK>', '<UNK>', '<UNK>', '和', 'r', 'a', 'p', '<UNK>', ' ', '2', ' ', '<UNK>', '<UNK>', '。', '<EOS>', '<SOS>', 'I', ' ', 'h', 'a', 'v', 'e', ' ', '1', '2', ' ', 'a', 'p', 'p', 'l', 'e', 's', '!', '<PAD>', '<PAD>']

解码文本 <SOS><UNK><UNK><UNK>和rap<UNK> 2 <UNK><UNK>。<EOS><SOS>I have 12 apples!<PAD><PAD>


以上文本中 `rap`, `have`, `apples` 可以被编码，

如 `apples` -> `'a', 'p', 'p', 'l', 'e', 's'`

由于词表中不存在 `我`, 则永远编码不出来，这是所有 tokenizer 的缺陷,

如果词表中没有 `s`，那么 `'a', 'p', 'p', 'l', 'e', '<UNK>'`

解决办法是, 在中文语料里独立训练一个新词表，融合到已有词表中。

规则是唯一的，编码结果是唯一的。

## 分词实例 1

In [19]:
# 英文文本

english_text ="""
### **Love Story**  
*By Taylor Swift*  

**[Verse 1]**  
We were both young when I first saw you  
I close my eyes and the flashback starts  
I'm standing there on a balcony in summer air  

See the lights, see the party, the ball gowns  
See you make your way through the crowd  
And say, "Hello, little did I know"  

**[Pre-Chorus]**  
That you were Romeo, you were throwing pebbles  
And my daddy said, "Stay away from Juliet"  
And I was crying on the staircase  
Begging you, "Please don't go"  

**[Chorus]**  
And I said,  
"Romeo, take me somewhere we can be alone  
I'll be waiting, all there's left to do is run  
You'll be the prince and I'll be the princess  
It's a love story, baby, just say 'Yes'"  

**[Verse 2]**  
So I sneak out to the garden to see you  
We keep quiet, 'cause we're dead if they knew  
So close your eyes, escape this town for a little while  

**[Pre-Chorus]**  
'Cause you were Romeo, I was a scarlet letter  
And my daddy said, "Stay away from Juliet"  
But you were everything to me  
I was begging you, "Please don't go"  

**[Chorus]**  
And I said,  
"Romeo, take me somewhere we can be alone  
I'll be waiting, all there's left to do is run  
You'll be the prince and I'll be the princess  
It's a love story, baby, just say 'Yes'  
Romeo, save me, they're trying to tell me how to feel  
This love is difficult, but it's real  
Don't be afraid, we'll make it out of this mess  
It's a love story, baby, just say 'Yes'"  

**[Bridge]**  
I got tired of waiting  
Wondering if you were ever coming around  
My faith in you was fading  
When I met you on the outskirts of town  

And I said,  
"Romeo, save me, I've been feeling so alone  
I keep waiting for you, but you never come  
Is this in my head? I don't know what to think"  
He knelt to the ground and pulled out a ring  

**[Final Chorus]**  
And said,  
"Marry me, Juliet, you'll never have to be alone  
I love you and that's all I really know  
I talked to your dad, go pick out a white dress  
It's a love story, baby, just say 'Yes'"  

**[Outro]**  
Oh, oh, oh  
Oh, oh, oh  
'Cause we were both young when I first saw you  
"""

tokens, token_ids = encode_anything(vocab, pattern, english_text) # 新的编码函数
print('\n原文本:', english_text[:100])
print('\n分词:',tokens[:100])
print('\ntoken ids:',token_ids[:100])
decode_token = decode(vocab_reverse, token_ids[:100])
print('\ndecode ids:',decode_token[:100])
print('\n解码文本',''.join(decode_token[:100]))


原文本: 
### **Love Story**  
*By Taylor Swift*  

**[Verse 1]**  
We were both young when I first saw you  

分词: ['\n', '#', '#', '#', ' ', '*', '*', 'Love', ' ', 'Story', '*', '*', ' ', ' ', '\n', '*', 'By', ' ', 'Taylor', ' ', 'Swift', '*', ' ', ' ', '\n\n', '*', '*', '[', 'Verse', ' ', '1', ']', '*', '*', ' ', ' ', '\nWe', ' ', 'were', ' ', 'both', ' ', 'young', ' ', 'when', ' ', 'I', ' ', 'first', ' ', 'saw', ' ', 'you', ' ', ' ', '\nI', ' ', 'close', ' ', 'my', ' ', 'eyes', ' ', 'and', ' ', 'the', ' ', 'flashback', ' ', 'starts', ' ', ' ', '\nI', "'", 'm', ' ', 'standing', ' ', 'there', ' ', 'on', ' ', 'a', ' ', 'balcony', ' ', 'in', ' ', 'summer', ' ', 'air', ' ', ' ', '\n\nSee', ' ', 'the', ' ', 'lights', ',', ' ']

token ids: [0, 88, 88, 88, 1, 95, 95, 39, 16, 23, 6, 1, 46, 21, 16, 19, 26, 95, 95, 1, 1, 0, 95, 29, 26, 1, 47, 2, 26, 13, 16, 19, 1, 46, 24, 10, 7, 21, 95, 1, 1, 117, 95, 95, 108, 49, 6, 19, 20, 6, 1, 55, 109, 95, 95, 1, 1, 0, 50, 6, 1, 24, 6, 19, 6, 1, 3, 16, 21, 9,

In [20]:
## 分词实例2:

code_text = """

[LeetCode 53. Maximum Subarray]

from typing import List

class Solution:
    def maxSubArray(self, nums: List[int]) -> int:
        # 当前以 nums[i] 结尾的子数组最大和
        cur_max = nums[0]
        # 全局最大和
        global_max = nums[0]

        for i in range(1, len(nums)):
            # 要么接在前面的子数组后面，要么从当前位置重新开始
            cur_max = max(nums[i], cur_max + nums[i])
            # 更新全局最大值
            global_max = max(global_max, cur_max)
        return global_max
"""


tokens, token_ids = encode_anything(vocab, pattern, code_text) # 新的编码函数
print('\n原文本:', code_text[:100])
print('\n分词:',tokens[:100])
print('\ntoken ids:',token_ids[:100])
decode_token = decode(vocab_reverse, token_ids)
print('\ndecode ids:',decode_token[:100])
print('\n解码文本',''.join(decode_token))


原文本: 

[LeetCode 53. Maximum Subarray]

from typing import List

class Solution:
    def maxSubArray(self

分词: ['\n\n', '[', 'LeetCode', ' ', '5', '3', '.', ' ', 'Maximum', ' ', 'Subarray', ']', '\n\nfrom', ' ', 'typing', ' ', 'import', ' ', 'List\n\nclass', ' ', 'Solution', ':', '\n', ' ', ' ', ' ', ' ', 'def', ' ', 'maxSubArray', '(', 'self', ',', ' ', 'nums', ':', ' ', 'List', '[', 'int', ']', ')', ' ', '-', '>', ' ', 'int', ':', '\n', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '#', ' ', '当', '前', '以', ' ', 'nums', '[', 'i', ']', ' ', '结', '尾', '的', '子', '数', '组', '最', '大', '和', '\n', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'cur', '_', 'max', ' ', '=', ' ', 'nums', '[', '0', ']', '\n', ' ', ' ', ' ']

token ids: [117, 108, 39, 6, 6, 21, 30, 16, 5, 6, 1, 59, 57, 99, 1, 40, 2, 25, 10, 14, 22, 14, 1, 46, 22, 3, 2, 19, 19, 2, 26, 109, 0, 0, 7, 19, 16, 14, 1, 21, 26, 17, 10, 15, 8, 1, 10, 14, 17, 16, 19, 21, 1, 39, 10, 20, 21, 0, 0, 4, 13, 2, 20, 20, 1, 46, 16, 13, 22, 21, 10, 16, 15, 101,

## Batch Tokenize

- pre-process
- padding
- truction

In [181]:
# special_tokens = ['<SOS>', '<EOS>', '<PAD>', '<UNK>']
# @dataclass
class SpecialToken:
    def __init__(self,):
        self.sos_token = '<SOS>'
        self.eos_token = '<EOS>'
        self.pad_token = '<PAD>'
        self.unk_token = '<UNK>'

special_token = SpecialToken()


print('\nsos token: ', vocab[special_token.sos_token], 
      '\neos token: ', vocab[special_token.eos_token], 
      '\npad token: ', vocab[special_token.pad_token], 
      '\nunk token: ', vocab[special_token.unk_token], )

# example
texts = ["Large Language Models (LLMs) are advanced AI systems trained on vast datasets to understand, generate, and manipulate human-like text. ", 
         "I have 12 apples!", 
         "large language model"]


sos token:  64 
eos token:  65 
pad token:  67 
unk token:  66


### 预处理

In [182]:
# example

texts = ["Large Language Models (LLMs) are advanced AI systems trained on vast datasets to understand, generate, and manipulate human-like text. ", 
         "I have 12 apples!", 
         "large language model"]

# 预处理有两种方式，一种是在 text 层面加 special token， 一种是在 token 层面加 special token id
# 1. pre process text
def text_pre_process(text, sos = '', eos = ''):
    return sos + text + eos

texts_pre = [ text_pre_process(text, special_token.sos_token, special_token.eos_token) for text in texts ]
token_ids_1 = [ encode_anything(vocab, pattern, text)[1] for text in texts_pre ]
print(texts_pre)
print(token_ids_1[1])

# 2. pre process token
def token_pre_process(tokens = tokens, sos = None, eos = None):
    if sos is not None:
        tokens = [sos] + tokens
    if eos is not None:
        tokens = tokens + [eos] 
    return tokens

token_ids = [ encode_anything(vocab, pattern, text)[1] for text in texts ]
# token_pre = [ token_pre_process(token, 
#                                sos = vocab[special_token.sos_token],
#                                eos = vocab[special_token.eos_token]) for token in token_ids ]
token_ids_2 = []
for tokens in token_ids:
    tmp = token_pre_process(tokens = tokens, sos = vocab[special_token.sos_token], eos = vocab[special_token.eos_token])
    token_ids_2.append(tmp)
print(token_ids_2[1])

input_ids = token_ids_2

['<SOS>Large Language Models (LLMs) are advanced AI systems trained on vast datasets to understand, generate, and manipulate human-like text. <EOS>', '<SOS>I have 12 apples!<EOS>', '<SOS>large language model<EOS>']
[64, 36, 1, 9, 2, 23, 6, 1, 55, 56, 1, 2, 17, 17, 13, 6, 20, 85, 65]
[64, 36, 1, 9, 2, 23, 6, 1, 55, 56, 1, 2, 17, 17, 13, 6, 20, 85, 65]


### Padding

In [189]:
import torch
# 使用 longest padding 策略, 默认使用 right-padding
# padding 后每条数据都是 等长的，适合使用 tensor 来存储
# padding 是在 token 层面加的, text 加没有意义。

max_len = 512

def padding(input_ids, pad_token_id = None, padding_side = 'RIGHT'):
    if pad_token_id is None:
        return
    tokens_lens = [ len(ids) for ids in input_ids]
    tokens_lens = torch.tensor(tokens_lens, dtype = torch.long)
    tokens_max_len = torch.max(tokens_lens)
    paddding_input_ids = torch.ones(len(input_ids), tokens_max_len, dtype = torch.long) * pad_token_id
    
    if padding_side == 'RIGHT':
        for i in range(len(input_ids)):
            paddding_input_ids[i, :tokens_lens[i]] = torch.tensor(input_ids[i], dtype = torch.long)
    else: # left padding
        for i in range(len(input_ids)):
            paddding_input_ids[i, -tokens_lens[i]:] = torch.tensor(input_ids[i], dtype = torch.long)
    return paddding_input_ids
    

pad_input_ids = padding(input_ids,
            pad_token_id = vocab[ special_token.pad_token ],
            padding_side = 'Left')
print(pad_input_ids)


pad_input_ids = padding(input_ids,
            pad_token_id = vocab[ special_token.pad_token ],
            padding_side = 'RIGHT')
print(pad_input_ids)

tensor([[ 64, 117,   1, 118,   1, 119,   1,  93, 120,  94,   1,   2,  19,   6,
           1,   2,   5,  23,   2,  15,   4,   6,   5,   1,  28,  36,   1,  20,
          26,  20,  21,   6,  14,  20,   1,  21,  19,   2,  10,  15,   6,   5,
           1,  16,  15,   1,  23,   2,  20,  21,   1, 170,   1, 154,   1,  22,
          15,   5,   6,  19,  20,  21,   2,  15,   5,  97,   1,   8,   6,  15,
           6,  19,   2,  21,   6,  97,   1, 126,   1,  14,   2,  15,  10,  17,
          22,  13,   2,  21,   6,   1,   9,  22,  14,   2,  15,  98,  13,  10,
          12,   6,   1,  21,   6,  25,  21,  99,   1,  65],
        [ 67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,
          67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,
          67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,
          67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,
          67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,  6

### Truncation

1. 可以在 padding 后好的 token 序列上按照长度裁剪
2. 可以在 padding 前进行裁剪

padding 方向 和 裁剪方向一致如： left padding: [p, p, 2, 3], 裁剪: left padding [p, 2, 3]

trucation 哪个方向更为合理 ?

In [197]:
# 展示方式 2

# max_len = 4, max_seq_len = 5
#                  cut
# seq1: 1, 1, 1, 1, | 1
# seq2: 1, 1, 
# seq3: 1, 1, 1, 

def padding_max_length(input_ids, 
                       max_len = 32, 
                       pad_token_id = None, 
                       padding_side = 'RIGHT',
                      truction_side = 'RIGHT'):
    if pad_token_id is None:
        return
    tokens_lens = [ len(ids) for ids in input_ids]
    tokens_lens = torch.tensor(tokens_lens, dtype = torch.long)
    tokens_max_len = torch.max(tokens_lens)

    if tokens_max_len > max_len:
        tokens_max_len = max_len
    if truction_side == 'RIGHT':
        input_ids = [ ids[ : min(len(ids), tokens_max_len)] for ids in input_ids]
    else:
        input_ids = [ ids[ -min(len(ids), tokens_max_len) : ] for ids in input_ids]

    
    paddding_input_ids = torch.ones(len(input_ids), tokens_max_len, dtype = torch.long) * pad_token_id
    if padding_side == 'RIGHT':
        for i in range(len(input_ids)):
            paddding_input_ids[i, : len(input_ids[i])] = torch.tensor(input_ids[i], dtype = torch.long)
    else: # left padding
        for i in range(len(input_ids)):
            paddding_input_ids[i, -len(input_ids[i]):] = torch.tensor(input_ids[i], dtype = torch.long)

    
    return paddding_input_ids
    

pad_input_ids = padding_max_length(input_ids,
            max_len =  32,
            pad_token_id = vocab[ special_token.pad_token ],
            padding_side = 'RIGHT',
            truction_side = 'RIGHT')
print(pad_input_ids)

tensor([[ 64, 117,   1, 118,   1, 119,   1,  93, 120,  94,   1,   2,  19,   6,
           1,   2,   5,  23,   2,  15,   4,   6,   5,   1,  28,  36,   1,  20,
          26,  20,  21,   6],
        [ 64,  36,   1,   9,   2,  23,   6,   1,  55,  56,   1,   2,  17,  17,
          13,   6,  20,  85,  65,  67,  67,  67,  67,  67,  67,  67,  67,  67,
          67,  67,  67,  67],
        [ 64,  13,   2,  19,   8,   6,   1,  13,   2,  15,   8,  22,   2,   8,
           6,   1, 244,  65,  67,  67,  67,  67,  67,  67,  67,  67,  67,  67,
          67,  67,  67,  67]])


## Tokenizer IO

1. 需要存储词表
2. 需要存储特殊词元
3. 需要定义一个 config 字典来管理
4. 加载时将 字典 转成 数据类。

In [236]:
from typing import Any
from dataclasses import dataclass, asdict

@dataclass
class TokenizerBaseConfig:
    vocab_size: int = -1
    class_name: str = 'TokenizerBase'
    sos_token: str = '<SOS>'
    sos_token_id: int = -1
    eos_token: str = '<EOS>'
    eos_token_id: int = -1
    pad_token: str = '<PAD>'
    pad_token_id: int = -1
    unk_token: str = '<UNK>'
    unk_token_id: int = -1
    pattern: str = ''
    

config = TokenizerBaseConfig()
print(config)

config_dict = {
    'vocab_size' : len(vocab),
    'class_name' : 'TokenizerBaseConfig',
    'sos_token' : special_token.sos_token,
    'sos_token_id' : vocab[special_token.sos_token],
    'eos_token' : special_token.eos_token,
    'eos_token_id' : vocab[special_token.eos_token],
    'pad_token' : special_token.pad_token,
    'pad_token_id'  : vocab[special_token.pad_token],
    'unk_token' : special_token.unk_token,
    'unk_token_id' : vocab[special_token.unk_token],
    'pattern' : pattern,
}
# print(config_dict)

config = TokenizerBaseConfig(**config_dict)
print(config)

TokenizerBaseConfig(vocab_size=-1, class_name='TokenizerBase', sos_token='<SOS>', sos_token_id=-1, eos_token='<EOS>', eos_token_id=-1, pad_token='<PAD>', pad_token_id=-1, unk_token='<UNK>', unk_token_id=-1, pattern='')
TokenizerBaseConfig(vocab_size=302, class_name='TokenizerBaseConfig', sos_token='<SOS>', sos_token_id=64, eos_token='<EOS>', eos_token_id=65, pad_token='<PAD>', pad_token_id=67, unk_token='<UNK>', unk_token_id=66, pattern='(?:<SOS>|<EOS>|<PAD>|<UNK>|\n)|[，。！？；：“”‘’【】（）《》、!"\\\\\\#\\\\\\$%\\\\\\&\'\\\\\\(\\\\\\)\\\\\\*\\\\\\+,\\\\\\-\\\\\\./:;<=>\\\\\\?@\\\\\\[\\\\\\\\\\\\\\]\\\\\\^_`\\\\\\{\\\\\\|\\\\\\}\\\\\\~\\ ]|\\d|[\\u4e00-\\u9fa5]|[^，。！？；：“”‘’【】（）《》、!"\\\\\\#\\\\\\$%\\\\\\&\'\\\\\\(\\\\\\)\\\\\\*\\\\\\+,\\\\\\-\\\\\\./:;<=>\\\\\\?@\\\\\\[\\\\\\\\\\\\\\]\\\\\\^_`\\\\\\{\\\\\\|\\\\\\}\\\\\\~\\ \\d\\u4e00-\\u9fa5<>]+')
The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the databas

### tokenizer save

In [237]:
!mkdir output

mkdir: output: File exists


In [247]:
# 存储函数：存 vocab, config
from dataclasses import asdict
import json
import os

def save_dict_to_json(filepath, data):
    """将字典保存为 JSON 文件"""
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=True, indent=4)
    print(f"字典已保存为 JSON 文件: {filepath}")

def save_pretrained(directory, vocab : Dict[str, int], config): 
    """
    保存 tokenizer, 包含词表, 分词规则, config
    config 保存 分词器 类名, 分词器保存规则 
    """
    if not os.path.exists(directory):
        os.makedirs(directory)
        print(f"目录 '{directory}' 已创建")
    else:
        print(f"目录 '{directory}' 已存在")
        # return False

    vocab_path = os.path.join(directory, 'vocab.json')
    config_path = os.path.join(directory, 'config.json')

    config_dict = asdict(config)

    save_dict_to_json(config_path, config_dict)
    save_dict_to_json(vocab_path, vocab)

In [248]:
save_pretrained('./output/tokenizer', vocab, config)

目录 './output/tokenizer' 已存在
字典已保存为 JSON 文件: ./output/tokenizer/config.json
字典已保存为 JSON 文件: ./output/tokenizer/vocab.json


In [249]:
!cat ./output/tokenizer/config.json
# !cat ./output/tokenizer/vocab.json

{
    "vocab_size": 302,
    "class_name": "TokenizerBaseConfig",
    "sos_token": "<SOS>",
    "sos_token_id": 64,
    "eos_token": "<EOS>",
    "eos_token_id": 65,
    "pad_token": "<PAD>",
    "pad_token_id": 67,
    "unk_token": "<UNK>",
    "unk_token_id": 66,
    "pattern": "(?:<SOS>|<EOS>|<PAD>|<UNK>|\n)|[\uff0c\u3002\uff01\uff1f\uff1b\uff1a\u201c\u201d\u2018\u2019\u3010\u3011\uff08\uff09\u300a\u300b\u3001!\"\\\\\\#\\\\\\$%\\\\\\&'\\\\\\(\\\\\\)\\\\\\*\\\\\\+,\\\\\\-\\\\\\./:;<=>\\\\\\?@\\\\\\[\\\\\\\\\\\\\\]\\\\\\^_`\\\\\\{\\\\\\|\\\\\\}\\\\\\~\\ ]|\\d|[\\u4e00-\\u9fa5]|[^\uff0c\u3002\uff01\uff1f\uff1b\uff1a\u201c\u201d\u2018\u2019\u3010\u3011\uff08\uff09\u300a\u300b\u3001!\"\\\\\\#\\\\\\$%\\\\\\&'\\\\\\(\\\\\\)\\\\\\*\\\\\\+,\\\\\\-\\\\\\./:;<=>\\\\\\?@\\\\\\[\\\\\\\\\\\\\\]\\\\\\^_`\\\\\\{\\\\\\|\\\\\\}\\\\\\~\\ \\d\\u4e00-\\u9fa5<>]+"
}

### tokenizer load

In [263]:
def load_tokenizer(directory):
    
    vocab_path = os.path.join(directory, 'vocab.json')
    config_path = os.path.join(directory, 'config.json')
    
    if os.path.isfile(config_path):
        with open(config_path, encoding='utf-8') as f:
            config = json.load(f) # loads 返回 dict
            print('加载成功：')
    else:
        print(f'[错误] 文件不存在：{config_path}')

        
    if 'class_name' in config:
        cls = globals()[config['class_name']]  # 获取类对象
        config = cls(**config)
    else:
        print('not specified tokenizer class name')

    
    if os.path.isfile(vocab_path):
        with open(vocab_path, encoding='utf-8') as f:
            vocab = json.load(f) # loads 返回 dict
            print('加载成功：')
    else:
        print(f'[错误] 文件不存在：{vocab_path}')

    return config, vocab

In [265]:
load_path = './output/tokenizer'
config_load, vocab_load =load_tokenizer(load_path)
print(config_load) # 数据类对象
print(vocab_load)

加载成功：
加载成功：
TokenizerBaseConfig(vocab_size=302, class_name='TokenizerBaseConfig', sos_token='<SOS>', sos_token_id=64, eos_token='<EOS>', eos_token_id=65, pad_token='<PAD>', pad_token_id=67, unk_token='<UNK>', unk_token_id=66, pattern='(?:<SOS>|<EOS>|<PAD>|<UNK>|\n)|[，。！？；：“”‘’【】（）《》、!"\\\\\\#\\\\\\$%\\\\\\&\'\\\\\\(\\\\\\)\\\\\\*\\\\\\+,\\\\\\-\\\\\\./:;<=>\\\\\\?@\\\\\\[\\\\\\\\\\\\\\]\\\\\\^_`\\\\\\{\\\\\\|\\\\\\}\\\\\\~\\ ]|\\d|[\\u4e00-\\u9fa5]|[^，。！？；：“”‘’【】（）《》、!"\\\\\\#\\\\\\$%\\\\\\&\'\\\\\\(\\\\\\)\\\\\\*\\\\\\+,\\\\\\-\\\\\\./:;<=>\\\\\\?@\\\\\\[\\\\\\\\\\\\\\]\\\\\\^_`\\\\\\{\\\\\\|\\\\\\}\\\\\\~\\ \\d\\u4e00-\\u9fa5<>]+')
{'\n': 0, ' ': 1, 'a': 2, 'b': 3, 'c': 4, 'd': 5, 'e': 6, 'f': 7, 'g': 8, 'h': 9, 'i': 10, 'j': 11, 'k': 12, 'l': 13, 'm': 14, 'n': 15, 'o': 16, 'p': 17, 'q': 18, 'r': 19, 's': 20, 't': 21, 'u': 22, 'v': 23, 'w': 24, 'x': 25, 'y': 26, 'z': 27, 'A': 28, 'B': 29, 'C': 30, 'D': 31, 'E': 32, 'F': 33, 'G': 34, 'H': 35, 'I': 36, 'J': 37, 'K': 38, 'L': 39, 'M': 4

## Tokenizer Base

In [266]:
# from typing import List, Dict, Tuple, Union, str
# from abc import ABC, abstractmethod

# class TokenizerBase(ABC):
#     # @abstractmethod
#     def __init__(self):
#         self.vocab : Dict[str, int] = {}
#         self.vocab_reverse : Dict[int, str] = {}
#         self.vocab_size : int = 0
#         self.special_token : Dict[str, int] = {}

#     @abstractmethod
#     def init_vocab(self, vocab: Dict[str, int] ) : -> None
#         """
#         初始化词表, 可以用现成的字典, 也可以默认使用基础字符来创建
#         """
#         pass
    
#     @abstractmethod
#     def train(self, text: Union[str, List[str])): -> None
#         """
#         输入语料
#         """
#         pass
        

#     @abstractmethod
#     def add_special_token(self, token: Dict[str, str]): -> None
#         """
#         添加特殊 token，存入 特殊的 tokenizer 表中
#         """
#         pass

#     @abstractmethod
#     def encode(self, 
#                input_list: List[str],
#                padding = False : str,
#                padding_side = 'right' : str, 
#                max_length = 'long' : Union[int, str] 
#                add_bos_token = False : bool,
#                add_eos_token = False : bool,
#                add_pad_token = False : bool,
#                return_type = None : str, # pt: pytorch tensor
#                 ): -> Union[torch.tensor, List[List[int]]]
#         """
#         批量编码
#         input: ["I have 12 apples!"] 
#         output: 9 numbers list  # -> ['I', ' ' ,'have', ' ', '1', '2', ' ', 'apple', '"']
#         """
#         pass

#     @abstractmethod
#     def decode(self, token_ids : list[list[int]],
#                     skip_special_token = True : bool,
#                        return_string = True : bool
#               ): -> Union[List[str], List[List[str]]]
#         """
#         批量解码
#         """
#         pass

    
#     @abstractmethod
#     def from_pretrained(self, filepath = './tokenizer' :str ): -> None
#         pass

#     @abstractmethod
#     def save_pretrained(self, filepath = './tokenizer' :str ): -> None
#         pass

#     @abstractmethod
#     def chat_template(self, 
#                       prompt = None : Union[str, List[str]), 
#                       response = None : Union[str, List[str]), 
#                         messages = None : List[Dict[str, Any]], 
#                       tokenize = False : bool,
#                       add_response_prompt=  False : bool,): -> Union[str, List[str], List[int], List[List[int]]]
#         pass